[RFC PATCH v3 00/20] Zero copy Rx using io

public inbox for [email protected]
 help / color / mirror / Atom feed

* [RFC PATCH v3 00/20] Zero copy Rx using io_uring
@ 2023-12-19 21:03 David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper David Wei
                   ` (19 more replies)
  0 siblings, 20 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

This patchset is a proposal that adds zero copy network Rx to io_uring.
With it, userspace can register a region of host memory for receiving
data directly from a NIC using DMA, without needing a kernel to user
copy.

Full kernel tree including some out of tree BNXT changes:

https://github.com/spikeh/linux/tree/zcrx_sil

On the userspace side, support is added to both liburing and Netbench:

https://github.com/spikeh/liburing/tree/zcrx2
https://github.com/spikeh/netbench/tree/zcrx

Hardware support is added to the Broadcom BNXT driver. This patchset +
userspace code was tested on an Intel Xeon Platinum 8321HC CPU and
Broadcom BCM57504 NIC.

Early benchmarks using this prototype, with iperf3 as a load generator,
showed a ~50% reduction in overall system memory bandwidth as measured
using perf counters. Note that DDIO must be disabled on Intel systems.
Build Netbench using the modified liburing above.

This patchset is based on the work by Jonathan Lemon
<[email protected]>:
https://lore.kernel.org/io-uring/[email protected]/

Changes in RFC v3:
------------------

* Rebased on top of Jakub Kicinski's memory provider API RFC. The ZC
  pool added is now a backend for memory provider.
* We're also reusing ppiov infrastructure. The refcounting rules stay
  the same but it's shifted into ppiov->refcount. That lets us to
  flexibly manage buffer lifetimes without adding any extra code to the
  common networking paths. It'd also make it easier to support dmabufs
  and device memory in the future.
  * io_uring also knows about pages, and so ppiovs might unnecessarily
    break tools inspecting data, that can easily be solved later.

Many patches are not for upstream as they depend on work in progress,
namely from Mina:

* struct netmem_t
* Driver ndo commands for Rx queue configs
* struct page_pool_iov and shared pp infra

Changes in RFC v2:
------------------

* Added copy fallback support if userspace memory allocated for ZC Rx
  runs out, or if header splitting or flow steering fails.
* Added veth support for ZC Rx, for testing and demonstration. We will
  need to figure out what driver would be best for such testing
  functionality in the future. Perhaps netdevsim?
* Added socket registration API to io_uring to associate specific
  sockets with ifqs/Rx queues for ZC.
* Added multi-socket support, such that multiple connections can be
  steered into the same hardware Rx queue.
* Added Netbench server/client support.

Known deficiencies that we will address in a future patchset:

* Proper test driver + selftests, maybe netdevsim.
* Revisiting userspace API.
* Multi-region support.
* Steering setup.
* Further optimisation work.
* ...and more.

If you would like to try out this patchset, build and run the kernel
tree then build Netbench using liburing, all from forks above.

Run setup.sh first:

https://gist.github.com/isilence/e6a28ce41a545a261566672104afa461

Then run the following commands:

sudo ip netns exec nsserv ./netbench --server_only 1 --v6 false \
    --rx "io_uring --provide_buffers 0 --use_zc 1 \
    --zc_pool_pages 16384 --zc_ifname ptp-serv" --use_port 9999

sudo ip netns exec nscl ./netbench --client_only 1 --v6 false \
    --tx "epoll --threads 1 --per_thread 1 --size 2800" \
    --host 10.10.10.20 --use_port 9999

David Wei (6):
  io_uring: add interface queue
  io_uring: add mmap support for shared ifq ringbuffers
  netdev: add XDP_SETUP_ZC_RX command
  io_uring: setup ZC for an Rx queue when registering an ifq
  io_uring: add ZC buf and pool
  io_uring: add io_recvzc request

Pavel Begunkov (14):
  net: page_pool: add ppiov mangling helper
  tcp: don't allow non-devmem originated ppiov
  net: page pool: rework ppiov life cycle
  net: enable napi_pp_put_page for ppiov
  net: page_pool: add ->scrub mem provider callback
  io_uring: separate header for exported net bits
  io_uring/zcrx: implement socket registration
  io_uring: implement pp memory provider for zc rx
  net: page pool: add io_uring memory provider
  net: execute custom callback from napi
  io_uring/zcrx: add copy fallback
  veth: add support for io_uring zc rx
  net: page pool: generalise ppiov dma address get
  bnxt: enable io_uring zc page pool

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  71 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   7 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   3 +
 drivers/net/veth.c                            | 211 +++-
 include/linux/io_uring.h                      |   6 -
 include/linux/io_uring/net.h                  |  31 +
 include/linux/io_uring_types.h                |   8 +
 include/linux/net.h                           |   2 +
 include/linux/netdevice.h                     |   6 +
 include/net/busy_poll.h                       |   7 +
 include/net/page_pool/helpers.h               |  27 +-
 include/net/page_pool/types.h                 |   4 +
 include/uapi/linux/io_uring.h                 |  61 ++
 io_uring/Makefile                             |   2 +-
 io_uring/io_uring.c                           |  24 +
 io_uring/net.c                                | 133 ++-
 io_uring/opdef.c                              |  16 +
 io_uring/uring_cmd.c                          |   1 +
 io_uring/zc_rx.c                              | 954 ++++++++++++++++++
 io_uring/zc_rx.h                              |  80 ++
 net/core/dev.c                                |  46 +
 net/core/page_pool.c                          |  68 +-
 net/core/skbuff.c                             |  28 +-
 net/ipv4/tcp.c                                |   7 +
 net/socket.c                                  |   3 +-
 25 files changed, 1737 insertions(+), 69 deletions(-)
 create mode 100644 include/linux/io_uring/net.h
 create mode 100644 io_uring/zc_rx.c
 create mode 100644 io_uring/zc_rx.h

-- 
2.39.3


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 23:22   ` Mina Almasry
  2023-12-19 21:03 ` [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov David Wei
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

NOT FOR UPSTREAM

The final version will depend on how ppiov looks like, but add a
convenience helper for now.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/helpers.h | 5 +++++
 net/core/page_pool.c            | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 95f4d579cbc4..92804c499833 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -86,6 +86,11 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
 
 /* page_pool_iov support */
 
+static inline struct page *page_pool_mangle_ppiov(struct page_pool_iov *ppiov)
+{
+	return (struct page *)((unsigned long)ppiov | PP_DEVMEM);
+}
+
 static inline struct dmabuf_genpool_chunk_owner *
 page_pool_iov_owner(const struct page_pool_iov *ppiov)
 {
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index c0bc62ee77c6..38eff947f679 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1074,7 +1074,7 @@ static struct page *mp_dmabuf_devmem_alloc_pages(struct page_pool *pool,
 	pool->pages_state_hold_cnt++;
 	trace_page_pool_state_hold(pool, (struct page *)ppiov,
 				   pool->pages_state_hold_cnt);
-	return (struct page *)((unsigned long)ppiov | PP_DEVMEM);
+	return page_pool_mangle_ppiov(ppiov);
 }
 
 static void mp_dmabuf_devmem_destroy(struct page_pool *pool)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 23:24   ` Mina Almasry
  2023-12-19 21:03 ` [RFC PATCH v3 03/20] net: page pool: rework ppiov life cycle David Wei
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

NOT FOR UPSTREAM

There will be more users of struct page_pool_iov, and ppiovs from one
subsystem must not be used by another. That should never happen for any
sane application, but we need to enforce it in case of bufs and/or
malicious users.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 net/ipv4/tcp.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 33a8bb63fbf5..9c6b18eebb5b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2384,6 +2384,13 @@ static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb,
 			}
 
 			ppiov = skb_frag_page_pool_iov(frag);
+
+			/* Disallow non devmem owned buffers */
+			if (ppiov->pp->p.memory_provider != PP_MP_DMABUF_DEVMEM) {
+				err = -ENODEV;
+				goto out;
+			}
+
 			end = start + skb_frag_size(frag);
 			copy = end - offset;
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 03/20] net: page pool: rework ppiov life cycle
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 23:35   ` Mina Almasry
  2023-12-19 21:03 ` [RFC PATCH v3 04/20] net: enable napi_pp_put_page for ppiov David Wei
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

NOT FOR UPSTREAM
The final version will depend on how the ppiov infra looks like

Page pool is tracking how many pages were allocated and returned, which
serves for refcounting the pool, and so every page/frag allocated should
eventually come back to the page pool via appropriate ways, e.g. by
calling page_pool_put_page().

When it comes to normal page pools (i.e. without memory providers
attached), it's fine to return a page when it's still refcounted by
somewhat in the stack, in which case we'll "detach" the page from the
pool and rely on page refcount for it to return back to the kernel.

Memory providers are different, at least ppiov based ones, they need
all their buffers to eventually return back, so apart from custom pp
->release handlers, we'll catch when someone puts down a ppiov and call
its memory provider to handle it, i.e. __page_pool_iov_free().

The first problem is that __page_pool_iov_free() hard coded devmem
handling, and other providers need a flexible way to specify their own
callbacks.

The second problem is that it doesn't go through the generic page pool
paths and so can't do the mentioned pp accounting right. And we can't
even safely rely on page_pool_put_page() to be called somewhere before
to do the pp refcounting, because then the page pool might get destroyed
and ppiov->pp would point to garbage.

The solution is to make the pp ->release callback to be responsible for
properly recycling its buffers, e.g. calling what was
__page_pool_iov_free() before in case of devmem.
page_pool_iov_put_many() will be returning buffers to the page pool.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/helpers.h | 15 ++++++++---
 net/core/page_pool.c            | 46 +++++++++++++++++----------------
 2 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 92804c499833..ef380ee8f205 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -137,15 +137,22 @@ static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
 	refcount_add(count, &ppiov->refcount);
 }
 
-void __page_pool_iov_free(struct page_pool_iov *ppiov);
+static inline bool page_pool_iov_sub_and_test(struct page_pool_iov *ppiov,
+					      unsigned int count)
+{
+	return refcount_sub_and_test(count, &ppiov->refcount);
+}
 
 static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
 					  unsigned int count)
 {
-	if (!refcount_sub_and_test(count, &ppiov->refcount))
-		return;
+	if (count > 1)
+		WARN_ON_ONCE(page_pool_iov_sub_and_test(ppiov, count - 1));
 
-	__page_pool_iov_free(ppiov);
+#ifdef CONFIG_PAGE_POOL
+	page_pool_put_defragged_page(ppiov->pp, page_pool_mangle_ppiov(ppiov),
+				     -1, false);
+#endif
 }
 
 /* page pool mm helpers */
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 38eff947f679..ecf90a1ccabe 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -599,6 +599,16 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
 	page_pool_set_dma_addr(page, 0);
 }
 
+static void page_pool_return_provider(struct page_pool *pool, struct page *page)
+{
+	int count;
+
+	if (pool->mp_ops->release_page(pool, page)) {
+		count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
+		trace_page_pool_state_release(pool, page, count);
+	}
+}
+
 /* Disconnects a page (from a page_pool).  API users can have a need
  * to disconnect a page (from a page_pool), to allow it to be used as
  * a regular page (that will eventually be returned to the normal
@@ -607,13 +617,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
 void page_pool_return_page(struct page_pool *pool, struct page *page)
 {
 	int count;
-	bool put;
 
-	put = true;
-	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
-		put = pool->mp_ops->release_page(pool, page);
-	else
-		__page_pool_release_page_dma(pool, page);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) {
+		page_pool_return_provider(pool, page);
+		return;
+	}
+
+	__page_pool_release_page_dma(pool, page);
 
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
@@ -621,10 +631,8 @@ void page_pool_return_page(struct page_pool *pool, struct page *page)
 	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
 	trace_page_pool_state_release(pool, page, count);
 
-	if (put) {
-		page_pool_clear_pp_info(page);
-		put_page(page);
-	}
+	page_pool_clear_pp_info(page);
+	put_page(page);
 	/* An optimization would be to call __free_pages(page, pool->p.order)
 	 * knowing page is not part of page-cache (thus avoiding a
 	 * __page_cache_release() call).
@@ -1034,15 +1042,6 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
 }
 EXPORT_SYMBOL(page_pool_update_nid);
 
-void __page_pool_iov_free(struct page_pool_iov *ppiov)
-{
-	if (ppiov->pp->mp_ops != &dmabuf_devmem_ops)
-		return;
-
-	netdev_free_devmem(ppiov);
-}
-EXPORT_SYMBOL_GPL(__page_pool_iov_free);
-
 /*** "Dmabuf devmem memory provider" ***/
 
 static int mp_dmabuf_devmem_init(struct page_pool *pool)
@@ -1093,9 +1092,12 @@ static bool mp_dmabuf_devmem_release_page(struct page_pool *pool,
 		return false;
 
 	ppiov = page_to_page_pool_iov(page);
-	page_pool_iov_put_many(ppiov, 1);
-	/* We don't want the page pool put_page()ing our page_pool_iovs. */
-	return false;
+
+	if (!page_pool_iov_sub_and_test(ppiov, 1))
+		return false;
+	netdev_free_devmem(ppiov);
+	/* tell page_pool that the ppiov is released */
+	return true;
 }
 
 const struct pp_memory_provider_ops dmabuf_devmem_ops = {
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 04/20] net: enable napi_pp_put_page for ppiov
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (2 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 03/20] net: page pool: rework ppiov life cycle David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 05/20] net: page_pool: add ->scrub mem provider callback David Wei
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

NOT FOR UPSTREAM

Teach napi_pp_put_page() how to work with ppiov.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/helpers.h |  2 +-
 net/core/page_pool.c            |  3 ---
 net/core/skbuff.c               | 28 ++++++++++++++++------------
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index ef380ee8f205..aca3a52d0e22 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -381,7 +381,7 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
 	long ret;
 
 	if (page_is_page_pool_iov(page))
-		return -EINVAL;
+		return 0;
 
 	/* If nr == pp_frag_count then we have cleared all remaining
 	 * references to the page:
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ecf90a1ccabe..71af9835638e 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -922,9 +922,6 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 {
 	struct page *page;
 
-	if (pool->destroy_cnt)
-		return;
-
 	/* Empty alloc cache, assume caller made sure this is
 	 * no-longer in use, and page_pool_alloc_pages() cannot be
 	 * call concurrently.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f44c53b0ca27..cf523d655f92 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -896,19 +896,23 @@ bool napi_pp_put_page(struct page *page, bool napi_safe)
 	bool allow_direct = false;
 	struct page_pool *pp;
 
-	page = compound_head(page);
-
-	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
-	 * in order to preserve any existing bits, such as bit 0 for the
-	 * head page of compound page and bit 1 for pfmemalloc page, so
-	 * mask those bits for freeing side when doing below checking,
-	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
-	 * to avoid recycling the pfmemalloc page.
-	 */
-	if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
-		return false;
+	if (page_is_page_pool_iov(page)) {
+		pp = page_to_page_pool_iov(page)->pp;
+	} else {
+		page = compound_head(page);
+
+		/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
+		 * in order to preserve any existing bits, such as bit 0 for the
+		 * head page of compound page and bit 1 for pfmemalloc page, so
+		 * mask those bits for freeing side when doing below checking,
+		 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
+		 * to avoid recycling the pfmemalloc page.
+		 */
+		if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
+			return false;
 
-	pp = page->pp;
+		pp = page->pp;
+	}
 
 	/* Allow direct recycle if we have reasons to believe that we are
 	 * in the same context as the consumer would run, so there's
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 05/20] net: page_pool: add ->scrub mem provider callback
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (3 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 04/20] net: enable napi_pp_put_page for ppiov David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 06/20] io_uring: separate header for exported net bits David Wei
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

page pool is now waiting for all ppiovs to return before destroying
itself, and for that to happen the memory provider might need to push
some buffers, flush caches and so on.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/types.h | 1 +
 net/core/page_pool.c          | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index a701310b9811..fd846cac9fb6 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -134,6 +134,7 @@ enum pp_memory_provider_type {
 struct pp_memory_provider_ops {
 	int (*init)(struct page_pool *pool);
 	void (*destroy)(struct page_pool *pool);
+	void (*scrub)(struct page_pool *pool);
 	struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
 	bool (*release_page)(struct page_pool *pool, struct page *page);
 };
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 71af9835638e..9e3073d61a97 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -947,6 +947,8 @@ static int page_pool_release(struct page_pool *pool)
 {
 	int inflight;
 
+	if (pool->mp_ops && pool->mp_ops->scrub)
+		pool->mp_ops->scrub(pool);
 	page_pool_scrub(pool);
 	inflight = page_pool_inflight(pool);
 	if (!inflight)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 06/20] io_uring: separate header for exported net bits
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (4 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 05/20] net: page_pool: add ->scrub mem provider callback David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-20 16:01   ` Jens Axboe
  2023-12-19 21:03 ` [RFC PATCH v3 07/20] io_uring: add interface queue David Wei
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

We're exporting some io_uring bits to networking, e.g. for implementing
a net callback for io_uring cmds, but we don't want to expose more than
needed. Add a separate header for networking.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring.h     |  6 ------
 include/linux/io_uring/net.h | 18 ++++++++++++++++++
 io_uring/uring_cmd.c         |  1 +
 net/socket.c                 |  2 +-
 4 files changed, 20 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/io_uring/net.h

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index d8fc93492dc5..88d9aae7681b 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -12,7 +12,6 @@ void __io_uring_cancel(bool cancel_all);
 void __io_uring_free(struct task_struct *tsk);
 void io_uring_unreg_ringfd(void);
 const char *io_uring_get_opcode(u8 opcode);
-int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
 
 static inline void io_uring_files_cancel(void)
 {
@@ -49,11 +48,6 @@ static inline const char *io_uring_get_opcode(u8 opcode)
 {
 	return "";
 }
-static inline int io_uring_cmd_sock(struct io_uring_cmd *cmd,
-				    unsigned int issue_flags)
-{
-	return -EOPNOTSUPP;
-}
 #endif
 
 #endif
diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
new file mode 100644
index 000000000000..b58f39fed4d5
--- /dev/null
+++ b/include/linux/io_uring/net.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_IO_URING_NET_H
+#define _LINUX_IO_URING_NET_H
+
+struct io_uring_cmd;
+
+#if defined(CONFIG_IO_URING)
+int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
+
+#else
+static inline int io_uring_cmd_sock(struct io_uring_cmd *cmd,
+				    unsigned int issue_flags)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
+#endif
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 34030583b9b2..c98749eff5ce 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -3,6 +3,7 @@
 #include <linux/errno.h>
 #include <linux/file.h>
 #include <linux/io_uring/cmd.h>
+#include <linux/io_uring/net.h>
 #include <linux/security.h>
 #include <linux/nospec.h>
 
diff --git a/net/socket.c b/net/socket.c
index 3379c64217a4..d75246450a3c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -88,7 +88,7 @@
 #include <linux/xattr.h>
 #include <linux/nospec.h>
 #include <linux/indirect_call_wrapper.h>
-#include <linux/io_uring.h>
+#include <linux/io_uring/net.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 07/20] io_uring: add interface queue
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (5 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 06/20] io_uring: separate header for exported net bits David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-20 16:13   ` Jens Axboe
  2023-12-21 17:57   ` Willem de Bruijn
  2023-12-19 21:03 ` [RFC PATCH v3 08/20] io_uring: add mmap support for shared ifq ringbuffers David Wei
                   ` (12 subsequent siblings)
  19 siblings, 2 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

This patch introduces a new object in io_uring called an interface queue
(ifq) which contains:

* A pool region allocated by userspace and registered w/ io_uring where
  Rx data is written to.
* A net device and one specific Rx queue in it that will be configured
  for ZC Rx.
* A pair of shared ringbuffers w/ userspace, dubbed registered buf
  (rbuf) rings. Each entry contains a pool region id and an offset + len
  within that region. The kernel writes entries into the completion ring
  to tell userspace where RX data is relative to the start of a region.
  Userspace writes entries into the refill ring to tell the kernel when
  it is done with the data.

For now, each io_uring instance has a single ifq, and each ifq has a
single pool region associated with one Rx queue.

Add a new opcode to io_uring_register that sets up an ifq. Size and
offsets of shared ringbuffers are returned to userspace for it to mmap.
The implementation will be added in a later patch.

Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring_types.h |   8 +++
 include/uapi/linux/io_uring.h  |  51 +++++++++++++++
 io_uring/Makefile              |   2 +-
 io_uring/io_uring.c            |  13 ++++
 io_uring/zc_rx.c               | 116 +++++++++++++++++++++++++++++++++
 io_uring/zc_rx.h               |  37 +++++++++++
 6 files changed, 226 insertions(+), 1 deletion(-)
 create mode 100644 io_uring/zc_rx.c
 create mode 100644 io_uring/zc_rx.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index bebab36abce8..e87053b200f2 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -38,6 +38,8 @@ enum io_uring_cmd_flags {
 	IO_URING_F_COMPAT		= (1 << 12),
 };
 
+struct io_zc_rx_ifq;
+
 struct io_wq_work_node {
 	struct io_wq_work_node *next;
 };
@@ -182,6 +184,10 @@ struct io_rings {
 	struct io_uring_cqe	cqes[] ____cacheline_aligned_in_smp;
 };
 
+struct io_rbuf_ring {
+	struct io_uring		rq, cq;
+};
+
 struct io_restriction {
 	DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
 	DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
@@ -383,6 +389,8 @@ struct io_ring_ctx {
 	struct io_rsrc_data		*file_data;
 	struct io_rsrc_data		*buf_data;
 
+	struct io_zc_rx_ifq		*ifq;
+
 	/* protected by ->uring_lock */
 	struct list_head		rsrc_ref_list;
 	struct io_alloc_cache		rsrc_node_cache;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f1c16f817742..024a6f79323b 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -558,6 +558,9 @@ enum {
 	/* register a range of fixed file slots for automatic slot allocation */
 	IORING_REGISTER_FILE_ALLOC_RANGE	= 25,
 
+	/* register a network interface queue for zerocopy */
+	IORING_REGISTER_ZC_RX_IFQ		= 26,
+
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
@@ -750,6 +753,54 @@ enum {
 	SOCKET_URING_OP_SETSOCKOPT,
 };
 
+struct io_uring_rbuf_rqe {
+	__u32	off;
+	__u32	len;
+	__u16	region;
+	__u8	__pad[6];
+};
+
+struct io_uring_rbuf_cqe {
+	__u32	off;
+	__u32	len;
+	__u16	region;
+	__u8	sock;
+	__u8	flags;
+	__u8	__pad[2];
+};
+
+struct io_rbuf_rqring_offsets {
+	__u32	head;
+	__u32	tail;
+	__u32	rqes;
+	__u8	__pad[4];
+};
+
+struct io_rbuf_cqring_offsets {
+	__u32	head;
+	__u32	tail;
+	__u32	cqes;
+	__u8	__pad[4];
+};
+
+/*
+ * Argument for IORING_REGISTER_ZC_RX_IFQ
+ */
+struct io_uring_zc_rx_ifq_reg {
+	__u32	if_idx;
+	/* hw rx descriptor ring id */
+	__u32	if_rxq_id;
+	__u32	region_id;
+	__u32	rq_entries;
+	__u32	cq_entries;
+	__u32	flags;
+	__u16	cpu;
+
+	__u32	mmap_sz;
+	struct io_rbuf_rqring_offsets rq_off;
+	struct io_rbuf_cqring_offsets cq_off;
+};
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/io_uring/Makefile b/io_uring/Makefile
index e5be47e4fc3b..6c4b4ed37a1f 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -8,6 +8,6 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o xattr.o nop.o fs.o splice.o \
 					statx.o net.o msg_ring.o timeout.o \
 					sqpoll.o fdinfo.o tctx.o poll.o \
 					cancel.o kbuf.o rsrc.o rw.o opdef.o \
-					notif.o waitid.o
+					notif.o waitid.o zc_rx.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
 obj-$(CONFIG_FUTEX)		+= futex.o
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1d254f2c997d..7fff01d57e9e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -95,6 +95,7 @@
 #include "notif.h"
 #include "waitid.h"
 #include "futex.h"
+#include "zc_rx.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -2919,6 +2920,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		return;
 
 	mutex_lock(&ctx->uring_lock);
+	io_unregister_zc_rx_ifqs(ctx);
 	if (ctx->buf_data)
 		__io_sqe_buffers_unregister(ctx);
 	if (ctx->file_data)
@@ -3109,6 +3111,11 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 			io_cqring_overflow_kill(ctx);
 			mutex_unlock(&ctx->uring_lock);
 		}
+		if (ctx->ifq) {
+			mutex_lock(&ctx->uring_lock);
+			io_shutdown_zc_rx_ifqs(ctx);
+			mutex_unlock(&ctx->uring_lock);
+		}
 
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
 			io_move_task_work_from_local(ctx);
@@ -4609,6 +4616,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_file_alloc_range(ctx, arg);
 		break;
+	case IORING_REGISTER_ZC_RX_IFQ:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_zc_rx_ifq(ctx, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
new file mode 100644
index 000000000000..5fc94cad5e3a
--- /dev/null
+++ b/io_uring/zc_rx.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0
+#if defined(CONFIG_PAGE_POOL)
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/io_uring.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "kbuf.h"
+#include "zc_rx.h"
+
+static int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq,
+				 struct io_uring_zc_rx_ifq_reg *reg)
+{
+	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP;
+	size_t off, size, rq_size, cq_size;
+	void *ptr;
+
+	off = sizeof(struct io_rbuf_ring);
+	rq_size = reg->rq_entries * sizeof(struct io_uring_rbuf_rqe);
+	cq_size = reg->cq_entries * sizeof(struct io_uring_rbuf_cqe);
+	size = off + rq_size + cq_size;
+	ptr = (void *) __get_free_pages(gfp, get_order(size));
+	if (!ptr)
+		return -ENOMEM;
+	ifq->ring = (struct io_rbuf_ring *)ptr;
+	ifq->rqes = (struct io_uring_rbuf_rqe *)((char *)ptr + off);
+	ifq->cqes = (struct io_uring_rbuf_cqe *)((char *)ifq->rqes + rq_size);
+	return 0;
+}
+
+static void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq)
+{
+	if (ifq->ring)
+		folio_put(virt_to_folio(ifq->ring));
+}
+
+static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx)
+{
+	struct io_zc_rx_ifq *ifq;
+
+	ifq = kzalloc(sizeof(*ifq), GFP_KERNEL);
+	if (!ifq)
+		return NULL;
+
+	ifq->if_rxq_id = -1;
+	ifq->ctx = ctx;
+	return ifq;
+}
+
+static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq)
+{
+	io_free_rbuf_ring(ifq);
+	kfree(ifq);
+}
+
+int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
+			  struct io_uring_zc_rx_ifq_reg __user *arg)
+{
+	struct io_uring_zc_rx_ifq_reg reg;
+	struct io_zc_rx_ifq *ifq;
+	int ret;
+
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+		return -EINVAL;
+	if (copy_from_user(&reg, arg, sizeof(reg)))
+		return -EFAULT;
+	if (ctx->ifq)
+		return -EBUSY;
+	if (reg.if_rxq_id == -1)
+		return -EINVAL;
+
+	ifq = io_zc_rx_ifq_alloc(ctx);
+	if (!ifq)
+		return -ENOMEM;
+
+	/* TODO: initialise network interface */
+
+	ret = io_allocate_rbuf_ring(ifq, &reg);
+	if (ret)
+		goto err;
+
+	/* TODO: map zc region and initialise zc pool */
+
+	ifq->rq_entries = reg.rq_entries;
+	ifq->cq_entries = reg.cq_entries;
+	ifq->if_rxq_id = reg.if_rxq_id;
+	ctx->ifq = ifq;
+
+	return 0;
+err:
+	io_zc_rx_ifq_free(ifq);
+	return ret;
+}
+
+void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx)
+{
+	struct io_zc_rx_ifq *ifq = ctx->ifq;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	if (!ifq)
+		return;
+
+	ctx->ifq = NULL;
+	io_zc_rx_ifq_free(ifq);
+}
+
+void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
+{
+	lockdep_assert_held(&ctx->uring_lock);
+}
+
+#endif
diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
new file mode 100644
index 000000000000..aab57c1a4c5d
--- /dev/null
+++ b/io_uring/zc_rx.h
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_ZC_RX_H
+#define IOU_ZC_RX_H
+
+struct io_zc_rx_ifq {
+	struct io_ring_ctx		*ctx;
+	struct net_device		*dev;
+	struct io_rbuf_ring		*ring;
+	struct io_uring_rbuf_rqe 	*rqes;
+	struct io_uring_rbuf_cqe 	*cqes;
+	u32				rq_entries;
+	u32				cq_entries;
+
+	/* hw rx descriptor ring id */
+	u32				if_rxq_id;
+};
+
+#if defined(CONFIG_PAGE_POOL)
+int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
+			  struct io_uring_zc_rx_ifq_reg __user *arg);
+void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx);
+void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx);
+#else
+static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
+			  struct io_uring_zc_rx_ifq_reg __user *arg)
+{
+	return -EOPNOTSUPP;
+}
+static inline void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx)
+{
+}
+static inline void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
+{
+}
+#endif
+
+#endif
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 08/20] io_uring: add mmap support for shared ifq ringbuffers
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (6 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 07/20] io_uring: add interface queue David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-20 16:13   ` Jens Axboe
  2023-12-19 21:03 ` [RFC PATCH v3 09/20] netdev: add XDP_SETUP_ZC_RX command David Wei
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

This patch adds mmap support for ifq rbuf rings. There are two rings and
a struct io_rbuf_ring that contains the head and tail ptrs into each
ring.

Just like the io_uring SQ/CQ rings, userspace issues a single mmap call
using the io_uring fd w/ magic offset IORING_OFF_RBUF_RING. An opaque
ptr is returned to userspace, which is then expected to use the offsets
returned in the registration struct to get access to the head/tail and
rings.

Signed-off-by: David Wei <[email protected]>
---
 include/uapi/linux/io_uring.h |  2 ++
 io_uring/io_uring.c           |  5 +++++
 io_uring/zc_rx.c              | 19 ++++++++++++++++++-
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 024a6f79323b..839933e562e6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -428,6 +428,8 @@ enum {
 #define IORING_OFF_PBUF_RING		0x80000000ULL
 #define IORING_OFF_PBUF_SHIFT		16
 #define IORING_OFF_MMAP_MASK		0xf8000000ULL
+#define IORING_OFF_RBUF_RING		0x20000000ULL
+#define IORING_OFF_RBUF_SHIFT		16
 
 /*
  * Filled with the offset for mmap(2)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7fff01d57e9e..02d6d638bd65 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3516,6 +3516,11 @@ static void *io_uring_validate_mmap_request(struct file *file,
 			return ERR_PTR(-EINVAL);
 		break;
 		}
+	case IORING_OFF_RBUF_RING:
+		if (!ctx->ifq)
+			return ERR_PTR(-EINVAL);
+		ptr = ctx->ifq->ring;
+		break;
 	default:
 		return ERR_PTR(-EINVAL);
 	}
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index 5fc94cad5e3a..7e3e6f6d446b 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -61,6 +61,7 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 {
 	struct io_uring_zc_rx_ifq_reg reg;
 	struct io_zc_rx_ifq *ifq;
+	size_t ring_sz, rqes_sz, cqes_sz;
 	int ret;
 
 	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
@@ -87,8 +88,24 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 	ifq->rq_entries = reg.rq_entries;
 	ifq->cq_entries = reg.cq_entries;
 	ifq->if_rxq_id = reg.if_rxq_id;
-	ctx->ifq = ifq;
 
+	ring_sz = sizeof(struct io_rbuf_ring);
+	rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries;
+	cqes_sz = sizeof(struct io_uring_rbuf_cqe) * ifq->cq_entries;
+	reg.mmap_sz = ring_sz + rqes_sz + cqes_sz;
+	reg.rq_off.rqes = ring_sz;
+	reg.cq_off.cqes = ring_sz + rqes_sz;
+	reg.rq_off.head = offsetof(struct io_rbuf_ring, rq.head);
+	reg.rq_off.tail = offsetof(struct io_rbuf_ring, rq.tail);
+	reg.cq_off.head = offsetof(struct io_rbuf_ring, cq.head);
+	reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail);
+
+	if (copy_to_user(arg, &reg, sizeof(reg))) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	ctx->ifq = ifq;
 	return 0;
 err:
 	io_zc_rx_ifq_free(ifq);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 09/20] netdev: add XDP_SETUP_ZC_RX command
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (7 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 08/20] io_uring: add mmap support for shared ifq ringbuffers David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 10/20] io_uring: setup ZC for an Rx queue when registering an ifq David Wei
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

RFC ONLY, NOT FOR UPSTREAM
This will be replaced with a separate ndo callback or some other
mechanism in next patchset revisions.

This patch adds a new XDP_SETUP_ZC_RX command that will be used in a
later patch to enable or disable ZC RX for a specific RX queue.

We are open to suggestions on a better way of doing this. Google's TCP
devmem proposal sets up struct netdev_rx_queue which persists across
device reset, then expects userspace to use an out-of-band method (e.g.
ethtool) to reset the device, thus re-filling a hardware Rx queue.

Signed-off-by: David Wei <[email protected]>
---
 include/linux/netdevice.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a4bdc35c7d6f..5b4df0b6a6c0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1097,6 +1097,7 @@ enum bpf_netdev_command {
 	BPF_OFFLOAD_MAP_ALLOC,
 	BPF_OFFLOAD_MAP_FREE,
 	XDP_SETUP_XSK_POOL,
+	XDP_SETUP_ZC_RX,
 };
 
 struct bpf_prog_offload_ops;
@@ -1135,6 +1136,11 @@ struct netdev_bpf {
 			struct xsk_buff_pool *pool;
 			u16 queue_id;
 		} xsk;
+		/* XDP_SETUP_ZC_RX */
+		struct {
+			struct io_zc_rx_ifq *ifq;
+			u16 queue_id;
+		} zc_rx;
 	};
 };
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 10/20] io_uring: setup ZC for an Rx queue when registering an ifq
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (8 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 09/20] netdev: add XDP_SETUP_ZC_RX command David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-20 16:06   ` Jens Axboe
  2023-12-19 21:03 ` [RFC PATCH v3 11/20] io_uring/zcrx: implement socket registration David Wei
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

This patch sets up ZC for an Rx queue in a net device when an ifq is
registered with io_uring. The Rx queue is specified in the registration
struct.

For now since there is only one ifq, its destruction is implicit during
io_uring cleanup.

Signed-off-by: David Wei <[email protected]>
---
 io_uring/zc_rx.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index 7e3e6f6d446b..259e08a34ab2 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -4,6 +4,7 @@
 #include <linux/errno.h>
 #include <linux/mm.h>
 #include <linux/io_uring.h>
+#include <linux/netdevice.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -11,6 +12,34 @@
 #include "kbuf.h"
 #include "zc_rx.h"
 
+typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
+
+static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq,
+			   u16 queue_id)
+{
+	struct netdev_bpf cmd;
+	bpf_op_t ndo_bpf;
+
+	ndo_bpf = dev->netdev_ops->ndo_bpf;
+	if (!ndo_bpf)
+		return -EINVAL;
+
+	cmd.command = XDP_SETUP_ZC_RX;
+	cmd.zc_rx.ifq = ifq;
+	cmd.zc_rx.queue_id = queue_id;
+	return ndo_bpf(dev, &cmd);
+}
+
+static int io_open_zc_rxq(struct io_zc_rx_ifq *ifq)
+{
+	return __io_queue_mgmt(ifq->dev, ifq, ifq->if_rxq_id);
+}
+
+static int io_close_zc_rxq(struct io_zc_rx_ifq *ifq)
+{
+	return __io_queue_mgmt(ifq->dev, NULL, ifq->if_rxq_id);
+}
+
 static int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq,
 				 struct io_uring_zc_rx_ifq_reg *reg)
 {
@@ -52,6 +81,10 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx)
 
 static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq)
 {
+	if (ifq->if_rxq_id != -1)
+		io_close_zc_rxq(ifq);
+	if (ifq->dev)
+		dev_put(ifq->dev);
 	io_free_rbuf_ring(ifq);
 	kfree(ifq);
 }
@@ -77,18 +110,25 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 	if (!ifq)
 		return -ENOMEM;
 
-	/* TODO: initialise network interface */
-
 	ret = io_allocate_rbuf_ring(ifq, &reg);
 	if (ret)
 		goto err;
 
+	ret = -ENODEV;
+	ifq->dev = dev_get_by_index(current->nsproxy->net_ns, reg.if_idx);
+	if (!ifq->dev)
+		goto err;
+
 	/* TODO: map zc region and initialise zc pool */
 
 	ifq->rq_entries = reg.rq_entries;
 	ifq->cq_entries = reg.cq_entries;
 	ifq->if_rxq_id = reg.if_rxq_id;
 
+	ret = io_open_zc_rxq(ifq);
+	if (ret)
+		goto err;
+
 	ring_sz = sizeof(struct io_rbuf_ring);
 	rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries;
 	cqes_sz = sizeof(struct io_uring_rbuf_cqe) * ifq->cq_entries;
@@ -101,6 +141,7 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 	reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail);
 
 	if (copy_to_user(arg, &reg, sizeof(reg))) {
+		io_close_zc_rxq(ifq);
 		ret = -EFAULT;
 		goto err;
 	}
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 11/20] io_uring/zcrx: implement socket registration
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (9 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 10/20] io_uring: setup ZC for an Rx queue when registering an ifq David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 12/20] io_uring: add ZC buf and pool David Wei
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

We want userspace to explicitly list all sockets it'll be using with a
particular zc ifq, so we can properly configure them, e.g. binding the
sockets to the corresponding interface and setting steering rules. We'll
also need it to better control ifq lifetime and for
termination / unregistration purposes.

TODO: remove zc_rx_idx from struct socket, which will fix zc_rx_idx
token init races and re-registration bug.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/linux/net.h           |  2 +
 include/uapi/linux/io_uring.h |  7 +++
 io_uring/io_uring.c           |  6 +++
 io_uring/net.c                | 20 ++++++++
 io_uring/zc_rx.c              | 89 +++++++++++++++++++++++++++++++++--
 io_uring/zc_rx.h              | 17 +++++++
 net/socket.c                  |  1 +
 7 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index c9b4a63791a4..867061a91d30 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -126,6 +126,8 @@ struct socket {
 	const struct proto_ops	*ops; /* Might change with IPV6_ADDRFORM or MPTCP. */
 
 	struct socket_wq	wq;
+
+	unsigned		zc_rx_idx;
 };
 
 /*
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 839933e562e6..f4ba58bce3bd 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -562,6 +562,7 @@ enum {
 
 	/* register a network interface queue for zerocopy */
 	IORING_REGISTER_ZC_RX_IFQ		= 26,
+	IORING_REGISTER_ZC_RX_SOCK		= 27,
 
 	/* this goes last */
 	IORING_REGISTER_LAST,
@@ -803,6 +804,12 @@ struct io_uring_zc_rx_ifq_reg {
 	struct io_rbuf_cqring_offsets cq_off;
 };
 
+struct io_uring_zc_rx_sock_reg {
+	__u32	sockfd;
+	__u32	zc_rx_ifq_idx;
+	__u32	__resv[2];
+};
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 02d6d638bd65..47859599469d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -4627,6 +4627,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_zc_rx_ifq(ctx, arg);
 		break;
+	case IORING_REGISTER_ZC_RX_SOCK:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_zc_rx_sock(ctx, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/io_uring/net.c b/io_uring/net.c
index 75d494dad7e2..454ba301ae6b 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -16,6 +16,7 @@
 #include "net.h"
 #include "notif.h"
 #include "rsrc.h"
+#include "zc_rx.h"
 
 #if defined(CONFIG_NET)
 struct io_shutdown {
@@ -955,6 +956,25 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
+static __maybe_unused
+struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
+					struct socket *sock)
+{
+	unsigned token = READ_ONCE(sock->zc_rx_idx);
+	unsigned ifq_idx = token >> IO_ZC_IFQ_IDX_OFFSET;
+	unsigned sock_idx = token & IO_ZC_IFQ_IDX_MASK;
+	struct io_zc_rx_ifq *ifq;
+
+	if (ifq_idx)
+		return NULL;
+	ifq = req->ctx->ifq;
+	if (!ifq || sock_idx >= ifq->nr_sockets)
+		return NULL;
+	if (ifq->sockets[sock_idx] != req->file)
+		return NULL;
+	return ifq;
+}
+
 void io_send_zc_cleanup(struct io_kiocb *req)
 {
 	struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index 259e08a34ab2..06e2c54d3f3d 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -11,6 +11,7 @@
 #include "io_uring.h"
 #include "kbuf.h"
 #include "zc_rx.h"
+#include "rsrc.h"
 
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 
@@ -79,10 +80,31 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx)
 	return ifq;
 }
 
-static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq)
+static void io_shutdown_ifq(struct io_zc_rx_ifq *ifq)
 {
-	if (ifq->if_rxq_id != -1)
+	int i;
+
+	if (!ifq)
+		return;
+
+	for (i = 0; i < ifq->nr_sockets; i++) {
+		if (ifq->sockets[i]) {
+			fput(ifq->sockets[i]);
+			ifq->sockets[i] = NULL;
+		}
+	}
+	ifq->nr_sockets = 0;
+
+	if (ifq->if_rxq_id != -1) {
 		io_close_zc_rxq(ifq);
+		ifq->if_rxq_id = -1;
+	}
+}
+
+static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq)
+{
+	io_shutdown_ifq(ifq);
+
 	if (ifq->dev)
 		dev_put(ifq->dev);
 	io_free_rbuf_ring(ifq);
@@ -141,7 +163,6 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 	reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail);
 
 	if (copy_to_user(arg, &reg, sizeof(reg))) {
-		io_close_zc_rxq(ifq);
 		ret = -EFAULT;
 		goto err;
 	}
@@ -162,6 +183,8 @@ void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx)
 	if (!ifq)
 		return;
 
+	WARN_ON_ONCE(ifq->nr_sockets);
+
 	ctx->ifq = NULL;
 	io_zc_rx_ifq_free(ifq);
 }
@@ -169,6 +192,66 @@ void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx)
 void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
 {
 	lockdep_assert_held(&ctx->uring_lock);
+
+	io_shutdown_ifq(ctx->ifq);
+}
+
+int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
+			   struct io_uring_zc_rx_sock_reg __user *arg)
+{
+	struct io_uring_zc_rx_sock_reg sr;
+	struct io_zc_rx_ifq *ifq;
+	struct socket *sock;
+	struct file *file;
+	int ret = -EEXIST;
+	int idx;
+
+	if (copy_from_user(&sr, arg, sizeof(sr)))
+		return -EFAULT;
+	if (sr.__resv[0] || sr.__resv[1])
+		return -EINVAL;
+	if (sr.zc_rx_ifq_idx != 0 || !ctx->ifq)
+		return -EINVAL;
+
+	ifq = ctx->ifq;
+	if (ifq->nr_sockets >= ARRAY_SIZE(ifq->sockets))
+		return -EINVAL;
+
+	BUILD_BUG_ON(ARRAY_SIZE(ifq->sockets) > IO_ZC_IFQ_IDX_MASK);
+
+	file = fget(sr.sockfd);
+	if (!file)
+		return -EBADF;
+
+	if (io_file_need_scm(file)) {
+		fput(file);
+		return -EBADF;
+	}
+
+	sock = sock_from_file(file);
+	if (unlikely(!sock || !sock->sk)) {
+		fput(file);
+		return -ENOTSOCK;
+	}
+
+	idx = ifq->nr_sockets;
+	lock_sock(sock->sk);
+	if (!sock->zc_rx_idx) {
+		unsigned token;
+
+		token = idx + (sr.zc_rx_ifq_idx << IO_ZC_IFQ_IDX_OFFSET);
+		WRITE_ONCE(sock->zc_rx_idx, token);
+		ret = 0;
+	}
+	release_sock(sock->sk);
+
+	if (ret) {
+		fput(file);
+		return -EINVAL;
+	}
+	ifq->sockets[idx] = file;
+	ifq->nr_sockets++;
+	return 0;
 }
 
 #endif
diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
index aab57c1a4c5d..9257dda77e92 100644
--- a/io_uring/zc_rx.h
+++ b/io_uring/zc_rx.h
@@ -2,6 +2,13 @@
 #ifndef IOU_ZC_RX_H
 #define IOU_ZC_RX_H
 
+#include <linux/io_uring_types.h>
+#include <linux/skbuff.h>
+
+#define IO_ZC_MAX_IFQ_SOCKETS		16
+#define IO_ZC_IFQ_IDX_OFFSET		16
+#define IO_ZC_IFQ_IDX_MASK		((1U << IO_ZC_IFQ_IDX_OFFSET) - 1)
+
 struct io_zc_rx_ifq {
 	struct io_ring_ctx		*ctx;
 	struct net_device		*dev;
@@ -13,6 +20,9 @@ struct io_zc_rx_ifq {
 
 	/* hw rx descriptor ring id */
 	u32				if_rxq_id;
+
+	unsigned			nr_sockets;
+	struct file			*sockets[IO_ZC_MAX_IFQ_SOCKETS];
 };
 
 #if defined(CONFIG_PAGE_POOL)
@@ -20,6 +30,8 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 			  struct io_uring_zc_rx_ifq_reg __user *arg);
 void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx);
 void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx);
+int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
+			   struct io_uring_zc_rx_sock_reg __user *arg);
 #else
 static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 			  struct io_uring_zc_rx_ifq_reg __user *arg)
@@ -32,6 +44,11 @@ static inline void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx)
 static inline void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
 {
 }
+static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
+				struct io_uring_zc_rx_sock_reg __user *arg)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 #endif
diff --git a/net/socket.c b/net/socket.c
index d75246450a3c..a9cef870309a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -637,6 +637,7 @@ struct socket *sock_alloc(void)
 
 	sock = SOCKET_I(inode);
 
+	sock->zc_rx_idx = 0;
 	inode->i_ino = get_next_ino();
 	inode->i_mode = S_IFSOCK | S_IRWXUGO;
 	inode->i_uid = current_fsuid();
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 12/20] io_uring: add ZC buf and pool
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (10 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 11/20] io_uring/zcrx: implement socket registration David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx David Wei
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

[TODO: REVIEW COMMIT MESSAGE]

This patch adds two objects:

* Zero copy buffer representation, holding a page, its mapped dma_addr,
  and a refcount for lifetime management.
* Zero copy pool, spiritually similar to page pool, that holds ZC bufs
  and hands them out to net devices.

Pool regions are registered w/ io_uring using the registered buffer API,
with a 1:1 mapping between region and nr_iovec in
io_uring_register_buffers. This does the heavy lifting of pinning and
chunking into bvecs into a struct io_mapped_ubuf for us.

For now as there is only one pool region per ifq, there is no separate
API for adding/removing regions yet and it is mapped implicitly during
ifq registration.

Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring/net.h |   8 +++
 io_uring/zc_rx.c             | 135 ++++++++++++++++++++++++++++++++++-
 io_uring/zc_rx.h             |  15 ++++
 3 files changed, 157 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
index b58f39fed4d5..d994d26116d0 100644
--- a/include/linux/io_uring/net.h
+++ b/include/linux/io_uring/net.h
@@ -2,8 +2,16 @@
 #ifndef _LINUX_IO_URING_NET_H
 #define _LINUX_IO_URING_NET_H
 
+#include <net/page_pool/types.h>
+
 struct io_uring_cmd;
 
+struct io_zc_rx_buf {
+	struct page_pool_iov	ppiov;
+	struct page		*page;
+	dma_addr_t		dma;
+};
+
 #if defined(CONFIG_IO_URING)
 int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
 
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index 06e2c54d3f3d..1e656b481725 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/io_uring.h>
 #include <linux/netdevice.h>
+#include <linux/nospec.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -15,6 +16,11 @@
 
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 
+static inline struct device *netdev2dev(struct net_device *dev)
+{
+	return dev->dev.parent;
+}
+
 static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq,
 			   u16 queue_id)
 {
@@ -67,6 +73,129 @@ static void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq)
 		folio_put(virt_to_folio(ifq->ring));
 }
 
+static int io_zc_rx_init_buf(struct device *dev, struct page *page, u16 pool_id,
+			     u32 pgid, struct io_zc_rx_buf *buf)
+{
+	dma_addr_t addr = 0;
+
+	/* Skip dma setup for devices that don't do any DMA transfers */
+	if (dev) {
+		addr = dma_map_page_attrs(dev, page, 0, PAGE_SIZE,
+					  DMA_BIDIRECTIONAL,
+					  DMA_ATTR_SKIP_CPU_SYNC);
+		if (dma_mapping_error(dev, addr))
+			return -ENOMEM;
+	}
+
+	buf->dma = addr;
+	buf->page = page;
+	refcount_set(&buf->ppiov.refcount, 0);
+	buf->ppiov.owner = NULL;
+	buf->ppiov.pp = NULL;
+	get_page(page);
+	return 0;
+}
+
+static void io_zc_rx_free_buf(struct device *dev, struct io_zc_rx_buf *buf)
+{
+	struct page *page = buf->page;
+
+	if (dev)
+		dma_unmap_page_attrs(dev, buf->dma, PAGE_SIZE,
+				     DMA_BIDIRECTIONAL,
+				     DMA_ATTR_SKIP_CPU_SYNC);
+	put_page(page);
+}
+
+static int io_zc_rx_map_pool(struct io_zc_rx_pool *pool,
+			     struct io_mapped_ubuf *imu,
+			     struct device *dev)
+{
+	struct io_zc_rx_buf *buf;
+	struct page *page;
+	int i, ret;
+
+	for (i = 0; i < imu->nr_bvecs; i++) {
+		page = imu->bvec[i].bv_page;
+		buf = &pool->bufs[i];
+		ret = io_zc_rx_init_buf(dev, page, pool->pool_id, i, buf);
+		if (ret)
+			goto err;
+
+		pool->freelist[i] = i;
+	}
+
+	pool->free_count = imu->nr_bvecs;
+	return 0;
+err:
+	while (i--) {
+		buf = &pool->bufs[i];
+		io_zc_rx_free_buf(dev, buf);
+	}
+	return ret;
+}
+
+static int io_zc_rx_create_pool(struct io_ring_ctx *ctx,
+				struct io_zc_rx_ifq *ifq,
+				u16 id)
+{
+	struct device *dev = netdev2dev(ifq->dev);
+	struct io_mapped_ubuf *imu;
+	struct io_zc_rx_pool *pool;
+	int nr_pages;
+	int ret;
+
+	if (ifq->pool)
+		return -EFAULT;
+
+	if (unlikely(id >= ctx->nr_user_bufs))
+		return -EFAULT;
+	id = array_index_nospec(id, ctx->nr_user_bufs);
+	imu = ctx->user_bufs[id];
+	if (imu->ubuf & ~PAGE_MASK || imu->ubuf_end & ~PAGE_MASK)
+		return -EFAULT;
+
+	ret = -ENOMEM;
+	nr_pages = imu->nr_bvecs;
+	pool = kvmalloc(struct_size(pool, freelist, nr_pages), GFP_KERNEL);
+	if (!pool)
+		goto err;
+
+	pool->bufs = kvmalloc_array(nr_pages, sizeof(*pool->bufs), GFP_KERNEL);
+	if (!pool->bufs)
+		goto err_buf;
+
+	ret = io_zc_rx_map_pool(pool, imu, dev);
+	if (ret)
+		goto err_map;
+
+	pool->ifq = ifq;
+	pool->pool_id = id;
+	pool->nr_bufs = nr_pages;
+	spin_lock_init(&pool->freelist_lock);
+	ifq->pool = pool;
+	return 0;
+err_map:
+	kvfree(pool->bufs);
+err_buf:
+	kvfree(pool);
+err:
+	return ret;
+}
+
+static void io_zc_rx_destroy_pool(struct io_zc_rx_pool *pool)
+{
+	struct device *dev = netdev2dev(pool->ifq->dev);
+	struct io_zc_rx_buf *buf;
+
+	for (int i = 0; i < pool->nr_bufs; i++) {
+		buf = &pool->bufs[i];
+		io_zc_rx_free_buf(dev, buf);
+	}
+	kvfree(pool->bufs);
+	kvfree(pool);
+}
+
 static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx)
 {
 	struct io_zc_rx_ifq *ifq;
@@ -105,6 +234,8 @@ static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq)
 {
 	io_shutdown_ifq(ifq);
 
+	if (ifq->pool)
+		io_zc_rx_destroy_pool(ifq->pool);
 	if (ifq->dev)
 		dev_put(ifq->dev);
 	io_free_rbuf_ring(ifq);
@@ -141,7 +272,9 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 	if (!ifq->dev)
 		goto err;
 
-	/* TODO: map zc region and initialise zc pool */
+	ret = io_zc_rx_create_pool(ctx, ifq, reg.region_id);
+	if (ret)
+		goto err;
 
 	ifq->rq_entries = reg.rq_entries;
 	ifq->cq_entries = reg.cq_entries;
diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
index 9257dda77e92..af1d865525d2 100644
--- a/io_uring/zc_rx.h
+++ b/io_uring/zc_rx.h
@@ -3,15 +3,30 @@
 #define IOU_ZC_RX_H
 
 #include <linux/io_uring_types.h>
+#include <linux/io_uring/net.h>
 #include <linux/skbuff.h>
 
 #define IO_ZC_MAX_IFQ_SOCKETS		16
 #define IO_ZC_IFQ_IDX_OFFSET		16
 #define IO_ZC_IFQ_IDX_MASK		((1U << IO_ZC_IFQ_IDX_OFFSET) - 1)
 
+struct io_zc_rx_pool {
+	struct io_zc_rx_ifq	*ifq;
+	struct io_zc_rx_buf	*bufs;
+	u32			nr_bufs;
+	u16			pool_id;
+
+	/* freelist */
+	spinlock_t		freelist_lock;
+	u32			free_count;
+	u32			freelist[];
+};
+
 struct io_zc_rx_ifq {
 	struct io_ring_ctx		*ctx;
 	struct net_device		*dev;
+	struct io_zc_rx_pool		*pool;
+
 	struct io_rbuf_ring		*ring;
 	struct io_uring_rbuf_rqe 	*rqes;
 	struct io_uring_rbuf_cqe 	*cqes;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (11 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 12/20] io_uring: add ZC buf and pool David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 23:44   ` Mina Almasry
  2023-12-21 19:36   ` Pavel Begunkov
  2023-12-19 21:03 ` [RFC PATCH v3 14/20] net: page pool: add io_uring memory provider David Wei
                   ` (6 subsequent siblings)
  19 siblings, 2 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

We're adding a new pp memory provider to implement io_uring zerocopy
receive. It'll be "registered" in pp and used in later paches.

The typical life cycle of a buffer goes as follows: first it's allocated
to a driver with the initial refcount set to 1. The drivers fills it
with data, puts it into an skb and passes down the stack, where it gets
queued up to a socket. Later, a zc io_uring request will be receiving
data from the socket from a task context. At that point io_uring will
tell the userspace that this buffer has some data by posting an
appropriate completion. It'll also elevating the refcount by
IO_ZC_RX_UREF, so the buffer is not recycled while userspace is reading
the data. When the userspace is done with the buffer it should return it
back to io_uring by adding an entry to the buffer refill ring. When
necessary io_uring will poll the refill ring, compare references
including IO_ZC_RX_UREF and reuse the buffer.

Initally, all buffers are placed in a spinlock protected ->freelist.
It's a slow path stash, where buffers are considered to be unallocated
and not exposed to core page pool. On allocation, pp will first try
all its caches, and the ->alloc_pages callback if everything else
failed.

The hot path for io_pp_zc_alloc_pages() is to grab pages from the refill
ring. The consumption from the ring is always done in the attached napi
context, so no additional synchronisation required. If that fails we'll
be getting buffers from the ->freelist.

Note: only ->freelist are considered unallocated for page pool, so we
only add pages_state_hold_cnt when allocating from there. Subsequently,
as page_pool_return_page() and others bump the ->pages_state_release_cnt
counter, io_pp_zc_release_page() can only use ->freelist, which is not a
problem as it's not a slow path.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring/net.h |   5 +
 io_uring/zc_rx.c             | 204 +++++++++++++++++++++++++++++++++++
 io_uring/zc_rx.h             |   6 ++
 3 files changed, 215 insertions(+)

diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
index d994d26116d0..13244ae5fc4a 100644
--- a/include/linux/io_uring/net.h
+++ b/include/linux/io_uring/net.h
@@ -13,6 +13,11 @@ struct io_zc_rx_buf {
 };
 
 #if defined(CONFIG_IO_URING)
+
+#if defined(CONFIG_PAGE_POOL)
+extern const struct pp_memory_provider_ops io_uring_pp_zc_ops;
+#endif
+
 int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
 
 #else
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index 1e656b481725..ff1dac24ac40 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -6,6 +6,7 @@
 #include <linux/io_uring.h>
 #include <linux/netdevice.h>
 #include <linux/nospec.h>
+#include <trace/events/page_pool.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -387,4 +388,207 @@ int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
 	return 0;
 }
 
+static inline struct io_zc_rx_buf *io_iov_to_buf(struct page_pool_iov *iov)
+{
+	return container_of(iov, struct io_zc_rx_buf, ppiov);
+}
+
+static inline unsigned io_buf_pgid(struct io_zc_rx_pool *pool,
+				   struct io_zc_rx_buf *buf)
+{
+	return buf - pool->bufs;
+}
+
+static __maybe_unused void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf)
+{
+	refcount_add(IO_ZC_RX_UREF, &buf->ppiov.refcount);
+}
+
+static bool io_zc_rx_put_buf_uref(struct io_zc_rx_buf *buf)
+{
+	if (page_pool_iov_refcount(&buf->ppiov) < IO_ZC_RX_UREF)
+		return false;
+
+	return page_pool_iov_sub_and_test(&buf->ppiov, IO_ZC_RX_UREF);
+}
+
+static inline struct page *io_zc_buf_to_pp_page(struct io_zc_rx_buf *buf)
+{
+	return page_pool_mangle_ppiov(&buf->ppiov);
+}
+
+static inline void io_zc_add_pp_cache(struct page_pool *pp,
+				      struct io_zc_rx_buf *buf)
+{
+	refcount_set(&buf->ppiov.refcount, 1);
+	pp->alloc.cache[pp->alloc.count++] = io_zc_buf_to_pp_page(buf);
+}
+
+static inline u32 io_zc_rx_rqring_entries(struct io_zc_rx_ifq *ifq)
+{
+	struct io_rbuf_ring *ring = ifq->ring;
+	u32 entries;
+
+	entries = smp_load_acquire(&ring->rq.tail) - ifq->cached_rq_head;
+	return min(entries, ifq->rq_entries);
+}
+
+static void io_zc_rx_ring_refill(struct page_pool *pp,
+				 struct io_zc_rx_ifq *ifq)
+{
+	unsigned int entries = io_zc_rx_rqring_entries(ifq);
+	unsigned int mask = ifq->rq_entries - 1;
+	struct io_zc_rx_pool *pool = ifq->pool;
+
+	if (unlikely(!entries))
+		return;
+
+	while (entries--) {
+		unsigned int rq_idx = ifq->cached_rq_head++ & mask;
+		struct io_uring_rbuf_rqe *rqe = &ifq->rqes[rq_idx];
+		u32 pgid = rqe->off / PAGE_SIZE;
+		struct io_zc_rx_buf *buf = &pool->bufs[pgid];
+
+		if (!io_zc_rx_put_buf_uref(buf))
+			continue;
+		io_zc_add_pp_cache(pp, buf);
+		if (pp->alloc.count >= PP_ALLOC_CACHE_REFILL)
+			break;
+	}
+	smp_store_release(&ifq->ring->rq.head, ifq->cached_rq_head);
+}
+
+static void io_zc_rx_refill_slow(struct page_pool *pp, struct io_zc_rx_ifq *ifq)
+{
+	struct io_zc_rx_pool *pool = ifq->pool;
+
+	spin_lock_bh(&pool->freelist_lock);
+	while (pool->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
+		struct io_zc_rx_buf *buf;
+		u32 pgid;
+
+		pgid = pool->freelist[--pool->free_count];
+		buf = &pool->bufs[pgid];
+
+		io_zc_add_pp_cache(pp, buf);
+		pp->pages_state_hold_cnt++;
+		trace_page_pool_state_hold(pp, io_zc_buf_to_pp_page(buf),
+					   pp->pages_state_hold_cnt);
+	}
+	spin_unlock_bh(&pool->freelist_lock);
+}
+
+static void io_zc_rx_recycle_buf(struct io_zc_rx_pool *pool,
+				 struct io_zc_rx_buf *buf)
+{
+	spin_lock_bh(&pool->freelist_lock);
+	pool->freelist[pool->free_count++] = io_buf_pgid(pool, buf);
+	spin_unlock_bh(&pool->freelist_lock);
+}
+
+static struct page *io_pp_zc_alloc_pages(struct page_pool *pp, gfp_t gfp)
+{
+	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+
+	/* pp should already be ensuring that */
+	if (unlikely(pp->alloc.count))
+		goto out_return;
+
+	io_zc_rx_ring_refill(pp, ifq);
+	if (likely(pp->alloc.count))
+		goto out_return;
+
+	io_zc_rx_refill_slow(pp, ifq);
+	if (!pp->alloc.count)
+		return NULL;
+out_return:
+	return pp->alloc.cache[--pp->alloc.count];
+}
+
+static bool io_pp_zc_release_page(struct page_pool *pp, struct page *page)
+{
+	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+	struct page_pool_iov *ppiov;
+
+	if (WARN_ON_ONCE(!page_is_page_pool_iov(page)))
+		return false;
+
+	ppiov = page_to_page_pool_iov(page);
+
+	if (!page_pool_iov_sub_and_test(ppiov, 1))
+		return false;
+
+	io_zc_rx_recycle_buf(ifq->pool, io_iov_to_buf(ppiov));
+	return true;
+}
+
+static void io_pp_zc_scrub(struct page_pool *pp)
+{
+	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+	struct io_zc_rx_pool *pool = ifq->pool;
+	struct io_zc_rx_buf *buf;
+	int i;
+
+	for (i = 0; i < pool->nr_bufs; i++) {
+		buf = &pool->bufs[i];
+
+		if (io_zc_rx_put_buf_uref(buf)) {
+			/* just return it to the page pool, it'll clean it up */
+			refcount_set(&buf->ppiov.refcount, 1);
+			page_pool_iov_put_many(&buf->ppiov, 1);
+		}
+	}
+}
+
+static void io_zc_rx_init_pool(struct io_zc_rx_pool *pool,
+			       struct page_pool *pp)
+{
+	struct io_zc_rx_buf *buf;
+	int i;
+
+	for (i = 0; i < pool->nr_bufs; i++) {
+		buf = &pool->bufs[i];
+		buf->ppiov.pp = pp;
+	}
+}
+
+static int io_pp_zc_init(struct page_pool *pp)
+{
+	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+
+	if (!ifq)
+		return -EINVAL;
+	if (pp->p.order != 0)
+		return -EINVAL;
+	if (!pp->p.napi)
+		return -EINVAL;
+
+	io_zc_rx_init_pool(ifq->pool, pp);
+	percpu_ref_get(&ifq->ctx->refs);
+	ifq->pp = pp;
+	return 0;
+}
+
+static void io_pp_zc_destroy(struct page_pool *pp)
+{
+	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+	struct io_zc_rx_pool *pool = ifq->pool;
+
+	ifq->pp = NULL;
+
+	if (WARN_ON_ONCE(pool->free_count != pool->nr_bufs))
+		return;
+	percpu_ref_put(&ifq->ctx->refs);
+}
+
+const struct pp_memory_provider_ops io_uring_pp_zc_ops = {
+	.alloc_pages		= io_pp_zc_alloc_pages,
+	.release_page		= io_pp_zc_release_page,
+	.init			= io_pp_zc_init,
+	.destroy		= io_pp_zc_destroy,
+	.scrub			= io_pp_zc_scrub,
+};
+EXPORT_SYMBOL(io_uring_pp_zc_ops);
+
+
 #endif
diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
index af1d865525d2..00d864700c67 100644
--- a/io_uring/zc_rx.h
+++ b/io_uring/zc_rx.h
@@ -10,6 +10,9 @@
 #define IO_ZC_IFQ_IDX_OFFSET		16
 #define IO_ZC_IFQ_IDX_MASK		((1U << IO_ZC_IFQ_IDX_OFFSET) - 1)
 
+#define IO_ZC_RX_UREF			0x10000
+#define IO_ZC_RX_KREF_MASK		(IO_ZC_RX_UREF - 1)
+
 struct io_zc_rx_pool {
 	struct io_zc_rx_ifq	*ifq;
 	struct io_zc_rx_buf	*bufs;
@@ -26,12 +29,15 @@ struct io_zc_rx_ifq {
 	struct io_ring_ctx		*ctx;
 	struct net_device		*dev;
 	struct io_zc_rx_pool		*pool;
+	struct page_pool		*pp;
 
 	struct io_rbuf_ring		*ring;
 	struct io_uring_rbuf_rqe 	*rqes;
 	struct io_uring_rbuf_cqe 	*cqes;
 	u32				rq_entries;
 	u32				cq_entries;
+	u32				cached_rq_head;
+	u32				cached_cq_tail;
 
 	/* hw rx descriptor ring id */
 	u32				if_rxq_id;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 14/20] net: page pool: add io_uring memory provider
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (12 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 23:39   ` Mina Almasry
  2023-12-19 21:03 ` [RFC PATCH v3 15/20] io_uring: add io_recvzc request David Wei
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Allow creating a special io_uring pp memory providers, which will be for
implementing io_uring zerocopy receive.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/types.h | 1 +
 net/core/page_pool.c          | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index fd846cac9fb6..f54ee759e362 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -129,6 +129,7 @@ struct mem_provider;
 enum pp_memory_provider_type {
 	__PP_MP_NONE, /* Use system allocator directly */
 	PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */
+	PP_MP_IOU_ZCRX, /* io_uring zerocopy receive provider */
 };
 
 struct pp_memory_provider_ops {
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 9e3073d61a97..ebf5ff009d9d 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -21,6 +21,7 @@
 #include <linux/ethtool.h>
 #include <linux/netdevice.h>
 #include <linux/genalloc.h>
+#include <linux/io_uring/net.h>
 
 #include <trace/events/page_pool.h>
 
@@ -242,6 +243,11 @@ static int page_pool_init(struct page_pool *pool,
 	case PP_MP_DMABUF_DEVMEM:
 		pool->mp_ops = &dmabuf_devmem_ops;
 		break;
+#if defined(CONFIG_IO_URING)
+	case PP_MP_IOU_ZCRX:
+		pool->mp_ops = &io_uring_pp_zc_ops;
+		break;
+#endif
 	default:
 		err = -EINVAL;
 		goto free_ptr_ring;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 15/20] io_uring: add io_recvzc request
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (13 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 14/20] net: page pool: add io_uring memory provider David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-20 16:27   ` Jens Axboe
  2023-12-19 21:03 ` [RFC PATCH v3 16/20] net: execute custom callback from napi David Wei
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

This patch adds an io_uring opcode OP_RECV_ZC for doing ZC reads from a
socket that is set up for ZC Rx. The request reads skbs from a socket
where its page frags are tagged w/ a magic cookie in their page private
field. For each frag, entries are written into the ifq rbuf completion
ring, and the total number of bytes read is returned to user as an
io_uring completion event.

Multishot requests work. There is no need to specify provided buffers as
data is returned in  the ifq rbuf completion rings.

Userspace is expected to look into the ifq rbuf completion ring when it
receives an io_uring completion event.

The addr3 field is used to encode params in the following format:

  addr3 = (readlen << 32);

readlen is the max amount of data to read from the socket. ifq_id is the
interface queue id, and currently only 0 is supported.

Signed-off-by: David Wei <[email protected]>
---
 include/uapi/linux/io_uring.h |   1 +
 io_uring/net.c                | 119 ++++++++++++++++-
 io_uring/opdef.c              |  16 +++
 io_uring/zc_rx.c              | 240 +++++++++++++++++++++++++++++++++-
 io_uring/zc_rx.h              |   5 +
 5 files changed, 375 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f4ba58bce3bd..f57f394744fe 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -253,6 +253,7 @@ enum io_uring_op {
 	IORING_OP_FUTEX_WAIT,
 	IORING_OP_FUTEX_WAKE,
 	IORING_OP_FUTEX_WAITV,
+	IORING_OP_RECV_ZC,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/net.c b/io_uring/net.c
index 454ba301ae6b..7a2aadf6962c 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -71,6 +71,16 @@ struct io_sr_msg {
 	struct io_kiocb 		*notif;
 };
 
+struct io_recvzc {
+	struct file			*file;
+	unsigned			len;
+	unsigned			done_io;
+	unsigned			msg_flags;
+	u16				flags;
+
+	u32				datalen;
+};
+
 static inline bool io_check_multishot(struct io_kiocb *req,
 				      unsigned int issue_flags)
 {
@@ -637,7 +647,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
 	unsigned int cflags;
 
 	cflags = io_put_kbuf(req, issue_flags);
-	if (msg->msg_inq && msg->msg_inq != -1)
+	if (msg && msg->msg_inq && msg->msg_inq != -1)
 		cflags |= IORING_CQE_F_SOCK_NONEMPTY;
 
 	if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
@@ -652,7 +662,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
 			io_recv_prep_retry(req);
 			/* Known not-empty or unknown state, retry */
 			if (cflags & IORING_CQE_F_SOCK_NONEMPTY ||
-			    msg->msg_inq == -1)
+			    (msg && msg->msg_inq == -1))
 				return false;
 			if (issue_flags & IO_URING_F_MULTISHOT)
 				*ret = IOU_ISSUE_SKIP_COMPLETE;
@@ -956,9 +966,8 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
-static __maybe_unused
-struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
-					struct socket *sock)
+static struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
+					      struct socket *sock)
 {
 	unsigned token = READ_ONCE(sock->zc_rx_idx);
 	unsigned ifq_idx = token >> IO_ZC_IFQ_IDX_OFFSET;
@@ -975,6 +984,106 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
 	return ifq;
 }
 
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+	u64 recvzc_cmd;
+
+	recvzc_cmd = READ_ONCE(sqe->addr3);
+	zc->datalen = recvzc_cmd >> 32;
+	if (recvzc_cmd & 0xffff)
+		return -EINVAL;
+	if (!(req->ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+		return -EINVAL;
+	if (unlikely(sqe->file_index || sqe->addr2))
+		return -EINVAL;
+
+	zc->len = READ_ONCE(sqe->len);
+	zc->flags = READ_ONCE(sqe->ioprio);
+	if (zc->flags & ~(RECVMSG_FLAGS))
+		return -EINVAL;
+	zc->msg_flags = READ_ONCE(sqe->msg_flags);
+	if (zc->msg_flags & MSG_DONTWAIT)
+		req->flags |= REQ_F_NOWAIT;
+	if (zc->msg_flags & MSG_ERRQUEUE)
+		req->flags |= REQ_F_CLEAR_POLLIN;
+	if (zc->flags & IORING_RECV_MULTISHOT) {
+		if (zc->msg_flags & MSG_WAITALL)
+			return -EINVAL;
+		if (req->opcode == IORING_OP_RECV && zc->len)
+			return -EINVAL;
+		req->flags |= REQ_F_APOLL_MULTISHOT;
+	}
+
+#ifdef CONFIG_COMPAT
+	if (req->ctx->compat)
+		zc->msg_flags |= MSG_CMSG_COMPAT;
+#endif
+	zc->done_io = 0;
+	return 0;
+}
+
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+	struct socket *sock;
+	unsigned flags;
+	int ret, min_ret = 0;
+	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct io_zc_rx_ifq *ifq;
+
+	if (issue_flags & IO_URING_F_UNLOCKED)
+		return -EAGAIN;
+
+	if (!(req->flags & REQ_F_POLLED) &&
+	    (zc->flags & IORING_RECVSEND_POLL_FIRST))
+		return -EAGAIN;
+
+	sock = sock_from_file(req->file);
+	if (unlikely(!sock))
+		return -ENOTSOCK;
+	ifq = io_zc_verify_sock(req, sock);
+	if (!ifq)
+		return -EINVAL;
+
+retry_multishot:
+	flags = zc->msg_flags;
+	if (force_nonblock)
+		flags |= MSG_DONTWAIT;
+	if (flags & MSG_WAITALL)
+		min_ret = zc->len;
+
+	ret = io_zc_rx_recv(ifq, sock, zc->datalen, flags);
+	if (ret < min_ret) {
+		if (ret == -EAGAIN && force_nonblock) {
+			if (issue_flags & IO_URING_F_MULTISHOT)
+				return IOU_ISSUE_SKIP_COMPLETE;
+			return -EAGAIN;
+		}
+		if (ret > 0 && io_net_retry(sock, flags)) {
+			zc->len -= ret;
+			zc->done_io += ret;
+			req->flags |= REQ_F_PARTIAL_IO;
+			return -EAGAIN;
+		}
+		if (ret == -ERESTARTSYS)
+			ret = -EINTR;
+		req_set_fail(req);
+	} else if ((flags & MSG_WAITALL) && (flags & (MSG_TRUNC | MSG_CTRUNC))) {
+		req_set_fail(req);
+	}
+
+	if (ret > 0)
+		ret += zc->done_io;
+	else if (zc->done_io)
+		ret = zc->done_io;
+
+	if (!io_recv_finish(req, &ret, 0, ret <= 0, issue_flags))
+		goto retry_multishot;
+
+	return ret;
+}
+
 void io_send_zc_cleanup(struct io_kiocb *req)
 {
 	struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 799db44283c7..a90231566d09 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -35,6 +35,7 @@
 #include "rw.h"
 #include "waitid.h"
 #include "futex.h"
+#include "zc_rx.h"
 
 static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
 {
@@ -467,6 +468,18 @@ const struct io_issue_def io_issue_defs[] = {
 		.issue			= io_futexv_wait,
 #else
 		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_RECV_ZC] = {
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
+		.ioprio			= 1,
+#if defined(CONFIG_NET)
+		.prep			= io_recvzc_prep,
+		.issue			= io_recvzc,
+#else
+		.prep			= io_eopnotsupp_prep,
 #endif
 	},
 };
@@ -704,6 +717,9 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_FUTEX_WAITV] = {
 		.name			= "FUTEX_WAITV",
 	},
+	[IORING_OP_RECV_ZC] = {
+		.name			= "RECV_ZC",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index ff1dac24ac40..acb70ca23150 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -6,6 +6,7 @@
 #include <linux/io_uring.h>
 #include <linux/netdevice.h>
 #include <linux/nospec.h>
+#include <net/tcp.h>
 #include <trace/events/page_pool.h>
 
 #include <uapi/linux/io_uring.h>
@@ -15,8 +16,20 @@
 #include "zc_rx.h"
 #include "rsrc.h"
 
+struct io_zc_rx_args {
+	struct io_zc_rx_ifq	*ifq;
+	struct socket		*sock;
+};
+
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 
+static inline u32 io_zc_rx_cqring_entries(struct io_zc_rx_ifq *ifq)
+{
+	struct io_rbuf_ring *ring = ifq->ring;
+
+	return ifq->cached_cq_tail - READ_ONCE(ring->cq.head);
+}
+
 static inline struct device *netdev2dev(struct net_device *dev)
 {
 	return dev->dev.parent;
@@ -399,7 +412,7 @@ static inline unsigned io_buf_pgid(struct io_zc_rx_pool *pool,
 	return buf - pool->bufs;
 }
 
-static __maybe_unused void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf)
+static void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf)
 {
 	refcount_add(IO_ZC_RX_UREF, &buf->ppiov.refcount);
 }
@@ -590,5 +603,230 @@ const struct pp_memory_provider_ops io_uring_pp_zc_ops = {
 };
 EXPORT_SYMBOL(io_uring_pp_zc_ops);
 
+static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *ifq)
+{
+	struct io_uring_rbuf_cqe *cqe;
+	unsigned int cq_idx, queued, free, entries;
+	unsigned int mask = ifq->cq_entries - 1;
+
+	cq_idx = ifq->cached_cq_tail & mask;
+	smp_rmb();
+	queued = min(io_zc_rx_cqring_entries(ifq), ifq->cq_entries);
+	free = ifq->cq_entries - queued;
+	entries = min(free, ifq->cq_entries - cq_idx);
+	if (!entries)
+		return NULL;
+
+	cqe = &ifq->cqes[cq_idx];
+	ifq->cached_cq_tail++;
+	return cqe;
+}
+
+static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag,
+			   int off, int len, unsigned sock_idx)
+{
+	off += skb_frag_off(frag);
+
+	if (likely(page_is_page_pool_iov(frag->bv_page))) {
+		struct io_uring_rbuf_cqe *cqe;
+		struct io_zc_rx_buf *buf;
+		struct page_pool_iov *ppiov;
+
+		ppiov = page_to_page_pool_iov(frag->bv_page);
+		if (ppiov->pp->p.memory_provider != PP_MP_IOU_ZCRX ||
+		    ppiov->pp->mp_priv != ifq)
+			return -EFAULT;
+
+		cqe = io_zc_get_rbuf_cqe(ifq);
+		if (!cqe)
+			return -ENOBUFS;
+
+		buf = io_iov_to_buf(ppiov);
+		io_zc_rx_get_buf_uref(buf);
+
+		cqe->region = 0;
+		cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off;
+		cqe->len = len;
+		cqe->sock = sock_idx;
+		cqe->flags = 0;
+	} else {
+		return -EOPNOTSUPP;
+	}
+
+	return len;
+}
+
+static int
+zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+	       unsigned int offset, size_t len)
+{
+	struct io_zc_rx_args *args = desc->arg.data;
+	struct io_zc_rx_ifq *ifq = args->ifq;
+	struct socket *sock = args->sock;
+	unsigned sock_idx = sock->zc_rx_idx & IO_ZC_IFQ_IDX_MASK;
+	struct sk_buff *frag_iter;
+	unsigned start, start_off;
+	int i, copy, end, off;
+	int ret = 0;
+
+	start = skb_headlen(skb);
+	start_off = offset;
+
+	if (offset < start)
+		return -EOPNOTSUPP;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		const skb_frag_t *frag;
+
+		WARN_ON(start > offset + len);
+
+		frag = &skb_shinfo(skb)->frags[i];
+		end = start + skb_frag_size(frag);
+
+		if (offset < end) {
+			copy = end - offset;
+			if (copy > len)
+				copy = len;
+
+			off = offset - start;
+			ret = zc_rx_recv_frag(ifq, frag, off, copy, sock_idx);
+			if (ret < 0)
+				goto out;
+
+			offset += ret;
+			len -= ret;
+			if (len == 0 || ret != copy)
+				goto out;
+		}
+		start = end;
+	}
+
+	skb_walk_frags(skb, frag_iter) {
+		WARN_ON(start > offset + len);
+
+		end = start + frag_iter->len;
+		if (offset < end) {
+			copy = end - offset;
+			if (copy > len)
+				copy = len;
+
+			off = offset - start;
+			ret = zc_rx_recv_skb(desc, frag_iter, off, copy);
+			if (ret < 0)
+				goto out;
+
+			offset += ret;
+			len -= ret;
+			if (len == 0 || ret != copy)
+				goto out;
+		}
+		start = end;
+	}
+
+out:
+	smp_store_release(&ifq->ring->cq.tail, ifq->cached_cq_tail);
+	if (offset == start_off)
+		return ret;
+	return offset - start_off;
+}
+
+static int io_zc_rx_tcp_read(struct io_zc_rx_ifq *ifq, struct sock *sk)
+{
+	struct io_zc_rx_args args = {
+		.ifq = ifq,
+		.sock = sk->sk_socket,
+	};
+	read_descriptor_t rd_desc = {
+		.count = 1,
+		.arg.data = &args,
+	};
+
+	return tcp_read_sock(sk, &rd_desc, zc_rx_recv_skb);
+}
+
+static int io_zc_rx_tcp_recvmsg(struct io_zc_rx_ifq *ifq, struct sock *sk,
+				unsigned int recv_limit,
+				int flags, int *addr_len)
+{
+	size_t used;
+	long timeo;
+	int ret;
+
+	ret = used = 0;
+
+	lock_sock(sk);
+
+	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+	while (recv_limit) {
+		ret = io_zc_rx_tcp_read(ifq, sk);
+		if (ret < 0)
+			break;
+		if (!ret) {
+			if (used)
+				break;
+			if (sock_flag(sk, SOCK_DONE))
+				break;
+			if (sk->sk_err) {
+				ret = sock_error(sk);
+				break;
+			}
+			if (sk->sk_shutdown & RCV_SHUTDOWN)
+				break;
+			if (sk->sk_state == TCP_CLOSE) {
+				ret = -ENOTCONN;
+				break;
+			}
+			if (!timeo) {
+				ret = -EAGAIN;
+				break;
+			}
+			if (!skb_queue_empty(&sk->sk_receive_queue))
+				break;
+			sk_wait_data(sk, &timeo, NULL);
+			if (signal_pending(current)) {
+				ret = sock_intr_errno(timeo);
+				break;
+			}
+			continue;
+		}
+		recv_limit -= ret;
+		used += ret;
+
+		if (!timeo)
+			break;
+		release_sock(sk);
+		lock_sock(sk);
+
+		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
+		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
+		    signal_pending(current))
+			break;
+	}
+	release_sock(sk);
+	/* TODO: handle timestamping */
+	return used ? used : ret;
+}
+
+int io_zc_rx_recv(struct io_zc_rx_ifq *ifq, struct socket *sock,
+		  unsigned int limit, unsigned int flags)
+{
+	struct sock *sk = sock->sk;
+	const struct proto *prot;
+	int addr_len = 0;
+	int ret;
+
+	if (flags & MSG_ERRQUEUE)
+		return -EOPNOTSUPP;
+
+	prot = READ_ONCE(sk->sk_prot);
+	if (prot->recvmsg != tcp_recvmsg)
+		return -EPROTONOSUPPORT;
+
+	sock_rps_record_flow(sk);
+
+	ret = io_zc_rx_tcp_recvmsg(ifq, sk, limit, flags, &addr_len);
+
+	return ret;
+}
 
 #endif
diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
index 00d864700c67..3e8f07e4b252 100644
--- a/io_uring/zc_rx.h
+++ b/io_uring/zc_rx.h
@@ -72,4 +72,9 @@ static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
 }
 #endif
 
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags);
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_zc_rx_recv(struct io_zc_rx_ifq *ifq, struct socket *sock,
+		  unsigned int limit, unsigned int flags);
+
 #endif
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 16/20] net: execute custom callback from napi
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (14 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 15/20] io_uring: add io_recvzc request David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 17/20] io_uring/zcrx: add copy fallback David Wei
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Sometimes we want to access a napi protected resource from task
context like in the case of io_uring zc falling back to copy and
accessing the buffer ring. Add a helper function that allows to execute
a custom function from napi context by first stopping it similarly to
napi_busy_loop().

Experimental, needs much polishing and sharing bits with
napi_busy_loop().

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/busy_poll.h |  7 +++++++
 net/core/dev.c          | 46 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 4dabeb6c76d3..64238467e00a 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -47,6 +47,8 @@ bool sk_busy_loop_end(void *p, unsigned long start_time);
 void napi_busy_loop(unsigned int napi_id,
 		    bool (*loop_end)(void *, unsigned long),
 		    void *loop_end_arg, bool prefer_busy_poll, u16 budget);
+void napi_execute(struct napi_struct *napi,
+		  void (*cb)(void *), void *cb_arg);
 
 #else /* CONFIG_NET_RX_BUSY_POLL */
 static inline unsigned long net_busy_loop_on(void)
@@ -59,6 +61,11 @@ static inline bool sk_can_busy_loop(struct sock *sk)
 	return false;
 }
 
+static inline void napi_execute(struct napi_struct *napi,
+				void (*cb)(void *), void *cb_arg)
+{
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static inline unsigned long busy_loop_current_time(void)
diff --git a/net/core/dev.c b/net/core/dev.c
index e55750c47245..2dd4f3846535 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6537,6 +6537,52 @@ void napi_busy_loop(unsigned int napi_id,
 }
 EXPORT_SYMBOL(napi_busy_loop);
 
+void napi_execute(struct napi_struct *napi,
+		  void (*cb)(void *), void *cb_arg)
+{
+	bool done = false;
+	unsigned long val;
+	void *have_poll_lock = NULL;
+
+	rcu_read_lock();
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+	for (;;) {
+		local_bh_disable();
+		val = READ_ONCE(napi->state);
+
+		/* If multiple threads are competing for this napi,
+		* we avoid dirtying napi->state as much as we can.
+		*/
+		if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
+			  NAPIF_STATE_IN_BUSY_POLL))
+			goto restart;
+
+		if (cmpxchg(&napi->state, val,
+			   val | NAPIF_STATE_IN_BUSY_POLL |
+				 NAPIF_STATE_SCHED) != val)
+			goto restart;
+
+		have_poll_lock = netpoll_poll_lock(napi);
+		cb(cb_arg);
+		done = true;
+		gro_normal_list(napi);
+		local_bh_enable();
+		break;
+restart:
+		local_bh_enable();
+		if (unlikely(need_resched()))
+			break;
+		cpu_relax();
+	}
+	if (done)
+		busy_poll_stop(napi, have_poll_lock, false, 1);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
+	rcu_read_unlock();
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static void napi_hash_add(struct napi_struct *napi)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 17/20] io_uring/zcrx: add copy fallback
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (15 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 16/20] net: execute custom callback from napi David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 18/20] veth: add support for io_uring zc rx David Wei
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Currently, if user fails to keep up with the network and doesn't refill
the buffer ring fast enough the NIC/driver will start dropping packets.
That might be too punishing. Add a fallback path, which would allow
drivers to allocate normal pages when there is starvation, then
zc_rx_recv_skb() we'll detect them and copy into the user specified
buffers, when they become available.

That should help with adoption and also help the user striking the right
balance allocating just the right amount of zerocopy buffers but also
being resilient to sudden surges in traffic.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 io_uring/zc_rx.c | 126 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 120 insertions(+), 6 deletions(-)

diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index acb70ca23150..f7d99d569885 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -6,6 +6,7 @@
 #include <linux/io_uring.h>
 #include <linux/netdevice.h>
 #include <linux/nospec.h>
+#include <net/busy_poll.h>
 #include <net/tcp.h>
 #include <trace/events/page_pool.h>
 
@@ -21,6 +22,11 @@ struct io_zc_rx_args {
 	struct socket		*sock;
 };
 
+struct io_zc_refill_data {
+	struct io_zc_rx_ifq *ifq;
+	struct io_zc_rx_buf *buf;
+};
+
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 
 static inline u32 io_zc_rx_cqring_entries(struct io_zc_rx_ifq *ifq)
@@ -603,6 +609,39 @@ const struct pp_memory_provider_ops io_uring_pp_zc_ops = {
 };
 EXPORT_SYMBOL(io_uring_pp_zc_ops);
 
+static void io_napi_refill(void *data)
+{
+	struct io_zc_refill_data *rd = data;
+	struct io_zc_rx_ifq *ifq = rd->ifq;
+	void *page;
+
+	if (WARN_ON_ONCE(!ifq->pp))
+		return;
+
+	page = page_pool_dev_alloc_pages(ifq->pp);
+	if (!page)
+		return;
+	if (WARN_ON_ONCE(!page_is_page_pool_iov(page)))
+		return;
+
+	rd->buf = io_iov_to_buf(page_to_page_pool_iov(page));
+}
+
+static struct io_zc_rx_buf *io_zc_get_buf_task_safe(struct io_zc_rx_ifq *ifq)
+{
+	struct io_zc_refill_data rd = {
+		.ifq = ifq,
+	};
+
+	napi_execute(ifq->pp->p.napi, io_napi_refill, &rd);
+	return rd.buf;
+}
+
+static inline void io_zc_return_rbuf_cqe(struct io_zc_rx_ifq *ifq)
+{
+	ifq->cached_cq_tail--;
+}
+
 static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *ifq)
 {
 	struct io_uring_rbuf_cqe *cqe;
@@ -622,6 +661,51 @@ static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *
 	return cqe;
 }
 
+static ssize_t zc_rx_copy_chunk(struct io_zc_rx_ifq *ifq, void *data,
+				unsigned int offset, size_t len,
+				unsigned sock_idx)
+{
+	size_t copy_size, copied = 0;
+	struct io_uring_rbuf_cqe *cqe;
+	struct io_zc_rx_buf *buf;
+	int ret = 0, off = 0;
+	u8 *vaddr;
+
+	do {
+		cqe = io_zc_get_rbuf_cqe(ifq);
+		if (!cqe) {
+			ret = -ENOBUFS;
+			break;
+		}
+		buf = io_zc_get_buf_task_safe(ifq);
+		if (!buf) {
+			io_zc_return_rbuf_cqe(ifq);
+			ret = -ENOMEM;
+			break;
+		}
+
+		vaddr = kmap_local_page(buf->page);
+		copy_size = min_t(size_t, PAGE_SIZE, len);
+		memcpy(vaddr, data + offset, copy_size);
+		kunmap_local(vaddr);
+
+		cqe->region = 0;
+		cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off;
+		cqe->len = copy_size;
+		cqe->flags = 0;
+		cqe->sock = sock_idx;
+
+		io_zc_rx_get_buf_uref(buf);
+		page_pool_iov_put_many(&buf->ppiov, 1);
+
+		offset += copy_size;
+		len -= copy_size;
+		copied += copy_size;
+	} while (offset < len);
+
+	return copied ? copied : ret;
+}
+
 static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag,
 			   int off, int len, unsigned sock_idx)
 {
@@ -650,7 +734,22 @@ static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag,
 		cqe->sock = sock_idx;
 		cqe->flags = 0;
 	} else {
-		return -EOPNOTSUPP;
+		struct page *page = skb_frag_page(frag);
+		u32 p_off, p_len, t, copied = 0;
+		u8 *vaddr;
+		int ret = 0;
+
+		skb_frag_foreach_page(frag, off, len,
+				      page, p_off, p_len, t) {
+			vaddr = kmap_local_page(page);
+			ret = zc_rx_copy_chunk(ifq, vaddr, p_off, p_len, sock_idx);
+			kunmap_local(vaddr);
+
+			if (ret < 0)
+				return copied ? copied : ret;
+			copied += ret;
+		}
+		len = copied;
 	}
 
 	return len;
@@ -665,15 +764,30 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 	struct socket *sock = args->sock;
 	unsigned sock_idx = sock->zc_rx_idx & IO_ZC_IFQ_IDX_MASK;
 	struct sk_buff *frag_iter;
-	unsigned start, start_off;
+	unsigned start, start_off = offset;
 	int i, copy, end, off;
 	int ret = 0;
 
-	start = skb_headlen(skb);
-	start_off = offset;
+	if (unlikely(offset < skb_headlen(skb))) {
+		ssize_t copied;
+		size_t to_copy;
 
-	if (offset < start)
-		return -EOPNOTSUPP;
+		to_copy = min_t(size_t, skb_headlen(skb) - offset, len);
+		copied = zc_rx_copy_chunk(ifq, skb->data, offset, to_copy,
+					  sock_idx);
+		if (copied < 0) {
+			ret = copied;
+			goto out;
+		}
+		offset += copied;
+		len -= copied;
+		if (!len)
+			goto out;
+		if (offset != skb_headlen(skb))
+			goto out;
+	}
+
+	start = skb_headlen(skb);
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		const skb_frag_t *frag;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 18/20] veth: add support for io_uring zc rx
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (16 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 17/20] io_uring/zcrx: add copy fallback David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 19/20] net: page pool: generalise ppiov dma address get David Wei
  2023-12-19 21:03 ` [RFC PATCH v3 20/20] bnxt: enable io_uring zc page pool David Wei
  19 siblings, 0 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

NOT FOR UPSTREAM, TESTING ONLY.

Add io_uring zerocopy support for veth. It's not actually zerocopy, we
copy data in napi, which is early enough in the stack to be useful for
testing.

Note, we'll need some virtual dev support for testing, but that should
not be in the way of real workloads.

Signed-off-by: David Wei <[email protected]>
---
 drivers/net/veth.c | 211 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 205 insertions(+), 6 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 57efb3454c57..dd00e172979f 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -26,6 +26,7 @@
 #include <linux/ptr_ring.h>
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
+#include <linux/io_uring/net.h>
 #include <net/page_pool/helpers.h>
 
 #define DRV_NAME	"veth"
@@ -75,6 +76,7 @@ struct veth_priv {
 	struct bpf_prog		*_xdp_prog;
 	struct veth_rq		*rq;
 	unsigned int		requested_headroom;
+	bool			zc_installed;
 };
 
 struct veth_xdp_tx_bq {
@@ -335,9 +337,12 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev,
 					 const struct net_device *rcv,
 					 const struct sk_buff *skb)
 {
+	struct veth_priv *rcv_priv = netdev_priv(rcv);
+
 	return !(dev->features & NETIF_F_ALL_TSO) ||
 		(skb->destructor == sock_wfree &&
-		 rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD));
+		 rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD)) ||
+		rcv_priv->zc_installed;
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -726,6 +731,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
 	struct sk_buff *skb = *pskb;
 	u32 frame_sz;
 
+	if (WARN_ON_ONCE(1))
+		return -EFAULT;
+
 	if (skb_shared(skb) || skb_head_is_locked(skb) ||
 	    skb_shinfo(skb)->nr_frags ||
 	    skb_headroom(skb) < XDP_PACKET_HEADROOM) {
@@ -827,6 +835,90 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
 	return -ENOMEM;
 }
 
+static noinline struct sk_buff *veth_iou_rcv_skb(struct veth_rq *rq,
+					struct sk_buff *skb)
+{
+	struct sk_buff *nskb;
+	u32 size, len, off, max_head_size;
+	struct page *page;
+	int ret, i, head_off;
+	void *vaddr;
+
+	/* Testing only, randomly send normal pages to test copy fallback */
+	if (ktime_get_ns() % 16 == 0)
+		return skb;
+
+	skb_prepare_for_gro(skb);
+	max_head_size = skb_headlen(skb);
+
+	rcu_read_lock();
+	nskb = napi_alloc_skb(&rq->xdp_napi, max_head_size);
+	if (!nskb)
+		goto drop;
+
+	skb_copy_header(nskb, skb);
+	skb_mark_for_recycle(nskb);
+
+	size = max_head_size;
+	if (skb_copy_bits(skb, 0, nskb->data, size)) {
+	consume_skb(nskb);
+		goto drop;
+	}
+	skb_put(nskb, size);
+	head_off = skb_headroom(nskb) - skb_headroom(skb);
+	skb_headers_offset_update(nskb, head_off);
+
+	/* Allocate paged area of new skb */
+	off = size;
+	len = skb->len - off;
+
+	for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
+		struct io_zc_rx_buf *buf;
+		void *ppage;
+
+		ppage = page_pool_dev_alloc_pages(rq->page_pool);
+		if (!ppage) {
+			consume_skb(nskb);
+			goto drop;
+		}
+		if (WARN_ON_ONCE(!page_is_page_pool_iov(ppage))) {
+			consume_skb(nskb);
+			goto drop;
+		}
+
+		buf = container_of(page_to_page_pool_iov(ppage),
+				   struct io_zc_rx_buf, ppiov);
+		page = buf->page;
+
+		if (WARN_ON_ONCE(buf->ppiov.pp != rq->page_pool))
+			goto drop;
+
+		size = min_t(u32, len, PAGE_SIZE);
+		skb_add_rx_frag(nskb, i, ppage, 0, size, PAGE_SIZE);
+
+		vaddr = kmap_atomic(page);
+		ret = skb_copy_bits(skb, off, vaddr, size);
+		kunmap_atomic(vaddr);
+
+		if (ret) {
+			consume_skb(nskb);
+			goto drop;
+		}
+		len -= size;
+		off += size;
+	}
+	rcu_read_unlock();
+
+	consume_skb(skb);
+	skb = nskb;
+	return skb;
+drop:
+	rcu_read_unlock();
+	kfree_skb(skb);
+	return NULL;
+}
+
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 					struct sk_buff *skb,
 					struct veth_xdp_tx_bq *bq,
@@ -970,8 +1062,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			/* ndo_start_xmit */
 			struct sk_buff *skb = ptr;
 
-			stats->xdp_bytes += skb->len;
-			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
+			if (rq->page_pool->p.memory_provider == PP_MP_IOU_ZCRX) {
+				skb = veth_iou_rcv_skb(rq, skb);
+			} else {
+				stats->xdp_bytes += skb->len;
+				skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
+			}
+
 			if (skb) {
 				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
 					netif_receive_skb(skb);
@@ -1030,15 +1127,21 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	return done;
 }
 
-static int veth_create_page_pool(struct veth_rq *rq)
+static int veth_create_page_pool(struct veth_rq *rq, struct io_zc_rx_ifq *ifq)
 {
 	struct page_pool_params pp_params = {
 		.order = 0,
 		.pool_size = VETH_RING_SIZE,
 		.nid = NUMA_NO_NODE,
 		.dev = &rq->dev->dev,
+		.napi = &rq->xdp_napi,
 	};
 
+	if (ifq) {
+		pp_params.mp_priv = ifq;
+		pp_params.memory_provider = PP_MP_IOU_ZCRX;
+	}
+
 	rq->page_pool = page_pool_create(&pp_params);
 	if (IS_ERR(rq->page_pool)) {
 		int err = PTR_ERR(rq->page_pool);
@@ -1056,7 +1159,7 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end)
 	int err, i;
 
 	for (i = start; i < end; i++) {
-		err = veth_create_page_pool(&priv->rq[i]);
+		err = veth_create_page_pool(&priv->rq[i], NULL);
 		if (err)
 			goto err_page_pool;
 	}
@@ -1112,9 +1215,17 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 
 	for (i = start; i < end; i++) {
 		struct veth_rq *rq = &priv->rq[i];
+		void *ptr;
+		int nr = 0;
 
 		rq->rx_notify_masked = false;
-		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
+
+		while ((ptr = ptr_ring_consume(&rq->xdp_ring))) {
+			veth_ptr_free(ptr);
+			nr++;
+		}
+
+		ptr_ring_cleanup(&rq->xdp_ring, NULL);
 	}
 
 	for (i = start; i < end; i++) {
@@ -1350,6 +1461,9 @@ static int veth_set_channels(struct net_device *dev,
 	struct net_device *peer;
 	int err;
 
+	if (priv->zc_installed)
+		return -EINVAL;
+
 	/* sanity check. Upper bounds are already enforced by the caller */
 	if (!ch->rx_count || !ch->tx_count)
 		return -EINVAL;
@@ -1427,6 +1541,8 @@ static int veth_open(struct net_device *dev)
 	struct net_device *peer = rtnl_dereference(priv->peer);
 	int err;
 
+	priv->zc_installed = false;
+
 	if (!peer)
 		return -ENOTCONN;
 
@@ -1604,6 +1720,84 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	rcu_read_unlock();
 }
 
+static int __veth_iou_set(struct net_device *dev,
+			  struct netdev_bpf *xdp)
+{
+	bool napi_already_on = veth_gro_requested(dev) && (dev->flags & IFF_UP);
+	unsigned qid = xdp->zc_rx.queue_id;
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
+	struct veth_rq *rq;
+	int ret;
+
+	if (priv->_xdp_prog)
+		return -EINVAL;
+	if (qid >= dev->real_num_rx_queues)
+		return -EINVAL;
+	if (!(dev->flags & IFF_UP))
+		return -EOPNOTSUPP;
+	if (dev->real_num_rx_queues != 1)
+		return -EINVAL;
+	rq = &priv->rq[qid];
+
+	if (!xdp->zc_rx.ifq) {
+		if (!priv->zc_installed)
+			return -EINVAL;
+
+		veth_napi_del(dev);
+		priv->zc_installed = false;
+		if (!veth_gro_requested(dev) && netif_running(dev)) {
+			dev->features &= ~NETIF_F_GRO;
+			netdev_features_change(dev);
+		}
+		return 0;
+	}
+
+	if (priv->zc_installed)
+		return -EINVAL;
+
+	peer = rtnl_dereference(priv->peer);
+	peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+
+	ret = veth_create_page_pool(rq, xdp->zc_rx.ifq);
+	if (ret)
+		return ret;
+
+	ret = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
+	if (ret) {
+		page_pool_destroy(rq->page_pool);
+		rq->page_pool = NULL;
+		return ret;
+	}
+
+	priv->zc_installed = true;
+
+	if (!veth_gro_requested(dev)) {
+		/* user-space did not require GRO, but adding XDP
+		 * is supposed to get GRO working
+		 */
+		dev->features |= NETIF_F_GRO;
+		netdev_features_change(dev);
+	}
+	if (!napi_already_on) {
+		netif_napi_add(dev, &rq->xdp_napi, veth_poll);
+		napi_enable(&rq->xdp_napi);
+		rcu_assign_pointer(rq->napi, &rq->xdp_napi);
+	}
+	return 0;
+}
+
+static int veth_iou_set(struct net_device *dev,
+			struct netdev_bpf *xdp)
+{
+	int ret;
+
+	rtnl_lock();
+	ret = __veth_iou_set(dev, xdp);
+	rtnl_unlock();
+	return ret;
+}
+
 static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 			struct netlink_ext_ack *extack)
 {
@@ -1613,6 +1807,9 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	unsigned int max_mtu;
 	int err;
 
+	if (priv->zc_installed)
+		return -EINVAL;
+
 	old_prog = priv->_xdp_prog;
 	priv->_xdp_prog = prog;
 	peer = rtnl_dereference(priv->peer);
@@ -1691,6 +1888,8 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
 		return veth_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_SETUP_ZC_RX:
+		return veth_iou_set(dev, xdp);
 	default:
 		return -EINVAL;
 	}
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 19/20] net: page pool: generalise ppiov dma address get
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (17 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 18/20] veth: add support for io_uring zc rx David Wei
@ 2023-12-19 21:03 ` David Wei
  2023-12-21 19:51   ` Mina Almasry
  2023-12-19 21:03 ` [RFC PATCH v3 20/20] bnxt: enable io_uring zc page pool David Wei
  19 siblings, 1 reply; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

io_uring pp memory provider doesn't have contiguous dma addresses,
implement page_pool_iov_dma_addr() via callbacks.

Note: it might be better to stash dma address into struct page_pool_iov.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/helpers.h | 5 +----
 include/net/page_pool/types.h   | 2 ++
 io_uring/zc_rx.c                | 8 ++++++++
 net/core/page_pool.c            | 9 +++++++++
 4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index aca3a52d0e22..10dba1f2aa0c 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -105,10 +105,7 @@ static inline unsigned int page_pool_iov_idx(const struct page_pool_iov *ppiov)
 static inline dma_addr_t
 page_pool_iov_dma_addr(const struct page_pool_iov *ppiov)
 {
-	struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov);
-
-	return owner->base_dma_addr +
-	       ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT);
+	return ppiov->pp->mp_ops->ppiov_dma_addr(ppiov);
 }
 
 static inline unsigned long
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index f54ee759e362..1b9266835ab6 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -125,6 +125,7 @@ struct page_pool_stats {
 #endif
 
 struct mem_provider;
+struct page_pool_iov;
 
 enum pp_memory_provider_type {
 	__PP_MP_NONE, /* Use system allocator directly */
@@ -138,6 +139,7 @@ struct pp_memory_provider_ops {
 	void (*scrub)(struct page_pool *pool);
 	struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
 	bool (*release_page)(struct page_pool *pool, struct page *page);
+	dma_addr_t (*ppiov_dma_addr)(const struct page_pool_iov *ppiov);
 };
 
 extern const struct pp_memory_provider_ops dmabuf_devmem_ops;
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index f7d99d569885..20fb89e6bad7 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -600,12 +600,20 @@ static void io_pp_zc_destroy(struct page_pool *pp)
 	percpu_ref_put(&ifq->ctx->refs);
 }
 
+static dma_addr_t io_pp_zc_ppiov_dma_addr(const struct page_pool_iov *ppiov)
+{
+	struct io_zc_rx_buf *buf = io_iov_to_buf((struct page_pool_iov *)ppiov);
+
+	return buf->dma;
+}
+
 const struct pp_memory_provider_ops io_uring_pp_zc_ops = {
 	.alloc_pages		= io_pp_zc_alloc_pages,
 	.release_page		= io_pp_zc_release_page,
 	.init			= io_pp_zc_init,
 	.destroy		= io_pp_zc_destroy,
 	.scrub			= io_pp_zc_scrub,
+	.ppiov_dma_addr		= io_pp_zc_ppiov_dma_addr,
 };
 EXPORT_SYMBOL(io_uring_pp_zc_ops);
 
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ebf5ff009d9d..6586631ecc2e 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1105,10 +1105,19 @@ static bool mp_dmabuf_devmem_release_page(struct page_pool *pool,
 	return true;
 }
 
+static dma_addr_t mp_dmabuf_devmem_ppiov_dma_addr(const struct page_pool_iov *ppiov)
+{
+	struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov);
+
+	return owner->base_dma_addr +
+	       ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT);
+}
+
 const struct pp_memory_provider_ops dmabuf_devmem_ops = {
 	.init			= mp_dmabuf_devmem_init,
 	.destroy		= mp_dmabuf_devmem_destroy,
 	.alloc_pages		= mp_dmabuf_devmem_alloc_pages,
 	.release_page		= mp_dmabuf_devmem_release_page,
+	.ppiov_dma_addr		= mp_dmabuf_devmem_ppiov_dma_addr,
 };
 EXPORT_SYMBOL(dmabuf_devmem_ops);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH v3 20/20] bnxt: enable io_uring zc page pool
  2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
                   ` (18 preceding siblings ...)
  2023-12-19 21:03 ` [RFC PATCH v3 19/20] net: page pool: generalise ppiov dma address get David Wei
@ 2023-12-19 21:03 ` David Wei
  19 siblings, 0 replies; 50+ messages in thread
From: David Wei @ 2023-12-19 21:03 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

TESTING ONLY

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 71 +++++++++++++++++--
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  7 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  3 +
 3 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 039f8d995a26..d9fb8633f226 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -55,6 +55,7 @@
 #include <net/page_pool/helpers.h>
 #include <linux/align.h>
 #include <net/netdev_queues.h>
+#include <linux/io_uring/net.h>
 
 #include "bnxt_hsi.h"
 #include "bnxt.h"
@@ -875,6 +876,25 @@ static inline u8 *__bnxt_alloc_rx_frag(struct bnxt *bp, dma_addr_t *mapping,
 	return data;
 }
 
+static inline struct page *bnxt_get_real_page(struct page *page)
+{
+	struct io_zc_rx_buf *buf;
+
+	if (page_is_page_pool_iov(page)) {
+		buf = container_of(page_to_page_pool_iov(page),
+				struct io_zc_rx_buf, ppiov);
+		page = buf->page;
+	}
+	return page;
+}
+
+static inline void *bnxt_get_page_address(struct page *frag)
+{
+	struct page *page = bnxt_get_real_page(frag);
+
+	return page_address(page);
+}
+
 int bnxt_alloc_rx_data(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 		       u16 prod, gfp_t gfp)
 {
@@ -892,7 +912,7 @@ int bnxt_alloc_rx_data(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 
 		mapping += bp->rx_dma_offset;
 		rx_buf->data = page;
-		rx_buf->data_ptr = page_address(page) + offset + bp->rx_offset;
+		rx_buf->data_ptr = bnxt_get_page_address(page) + offset + bp->rx_offset;
 	} else {
 		u8 *data = __bnxt_alloc_rx_frag(bp, &mapping, gfp);
 
@@ -954,8 +974,9 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp,
 
 	if (PAGE_SIZE <= BNXT_RX_PAGE_SIZE)
 		page = __bnxt_alloc_rx_page(bp, &mapping, rxr, &offset, gfp);
-	else
+	else {
 		page = __bnxt_alloc_rx_64k_page(bp, &mapping, rxr, gfp, &offset);
+	}
 
 	if (!page)
 		return -ENOMEM;
@@ -1079,6 +1100,7 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	skb_mark_for_recycle(skb);
+
 	skb_reserve(skb, bp->rx_offset);
 	__skb_put(skb, len);
 
@@ -1118,7 +1140,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 	}
 
 	skb_mark_for_recycle(skb);
-	off = (void *)data_ptr - page_address(page);
+	off = (void *)data_ptr - bnxt_get_page_address(page);
 	skb_add_rx_frag(skb, 0, page, off, len, BNXT_RX_PAGE_SIZE);
 	memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN,
 	       payload + NET_IP_ALIGN);
@@ -2032,7 +2054,6 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 				goto next_rx;
 			}
 		} else {
-			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs, rxr->page_pool, &xdp, rxcmp1);
 			if (!skb) {
 				/* we should be able to free the old skb here */
 				bnxt_xdp_buff_frags_free(rxr, &xdp);
@@ -3402,7 +3423,8 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
 }
 
 static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
-				   struct bnxt_rx_ring_info *rxr)
+				   struct bnxt_rx_ring_info *rxr,
+				   int qid)
 {
 	struct page_pool_params pp = { 0 };
 
@@ -3416,6 +3438,13 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	pp.max_len = PAGE_SIZE;
 	pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 
+	if (bp->iou_ifq && qid == bp->iou_qid) {
+		pp.mp_priv = bp->iou_ifq;
+		pp.memory_provider = PP_MP_IOU_ZCRX;
+		pp.max_len = PAGE_SIZE;
+		pp.flags = 0;
+	}
+
 	rxr->page_pool = page_pool_create(&pp);
 	if (IS_ERR(rxr->page_pool)) {
 		int err = PTR_ERR(rxr->page_pool);
@@ -3442,7 +3471,7 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
 
 		ring = &rxr->rx_ring_struct;
 
-		rc = bnxt_alloc_rx_page_pool(bp, rxr);
+		rc = bnxt_alloc_rx_page_pool(bp, rxr, i);
 		if (rc)
 			return rc;
 
@@ -14347,6 +14376,36 @@ void bnxt_print_device_info(struct bnxt *bp)
 	pcie_print_link_status(bp->pdev);
 }
 
+int bnxt_zc_rx(struct bnxt *bp, struct netdev_bpf *xdp)
+{
+	unsigned ifq_idx = xdp->zc_rx.queue_id;
+
+	if (ifq_idx >= bp->rx_nr_rings)
+		return -EINVAL;
+	if (PAGE_SIZE != BNXT_RX_PAGE_SIZE)
+		return -EINVAL;
+
+	bnxt_rtnl_lock_sp(bp);
+	if (!!bp->iou_ifq == !!xdp->zc_rx.ifq) {
+		bnxt_rtnl_unlock_sp(bp);
+		return -EINVAL;
+	}
+	if (netif_running(bp->dev)) {
+		int rc;
+
+		bnxt_ulp_stop(bp);
+		bnxt_close_nic(bp, true, false);
+
+		bp->iou_qid = ifq_idx;
+		bp->iou_ifq = xdp->zc_rx.ifq;
+
+		rc = bnxt_open_nic(bp, true, false);
+		bnxt_ulp_start(bp, rc);
+	}
+	bnxt_rtnl_unlock_sp(bp);
+	return 0;
+}
+
 static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	struct net_device *dev;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index e31164e3b8fb..1003f9260805 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -2342,6 +2342,10 @@ struct bnxt {
 #endif
 	u32			thermal_threshold_type;
 	enum board_idx		board_idx;
+
+	/* io_uring zerocopy */
+	void			*iou_ifq;
+	unsigned		iou_qid;
 };
 
 #define BNXT_NUM_RX_RING_STATS			8
@@ -2556,4 +2560,7 @@ int bnxt_get_port_parent_id(struct net_device *dev,
 void bnxt_dim_work(struct work_struct *work);
 int bnxt_hwrm_set_ring_coal(struct bnxt *bp, struct bnxt_napi *bnapi);
 void bnxt_print_device_info(struct bnxt *bp);
+
+int bnxt_zc_rx(struct bnxt *bp, struct netdev_bpf *xdp);
+
 #endif
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 4791f6a14e55..a3ae02c31ffc 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -466,6 +466,9 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_SETUP_PROG:
 		rc = bnxt_xdp_set(bp, xdp->prog);
 		break;
+	case XDP_SETUP_ZC_RX:
+		return bnxt_zc_rx(bp, xdp);
+		break;
 	default:
 		rc = -EINVAL;
 		break;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper
  2023-12-19 21:03 ` [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper David Wei
@ 2023-12-19 23:22   ` Mina Almasry
  2023-12-19 23:59     ` Pavel Begunkov
  0 siblings, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2023-12-19 23:22 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> NOT FOR UPSTREAM
>
> The final version will depend on how ppiov looks like, but add a
> convenience helper for now.
>

Thanks, this patch becomes unnecessary once you pull in the latest
version of our changes; you could use net_iov_to_netmem() added here:

https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

Not any kind of objection from me, just an FYI.

> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  include/net/page_pool/helpers.h | 5 +++++
>  net/core/page_pool.c            | 2 +-
>  2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 95f4d579cbc4..92804c499833 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -86,6 +86,11 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
>
>  /* page_pool_iov support */
>
> +static inline struct page *page_pool_mangle_ppiov(struct page_pool_iov *ppiov)
> +{
> +       return (struct page *)((unsigned long)ppiov | PP_DEVMEM);
> +}
> +
>  static inline struct dmabuf_genpool_chunk_owner *
>  page_pool_iov_owner(const struct page_pool_iov *ppiov)
>  {
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index c0bc62ee77c6..38eff947f679 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -1074,7 +1074,7 @@ static struct page *mp_dmabuf_devmem_alloc_pages(struct page_pool *pool,
>         pool->pages_state_hold_cnt++;
>         trace_page_pool_state_hold(pool, (struct page *)ppiov,
>                                    pool->pages_state_hold_cnt);
> -       return (struct page *)((unsigned long)ppiov | PP_DEVMEM);
> +       return page_pool_mangle_ppiov(ppiov);
>  }
>
>  static void mp_dmabuf_devmem_destroy(struct page_pool *pool)
> --
> 2.39.3
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov
  2023-12-19 21:03 ` [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov David Wei
@ 2023-12-19 23:24   ` Mina Almasry
  2023-12-20  1:29     ` Pavel Begunkov
  0 siblings, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2023-12-19 23:24 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> NOT FOR UPSTREAM
>
> There will be more users of struct page_pool_iov, and ppiovs from one
> subsystem must not be used by another. That should never happen for any
> sane application, but we need to enforce it in case of bufs and/or
> malicious users.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  net/ipv4/tcp.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 33a8bb63fbf5..9c6b18eebb5b 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2384,6 +2384,13 @@ static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb,
>                         }
>
>                         ppiov = skb_frag_page_pool_iov(frag);
> +
> +                       /* Disallow non devmem owned buffers */
> +                       if (ppiov->pp->p.memory_provider != PP_MP_DMABUF_DEVMEM) {
> +                               err = -ENODEV;
> +                               goto out;
> +                       }
> +

Instead of this, I maybe recommend modifying the skb->dmabuf flag? My
mental model is that flag means all the frags in the skb are
specifically dmabuf, not general ppiovs or net_iovs. Is it possible to
add skb->io_uring or something?

If that bloats the skb headers, then maybe we need another place to
put this flag. Maybe the [page_pool|net]_iov should declare whether
it's dmabuf or otherwise, and we can check frag[0] and assume all
frags are the same as frag0.

But IMO the page pool internals should not leak into the
implementation of generic tcp stack functions.

>                         end = start + skb_frag_size(frag);
>                         copy = end - offset;
>
> --
> 2.39.3
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 03/20] net: page pool: rework ppiov life cycle
  2023-12-19 21:03 ` [RFC PATCH v3 03/20] net: page pool: rework ppiov life cycle David Wei
@ 2023-12-19 23:35   ` Mina Almasry
  2023-12-20  0:49     ` Pavel Begunkov
  0 siblings, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2023-12-19 23:35 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> NOT FOR UPSTREAM
> The final version will depend on how the ppiov infra looks like
>
> Page pool is tracking how many pages were allocated and returned, which
> serves for refcounting the pool, and so every page/frag allocated should
> eventually come back to the page pool via appropriate ways, e.g. by
> calling page_pool_put_page().
>
> When it comes to normal page pools (i.e. without memory providers
> attached), it's fine to return a page when it's still refcounted by
> somewhat in the stack, in which case we'll "detach" the page from the
> pool and rely on page refcount for it to return back to the kernel.
>
> Memory providers are different, at least ppiov based ones, they need
> all their buffers to eventually return back, so apart from custom pp
> ->release handlers, we'll catch when someone puts down a ppiov and call
> its memory provider to handle it, i.e. __page_pool_iov_free().
>
> The first problem is that __page_pool_iov_free() hard coded devmem
> handling, and other providers need a flexible way to specify their own
> callbacks.
>
> The second problem is that it doesn't go through the generic page pool
> paths and so can't do the mentioned pp accounting right. And we can't
> even safely rely on page_pool_put_page() to be called somewhere before
> to do the pp refcounting, because then the page pool might get destroyed
> and ppiov->pp would point to garbage.
>
> The solution is to make the pp ->release callback to be responsible for
> properly recycling its buffers, e.g. calling what was
> __page_pool_iov_free() before in case of devmem.
> page_pool_iov_put_many() will be returning buffers to the page pool.
>

Hmm this patch is working on top of slightly outdated code. I think
the correct solution here is to transition to using pp_ref_count for
refcounting the ppiovs/niovs. Once we do that, we no longer need
special refcounting for ppiovs, they're refcounted identically to
pages, makes the pp more maintainable, gives us some unified handling
of page pool refcounting, it becomes trivial to support fragmented
pages which require a pp_ref_count, and all the code in this patch can
go away.

I'm unsure if this patch is just because you haven't rebased to my
latest RFC (which is completely fine by me), or if you actually think
using pp_ref_count for refcounting is wrong and want us to go back to
the older model which required some custom handling for ppiov and
disabled frag support. I'm guessing it's the former, but please
correct if I'm wrong.

[1] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  include/net/page_pool/helpers.h | 15 ++++++++---
>  net/core/page_pool.c            | 46 +++++++++++++++++----------------
>  2 files changed, 35 insertions(+), 26 deletions(-)
>
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 92804c499833..ef380ee8f205 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -137,15 +137,22 @@ static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
>         refcount_add(count, &ppiov->refcount);
>  }
>
> -void __page_pool_iov_free(struct page_pool_iov *ppiov);
> +static inline bool page_pool_iov_sub_and_test(struct page_pool_iov *ppiov,
> +                                             unsigned int count)
> +{
> +       return refcount_sub_and_test(count, &ppiov->refcount);
> +}
>
>  static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
>                                           unsigned int count)
>  {
> -       if (!refcount_sub_and_test(count, &ppiov->refcount))
> -               return;
> +       if (count > 1)
> +               WARN_ON_ONCE(page_pool_iov_sub_and_test(ppiov, count - 1));
>
> -       __page_pool_iov_free(ppiov);
> +#ifdef CONFIG_PAGE_POOL
> +       page_pool_put_defragged_page(ppiov->pp, page_pool_mangle_ppiov(ppiov),
> +                                    -1, false);
> +#endif
>  }
>
>  /* page pool mm helpers */
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 38eff947f679..ecf90a1ccabe 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -599,6 +599,16 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
>         page_pool_set_dma_addr(page, 0);
>  }
>
> +static void page_pool_return_provider(struct page_pool *pool, struct page *page)
> +{
> +       int count;
> +
> +       if (pool->mp_ops->release_page(pool, page)) {
> +               count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
> +               trace_page_pool_state_release(pool, page, count);
> +       }
> +}
> +
>  /* Disconnects a page (from a page_pool).  API users can have a need
>   * to disconnect a page (from a page_pool), to allow it to be used as
>   * a regular page (that will eventually be returned to the normal
> @@ -607,13 +617,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
>  void page_pool_return_page(struct page_pool *pool, struct page *page)
>  {
>         int count;
> -       bool put;
>
> -       put = true;
> -       if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
> -               put = pool->mp_ops->release_page(pool, page);
> -       else
> -               __page_pool_release_page_dma(pool, page);
> +       if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) {
> +               page_pool_return_provider(pool, page);
> +               return;
> +       }
> +
> +       __page_pool_release_page_dma(pool, page);
>
>         /* This may be the last page returned, releasing the pool, so
>          * it is not safe to reference pool afterwards.
> @@ -621,10 +631,8 @@ void page_pool_return_page(struct page_pool *pool, struct page *page)
>         count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
>         trace_page_pool_state_release(pool, page, count);
>
> -       if (put) {
> -               page_pool_clear_pp_info(page);
> -               put_page(page);
> -       }
> +       page_pool_clear_pp_info(page);
> +       put_page(page);
>         /* An optimization would be to call __free_pages(page, pool->p.order)
>          * knowing page is not part of page-cache (thus avoiding a
>          * __page_cache_release() call).
> @@ -1034,15 +1042,6 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>  }
>  EXPORT_SYMBOL(page_pool_update_nid);
>
> -void __page_pool_iov_free(struct page_pool_iov *ppiov)
> -{
> -       if (ppiov->pp->mp_ops != &dmabuf_devmem_ops)
> -               return;
> -
> -       netdev_free_devmem(ppiov);
> -}
> -EXPORT_SYMBOL_GPL(__page_pool_iov_free);
> -
>  /*** "Dmabuf devmem memory provider" ***/
>
>  static int mp_dmabuf_devmem_init(struct page_pool *pool)
> @@ -1093,9 +1092,12 @@ static bool mp_dmabuf_devmem_release_page(struct page_pool *pool,
>                 return false;
>
>         ppiov = page_to_page_pool_iov(page);
> -       page_pool_iov_put_many(ppiov, 1);
> -       /* We don't want the page pool put_page()ing our page_pool_iovs. */
> -       return false;
> +
> +       if (!page_pool_iov_sub_and_test(ppiov, 1))
> +               return false;
> +       netdev_free_devmem(ppiov);
> +       /* tell page_pool that the ppiov is released */
> +       return true;
>  }
>
>  const struct pp_memory_provider_ops dmabuf_devmem_ops = {
> --
> 2.39.3
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 14/20] net: page pool: add io_uring memory provider
  2023-12-19 21:03 ` [RFC PATCH v3 14/20] net: page pool: add io_uring memory provider David Wei
@ 2023-12-19 23:39   ` Mina Almasry
  2023-12-20  0:04     ` Pavel Begunkov
  0 siblings, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2023-12-19 23:39 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> Allow creating a special io_uring pp memory providers, which will be for
> implementing io_uring zerocopy receive.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>

For your non-RFC versions, I think maybe you want to do a patch by
patch make W=1. I suspect this patch would build fail, because the
next patch adds the io_uring_pp_zc_ops. You're likely skipping this
step because this is an RFC, which is fine.

> ---
>  include/net/page_pool/types.h | 1 +
>  net/core/page_pool.c          | 6 ++++++
>  2 files changed, 7 insertions(+)
>
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index fd846cac9fb6..f54ee759e362 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -129,6 +129,7 @@ struct mem_provider;
>  enum pp_memory_provider_type {
>         __PP_MP_NONE, /* Use system allocator directly */
>         PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */
> +       PP_MP_IOU_ZCRX, /* io_uring zerocopy receive provider */
>  };
>
>  struct pp_memory_provider_ops {
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 9e3073d61a97..ebf5ff009d9d 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -21,6 +21,7 @@
>  #include <linux/ethtool.h>
>  #include <linux/netdevice.h>
>  #include <linux/genalloc.h>
> +#include <linux/io_uring/net.h>
>
>  #include <trace/events/page_pool.h>
>
> @@ -242,6 +243,11 @@ static int page_pool_init(struct page_pool *pool,
>         case PP_MP_DMABUF_DEVMEM:
>                 pool->mp_ops = &dmabuf_devmem_ops;
>                 break;
> +#if defined(CONFIG_IO_URING)
> +       case PP_MP_IOU_ZCRX:
> +               pool->mp_ops = &io_uring_pp_zc_ops;
> +               break;
> +#endif
>         default:
>                 err = -EINVAL;
>                 goto free_ptr_ring;
> --
> 2.39.3
>


--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx
  2023-12-19 21:03 ` [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx David Wei
@ 2023-12-19 23:44   ` Mina Almasry
  2023-12-20  0:39     ` Pavel Begunkov
  2023-12-21 19:36   ` Pavel Begunkov
  1 sibling, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2023-12-19 23:44 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> We're adding a new pp memory provider to implement io_uring zerocopy
> receive. It'll be "registered" in pp and used in later paches.
>
> The typical life cycle of a buffer goes as follows: first it's allocated
> to a driver with the initial refcount set to 1. The drivers fills it
> with data, puts it into an skb and passes down the stack, where it gets
> queued up to a socket. Later, a zc io_uring request will be receiving
> data from the socket from a task context. At that point io_uring will
> tell the userspace that this buffer has some data by posting an
> appropriate completion. It'll also elevating the refcount by
> IO_ZC_RX_UREF, so the buffer is not recycled while userspace is reading

After you rebase to the latest RFC, you will want to elevante the
[pp|n]iov->pp_ref_count, rather than the non-existent ppiov->refcount.
I do the same thing for devmem TCP.

> the data. When the userspace is done with the buffer it should return it
> back to io_uring by adding an entry to the buffer refill ring. When
> necessary io_uring will poll the refill ring, compare references
> including IO_ZC_RX_UREF and reuse the buffer.
>
> Initally, all buffers are placed in a spinlock protected ->freelist.
> It's a slow path stash, where buffers are considered to be unallocated
> and not exposed to core page pool. On allocation, pp will first try
> all its caches, and the ->alloc_pages callback if everything else
> failed.
>
> The hot path for io_pp_zc_alloc_pages() is to grab pages from the refill
> ring. The consumption from the ring is always done in the attached napi
> context, so no additional synchronisation required. If that fails we'll
> be getting buffers from the ->freelist.
>
> Note: only ->freelist are considered unallocated for page pool, so we
> only add pages_state_hold_cnt when allocating from there. Subsequently,
> as page_pool_return_page() and others bump the ->pages_state_release_cnt
> counter, io_pp_zc_release_page() can only use ->freelist, which is not a
> problem as it's not a slow path.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  include/linux/io_uring/net.h |   5 +
>  io_uring/zc_rx.c             | 204 +++++++++++++++++++++++++++++++++++
>  io_uring/zc_rx.h             |   6 ++
>  3 files changed, 215 insertions(+)
>
> diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
> index d994d26116d0..13244ae5fc4a 100644
> --- a/include/linux/io_uring/net.h
> +++ b/include/linux/io_uring/net.h
> @@ -13,6 +13,11 @@ struct io_zc_rx_buf {
>  };
>
>  #if defined(CONFIG_IO_URING)
> +
> +#if defined(CONFIG_PAGE_POOL)
> +extern const struct pp_memory_provider_ops io_uring_pp_zc_ops;
> +#endif
> +
>  int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
>
>  #else
> diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
> index 1e656b481725..ff1dac24ac40 100644
> --- a/io_uring/zc_rx.c
> +++ b/io_uring/zc_rx.c
> @@ -6,6 +6,7 @@
>  #include <linux/io_uring.h>
>  #include <linux/netdevice.h>
>  #include <linux/nospec.h>
> +#include <trace/events/page_pool.h>
>
>  #include <uapi/linux/io_uring.h>
>
> @@ -387,4 +388,207 @@ int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
>         return 0;
>  }
>
> +static inline struct io_zc_rx_buf *io_iov_to_buf(struct page_pool_iov *iov)
> +{
> +       return container_of(iov, struct io_zc_rx_buf, ppiov);
> +}
> +
> +static inline unsigned io_buf_pgid(struct io_zc_rx_pool *pool,
> +                                  struct io_zc_rx_buf *buf)
> +{
> +       return buf - pool->bufs;
> +}
> +
> +static __maybe_unused void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf)
> +{
> +       refcount_add(IO_ZC_RX_UREF, &buf->ppiov.refcount);
> +}
> +
> +static bool io_zc_rx_put_buf_uref(struct io_zc_rx_buf *buf)
> +{
> +       if (page_pool_iov_refcount(&buf->ppiov) < IO_ZC_RX_UREF)
> +               return false;
> +
> +       return page_pool_iov_sub_and_test(&buf->ppiov, IO_ZC_RX_UREF);
> +}
> +
> +static inline struct page *io_zc_buf_to_pp_page(struct io_zc_rx_buf *buf)
> +{
> +       return page_pool_mangle_ppiov(&buf->ppiov);
> +}
> +
> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
> +                                     struct io_zc_rx_buf *buf)
> +{
> +       refcount_set(&buf->ppiov.refcount, 1);
> +       pp->alloc.cache[pp->alloc.count++] = io_zc_buf_to_pp_page(buf);
> +}
> +
> +static inline u32 io_zc_rx_rqring_entries(struct io_zc_rx_ifq *ifq)
> +{
> +       struct io_rbuf_ring *ring = ifq->ring;
> +       u32 entries;
> +
> +       entries = smp_load_acquire(&ring->rq.tail) - ifq->cached_rq_head;
> +       return min(entries, ifq->rq_entries);
> +}
> +
> +static void io_zc_rx_ring_refill(struct page_pool *pp,
> +                                struct io_zc_rx_ifq *ifq)
> +{
> +       unsigned int entries = io_zc_rx_rqring_entries(ifq);
> +       unsigned int mask = ifq->rq_entries - 1;
> +       struct io_zc_rx_pool *pool = ifq->pool;
> +
> +       if (unlikely(!entries))
> +               return;
> +
> +       while (entries--) {
> +               unsigned int rq_idx = ifq->cached_rq_head++ & mask;
> +               struct io_uring_rbuf_rqe *rqe = &ifq->rqes[rq_idx];
> +               u32 pgid = rqe->off / PAGE_SIZE;
> +               struct io_zc_rx_buf *buf = &pool->bufs[pgid];
> +
> +               if (!io_zc_rx_put_buf_uref(buf))
> +                       continue;
> +               io_zc_add_pp_cache(pp, buf);
> +               if (pp->alloc.count >= PP_ALLOC_CACHE_REFILL)
> +                       break;
> +       }
> +       smp_store_release(&ifq->ring->rq.head, ifq->cached_rq_head);
> +}
> +
> +static void io_zc_rx_refill_slow(struct page_pool *pp, struct io_zc_rx_ifq *ifq)
> +{
> +       struct io_zc_rx_pool *pool = ifq->pool;
> +
> +       spin_lock_bh(&pool->freelist_lock);
> +       while (pool->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
> +               struct io_zc_rx_buf *buf;
> +               u32 pgid;
> +
> +               pgid = pool->freelist[--pool->free_count];
> +               buf = &pool->bufs[pgid];
> +
> +               io_zc_add_pp_cache(pp, buf);
> +               pp->pages_state_hold_cnt++;
> +               trace_page_pool_state_hold(pp, io_zc_buf_to_pp_page(buf),
> +                                          pp->pages_state_hold_cnt);
> +       }
> +       spin_unlock_bh(&pool->freelist_lock);
> +}
> +
> +static void io_zc_rx_recycle_buf(struct io_zc_rx_pool *pool,
> +                                struct io_zc_rx_buf *buf)
> +{
> +       spin_lock_bh(&pool->freelist_lock);
> +       pool->freelist[pool->free_count++] = io_buf_pgid(pool, buf);
> +       spin_unlock_bh(&pool->freelist_lock);
> +}
> +
> +static struct page *io_pp_zc_alloc_pages(struct page_pool *pp, gfp_t gfp)
> +{
> +       struct io_zc_rx_ifq *ifq = pp->mp_priv;
> +
> +       /* pp should already be ensuring that */
> +       if (unlikely(pp->alloc.count))
> +               goto out_return;
> +
> +       io_zc_rx_ring_refill(pp, ifq);
> +       if (likely(pp->alloc.count))
> +               goto out_return;
> +
> +       io_zc_rx_refill_slow(pp, ifq);
> +       if (!pp->alloc.count)
> +               return NULL;
> +out_return:
> +       return pp->alloc.cache[--pp->alloc.count];
> +}
> +
> +static bool io_pp_zc_release_page(struct page_pool *pp, struct page *page)
> +{
> +       struct io_zc_rx_ifq *ifq = pp->mp_priv;
> +       struct page_pool_iov *ppiov;
> +
> +       if (WARN_ON_ONCE(!page_is_page_pool_iov(page)))
> +               return false;
> +
> +       ppiov = page_to_page_pool_iov(page);
> +
> +       if (!page_pool_iov_sub_and_test(ppiov, 1))
> +               return false;
> +
> +       io_zc_rx_recycle_buf(ifq->pool, io_iov_to_buf(ppiov));
> +       return true;
> +}
> +
> +static void io_pp_zc_scrub(struct page_pool *pp)
> +{
> +       struct io_zc_rx_ifq *ifq = pp->mp_priv;
> +       struct io_zc_rx_pool *pool = ifq->pool;
> +       struct io_zc_rx_buf *buf;
> +       int i;
> +
> +       for (i = 0; i < pool->nr_bufs; i++) {
> +               buf = &pool->bufs[i];
> +
> +               if (io_zc_rx_put_buf_uref(buf)) {
> +                       /* just return it to the page pool, it'll clean it up */
> +                       refcount_set(&buf->ppiov.refcount, 1);
> +                       page_pool_iov_put_many(&buf->ppiov, 1);
> +               }
> +       }
> +}
> +

I'm unsure about this. So scrub forcibly frees the pending data? Why
does this work? Can't the application want to read this data even
though the page_pool is destroyed?

AFAIK the page_pool being destroyed doesn't mean we can free the
pages/niovs in it. The niovs that were in it can be waiting on the
receive queue for the application to call recvmsg() on it. Does
io_uring work differently such that you're able to force-free the
ppiovs/niovs?

> +static void io_zc_rx_init_pool(struct io_zc_rx_pool *pool,
> +                              struct page_pool *pp)
> +{
> +       struct io_zc_rx_buf *buf;
> +       int i;
> +
> +       for (i = 0; i < pool->nr_bufs; i++) {
> +               buf = &pool->bufs[i];
> +               buf->ppiov.pp = pp;
> +       }
> +}
> +
> +static int io_pp_zc_init(struct page_pool *pp)
> +{
> +       struct io_zc_rx_ifq *ifq = pp->mp_priv;
> +
> +       if (!ifq)
> +               return -EINVAL;
> +       if (pp->p.order != 0)
> +               return -EINVAL;
> +       if (!pp->p.napi)
> +               return -EINVAL;
> +
> +       io_zc_rx_init_pool(ifq->pool, pp);
> +       percpu_ref_get(&ifq->ctx->refs);
> +       ifq->pp = pp;
> +       return 0;
> +}
> +
> +static void io_pp_zc_destroy(struct page_pool *pp)
> +{
> +       struct io_zc_rx_ifq *ifq = pp->mp_priv;
> +       struct io_zc_rx_pool *pool = ifq->pool;
> +
> +       ifq->pp = NULL;
> +
> +       if (WARN_ON_ONCE(pool->free_count != pool->nr_bufs))
> +               return;
> +       percpu_ref_put(&ifq->ctx->refs);
> +}
> +
> +const struct pp_memory_provider_ops io_uring_pp_zc_ops = {
> +       .alloc_pages            = io_pp_zc_alloc_pages,
> +       .release_page           = io_pp_zc_release_page,
> +       .init                   = io_pp_zc_init,
> +       .destroy                = io_pp_zc_destroy,
> +       .scrub                  = io_pp_zc_scrub,
> +};
> +EXPORT_SYMBOL(io_uring_pp_zc_ops);
> +
> +
>  #endif
> diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
> index af1d865525d2..00d864700c67 100644
> --- a/io_uring/zc_rx.h
> +++ b/io_uring/zc_rx.h
> @@ -10,6 +10,9 @@
>  #define IO_ZC_IFQ_IDX_OFFSET           16
>  #define IO_ZC_IFQ_IDX_MASK             ((1U << IO_ZC_IFQ_IDX_OFFSET) - 1)
>
> +#define IO_ZC_RX_UREF                  0x10000
> +#define IO_ZC_RX_KREF_MASK             (IO_ZC_RX_UREF - 1)
> +
>  struct io_zc_rx_pool {
>         struct io_zc_rx_ifq     *ifq;
>         struct io_zc_rx_buf     *bufs;
> @@ -26,12 +29,15 @@ struct io_zc_rx_ifq {
>         struct io_ring_ctx              *ctx;
>         struct net_device               *dev;
>         struct io_zc_rx_pool            *pool;
> +       struct page_pool                *pp;
>
>         struct io_rbuf_ring             *ring;
>         struct io_uring_rbuf_rqe        *rqes;
>         struct io_uring_rbuf_cqe        *cqes;
>         u32                             rq_entries;
>         u32                             cq_entries;
> +       u32                             cached_rq_head;
> +       u32                             cached_cq_tail;
>
>         /* hw rx descriptor ring id */
>         u32                             if_rxq_id;
> --
> 2.39.3
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper
  2023-12-19 23:22   ` Mina Almasry
@ 2023-12-19 23:59     ` Pavel Begunkov
  0 siblings, 0 replies; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-19 23:59 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 12/19/23 23:22, Mina Almasry wrote:
> On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> NOT FOR UPSTREAM
>>
>> The final version will depend on how ppiov looks like, but add a
>> convenience helper for now.
>>
> 
> Thanks, this patch becomes unnecessary once you pull in the latest
> version of our changes; you could use net_iov_to_netmem() added here:
> 
> https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
> 
> Not any kind of objection from me, just an FYI.

Right, that's predicated, and that's why there are disclaimers
saying that it depends on your paches final form, and many of
such patches will get dropped as unnecessary.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 14/20] net: page pool: add io_uring memory provider
  2023-12-19 23:39   ` Mina Almasry
@ 2023-12-20  0:04     ` Pavel Begunkov
  0 siblings, 0 replies; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-20  0:04 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 12/19/23 23:39, Mina Almasry wrote:
> On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> Allow creating a special io_uring pp memory providers, which will be for
>> implementing io_uring zerocopy receive.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> Signed-off-by: David Wei <[email protected]>
> 
> For your non-RFC versions, I think maybe you want to do a patch by
> patch make W=1. I suspect this patch would build fail, because the
> next patch adds the io_uring_pp_zc_ops. You're likely skipping this
> step because this is an RFC, which is fine.

Hmm? io_uring_pp_zc_ops is added is Patch 13, used in Patch 14.
Compiles well.


>> ---
>>   include/net/page_pool/types.h | 1 +
>>   net/core/page_pool.c          | 6 ++++++
>>   2 files changed, 7 insertions(+)
>>
>> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
>> index fd846cac9fb6..f54ee759e362 100644
>> --- a/include/net/page_pool/types.h
>> +++ b/include/net/page_pool/types.h
>> @@ -129,6 +129,7 @@ struct mem_provider;
>>   enum pp_memory_provider_type {
>>          __PP_MP_NONE, /* Use system allocator directly */
>>          PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */
>> +       PP_MP_IOU_ZCRX, /* io_uring zerocopy receive provider */
>>   };
>>
>>   struct pp_memory_provider_ops {
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index 9e3073d61a97..ebf5ff009d9d 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/ethtool.h>
>>   #include <linux/netdevice.h>
>>   #include <linux/genalloc.h>
>> +#include <linux/io_uring/net.h>
>>
>>   #include <trace/events/page_pool.h>
>>
>> @@ -242,6 +243,11 @@ static int page_pool_init(struct page_pool *pool,
>>          case PP_MP_DMABUF_DEVMEM:
>>                  pool->mp_ops = &dmabuf_devmem_ops;
>>                  break;
>> +#if defined(CONFIG_IO_URING)
>> +       case PP_MP_IOU_ZCRX:
>> +               pool->mp_ops = &io_uring_pp_zc_ops;
>> +               break;
>> +#endif
>>          default:
>>                  err = -EINVAL;
>>                  goto free_ptr_ring;
>> --
>> 2.39.3
>>
> 
> 
> --
> Thanks,
> Mina

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx
  2023-12-19 23:44   ` Mina Almasry
@ 2023-12-20  0:39     ` Pavel Begunkov
  0 siblings, 0 replies; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-20  0:39 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 12/19/23 23:44, Mina Almasry wrote:
> On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> We're adding a new pp memory provider to implement io_uring zerocopy
>> receive. It'll be "registered" in pp and used in later paches.
>>
>> The typical life cycle of a buffer goes as follows: first it's allocated
>> to a driver with the initial refcount set to 1. The drivers fills it
>> with data, puts it into an skb and passes down the stack, where it gets
>> queued up to a socket. Later, a zc io_uring request will be receiving
>> data from the socket from a task context. At that point io_uring will
>> tell the userspace that this buffer has some data by posting an
>> appropriate completion. It'll also elevating the refcount by
>> IO_ZC_RX_UREF, so the buffer is not recycled while userspace is reading
> 
> After you rebase to the latest RFC, you will want to elevante the
> [pp|n]iov->pp_ref_count, rather than the non-existent ppiov->refcount.
> I do the same thing for devmem TCP.
> 
>> the data. When the userspace is done with the buffer it should return it
>> back to io_uring by adding an entry to the buffer refill ring. When
>> necessary io_uring will poll the refill ring, compare references
>> including IO_ZC_RX_UREF and reuse the buffer.
>>
>> Initally, all buffers are placed in a spinlock protected ->freelist.
>> It's a slow path stash, where buffers are considered to be unallocated
>> and not exposed to core page pool. On allocation, pp will first try
>> all its caches, and the ->alloc_pages callback if everything else
>> failed.
>>
>> The hot path for io_pp_zc_alloc_pages() is to grab pages from the refill
>> ring. The consumption from the ring is always done in the attached napi
>> context, so no additional synchronisation required. If that fails we'll
>> be getting buffers from the ->freelist.
>>
>> Note: only ->freelist are considered unallocated for page pool, so we
>> only add pages_state_hold_cnt when allocating from there. Subsequently,
>> as page_pool_return_page() and others bump the ->pages_state_release_cnt
>> counter, io_pp_zc_release_page() can only use ->freelist, which is not a
>> problem as it's not a slow path.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>   include/linux/io_uring/net.h |   5 +
>>   io_uring/zc_rx.c             | 204 +++++++++++++++++++++++++++++++++++
>>   io_uring/zc_rx.h             |   6 ++
>>   3 files changed, 215 insertions(+)
>>
>> diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
>> index d994d26116d0..13244ae5fc4a 100644
>> --- a/include/linux/io_uring/net.h
>> +++ b/include/linux/io_uring/net.h
>> @@ -13,6 +13,11 @@ struct io_zc_rx_buf {
>>   };
>>
>>   #if defined(CONFIG_IO_URING)
>> +
>> +#if defined(CONFIG_PAGE_POOL)
>> +extern const struct pp_memory_provider_ops io_uring_pp_zc_ops;
>> +#endif
>> +
>>   int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
>>
>>   #else
>> diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
>> index 1e656b481725..ff1dac24ac40 100644
>> --- a/io_uring/zc_rx.c
>> +++ b/io_uring/zc_rx.c
>> @@ -6,6 +6,7 @@
>>   #include <linux/io_uring.h>
>>   #include <linux/netdevice.h>
>>   #include <linux/nospec.h>
>> +#include <trace/events/page_pool.h>
>>
>>   #include <uapi/linux/io_uring.h>
>>
>> @@ -387,4 +388,207 @@ int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
>>          return 0;
>>   }
>>
>> +static inline struct io_zc_rx_buf *io_iov_to_buf(struct page_pool_iov *iov)
>> +{
>> +       return container_of(iov, struct io_zc_rx_buf, ppiov);
>> +}
>> +
>> +static inline unsigned io_buf_pgid(struct io_zc_rx_pool *pool,
>> +                                  struct io_zc_rx_buf *buf)
>> +{
>> +       return buf - pool->bufs;
>> +}
>> +
>> +static __maybe_unused void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf)
>> +{
>> +       refcount_add(IO_ZC_RX_UREF, &buf->ppiov.refcount);
>> +}
>> +
>> +static bool io_zc_rx_put_buf_uref(struct io_zc_rx_buf *buf)
>> +{
>> +       if (page_pool_iov_refcount(&buf->ppiov) < IO_ZC_RX_UREF)
>> +               return false;
>> +
>> +       return page_pool_iov_sub_and_test(&buf->ppiov, IO_ZC_RX_UREF);
>> +}
>> +
>> +static inline struct page *io_zc_buf_to_pp_page(struct io_zc_rx_buf *buf)
>> +{
>> +       return page_pool_mangle_ppiov(&buf->ppiov);
>> +}
>> +
>> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
>> +                                     struct io_zc_rx_buf *buf)
>> +{
>> +       refcount_set(&buf->ppiov.refcount, 1);
>> +       pp->alloc.cache[pp->alloc.count++] = io_zc_buf_to_pp_page(buf);
>> +}
>> +
>> +static inline u32 io_zc_rx_rqring_entries(struct io_zc_rx_ifq *ifq)
>> +{
>> +       struct io_rbuf_ring *ring = ifq->ring;
>> +       u32 entries;
>> +
>> +       entries = smp_load_acquire(&ring->rq.tail) - ifq->cached_rq_head;
>> +       return min(entries, ifq->rq_entries);
>> +}
>> +
>> +static void io_zc_rx_ring_refill(struct page_pool *pp,
>> +                                struct io_zc_rx_ifq *ifq)
>> +{
>> +       unsigned int entries = io_zc_rx_rqring_entries(ifq);
>> +       unsigned int mask = ifq->rq_entries - 1;
>> +       struct io_zc_rx_pool *pool = ifq->pool;
>> +
>> +       if (unlikely(!entries))
>> +               return;
>> +
>> +       while (entries--) {
>> +               unsigned int rq_idx = ifq->cached_rq_head++ & mask;
>> +               struct io_uring_rbuf_rqe *rqe = &ifq->rqes[rq_idx];
>> +               u32 pgid = rqe->off / PAGE_SIZE;
>> +               struct io_zc_rx_buf *buf = &pool->bufs[pgid];
>> +
>> +               if (!io_zc_rx_put_buf_uref(buf))
>> +                       continue;
>> +               io_zc_add_pp_cache(pp, buf);
>> +               if (pp->alloc.count >= PP_ALLOC_CACHE_REFILL)
>> +                       break;
>> +       }
>> +       smp_store_release(&ifq->ring->rq.head, ifq->cached_rq_head);
>> +}
>> +
>> +static void io_zc_rx_refill_slow(struct page_pool *pp, struct io_zc_rx_ifq *ifq)
>> +{
>> +       struct io_zc_rx_pool *pool = ifq->pool;
>> +
>> +       spin_lock_bh(&pool->freelist_lock);
>> +       while (pool->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
>> +               struct io_zc_rx_buf *buf;
>> +               u32 pgid;
>> +
>> +               pgid = pool->freelist[--pool->free_count];
>> +               buf = &pool->bufs[pgid];
>> +
>> +               io_zc_add_pp_cache(pp, buf);
>> +               pp->pages_state_hold_cnt++;
>> +               trace_page_pool_state_hold(pp, io_zc_buf_to_pp_page(buf),
>> +                                          pp->pages_state_hold_cnt);
>> +       }
>> +       spin_unlock_bh(&pool->freelist_lock);
>> +}
>> +
>> +static void io_zc_rx_recycle_buf(struct io_zc_rx_pool *pool,
>> +                                struct io_zc_rx_buf *buf)
>> +{
>> +       spin_lock_bh(&pool->freelist_lock);
>> +       pool->freelist[pool->free_count++] = io_buf_pgid(pool, buf);
>> +       spin_unlock_bh(&pool->freelist_lock);
>> +}
>> +
>> +static struct page *io_pp_zc_alloc_pages(struct page_pool *pp, gfp_t gfp)
>> +{
>> +       struct io_zc_rx_ifq *ifq = pp->mp_priv;
>> +
>> +       /* pp should already be ensuring that */
>> +       if (unlikely(pp->alloc.count))
>> +               goto out_return;
>> +
>> +       io_zc_rx_ring_refill(pp, ifq);
>> +       if (likely(pp->alloc.count))
>> +               goto out_return;
>> +
>> +       io_zc_rx_refill_slow(pp, ifq);
>> +       if (!pp->alloc.count)
>> +               return NULL;
>> +out_return:
>> +       return pp->alloc.cache[--pp->alloc.count];
>> +}
>> +
>> +static bool io_pp_zc_release_page(struct page_pool *pp, struct page *page)
>> +{
>> +       struct io_zc_rx_ifq *ifq = pp->mp_priv;
>> +       struct page_pool_iov *ppiov;
>> +
>> +       if (WARN_ON_ONCE(!page_is_page_pool_iov(page)))
>> +               return false;
>> +
>> +       ppiov = page_to_page_pool_iov(page);
>> +
>> +       if (!page_pool_iov_sub_and_test(ppiov, 1))
>> +               return false;
>> +
>> +       io_zc_rx_recycle_buf(ifq->pool, io_iov_to_buf(ppiov));
>> +       return true;
>> +}
>> +
>> +static void io_pp_zc_scrub(struct page_pool *pp)
>> +{
>> +       struct io_zc_rx_ifq *ifq = pp->mp_priv;
>> +       struct io_zc_rx_pool *pool = ifq->pool;
>> +       struct io_zc_rx_buf *buf;
>> +       int i;
>> +
>> +       for (i = 0; i < pool->nr_bufs; i++) {
>> +               buf = &pool->bufs[i];
>> +
>> +               if (io_zc_rx_put_buf_uref(buf)) {
>> +                       /* just return it to the page pool, it'll clean it up */
>> +                       refcount_set(&buf->ppiov.refcount, 1);
>> +                       page_pool_iov_put_many(&buf->ppiov, 1);
>> +               }
>> +       }
>> +}
>> +
> 
> I'm unsure about this. So scrub forcibly frees the pending data? Why
> does this work? Can't the application want to read this data even
> though the page_pool is destroyed?

It only affects buffers that were given back to the userspace
and are still there. Even if it scrubs like that, the completions are
still visible to the user, there are pointing to correct buffers,
which are still supposed to be mapped well. Yes, we return them earlier
to the kernel, but since reallocation should not be possible at that
point the data in the buffers would stay correct. There might be
problems with copy fallback, but it'll need to improve anyway,
I'm keeping an eye on it

In any case, I'd say if page pool is destroyed while we're still
using it, I'd say something is going really wrong and the user should
terminate all of it, but that raises a question in what valid cases
the kernel might decide to reallocate pp (device reconfiguration),
and what the application is supposed to do about it.


> AFAIK the page_pool being destroyed doesn't mean we can free the
> pages/niovs in it. The niovs that were in it can be waiting on the
> receive queue for the application to call recvmsg() on it. Does
> io_uring work differently such that you're able to force-free the
> ppiovs/niovs?

If a buffer is used by the stack, the stack will hold a reference.
We're not just blindly freeing them here, only dropping the userspace
reference. An analogy for devmem would to remove all involved buffers
from dont_need xarrays (also dropping a reference IIRC).

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 03/20] net: page pool: rework ppiov life cycle
  2023-12-19 23:35   ` Mina Almasry
@ 2023-12-20  0:49     ` Pavel Begunkov
  0 siblings, 0 replies; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-20  0:49 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 12/19/23 23:35, Mina Almasry wrote:
> On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> NOT FOR UPSTREAM
>> The final version will depend on how the ppiov infra looks like
>>
>> Page pool is tracking how many pages were allocated and returned, which
>> serves for refcounting the pool, and so every page/frag allocated should
>> eventually come back to the page pool via appropriate ways, e.g. by
>> calling page_pool_put_page().
>>
>> When it comes to normal page pools (i.e. without memory providers
>> attached), it's fine to return a page when it's still refcounted by
>> somewhat in the stack, in which case we'll "detach" the page from the
>> pool and rely on page refcount for it to return back to the kernel.
>>
>> Memory providers are different, at least ppiov based ones, they need
>> all their buffers to eventually return back, so apart from custom pp
>> ->release handlers, we'll catch when someone puts down a ppiov and call
>> its memory provider to handle it, i.e. __page_pool_iov_free().
>>
>> The first problem is that __page_pool_iov_free() hard coded devmem
>> handling, and other providers need a flexible way to specify their own
>> callbacks.
>>
>> The second problem is that it doesn't go through the generic page pool
>> paths and so can't do the mentioned pp accounting right. And we can't
>> even safely rely on page_pool_put_page() to be called somewhere before
>> to do the pp refcounting, because then the page pool might get destroyed
>> and ppiov->pp would point to garbage.
>>
>> The solution is to make the pp ->release callback to be responsible for
>> properly recycling its buffers, e.g. calling what was
>> __page_pool_iov_free() before in case of devmem.
>> page_pool_iov_put_many() will be returning buffers to the page pool.
>>
> 
> Hmm this patch is working on top of slightly outdated code. I think> the correct solution here is to transition to using pp_ref_count for
> refcounting the ppiovs/niovs. Once we do that, we no longer need
> special refcounting for ppiovs, they're refcounted identically to
> pages, makes the pp more maintainable, gives us some unified handling
> of page pool refcounting, it becomes trivial to support fragmented
> pages which require a pp_ref_count, and all the code in this patch can
> go away.
> 
> I'm unsure if this patch is just because you haven't rebased to my
> latest RFC (which is completely fine by me), or if you actually think
> using pp_ref_count for refcounting is wrong and want us to go back to
> the older model which required some custom handling for ppiov and
> disabled frag support. I'm guessing it's the former, but please
> correct if I'm wrong.

Right, it's based on older patches, it'd be a fool's work keep
rebasing it while the code is still changing unless there is a
good reason for that.

I haven't taken a look at devmem v5, I definitely going to. IMHO,
this approach is versatile and clear, but if there is a better one,
I'm all for it.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov
  2023-12-19 23:24   ` Mina Almasry
@ 2023-12-20  1:29     ` Pavel Begunkov
  2024-01-02 16:11       ` Mina Almasry
  0 siblings, 1 reply; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-20  1:29 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 12/19/23 23:24, Mina Almasry wrote:
> On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> NOT FOR UPSTREAM
>>
>> There will be more users of struct page_pool_iov, and ppiovs from one
>> subsystem must not be used by another. That should never happen for any
>> sane application, but we need to enforce it in case of bufs and/or
>> malicious users.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>   net/ipv4/tcp.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 33a8bb63fbf5..9c6b18eebb5b 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -2384,6 +2384,13 @@ static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb,
>>                          }
>>
>>                          ppiov = skb_frag_page_pool_iov(frag);
>> +
>> +                       /* Disallow non devmem owned buffers */
>> +                       if (ppiov->pp->p.memory_provider != PP_MP_DMABUF_DEVMEM) {
>> +                               err = -ENODEV;
>> +                               goto out;
>> +                       }
>> +
> 
> Instead of this, I maybe recommend modifying the skb->dmabuf flag? My
> mental model is that flag means all the frags in the skb are

That's a good point, we need to separate them, and I have it in my
todo list.

> specifically dmabuf, not general ppiovs or net_iovs. Is it possible to
> add skb->io_uring or something?

->io_uring flag is not feasible, converting ->devmem into a type
{page,devmem,iouring} is better but not great either.

> If that bloats the skb headers, then maybe we need another place to
> put this flag. Maybe the [page_pool|net]_iov should declare whether
> it's dmabuf or otherwise, and we can check frag[0] and assume all

ppiov->pp should be enough, either not mixing buffers from different
pools or comparing pp->ops or some pp->type.

> frags are the same as frag0.

I think I like this one the most. I think David Ahern mentioned
before, but would be nice having it on per frag basis and kill
->devmem flag. That would still stop collapsing if frags are
from different pools or so.

> But IMO the page pool internals should not leak into the
> implementation of generic tcp stack functions.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 06/20] io_uring: separate header for exported net bits
  2023-12-19 21:03 ` [RFC PATCH v3 06/20] io_uring: separate header for exported net bits David Wei
@ 2023-12-20 16:01   ` Jens Axboe
  0 siblings, 0 replies; 50+ messages in thread
From: Jens Axboe @ 2023-12-20 16:01 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/19/23 2:03 PM, David Wei wrote:
> From: Pavel Begunkov <[email protected]>
> 
> We're exporting some io_uring bits to networking, e.g. for implementing
> a net callback for io_uring cmds, but we don't want to expose more than
> needed. Add a separate header for networking.

Reviewed-by: Jens Axboe <[email protected]>

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 10/20] io_uring: setup ZC for an Rx queue when registering an ifq
  2023-12-19 21:03 ` [RFC PATCH v3 10/20] io_uring: setup ZC for an Rx queue when registering an ifq David Wei
@ 2023-12-20 16:06   ` Jens Axboe
  2023-12-20 16:24     ` Pavel Begunkov
  0 siblings, 1 reply; 50+ messages in thread
From: Jens Axboe @ 2023-12-20 16:06 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/19/23 2:03 PM, David Wei wrote:
> From: David Wei <[email protected]>
> 
> This patch sets up ZC for an Rx queue in a net device when an ifq is
> registered with io_uring. The Rx queue is specified in the registration
> struct.
> 
> For now since there is only one ifq, its destruction is implicit during
> io_uring cleanup.
> 
> Signed-off-by: David Wei <[email protected]>
> ---
>  io_uring/zc_rx.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
> index 7e3e6f6d446b..259e08a34ab2 100644
> --- a/io_uring/zc_rx.c
> +++ b/io_uring/zc_rx.c
> @@ -4,6 +4,7 @@
>  #include <linux/errno.h>
>  #include <linux/mm.h>
>  #include <linux/io_uring.h>
> +#include <linux/netdevice.h>
>  
>  #include <uapi/linux/io_uring.h>
>  
> @@ -11,6 +12,34 @@
>  #include "kbuf.h"
>  #include "zc_rx.h"
>  
> +typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);

Let's get rid of this, since it isn't even typedef'ed on the networking
side. Doesn't really buy us anything, and it's only used once anyway.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 07/20] io_uring: add interface queue
  2023-12-19 21:03 ` [RFC PATCH v3 07/20] io_uring: add interface queue David Wei
@ 2023-12-20 16:13   ` Jens Axboe
  2023-12-20 16:23     ` Pavel Begunkov
  2023-12-21  1:44     ` David Wei
  2023-12-21 17:57   ` Willem de Bruijn
  1 sibling, 2 replies; 50+ messages in thread
From: Jens Axboe @ 2023-12-20 16:13 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/19/23 2:03 PM, David Wei wrote:
> @@ -750,6 +753,54 @@ enum {
>  	SOCKET_URING_OP_SETSOCKOPT,
>  };
>  
> +struct io_uring_rbuf_rqe {
> +	__u32	off;
> +	__u32	len;
> +	__u16	region;
> +	__u8	__pad[6];
> +};
> +
> +struct io_uring_rbuf_cqe {
> +	__u32	off;
> +	__u32	len;
> +	__u16	region;
> +	__u8	sock;
> +	__u8	flags;
> +	__u8	__pad[2];
> +};

Looks like this leaves a gap? Should be __pad[4] or probably just __u32
__pad; For all of these, definitely worth thinking about if we'll ever
need more than the slight padding. Might not hurt to always leave 8
bytes extra, outside of the required padding.

> +struct io_rbuf_rqring_offsets {
> +	__u32	head;
> +	__u32	tail;
> +	__u32	rqes;
> +	__u8	__pad[4];
> +};

Ditto here, __u32 __pad;

> +struct io_rbuf_cqring_offsets {
> +	__u32	head;
> +	__u32	tail;
> +	__u32	cqes;
> +	__u8	__pad[4];
> +};

And here.

> +
> +/*
> + * Argument for IORING_REGISTER_ZC_RX_IFQ
> + */
> +struct io_uring_zc_rx_ifq_reg {
> +	__u32	if_idx;
> +	/* hw rx descriptor ring id */
> +	__u32	if_rxq_id;
> +	__u32	region_id;
> +	__u32	rq_entries;
> +	__u32	cq_entries;
> +	__u32	flags;
> +	__u16	cpu;
> +
> +	__u32	mmap_sz;
> +	struct io_rbuf_rqring_offsets rq_off;
> +	struct io_rbuf_cqring_offsets cq_off;
> +};

You have rq_off starting at a 48-bit offset here, don't think this is
going to work as it's uapi. You'd need padding to align it to 64-bits.

> diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
> new file mode 100644
> index 000000000000..5fc94cad5e3a
> --- /dev/null
> +++ b/io_uring/zc_rx.c
> +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
> +			  struct io_uring_zc_rx_ifq_reg __user *arg)
> +{
> +	struct io_uring_zc_rx_ifq_reg reg;
> +	struct io_zc_rx_ifq *ifq;
> +	int ret;
> +
> +	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
> +		return -EINVAL;
> +	if (copy_from_user(&reg, arg, sizeof(reg)))
> +		return -EFAULT;
> +	if (ctx->ifq)
> +		return -EBUSY;
> +	if (reg.if_rxq_id == -1)
> +		return -EINVAL;
> +
> +	ifq = io_zc_rx_ifq_alloc(ctx);
> +	if (!ifq)
> +		return -ENOMEM;
> +
> +	/* TODO: initialise network interface */
> +
> +	ret = io_allocate_rbuf_ring(ifq, &reg);
> +	if (ret)
> +		goto err;
> +
> +	/* TODO: map zc region and initialise zc pool */
> +
> +	ifq->rq_entries = reg.rq_entries;
> +	ifq->cq_entries = reg.cq_entries;
> +	ifq->if_rxq_id = reg.if_rxq_id;
> +	ctx->ifq = ifq;

As these TODO's are removed in later patches, I think you should just
not include them to begin with. It reads more like notes to yourself,
doesn't really add anything to the series.

> +void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
> +{
> +	lockdep_assert_held(&ctx->uring_lock);
> +}

This is a bit odd?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 08/20] io_uring: add mmap support for shared ifq ringbuffers
  2023-12-19 21:03 ` [RFC PATCH v3 08/20] io_uring: add mmap support for shared ifq ringbuffers David Wei
@ 2023-12-20 16:13   ` Jens Axboe
  0 siblings, 0 replies; 50+ messages in thread
From: Jens Axboe @ 2023-12-20 16:13 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/19/23 2:03 PM, David Wei wrote:
> From: David Wei <[email protected]>
> 
> This patch adds mmap support for ifq rbuf rings. There are two rings and
> a struct io_rbuf_ring that contains the head and tail ptrs into each
> ring.
> 
> Just like the io_uring SQ/CQ rings, userspace issues a single mmap call
> using the io_uring fd w/ magic offset IORING_OFF_RBUF_RING. An opaque
> ptr is returned to userspace, which is then expected to use the offsets
> returned in the registration struct to get access to the head/tail and
> rings.

Reviewed-by: Jens Axboe <[email protected]>

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 07/20] io_uring: add interface queue
  2023-12-20 16:13   ` Jens Axboe
@ 2023-12-20 16:23     ` Pavel Begunkov
  2023-12-21  1:44     ` David Wei
  1 sibling, 0 replies; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-20 16:23 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/20/23 16:13, Jens Axboe wrote:
> On 12/19/23 2:03 PM, David Wei wrote:
>> @@ -750,6 +753,54 @@ enum {
>>   	SOCKET_URING_OP_SETSOCKOPT,
>>   };
>>   
>> +struct io_uring_rbuf_rqe {
>> +	__u32	off;
>> +	__u32	len;
>> +	__u16	region;
>> +	__u8	__pad[6];
>> +};
>> +
>> +struct io_uring_rbuf_cqe {
>> +	__u32	off;
>> +	__u32	len;
>> +	__u16	region;
>> +	__u8	sock;
>> +	__u8	flags;
>> +	__u8	__pad[2];
>> +};
> 
> Looks like this leaves a gap? Should be __pad[4] or probably just __u32
> __pad; For all of these, definitely worth thinking about if we'll ever
> need more than the slight padding. Might not hurt to always leave 8
> bytes extra, outside of the required padding.

Good catch, and that all should be paholed to ensure all of them
are fitted nicely.

FWIW, the format will also be revisited, e.g. max 256 sockets per
ifq is too restrictive, and most probably moved from a separate queue
into the CQ.


>> +struct io_rbuf_rqring_offsets {
>> +	__u32	head;
>> +	__u32	tail;
>> +	__u32	rqes;
>> +	__u8	__pad[4];
>> +};
> 
> Ditto here, __u32 __pad;
> 
>> +struct io_rbuf_cqring_offsets {
>> +	__u32	head;
>> +	__u32	tail;
>> +	__u32	cqes;
>> +	__u8	__pad[4];
>> +};
> 
> And here.
> 
>> +
>> +/*
>> + * Argument for IORING_REGISTER_ZC_RX_IFQ
>> + */
>> +struct io_uring_zc_rx_ifq_reg {
>> +	__u32	if_idx;
>> +	/* hw rx descriptor ring id */
>> +	__u32	if_rxq_id;
>> +	__u32	region_id;
>> +	__u32	rq_entries;
>> +	__u32	cq_entries;
>> +	__u32	flags;
>> +	__u16	cpu;
>> +
>> +	__u32	mmap_sz;
>> +	struct io_rbuf_rqring_offsets rq_off;
>> +	struct io_rbuf_cqring_offsets cq_off;
>> +};
> 
> You have rq_off starting at a 48-bit offset here, don't think this is
> going to work as it's uapi. You'd need padding to align it to 64-bits.
> 
>> diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
>> new file mode 100644
>> index 000000000000..5fc94cad5e3a
>> --- /dev/null
>> +++ b/io_uring/zc_rx.c
>> +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
>> +			  struct io_uring_zc_rx_ifq_reg __user *arg)
>> +{
>> +	struct io_uring_zc_rx_ifq_reg reg;
>> +	struct io_zc_rx_ifq *ifq;
>> +	int ret;
>> +
>> +	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
>> +		return -EINVAL;
>> +	if (copy_from_user(&reg, arg, sizeof(reg)))
>> +		return -EFAULT;
>> +	if (ctx->ifq)
>> +		return -EBUSY;
>> +	if (reg.if_rxq_id == -1)
>> +		return -EINVAL;
>> +
>> +	ifq = io_zc_rx_ifq_alloc(ctx);
>> +	if (!ifq)
>> +		return -ENOMEM;
>> +
>> +	/* TODO: initialise network interface */
>> +
>> +	ret = io_allocate_rbuf_ring(ifq, &reg);
>> +	if (ret)
>> +		goto err;
>> +
>> +	/* TODO: map zc region and initialise zc pool */
>> +
>> +	ifq->rq_entries = reg.rq_entries;
>> +	ifq->cq_entries = reg.cq_entries;
>> +	ifq->if_rxq_id = reg.if_rxq_id;
>> +	ctx->ifq = ifq;
> 
> As these TODO's are removed in later patches, I think you should just
> not include them to begin with. It reads more like notes to yourself,
> doesn't really add anything to the series.
> 
>> +void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
>> +{
>> +	lockdep_assert_held(&ctx->uring_lock);
>> +}
> 
> This is a bit odd?

Oh, this chunk actually leaked here from my rebases, which is not
a big deal as it provides the interface and a later patch implements
it, but might be better to move it there in the first place.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 10/20] io_uring: setup ZC for an Rx queue when registering an ifq
  2023-12-20 16:06   ` Jens Axboe
@ 2023-12-20 16:24     ` Pavel Begunkov
  0 siblings, 0 replies; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-20 16:24 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/20/23 16:06, Jens Axboe wrote:
> On 12/19/23 2:03 PM, David Wei wrote:
>> From: David Wei <[email protected]>
>>
>> This patch sets up ZC for an Rx queue in a net device when an ifq is
>> registered with io_uring. The Rx queue is specified in the registration
>> struct.
>>
>> For now since there is only one ifq, its destruction is implicit during
>> io_uring cleanup.
>>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>   io_uring/zc_rx.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 43 insertions(+), 2 deletions(-)
>>
>> diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
>> index 7e3e6f6d446b..259e08a34ab2 100644
>> --- a/io_uring/zc_rx.c
>> +++ b/io_uring/zc_rx.c
>> @@ -4,6 +4,7 @@
>>   #include <linux/errno.h>
>>   #include <linux/mm.h>
>>   #include <linux/io_uring.h>
>> +#include <linux/netdevice.h>
>>   
>>   #include <uapi/linux/io_uring.h>
>>   
>> @@ -11,6 +12,34 @@
>>   #include "kbuf.h"
>>   #include "zc_rx.h"
>>   
>> +typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
> 
> Let's get rid of this, since it isn't even typedef'ed on the networking
> side. Doesn't really buy us anything, and it's only used once anyway.

That should naturally go away once we move from ndo_bpf

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 15/20] io_uring: add io_recvzc request
  2023-12-19 21:03 ` [RFC PATCH v3 15/20] io_uring: add io_recvzc request David Wei
@ 2023-12-20 16:27   ` Jens Axboe
  2023-12-20 17:04     ` Pavel Begunkov
  0 siblings, 1 reply; 50+ messages in thread
From: Jens Axboe @ 2023-12-20 16:27 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/19/23 2:03 PM, David Wei wrote:
> diff --git a/io_uring/net.c b/io_uring/net.c
> index 454ba301ae6b..7a2aadf6962c 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -637,7 +647,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>  	unsigned int cflags;
>  
>  	cflags = io_put_kbuf(req, issue_flags);
> -	if (msg->msg_inq && msg->msg_inq != -1)
> +	if (msg && msg->msg_inq && msg->msg_inq != -1)
>  		cflags |= IORING_CQE_F_SOCK_NONEMPTY;
>  
>  	if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
> @@ -652,7 +662,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>  			io_recv_prep_retry(req);
>  			/* Known not-empty or unknown state, retry */
>  			if (cflags & IORING_CQE_F_SOCK_NONEMPTY ||
> -			    msg->msg_inq == -1)
> +			    (msg && msg->msg_inq == -1))
>  				return false;
>  			if (issue_flags & IO_URING_F_MULTISHOT)
>  				*ret = IOU_ISSUE_SKIP_COMPLETE;

These are a bit ugly, just pass in a dummy msg for this?

> +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
> +	struct socket *sock;
> +	unsigned flags;
> +	int ret, min_ret = 0;
> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
> +	struct io_zc_rx_ifq *ifq;

Eg
	struct msghdr dummy_msg;

	dummy_msg.msg_inq = -1;

which will eat some stack, but probably not really an issue.


> +	if (issue_flags & IO_URING_F_UNLOCKED)
> +		return -EAGAIN;

This seems odd, why? If we're called with IO_URING_F_UNLOCKED set, then
it's from io-wq. And returning -EAGAIN there will not do anything to
change that. Usually this check is done to lock if we don't have it
already, eg with io_ring_submit_unlock(). Maybe I'm missing something
here!

> @@ -590,5 +603,230 @@ const struct pp_memory_provider_ops io_uring_pp_zc_ops = {
>  };
>  EXPORT_SYMBOL(io_uring_pp_zc_ops);
>  
> +static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *ifq)
> +{
> +	struct io_uring_rbuf_cqe *cqe;
> +	unsigned int cq_idx, queued, free, entries;
> +	unsigned int mask = ifq->cq_entries - 1;
> +
> +	cq_idx = ifq->cached_cq_tail & mask;
> +	smp_rmb();
> +	queued = min(io_zc_rx_cqring_entries(ifq), ifq->cq_entries);
> +	free = ifq->cq_entries - queued;
> +	entries = min(free, ifq->cq_entries - cq_idx);
> +	if (!entries)
> +		return NULL;
> +
> +	cqe = &ifq->cqes[cq_idx];
> +	ifq->cached_cq_tail++;
> +	return cqe;
> +}

smp_rmb() here needs a good comment on what the matching smp_wmb() is,
and why it's needed. Or maybe it should be an smp_load_acquire()?

> +
> +static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag,
> +			   int off, int len, unsigned sock_idx)
> +{
> +	off += skb_frag_off(frag);
> +
> +	if (likely(page_is_page_pool_iov(frag->bv_page))) {
> +		struct io_uring_rbuf_cqe *cqe;
> +		struct io_zc_rx_buf *buf;
> +		struct page_pool_iov *ppiov;
> +
> +		ppiov = page_to_page_pool_iov(frag->bv_page);
> +		if (ppiov->pp->p.memory_provider != PP_MP_IOU_ZCRX ||
> +		    ppiov->pp->mp_priv != ifq)
> +			return -EFAULT;
> +
> +		cqe = io_zc_get_rbuf_cqe(ifq);
> +		if (!cqe)
> +			return -ENOBUFS;
> +
> +		buf = io_iov_to_buf(ppiov);
> +		io_zc_rx_get_buf_uref(buf);
> +
> +		cqe->region = 0;
> +		cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off;
> +		cqe->len = len;
> +		cqe->sock = sock_idx;
> +		cqe->flags = 0;
> +	} else {
> +		return -EOPNOTSUPP;
> +	}
> +
> +	return len;
> +}

I think this would read a lot better as:

	if (unlikely(!page_is_page_pool_iov(frag->bv_page)))
		return -EOPNOTSUPP;

	...
	return len;

> +zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
> +	       unsigned int offset, size_t len)
> +{
> +	struct io_zc_rx_args *args = desc->arg.data;
> +	struct io_zc_rx_ifq *ifq = args->ifq;
> +	struct socket *sock = args->sock;
> +	unsigned sock_idx = sock->zc_rx_idx & IO_ZC_IFQ_IDX_MASK;
> +	struct sk_buff *frag_iter;
> +	unsigned start, start_off;
> +	int i, copy, end, off;
> +	int ret = 0;
> +
> +	start = skb_headlen(skb);
> +	start_off = offset;
> +
> +	if (offset < start)
> +		return -EOPNOTSUPP;
> +
> +	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> +		const skb_frag_t *frag;
> +
> +		WARN_ON(start > offset + len);

This probably can't happen, but should it abort if it did?

> +
> +		frag = &skb_shinfo(skb)->frags[i];
> +		end = start + skb_frag_size(frag);
> +
> +		if (offset < end) {
> +			copy = end - offset;
> +			if (copy > len)
> +				copy = len;
> +
> +			off = offset - start;
> +			ret = zc_rx_recv_frag(ifq, frag, off, copy, sock_idx);
> +			if (ret < 0)
> +				goto out;
> +
> +			offset += ret;
> +			len -= ret;
> +			if (len == 0 || ret != copy)
> +				goto out;
> +		}
> +		start = end;
> +	}
> +
> +	skb_walk_frags(skb, frag_iter) {
> +		WARN_ON(start > offset + len);
> +
> +		end = start + frag_iter->len;
> +		if (offset < end) {
> +			copy = end - offset;
> +			if (copy > len)
> +				copy = len;
> +
> +			off = offset - start;
> +			ret = zc_rx_recv_skb(desc, frag_iter, off, copy);
> +			if (ret < 0)
> +				goto out;
> +
> +			offset += ret;
> +			len -= ret;
> +			if (len == 0 || ret != copy)
> +				goto out;
> +		}
> +		start = end;
> +	}
> +
> +out:
> +	smp_store_release(&ifq->ring->cq.tail, ifq->cached_cq_tail);
> +	if (offset == start_off)
> +		return ret;
> +	return offset - start_off;
> +}
> +
> +static int io_zc_rx_tcp_read(struct io_zc_rx_ifq *ifq, struct sock *sk)
> +{
> +	struct io_zc_rx_args args = {
> +		.ifq = ifq,
> +		.sock = sk->sk_socket,
> +	};
> +	read_descriptor_t rd_desc = {
> +		.count = 1,
> +		.arg.data = &args,
> +	};
> +
> +	return tcp_read_sock(sk, &rd_desc, zc_rx_recv_skb);
> +}
> +
> +static int io_zc_rx_tcp_recvmsg(struct io_zc_rx_ifq *ifq, struct sock *sk,
> +				unsigned int recv_limit,
> +				int flags, int *addr_len)
> +{
> +	size_t used;
> +	long timeo;
> +	int ret;
> +
> +	ret = used = 0;
> +
> +	lock_sock(sk);
> +
> +	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> +	while (recv_limit) {
> +		ret = io_zc_rx_tcp_read(ifq, sk);
> +		if (ret < 0)
> +			break;
> +		if (!ret) {
> +			if (used)
> +				break;
> +			if (sock_flag(sk, SOCK_DONE))
> +				break;
> +			if (sk->sk_err) {
> +				ret = sock_error(sk);
> +				break;
> +			}
> +			if (sk->sk_shutdown & RCV_SHUTDOWN)
> +				break;
> +			if (sk->sk_state == TCP_CLOSE) {
> +				ret = -ENOTCONN;
> +				break;
> +			}
> +			if (!timeo) {
> +				ret = -EAGAIN;
> +				break;
> +			}
> +			if (!skb_queue_empty(&sk->sk_receive_queue))
> +				break;
> +			sk_wait_data(sk, &timeo, NULL);
> +			if (signal_pending(current)) {
> +				ret = sock_intr_errno(timeo);
> +				break;
> +			}
> +			continue;
> +		}
> +		recv_limit -= ret;
> +		used += ret;
> +
> +		if (!timeo)
> +			break;
> +		release_sock(sk);
> +		lock_sock(sk);
> +
> +		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
> +		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
> +		    signal_pending(current))
> +			break;
> +	}
> +	release_sock(sk);
> +	/* TODO: handle timestamping */
> +	return used ? used : ret;
> +}
> +
> +int io_zc_rx_recv(struct io_zc_rx_ifq *ifq, struct socket *sock,
> +		  unsigned int limit, unsigned int flags)
> +{
> +	struct sock *sk = sock->sk;
> +	const struct proto *prot;
> +	int addr_len = 0;
> +	int ret;
> +
> +	if (flags & MSG_ERRQUEUE)
> +		return -EOPNOTSUPP;
> +
> +	prot = READ_ONCE(sk->sk_prot);
> +	if (prot->recvmsg != tcp_recvmsg)
> +		return -EPROTONOSUPPORT;
> +
> +	sock_rps_record_flow(sk);
> +
> +	ret = io_zc_rx_tcp_recvmsg(ifq, sk, limit, flags, &addr_len);
> +
> +	return ret;
> +}


return io_zc_rx_tcp_recvmsg(ifq, sk, limit, flags, &addr_len);

and then you can remove 'int ret' as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 15/20] io_uring: add io_recvzc request
  2023-12-20 16:27   ` Jens Axboe
@ 2023-12-20 17:04     ` Pavel Begunkov
  2023-12-20 18:09       ` Jens Axboe
  0 siblings, 1 reply; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-20 17:04 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/20/23 16:27, Jens Axboe wrote:
> On 12/19/23 2:03 PM, David Wei wrote:
>> diff --git a/io_uring/net.c b/io_uring/net.c
>> index 454ba301ae6b..7a2aadf6962c 100644
>> --- a/io_uring/net.c
>> +++ b/io_uring/net.c
>> @@ -637,7 +647,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>   	unsigned int cflags;
>>   
>>   	cflags = io_put_kbuf(req, issue_flags);
>> -	if (msg->msg_inq && msg->msg_inq != -1)
>> +	if (msg && msg->msg_inq && msg->msg_inq != -1)
>>   		cflags |= IORING_CQE_F_SOCK_NONEMPTY;
>>   
>>   	if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
>> @@ -652,7 +662,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>   			io_recv_prep_retry(req);
>>   			/* Known not-empty or unknown state, retry */
>>   			if (cflags & IORING_CQE_F_SOCK_NONEMPTY ||
>> -			    msg->msg_inq == -1)
>> +			    (msg && msg->msg_inq == -1))
>>   				return false;
>>   			if (issue_flags & IO_URING_F_MULTISHOT)
>>   				*ret = IOU_ISSUE_SKIP_COMPLETE;
> 
> These are a bit ugly, just pass in a dummy msg for this?
> 
>> +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>> +	struct socket *sock;
>> +	unsigned flags;
>> +	int ret, min_ret = 0;
>> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>> +	struct io_zc_rx_ifq *ifq;
> 
> Eg
> 	struct msghdr dummy_msg;
> 
> 	dummy_msg.msg_inq = -1;
> 
> which will eat some stack, but probably not really an issue.
> 
> 
>> +	if (issue_flags & IO_URING_F_UNLOCKED)
>> +		return -EAGAIN;
> 
> This seems odd, why? If we're called with IO_URING_F_UNLOCKED set, then

It's my addition, let me explain.

io_recvzc() -> io_zc_rx_recv() -> ... -> zc_rx_recv_frag()

This chain posts completions to a buffer completion queue, and
we don't want extra locking to share it with io-wq threads. In
some sense it's quite similar to the CQ locking, considering
we restrict zc to DEFER_TASKRUN. And doesn't change anything
anyway because multishot cannot post completions from io-wq
and are executed from the poll callback in task work.

> it's from io-wq. And returning -EAGAIN there will not do anything to

It will. It's supposed to just requeue for polling (it's not
IOPOLL to keep retrying -EAGAIN), just like multishots do.

Double checking the code, it can actually terminate the request,
which doesn't make much difference for us because multishots
should normally never end up in io-wq anyway, but I guess we
can improve it a liitle bit.

And it should also use IO_URING_F_IOWQ, forgot I split out
it from *F_UNLOCK.

> change that. Usually this check is done to lock if we don't have it
> already, eg with io_ring_submit_unlock(). Maybe I'm missing something
> here!
> 
>> @@ -590,5 +603,230 @@ const struct pp_memory_provider_ops io_uring_pp_zc_ops = {
>>   };
>>   EXPORT_SYMBOL(io_uring_pp_zc_ops);
>>   
>> +static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *ifq)
>> +{
>> +	struct io_uring_rbuf_cqe *cqe;
>> +	unsigned int cq_idx, queued, free, entries;
>> +	unsigned int mask = ifq->cq_entries - 1;
>> +
>> +	cq_idx = ifq->cached_cq_tail & mask;
>> +	smp_rmb();
>> +	queued = min(io_zc_rx_cqring_entries(ifq), ifq->cq_entries);
>> +	free = ifq->cq_entries - queued;
>> +	entries = min(free, ifq->cq_entries - cq_idx);
>> +	if (!entries)
>> +		return NULL;
>> +
>> +	cqe = &ifq->cqes[cq_idx];
>> +	ifq->cached_cq_tail++;
>> +	return cqe;
>> +}
> 
> smp_rmb() here needs a good comment on what the matching smp_wmb() is,
> and why it's needed. Or maybe it should be an smp_load_acquire()?
> 
>> +
>> +static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag,
>> +			   int off, int len, unsigned sock_idx)
>> +{
>> +	off += skb_frag_off(frag);
>> +
>> +	if (likely(page_is_page_pool_iov(frag->bv_page))) {
>> +		struct io_uring_rbuf_cqe *cqe;
>> +		struct io_zc_rx_buf *buf;
>> +		struct page_pool_iov *ppiov;
>> +
>> +		ppiov = page_to_page_pool_iov(frag->bv_page);
>> +		if (ppiov->pp->p.memory_provider != PP_MP_IOU_ZCRX ||
>> +		    ppiov->pp->mp_priv != ifq)
>> +			return -EFAULT;
>> +
>> +		cqe = io_zc_get_rbuf_cqe(ifq);
>> +		if (!cqe)
>> +			return -ENOBUFS;
>> +
>> +		buf = io_iov_to_buf(ppiov);
>> +		io_zc_rx_get_buf_uref(buf);
>> +
>> +		cqe->region = 0;
>> +		cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off;
>> +		cqe->len = len;
>> +		cqe->sock = sock_idx;
>> +		cqe->flags = 0;
>> +	} else {
>> +		return -EOPNOTSUPP;
>> +	}
>> +
>> +	return len;
>> +}
> 
> I think this would read a lot better as:
> 
> 	if (unlikely(!page_is_page_pool_iov(frag->bv_page)))
> 		return -EOPNOTSUPP;

That's a bit of oracle coding, this branch is implemented in
a later patch.

> 
> 	...
> 	return len;
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 15/20] io_uring: add io_recvzc request
  2023-12-20 17:04     ` Pavel Begunkov
@ 2023-12-20 18:09       ` Jens Axboe
  2023-12-21 18:59         ` Pavel Begunkov
  0 siblings, 1 reply; 50+ messages in thread
From: Jens Axboe @ 2023-12-20 18:09 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/20/23 10:04 AM, Pavel Begunkov wrote:
> On 12/20/23 16:27, Jens Axboe wrote:
>> On 12/19/23 2:03 PM, David Wei wrote:
>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>> index 454ba301ae6b..7a2aadf6962c 100644
>>> --- a/io_uring/net.c
>>> +++ b/io_uring/net.c
>>> @@ -637,7 +647,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>>       unsigned int cflags;
>>>         cflags = io_put_kbuf(req, issue_flags);
>>> -    if (msg->msg_inq && msg->msg_inq != -1)
>>> +    if (msg && msg->msg_inq && msg->msg_inq != -1)
>>>           cflags |= IORING_CQE_F_SOCK_NONEMPTY;
>>>         if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
>>> @@ -652,7 +662,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>>               io_recv_prep_retry(req);
>>>               /* Known not-empty or unknown state, retry */
>>>               if (cflags & IORING_CQE_F_SOCK_NONEMPTY ||
>>> -                msg->msg_inq == -1)
>>> +                (msg && msg->msg_inq == -1))
>>>                   return false;
>>>               if (issue_flags & IO_URING_F_MULTISHOT)
>>>                   *ret = IOU_ISSUE_SKIP_COMPLETE;
>>
>> These are a bit ugly, just pass in a dummy msg for this?
>>
>>> +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>>> +{
>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>> +    struct socket *sock;
>>> +    unsigned flags;
>>> +    int ret, min_ret = 0;
>>> +    bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>>> +    struct io_zc_rx_ifq *ifq;
>>
>> Eg
>>     struct msghdr dummy_msg;
>>
>>     dummy_msg.msg_inq = -1;
>>
>> which will eat some stack, but probably not really an issue.
>>
>>
>>> +    if (issue_flags & IO_URING_F_UNLOCKED)
>>> +        return -EAGAIN;
>>
>> This seems odd, why? If we're called with IO_URING_F_UNLOCKED set, then
> 
> It's my addition, let me explain.
> 
> io_recvzc() -> io_zc_rx_recv() -> ... -> zc_rx_recv_frag()
> 
> This chain posts completions to a buffer completion queue, and
> we don't want extra locking to share it with io-wq threads. In
> some sense it's quite similar to the CQ locking, considering
> we restrict zc to DEFER_TASKRUN. And doesn't change anything
> anyway because multishot cannot post completions from io-wq
> and are executed from the poll callback in task work.
> 
>> it's from io-wq. And returning -EAGAIN there will not do anything to
> 
> It will. It's supposed to just requeue for polling (it's not
> IOPOLL to keep retrying -EAGAIN), just like multishots do.

It definitely needs a good comment, as it's highly non-obvious when
reading the code!

> Double checking the code, it can actually terminate the request,
> which doesn't make much difference for us because multishots
> should normally never end up in io-wq anyway, but I guess we
> can improve it a liitle bit.

Right, assumptions seems to be that -EAGAIN will lead to poll arm, which
seems a bit fragile.

> And it should also use IO_URING_F_IOWQ, forgot I split out
> it from *F_UNLOCK.

Yep, that'd be clearer.

>>> +static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag,
>>> +               int off, int len, unsigned sock_idx)
>>> +{
>>> +    off += skb_frag_off(frag);
>>> +
>>> +    if (likely(page_is_page_pool_iov(frag->bv_page))) {
>>> +        struct io_uring_rbuf_cqe *cqe;
>>> +        struct io_zc_rx_buf *buf;
>>> +        struct page_pool_iov *ppiov;
>>> +
>>> +        ppiov = page_to_page_pool_iov(frag->bv_page);
>>> +        if (ppiov->pp->p.memory_provider != PP_MP_IOU_ZCRX ||
>>> +            ppiov->pp->mp_priv != ifq)
>>> +            return -EFAULT;
>>> +
>>> +        cqe = io_zc_get_rbuf_cqe(ifq);
>>> +        if (!cqe)
>>> +            return -ENOBUFS;
>>> +
>>> +        buf = io_iov_to_buf(ppiov);
>>> +        io_zc_rx_get_buf_uref(buf);
>>> +
>>> +        cqe->region = 0;
>>> +        cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off;
>>> +        cqe->len = len;
>>> +        cqe->sock = sock_idx;
>>> +        cqe->flags = 0;
>>> +    } else {
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +
>>> +    return len;
>>> +}
>>
>> I think this would read a lot better as:
>>
>>     if (unlikely(!page_is_page_pool_iov(frag->bv_page)))
>>         return -EOPNOTSUPP;
> 
> That's a bit of oracle coding, this branch is implemented in
> a later patch.

Oracle coding?

Each patch stands separately, there's no reason not to make this one as
clean as it can be. And an error case with the main bits inline is a lot
nicer imho than two separate indented parts. For the latter addition
instead of the -EOPNOTSUPP, would probably be nice to have it in a
separate function. Probably ditto for the page pool case here now, would
make the later patch simpler too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 07/20] io_uring: add interface queue
  2023-12-20 16:13   ` Jens Axboe
  2023-12-20 16:23     ` Pavel Begunkov
@ 2023-12-21  1:44     ` David Wei
  1 sibling, 0 replies; 50+ messages in thread
From: David Wei @ 2023-12-21  1:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 2023-12-20 08:13, Jens Axboe wrote:
> On 12/19/23 2:03 PM, David Wei wrote:
>> @@ -750,6 +753,54 @@ enum {
>>  	SOCKET_URING_OP_SETSOCKOPT,
>>  };
>>  
>> +struct io_uring_rbuf_rqe {
>> +	__u32	off;
>> +	__u32	len;
>> +	__u16	region;
>> +	__u8	__pad[6];
>> +};
>> +
>> +struct io_uring_rbuf_cqe {
>> +	__u32	off;
>> +	__u32	len;
>> +	__u16	region;
>> +	__u8	sock;
>> +	__u8	flags;
>> +	__u8	__pad[2];
>> +};
> 
> Looks like this leaves a gap? Should be __pad[4] or probably just __u32
> __pad; For all of these, definitely worth thinking about if we'll ever
> need more than the slight padding. Might not hurt to always leave 8
> bytes extra, outside of the required padding.

Apologies, it's been a while since I last pahole'd these structs. We may
have added more fields later and reintroduced gaps.

> 
>> +struct io_rbuf_rqring_offsets {
>> +	__u32	head;
>> +	__u32	tail;
>> +	__u32	rqes;
>> +	__u8	__pad[4];
>> +};
> 
> Ditto here, __u32 __pad;
> 
>> +struct io_rbuf_cqring_offsets {
>> +	__u32	head;
>> +	__u32	tail;
>> +	__u32	cqes;
>> +	__u8	__pad[4];
>> +};
> 
> And here.
> 
>> +
>> +/*
>> + * Argument for IORING_REGISTER_ZC_RX_IFQ
>> + */
>> +struct io_uring_zc_rx_ifq_reg {
>> +	__u32	if_idx;
>> +	/* hw rx descriptor ring id */
>> +	__u32	if_rxq_id;
>> +	__u32	region_id;
>> +	__u32	rq_entries;
>> +	__u32	cq_entries;
>> +	__u32	flags;
>> +	__u16	cpu;
>> +
>> +	__u32	mmap_sz;
>> +	struct io_rbuf_rqring_offsets rq_off;
>> +	struct io_rbuf_cqring_offsets cq_off;
>> +};
> 
> You have rq_off starting at a 48-bit offset here, don't think this is
> going to work as it's uapi. You'd need padding to align it to 64-bits.

I will remove the io_rbuf_cqring in a future patchset which should
simplify things, but io_rbuf_rqring will stay. I'll make sure offsets
are 64-bit aligned.

> 
>> diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
>> new file mode 100644
>> index 000000000000..5fc94cad5e3a
>> --- /dev/null
>> +++ b/io_uring/zc_rx.c
>> +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
>> +			  struct io_uring_zc_rx_ifq_reg __user *arg)
>> +{
>> +	struct io_uring_zc_rx_ifq_reg reg;
>> +	struct io_zc_rx_ifq *ifq;
>> +	int ret;
>> +
>> +	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
>> +		return -EINVAL;
>> +	if (copy_from_user(&reg, arg, sizeof(reg)))
>> +		return -EFAULT;
>> +	if (ctx->ifq)
>> +		return -EBUSY;
>> +	if (reg.if_rxq_id == -1)
>> +		return -EINVAL;
>> +
>> +	ifq = io_zc_rx_ifq_alloc(ctx);
>> +	if (!ifq)
>> +		return -ENOMEM;
>> +
>> +	/* TODO: initialise network interface */
>> +
>> +	ret = io_allocate_rbuf_ring(ifq, &reg);
>> +	if (ret)
>> +		goto err;
>> +
>> +	/* TODO: map zc region and initialise zc pool */
>> +
>> +	ifq->rq_entries = reg.rq_entries;
>> +	ifq->cq_entries = reg.cq_entries;
>> +	ifq->if_rxq_id = reg.if_rxq_id;
>> +	ctx->ifq = ifq;
> 
> As these TODO's are removed in later patches, I think you should just
> not include them to begin with. It reads more like notes to yourself,
> doesn't really add anything to the series.

Got it, will remove them.

> 
>> +void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
>> +{
>> +	lockdep_assert_held(&ctx->uring_lock);
>> +}
> 
> This is a bit odd?
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 07/20] io_uring: add interface queue
  2023-12-19 21:03 ` [RFC PATCH v3 07/20] io_uring: add interface queue David Wei
  2023-12-20 16:13   ` Jens Axboe
@ 2023-12-21 17:57   ` Willem de Bruijn
  2023-12-30 16:25     ` Pavel Begunkov
  1 sibling, 1 reply; 50+ messages in thread
From: Willem de Bruijn @ 2023-12-21 17:57 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, magnus.karlsson, bjorn

David Wei wrote:
> From: David Wei <[email protected]>
> 
> This patch introduces a new object in io_uring called an interface queue
> (ifq) which contains:
> 
> * A pool region allocated by userspace and registered w/ io_uring where
>   Rx data is written to.
> * A net device and one specific Rx queue in it that will be configured
>   for ZC Rx.
> * A pair of shared ringbuffers w/ userspace, dubbed registered buf
>   (rbuf) rings. Each entry contains a pool region id and an offset + len
>   within that region. The kernel writes entries into the completion ring
>   to tell userspace where RX data is relative to the start of a region.
>   Userspace writes entries into the refill ring to tell the kernel when
>   it is done with the data.
> 
> For now, each io_uring instance has a single ifq, and each ifq has a
> single pool region associated with one Rx queue.
> 
> Add a new opcode to io_uring_register that sets up an ifq. Size and
> offsets of shared ringbuffers are returned to userspace for it to mmap.
> The implementation will be added in a later patch.
> 
> Signed-off-by: David Wei <[email protected]>

This is quite similar to AF_XDP, of course. Is it at all possible to
reuse all or some of that? If not, why not?

As a side effect, unification would also show a path of moving AF_XDP
from its custom allocator to the page_pool infra.

Related: what is the story wrt the process crashing while user memory
is posted to the NIC or present in the kernel stack.

SO_DEVMEM already demonstrates zerocopy into user buffers using usdma.
To a certain extent that and asyncronous I/O with iouring are two
independent goals. SO_DEVMEM imposes limitations on the stack because
it might hold opaque device mem. That is too strong for this case.

But for this iouring provider, is there anything ioring specific about
it beyond being user memory? If not, maybe just call it a umem
provider, and anticipate it being usable for AF_XDP in the future too?

Besides delivery up to the intended socket, packets may also end up
in other code paths, such as packet sockets or forwarding. All of
this is simpler with userspace backed buffers than with device mem.
But good to call out explicitly how this is handled. MSG_ZEROCOPY
makes a deep packet copy in unexpected code paths, for instance. To
avoid indefinite latency to buffer reclaim.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 15/20] io_uring: add io_recvzc request
  2023-12-20 18:09       ` Jens Axboe
@ 2023-12-21 18:59         ` Pavel Begunkov
  2023-12-21 21:32           ` Jens Axboe
  0 siblings, 1 reply; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-21 18:59 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/20/23 18:09, Jens Axboe wrote:
> On 12/20/23 10:04 AM, Pavel Begunkov wrote:
>> On 12/20/23 16:27, Jens Axboe wrote:
>>> On 12/19/23 2:03 PM, David Wei wrote:
>>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>>> index 454ba301ae6b..7a2aadf6962c 100644
>>>> --- a/io_uring/net.c
>>>> +++ b/io_uring/net.c
>>>> @@ -637,7 +647,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>>>        unsigned int cflags;
>>>>          cflags = io_put_kbuf(req, issue_flags);
>>>> -    if (msg->msg_inq && msg->msg_inq != -1)
>>>> +    if (msg && msg->msg_inq && msg->msg_inq != -1)
>>>>            cflags |= IORING_CQE_F_SOCK_NONEMPTY;
>>>>          if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
>>>> @@ -652,7 +662,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>>>                io_recv_prep_retry(req);
>>>>                /* Known not-empty or unknown state, retry */
>>>>                if (cflags & IORING_CQE_F_SOCK_NONEMPTY ||
>>>> -                msg->msg_inq == -1)
>>>> +                (msg && msg->msg_inq == -1))
>>>>                    return false;
>>>>                if (issue_flags & IO_URING_F_MULTISHOT)
>>>>                    *ret = IOU_ISSUE_SKIP_COMPLETE;
>>>
>>> These are a bit ugly, just pass in a dummy msg for this?
>>>
>>>> +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>>>> +{
>>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>>> +    struct socket *sock;
>>>> +    unsigned flags;
>>>> +    int ret, min_ret = 0;
>>>> +    bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>>>> +    struct io_zc_rx_ifq *ifq;
>>>
>>> Eg
>>>      struct msghdr dummy_msg;
>>>
>>>      dummy_msg.msg_inq = -1;
>>>
>>> which will eat some stack, but probably not really an issue.
>>>
>>>
>>>> +    if (issue_flags & IO_URING_F_UNLOCKED)
>>>> +        return -EAGAIN;
>>>
>>> This seems odd, why? If we're called with IO_URING_F_UNLOCKED set, then
>>
>> It's my addition, let me explain.
>>
>> io_recvzc() -> io_zc_rx_recv() -> ... -> zc_rx_recv_frag()
>>
>> This chain posts completions to a buffer completion queue, and
>> we don't want extra locking to share it with io-wq threads. In
>> some sense it's quite similar to the CQ locking, considering
>> we restrict zc to DEFER_TASKRUN. And doesn't change anything
>> anyway because multishot cannot post completions from io-wq
>> and are executed from the poll callback in task work.
>>
>>> it's from io-wq. And returning -EAGAIN there will not do anything to
>>
>> It will. It's supposed to just requeue for polling (it's not
>> IOPOLL to keep retrying -EAGAIN), just like multishots do.
> 
> It definitely needs a good comment, as it's highly non-obvious when
> reading the code!
> 
>> Double checking the code, it can actually terminate the request,
>> which doesn't make much difference for us because multishots
>> should normally never end up in io-wq anyway, but I guess we
>> can improve it a liitle bit.
> 
> Right, assumptions seems to be that -EAGAIN will lead to poll arm, which
> seems a bit fragile.

The main assumption is that io-wq will eventually leave the
request alone and push it somewhere else, either queuing for
polling or terminating, which is more than reasonable. I'd
add that it's rather insane for io-wq indefinitely spinning
on -EAGAIN, but it has long been fixed (for !IOPOLL).

As said, can be made a bit better, but it won't change anything
for real life execution, multishots would never end up there
after they start listening for poll events.

>> And it should also use IO_URING_F_IOWQ, forgot I split out
>> it from *F_UNLOCK.
> 
> Yep, that'd be clearer.

Not "clearer", but more correct. Even though it's not
a bug because deps between the flags.

>>>> +static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag,
>>>> +               int off, int len, unsigned sock_idx)
>>>> +{
>>>> +    off += skb_frag_off(frag);
>>>> +
>>>> +    if (likely(page_is_page_pool_iov(frag->bv_page))) {
>>>> +        struct io_uring_rbuf_cqe *cqe;
>>>> +        struct io_zc_rx_buf *buf;
>>>> +        struct page_pool_iov *ppiov;
>>>> +
>>>> +        ppiov = page_to_page_pool_iov(frag->bv_page);
>>>> +        if (ppiov->pp->p.memory_provider != PP_MP_IOU_ZCRX ||
>>>> +            ppiov->pp->mp_priv != ifq)
>>>> +            return -EFAULT;
>>>> +
>>>> +        cqe = io_zc_get_rbuf_cqe(ifq);
>>>> +        if (!cqe)
>>>> +            return -ENOBUFS;
>>>> +
>>>> +        buf = io_iov_to_buf(ppiov);
>>>> +        io_zc_rx_get_buf_uref(buf);
>>>> +
>>>> +        cqe->region = 0;
>>>> +        cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off;
>>>> +        cqe->len = len;
>>>> +        cqe->sock = sock_idx;
>>>> +        cqe->flags = 0;
>>>> +    } else {
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +
>>>> +    return len;
>>>> +}
>>>
>>> I think this would read a lot better as:
>>>
>>>      if (unlikely(!page_is_page_pool_iov(frag->bv_page)))
>>>          return -EOPNOTSUPP;
>>
>> That's a bit of oracle coding, this branch is implemented in
>> a later patch.
> 
> Oracle coding?

I.e. knowing how later patches (should) look like.

> Each patch stands separately, there's no reason not to make this one as

They are not standalone, you cannot sanely develop anything not
thinking how and where it's used, otherwise you'd get a set of
functions full of sleeping but later used in irq context or just
unfittable into a desired framework. By extent, code often is
written while trying to look a step ahead. For example, first
patches don't push everything into io_uring.c just to wholesale
move it into zc_rx.c because of exceeding some size threshold.

> clean as it can be. And an error case with the main bits inline is a lot

I agree that it should be clean among all, but it _is_ clean
and readable, all else is stylistic nit picking. And maybe it's
just my opinion, but I also personally appreciate when a patch is
easy to review, which includes not restructuring all written before
with every patch, which also helps with back porting and other
developing aspects.

> nicer imho than two separate indented parts. For the latter addition
> instead of the -EOPNOTSUPP, would probably be nice to have it in a
> separate function. Probably ditto for the page pool case here now, would
> make the later patch simpler too.

If we'd need it in the future, we'll change it then, patches
stand separately, at least it's IMHO not needed in the current
series.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx
  2023-12-19 21:03 ` [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx David Wei
  2023-12-19 23:44   ` Mina Almasry
@ 2023-12-21 19:36   ` Pavel Begunkov
  1 sibling, 0 replies; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-21 19:36 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Jens Axboe, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/19/23 21:03, David Wei wrote:
> From: Pavel Begunkov <[email protected]>
> 
> We're adding a new pp memory provider to implement io_uring zerocopy
> receive. It'll be "registered" in pp and used in later paches.
> 
> The typical life cycle of a buffer goes as follows: first it's allocated
> to a driver with the initial refcount set to 1. The drivers fills it
> with data, puts it into an skb and passes down the stack, where it gets
> queued up to a socket. Later, a zc io_uring request will be receiving
> data from the socket from a task context. At that point io_uring will
> tell the userspace that this buffer has some data by posting an
> appropriate completion. It'll also elevating the refcount by
> IO_ZC_RX_UREF, so the buffer is not recycled while userspace is reading
> the data. When the userspace is done with the buffer it should return it
> back to io_uring by adding an entry to the buffer refill ring. When
> necessary io_uring will poll the refill ring, compare references
> including IO_ZC_RX_UREF and reuse the buffer.
> 
> Initally, all buffers are placed in a spinlock protected ->freelist.
> It's a slow path stash, where buffers are considered to be unallocated
> and not exposed to core page pool. On allocation, pp will first try
> all its caches, and the ->alloc_pages callback if everything else
> failed.
> 
> The hot path for io_pp_zc_alloc_pages() is to grab pages from the refill
> ring. The consumption from the ring is always done in the attached napi
> context, so no additional synchronisation required. If that fails we'll
> be getting buffers from the ->freelist.
> 
> Note: only ->freelist are considered unallocated for page pool, so we
> only add pages_state_hold_cnt when allocating from there. Subsequently,
> as page_pool_return_page() and others bump the ->pages_state_release_cnt
> counter, io_pp_zc_release_page() can only use ->freelist, which is not a
> problem as it's not a slow path.
> 
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
...
> +static void io_zc_rx_ring_refill(struct page_pool *pp,
> +				 struct io_zc_rx_ifq *ifq)
> +{
> +	unsigned int entries = io_zc_rx_rqring_entries(ifq);
> +	unsigned int mask = ifq->rq_entries - 1;
> +	struct io_zc_rx_pool *pool = ifq->pool;
> +
> +	if (unlikely(!entries))
> +		return;
> +
> +	while (entries--) {
> +		unsigned int rq_idx = ifq->cached_rq_head++ & mask;
> +		struct io_uring_rbuf_rqe *rqe = &ifq->rqes[rq_idx];
> +		u32 pgid = rqe->off / PAGE_SIZE;
> +		struct io_zc_rx_buf *buf = &pool->bufs[pgid];
> +
> +		if (!io_zc_rx_put_buf_uref(buf))
> +			continue;

It's worth to note that here we have to add a dma sync as per
discussions with page pool folks.

> +		io_zc_add_pp_cache(pp, buf);
> +		if (pp->alloc.count >= PP_ALLOC_CACHE_REFILL)
> +			break;
> +	}
> +	smp_store_release(&ifq->ring->rq.head, ifq->cached_rq_head);
> +}
> +
> +static void io_zc_rx_refill_slow(struct page_pool *pp, struct io_zc_rx_ifq *ifq)
> +{
> +	struct io_zc_rx_pool *pool = ifq->pool;
> +
> +	spin_lock_bh(&pool->freelist_lock);
> +	while (pool->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
> +		struct io_zc_rx_buf *buf;
> +		u32 pgid;
> +
> +		pgid = pool->freelist[--pool->free_count];
> +		buf = &pool->bufs[pgid];
> +
> +		io_zc_add_pp_cache(pp, buf);
> +		pp->pages_state_hold_cnt++;
> +		trace_page_pool_state_hold(pp, io_zc_buf_to_pp_page(buf),
> +					   pp->pages_state_hold_cnt);
> +	}
> +	spin_unlock_bh(&pool->freelist_lock);
> +}
...

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 19/20] net: page pool: generalise ppiov dma address get
  2023-12-19 21:03 ` [RFC PATCH v3 19/20] net: page pool: generalise ppiov dma address get David Wei
@ 2023-12-21 19:51   ` Mina Almasry
  0 siblings, 0 replies; 50+ messages in thread
From: Mina Almasry @ 2023-12-21 19:51 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> io_uring pp memory provider doesn't have contiguous dma addresses,
> implement page_pool_iov_dma_addr() via callbacks.
>
> Note: it might be better to stash dma address into struct page_pool_iov.
>

This is the approach already taken in v1 & RFC v5. I suspect you'd be
able to take advantage when you rebase.

> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  include/net/page_pool/helpers.h | 5 +----
>  include/net/page_pool/types.h   | 2 ++
>  io_uring/zc_rx.c                | 8 ++++++++
>  net/core/page_pool.c            | 9 +++++++++
>  4 files changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index aca3a52d0e22..10dba1f2aa0c 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -105,10 +105,7 @@ static inline unsigned int page_pool_iov_idx(const struct page_pool_iov *ppiov)
>  static inline dma_addr_t
>  page_pool_iov_dma_addr(const struct page_pool_iov *ppiov)
>  {
> -       struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov);
> -
> -       return owner->base_dma_addr +
> -              ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT);
> +       return ppiov->pp->mp_ops->ppiov_dma_addr(ppiov);
>  }
>
>  static inline unsigned long
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index f54ee759e362..1b9266835ab6 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -125,6 +125,7 @@ struct page_pool_stats {
>  #endif
>
>  struct mem_provider;
> +struct page_pool_iov;
>
>  enum pp_memory_provider_type {
>         __PP_MP_NONE, /* Use system allocator directly */
> @@ -138,6 +139,7 @@ struct pp_memory_provider_ops {
>         void (*scrub)(struct page_pool *pool);
>         struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
>         bool (*release_page)(struct page_pool *pool, struct page *page);
> +       dma_addr_t (*ppiov_dma_addr)(const struct page_pool_iov *ppiov);
>  };
>
>  extern const struct pp_memory_provider_ops dmabuf_devmem_ops;
> diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
> index f7d99d569885..20fb89e6bad7 100644
> --- a/io_uring/zc_rx.c
> +++ b/io_uring/zc_rx.c
> @@ -600,12 +600,20 @@ static void io_pp_zc_destroy(struct page_pool *pp)
>         percpu_ref_put(&ifq->ctx->refs);
>  }
>
> +static dma_addr_t io_pp_zc_ppiov_dma_addr(const struct page_pool_iov *ppiov)
> +{
> +       struct io_zc_rx_buf *buf = io_iov_to_buf((struct page_pool_iov *)ppiov);
> +
> +       return buf->dma;
> +}
> +
>  const struct pp_memory_provider_ops io_uring_pp_zc_ops = {
>         .alloc_pages            = io_pp_zc_alloc_pages,
>         .release_page           = io_pp_zc_release_page,
>         .init                   = io_pp_zc_init,
>         .destroy                = io_pp_zc_destroy,
>         .scrub                  = io_pp_zc_scrub,
> +       .ppiov_dma_addr         = io_pp_zc_ppiov_dma_addr,
>  };
>  EXPORT_SYMBOL(io_uring_pp_zc_ops);
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ebf5ff009d9d..6586631ecc2e 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -1105,10 +1105,19 @@ static bool mp_dmabuf_devmem_release_page(struct page_pool *pool,
>         return true;
>  }
>
> +static dma_addr_t mp_dmabuf_devmem_ppiov_dma_addr(const struct page_pool_iov *ppiov)
> +{
> +       struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov);
> +
> +       return owner->base_dma_addr +
> +              ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT);
> +}
> +
>  const struct pp_memory_provider_ops dmabuf_devmem_ops = {
>         .init                   = mp_dmabuf_devmem_init,
>         .destroy                = mp_dmabuf_devmem_destroy,
>         .alloc_pages            = mp_dmabuf_devmem_alloc_pages,
>         .release_page           = mp_dmabuf_devmem_release_page,
> +       .ppiov_dma_addr         = mp_dmabuf_devmem_ppiov_dma_addr,
>  };
>  EXPORT_SYMBOL(dmabuf_devmem_ops);
> --
> 2.39.3
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 15/20] io_uring: add io_recvzc request
  2023-12-21 18:59         ` Pavel Begunkov
@ 2023-12-21 21:32           ` Jens Axboe
  2023-12-30 21:15             ` Pavel Begunkov
  0 siblings, 1 reply; 50+ messages in thread
From: Jens Axboe @ 2023-12-21 21:32 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/21/23 11:59 AM, Pavel Begunkov wrote:
> On 12/20/23 18:09, Jens Axboe wrote:
>> On 12/20/23 10:04 AM, Pavel Begunkov wrote:
>>> On 12/20/23 16:27, Jens Axboe wrote:
>>>> On 12/19/23 2:03 PM, David Wei wrote:
>>>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>>>> index 454ba301ae6b..7a2aadf6962c 100644
>>>>> --- a/io_uring/net.c
>>>>> +++ b/io_uring/net.c
>>>>> @@ -637,7 +647,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>>>>        unsigned int cflags;
>>>>>          cflags = io_put_kbuf(req, issue_flags);
>>>>> -    if (msg->msg_inq && msg->msg_inq != -1)
>>>>> +    if (msg && msg->msg_inq && msg->msg_inq != -1)
>>>>>            cflags |= IORING_CQE_F_SOCK_NONEMPTY;
>>>>>          if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
>>>>> @@ -652,7 +662,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>>>>                io_recv_prep_retry(req);
>>>>>                /* Known not-empty or unknown state, retry */
>>>>>                if (cflags & IORING_CQE_F_SOCK_NONEMPTY ||
>>>>> -                msg->msg_inq == -1)
>>>>> +                (msg && msg->msg_inq == -1))
>>>>>                    return false;
>>>>>                if (issue_flags & IO_URING_F_MULTISHOT)
>>>>>                    *ret = IOU_ISSUE_SKIP_COMPLETE;
>>>>
>>>> These are a bit ugly, just pass in a dummy msg for this?
>>>>
>>>>> +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>>>>> +{
>>>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>>>> +    struct socket *sock;
>>>>> +    unsigned flags;
>>>>> +    int ret, min_ret = 0;
>>>>> +    bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>>>>> +    struct io_zc_rx_ifq *ifq;
>>>>
>>>> Eg
>>>>      struct msghdr dummy_msg;
>>>>
>>>>      dummy_msg.msg_inq = -1;
>>>>
>>>> which will eat some stack, but probably not really an issue.
>>>>
>>>>
>>>>> +    if (issue_flags & IO_URING_F_UNLOCKED)
>>>>> +        return -EAGAIN;
>>>>
>>>> This seems odd, why? If we're called with IO_URING_F_UNLOCKED set, then
>>>
>>> It's my addition, let me explain.
>>>
>>> io_recvzc() -> io_zc_rx_recv() -> ... -> zc_rx_recv_frag()
>>>
>>> This chain posts completions to a buffer completion queue, and
>>> we don't want extra locking to share it with io-wq threads. In
>>> some sense it's quite similar to the CQ locking, considering
>>> we restrict zc to DEFER_TASKRUN. And doesn't change anything
>>> anyway because multishot cannot post completions from io-wq
>>> and are executed from the poll callback in task work.
>>>
>>>> it's from io-wq. And returning -EAGAIN there will not do anything to
>>>
>>> It will. It's supposed to just requeue for polling (it's not
>>> IOPOLL to keep retrying -EAGAIN), just like multishots do.
>>
>> It definitely needs a good comment, as it's highly non-obvious when
>> reading the code!
>>
>>> Double checking the code, it can actually terminate the request,
>>> which doesn't make much difference for us because multishots
>>> should normally never end up in io-wq anyway, but I guess we
>>> can improve it a liitle bit.
>>
>> Right, assumptions seems to be that -EAGAIN will lead to poll arm, which
>> seems a bit fragile.
> 
> The main assumption is that io-wq will eventually leave the
> request alone and push it somewhere else, either queuing for
> polling or terminating, which is more than reasonable. I'd

But surely you don't want it terminated from here? It seems like a very
odd choice. As it stands, if you end up doing more than one loop, then
it won't arm poll and it'll get failed.

> add that it's rather insane for io-wq indefinitely spinning
> on -EAGAIN, but it has long been fixed (for !IOPOLL).

There's no other choice for polling, and it doesn't do it for
non-polling. The current logic makes sense - if you do a blocking
attempt and you get -EAGAIN, that's really the final result and you
cannot sanely retry for !IOPOLL in that case. Before we did poll arm for
io-wq, even the first -EAGAIN would've terminated it. Relying on -EAGAIN
from a blocking attempt to do anything but fail the request with -EAGAIN
res is pretty fragile and odd, I think that needs sorting out.

> As said, can be made a bit better, but it won't change anything
> for real life execution, multishots would never end up there
> after they start listening for poll events.

Right, probably won't ever be a thing for !multishot. As mentioned in my
original reply, it really just needs a comment explaining exactly what
it's doing and why it's fine.

>>> And it should also use IO_URING_F_IOWQ, forgot I split out
>>> it from *F_UNLOCK.
>>
>> Yep, that'd be clearer.
> 
> Not "clearer", but more correct. Even though it's not
> a bug because deps between the flags.

Both clearer and more correct, I would say.

>>>>> +static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag,
>>>>> +               int off, int len, unsigned sock_idx)
>>>>> +{
>>>>> +    off += skb_frag_off(frag);
>>>>> +
>>>>> +    if (likely(page_is_page_pool_iov(frag->bv_page))) {
>>>>> +        struct io_uring_rbuf_cqe *cqe;
>>>>> +        struct io_zc_rx_buf *buf;
>>>>> +        struct page_pool_iov *ppiov;
>>>>> +
>>>>> +        ppiov = page_to_page_pool_iov(frag->bv_page);
>>>>> +        if (ppiov->pp->p.memory_provider != PP_MP_IOU_ZCRX ||
>>>>> +            ppiov->pp->mp_priv != ifq)
>>>>> +            return -EFAULT;
>>>>> +
>>>>> +        cqe = io_zc_get_rbuf_cqe(ifq);
>>>>> +        if (!cqe)
>>>>> +            return -ENOBUFS;
>>>>> +
>>>>> +        buf = io_iov_to_buf(ppiov);
>>>>> +        io_zc_rx_get_buf_uref(buf);
>>>>> +
>>>>> +        cqe->region = 0;
>>>>> +        cqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off;
>>>>> +        cqe->len = len;
>>>>> +        cqe->sock = sock_idx;
>>>>> +        cqe->flags = 0;
>>>>> +    } else {
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +
>>>>> +    return len;
>>>>> +}
>>>>
>>>> I think this would read a lot better as:
>>>>
>>>>      if (unlikely(!page_is_page_pool_iov(frag->bv_page)))
>>>>          return -EOPNOTSUPP;
>>>
>>> That's a bit of oracle coding, this branch is implemented in
>>> a later patch.
>>
>> Oracle coding?
> 
> I.e. knowing how later patches (should) look like.
> 
>> Each patch stands separately, there's no reason not to make this one as
> 
> They are not standalone, you cannot sanely develop anything not
> thinking how and where it's used, otherwise you'd get a set of
> functions full of sleeping but later used in irq context or just
> unfittable into a desired framework. By extent, code often is
> written while trying to look a step ahead. For example, first
> patches don't push everything into io_uring.c just to wholesale
> move it into zc_rx.c because of exceeding some size threshold.

Yes, this is how most patch series are - they will compile separately,
but obviously won't necessarily make sense or be functional until you
get to the end. But since you very much do have future knowledge in
these patches, there's no excuse for not making them interact with each
other better. Each patch should not pretend it doesn't know what comes
next in a series, if you can make a followup patch simpler with a tweak
to a previous patch, that is definitely a good idea.

And here, even the end result would be better imho without having

if (a) {
	big blob of stuff
} else {
	other blob of stuff
}

when it could just be

if (a)
	return big_blog_of_stuff();

return other_blog_of_stuff();

instead.

>> clean as it can be. And an error case with the main bits inline is a lot
> 
> I agree that it should be clean among all, but it _is_ clean
> and readable, all else is stylistic nit picking. And maybe it's
> just my opinion, but I also personally appreciate when a patch is
> easy to review, which includes not restructuring all written before
> with every patch, which also helps with back porting and other
> developing aspects.

But that's basically my point, it even makes followup patches simpler to
read as well. Is it stylistic? Certainly, I just prefer having the above
rather than two big indentations. But it also makes the followup patch
simpler and it's basically a one-liner change at that point, and a
bigger hunk of added code that's the new function that handles the new
case.

>> nicer imho than two separate indented parts. For the latter addition
>> instead of the -EOPNOTSUPP, would probably be nice to have it in a
>> separate function. Probably ditto for the page pool case here now, would
>> make the later patch simpler too.
> 
> If we'd need it in the future, we'll change it then, patches
> stand separately, at least it's IMHO not needed in the current
> series.

It's still an RFC series, please do change it for v4.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 07/20] io_uring: add interface queue
  2023-12-21 17:57   ` Willem de Bruijn
@ 2023-12-30 16:25     ` Pavel Begunkov
  2023-12-31 22:25       ` Willem de Bruijn
  0 siblings, 1 reply; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-30 16:25 UTC (permalink / raw)
  To: Willem de Bruijn, David Wei, io-uring, netdev
  Cc: Jens Axboe, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry,
	magnus.karlsson, bjorn

On 12/21/23 17:57, Willem de Bruijn wrote:
> David Wei wrote:
>> From: David Wei <[email protected]>
>>
>> This patch introduces a new object in io_uring called an interface queue
>> (ifq) which contains:
>>
>> * A pool region allocated by userspace and registered w/ io_uring where
>>    Rx data is written to.
>> * A net device and one specific Rx queue in it that will be configured
>>    for ZC Rx.
>> * A pair of shared ringbuffers w/ userspace, dubbed registered buf
>>    (rbuf) rings. Each entry contains a pool region id and an offset + len
>>    within that region. The kernel writes entries into the completion ring
>>    to tell userspace where RX data is relative to the start of a region.
>>    Userspace writes entries into the refill ring to tell the kernel when
>>    it is done with the data.
>>
>> For now, each io_uring instance has a single ifq, and each ifq has a
>> single pool region associated with one Rx queue.
>>
>> Add a new opcode to io_uring_register that sets up an ifq. Size and
>> offsets of shared ringbuffers are returned to userspace for it to mmap.
>> The implementation will be added in a later patch.
>>
>> Signed-off-by: David Wei <[email protected]>
> 
> This is quite similar to AF_XDP, of course. Is it at all possible to
> reuse all or some of that? If not, why not?

Let me rather ask what do you have in mind for reuse? I'm not too
intimately familiar with xdp, but I don't see what we can take.

Queue formats will be different, there won't be a separate CQ
for zc all they will lend in the main io_uring CQ in next revisions.
io_uring also supports multiple sockets per zc ifq and other quirks
reflected in the uapi.

Receive has to work with generic sockets and skbs if we want
to be able to reuse the protocol stack. Queue allocation and
mapping is similar but that one thing that should be bound to
the API (i.e. io_uring vs af xdp) together with locking and
synchronisation. Wakeups are different as well.

And IIUC AF_XDP is still operates with raw packets quite early
in the stack, while io_uring completes from a syscall, that
would definitely make sync diverging a lot.

I don't see many opportunities here.

> As a side effect, unification would also show a path of moving AF_XDP
> from its custom allocator to the page_pool infra.

I assume it's about xsk_buff_alloc() and likes of it. I'm lacking
here, I it's much better to ask XDP guys what they think about
moving to pp, whether it's needed, etc. And if so, it'd likely
be easier to base it on raw page pool providers api than the io_uring
provider implementation, probably having some common helpers if
things come to that.

> Related: what is the story wrt the process crashing while user memory
> is posted to the NIC or present in the kernel stack.

Buffers are pinned by io_uring. If the process crashes closing the
ring, io_uring will release the pp provider and wait for all buffer
to come back before unpinning pages and freeing the rest. I.e.
it's not going to unpin before pp's ->destroy is called.

> SO_DEVMEM already demonstrates zerocopy into user buffers using usdma.
> To a certain extent that and asyncronous I/O with iouring are two
> independent goals. SO_DEVMEM imposes limitations on the stack because
> it might hold opaque device mem. That is too strong for this case.

Basing it onto ppiov simplifies refcounting a lot, with that we
don't need any dirty hacks nor adding any extra changes in the stack,
and I think it's aligned with the net stack goals. What I think
we can do on top is allowing ppiov's to optionally have pages
(via a callback ->get_page), and use it it in those rare cases
when someone has to peek at the payload.

> But for this iouring provider, is there anything ioring specific about
> it beyond being user memory? If not, maybe just call it a umem
> provider, and anticipate it being usable for AF_XDP in the future too?

Queue formats with a set of features, synchronisation, mostly
answered above, but I also think it should as easy to just have
a separate provider and reuse some code later if there is anything
to reuse.

> Besides delivery up to the intended socket, packets may also end up
> in other code paths, such as packet sockets or forwarding. All of
> this is simpler with userspace backed buffers than with device mem.
> But good to call out explicitly how this is handled. MSG_ZEROCOPY
> makes a deep packet copy in unexpected code paths, for instance. To
> avoid indefinite latency to buffer reclaim.

Yeah, that's concerning, I intend to add something for the sockets
we used, but there is nothing for truly unexpected paths. How devmem
handles it?

It's probably not a huge worry for now, I expect killing the
task/sockets should resolve dependencies, but would be great to find
such scenarios. I'd appreciate any pointers if you have some in mind.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 15/20] io_uring: add io_recvzc request
  2023-12-21 21:32           ` Jens Axboe
@ 2023-12-30 21:15             ` Pavel Begunkov
  0 siblings, 0 replies; 50+ messages in thread
From: Pavel Begunkov @ 2023-12-30 21:15 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 12/21/23 21:32, Jens Axboe wrote:
> On 12/21/23 11:59 AM, Pavel Begunkov wrote:
>> On 12/20/23 18:09, Jens Axboe wrote:
>>> On 12/20/23 10:04 AM, Pavel Begunkov wrote:
>>>> On 12/20/23 16:27, Jens Axboe wrote:
>>>>> On 12/19/23 2:03 PM, David Wei wrote:
>>>>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>>>>> index 454ba301ae6b..7a2aadf6962c 100644
>>>>>> --- a/io_uring/net.c
>>>>>> +++ b/io_uring/net.c
[...]
>>>>>> +    if (issue_flags & IO_URING_F_UNLOCKED)
>>>>>> +        return -EAGAIN;
>>>>>
>>>>> This seems odd, why? If we're called with IO_URING_F_UNLOCKED set, then
>>>>
>>>> It's my addition, let me explain.
>>>>
>>>> io_recvzc() -> io_zc_rx_recv() -> ... -> zc_rx_recv_frag()
>>>>
>>>> This chain posts completions to a buffer completion queue, and
>>>> we don't want extra locking to share it with io-wq threads. In
>>>> some sense it's quite similar to the CQ locking, considering
>>>> we restrict zc to DEFER_TASKRUN. And doesn't change anything
>>>> anyway because multishot cannot post completions from io-wq
>>>> and are executed from the poll callback in task work.
>>>>
>>>>> it's from io-wq. And returning -EAGAIN there will not do anything to
>>>>
>>>> It will. It's supposed to just requeue for polling (it's not
>>>> IOPOLL to keep retrying -EAGAIN), just like multishots do.
>>>
>>> It definitely needs a good comment, as it's highly non-obvious when
>>> reading the code!
>>>
>>>> Double checking the code, it can actually terminate the request,
>>>> which doesn't make much difference for us because multishots
>>>> should normally never end up in io-wq anyway, but I guess we
>>>> can improve it a liitle bit.
>>>
>>> Right, assumptions seems to be that -EAGAIN will lead to poll arm, which
>>> seems a bit fragile.
>>
>> The main assumption is that io-wq will eventually leave the
>> request alone and push it somewhere else, either queuing for
>> polling or terminating, which is more than reasonable. I'd
> 
> But surely you don't want it terminated from here? It seems like a very
> odd choice. As it stands, if you end up doing more than one loop, then
> it won't arm poll and it'll get failed.
>> add that it's rather insane for io-wq indefinitely spinning
>> on -EAGAIN, but it has long been fixed (for !IOPOLL).
> 
> There's no other choice for polling, and it doesn't do it for

zc rx is !IOPOLL, that's what I care about.

> non-polling. The current logic makes sense - if you do a blocking
> attempt and you get -EAGAIN, that's really the final result and you
> cannot sanely retry for !IOPOLL in that case. Before we did poll arm for
> io-wq, even the first -EAGAIN would've terminated it. Relying on -EAGAIN
> from a blocking attempt to do anything but fail the request with -EAGAIN
> res is pretty fragile and odd, I think that needs sorting out.
> 
>> As said, can be made a bit better, but it won't change anything
>> for real life execution, multishots would never end up there
>> after they start listening for poll events.
> 
> Right, probably won't ever be a thing for !multishot. As mentioned in my
> original reply, it really just needs a comment explaining exactly what
> it's doing and why it's fine.
> 
>>>> And it should also use IO_URING_F_IOWQ, forgot I split out
>>>> it from *F_UNLOCK.
>>>
>>> Yep, that'd be clearer.
>>
>> Not "clearer", but more correct. Even though it's not
>> a bug because deps between the flags.
> 
> Both clearer and more correct, I would say.
> 
[...]
>>>
>>> Oracle coding?
>>
>> I.e. knowing how later patches (should) look like.
>>
>>> Each patch stands separately, there's no reason not to make this one as
>>
>> They are not standalone, you cannot sanely develop anything not
>> thinking how and where it's used, otherwise you'd get a set of
>> functions full of sleeping but later used in irq context or just
>> unfittable into a desired framework. By extent, code often is
>> written while trying to look a step ahead. For example, first
>> patches don't push everything into io_uring.c just to wholesale
>> move it into zc_rx.c because of exceeding some size threshold.
> 
> Yes, this is how most patch series are - they will compile separately,
> but obviously won't necessarily make sense or be functional until you
> get to the end. But since you very much do have future knowledge in
> these patches, there's no excuse for not making them interact with each
> other better. Each patch should not pretend it doesn't know what comes

Which is exactly the reason why it is how it is.

> next in a series, if you can make a followup patch simpler with a tweak
> to a previous patch, that is definitely a good idea.
> 
> And here, even the end result would be better imho without having
> 
> if (a) {
> 	big blob of stuff
> } else {
> 	other blob of stuff
> }
> 
> when it could just be
> 
> if (a)
> 	return big_blog_of_stuff();
> 
> return other_blog_of_stuff();
> 
> instead.

That sounds like a good general advice, but the "blobs" are
not big nor expected to grow to require splitting, I can't say
it makes it any cleaner or simpler.

>>> clean as it can be. And an error case with the main bits inline is a lot
>>
>> I agree that it should be clean among all, but it _is_ clean
>> and readable, all else is stylistic nit picking. And maybe it's
>> just my opinion, but I also personally appreciate when a patch is
>> easy to review, which includes not restructuring all written before
>> with every patch, which also helps with back porting and other
>> developing aspects.
> 
> But that's basically my point, it even makes followup patches simpler to
> read as well. Is it stylistic? Certainly, I just prefer having the above
> rather than two big indentations. But it also makes the followup patch
> simpler
> and it's basically a one-liner change at that point, and a
> bigger hunk of added code that's the new function that handles the new
> case.
> 
>>> nicer imho than two separate indented parts. For the latter addition
>>> instead of the -EOPNOTSUPP, would probably be nice to have it in a
>>> separate function. Probably ditto for the page pool case here now, would
>>> make the later patch simpler too.
>>
>> If we'd need it in the future, we'll change it then, patches
>> stand separately, at least it's IMHO not needed in the current
>> series.
> 
> It's still an RFC series, please do change it for v4.

It's not my patch, but I don't view it as moving the patches in
any positive direction.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 07/20] io_uring: add interface queue
  2023-12-30 16:25     ` Pavel Begunkov
@ 2023-12-31 22:25       ` Willem de Bruijn
  0 siblings, 0 replies; 50+ messages in thread
From: Willem de Bruijn @ 2023-12-31 22:25 UTC (permalink / raw)
  To: Pavel Begunkov, Willem de Bruijn, David Wei, io-uring, netdev
  Cc: Jens Axboe, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry,
	magnus.karlsson, bjorn

Pavel Begunkov wrote:
> On 12/21/23 17:57, Willem de Bruijn wrote:
> > David Wei wrote:
> >> From: David Wei <[email protected]>
> >>
> >> This patch introduces a new object in io_uring called an interface queue
> >> (ifq) which contains:
> >>
> >> * A pool region allocated by userspace and registered w/ io_uring where
> >>    Rx data is written to.
> >> * A net device and one specific Rx queue in it that will be configured
> >>    for ZC Rx.
> >> * A pair of shared ringbuffers w/ userspace, dubbed registered buf
> >>    (rbuf) rings. Each entry contains a pool region id and an offset + len
> >>    within that region. The kernel writes entries into the completion ring
> >>    to tell userspace where RX data is relative to the start of a region.
> >>    Userspace writes entries into the refill ring to tell the kernel when
> >>    it is done with the data.
> >>
> >> For now, each io_uring instance has a single ifq, and each ifq has a
> >> single pool region associated with one Rx queue.
> >>
> >> Add a new opcode to io_uring_register that sets up an ifq. Size and
> >> offsets of shared ringbuffers are returned to userspace for it to mmap.
> >> The implementation will be added in a later patch.
> >>
> >> Signed-off-by: David Wei <[email protected]>
> > 
> > This is quite similar to AF_XDP, of course. Is it at all possible to
> > reuse all or some of that? If not, why not?
> 
> Let me rather ask what do you have in mind for reuse? I'm not too
> intimately familiar with xdp, but I don't see what we can take.

At a high level all points in this commit message:

	* A pool region allocated by userspace and registered w/ io_uring where
	  Rx data is written to.
	* A net device and one specific Rx queue in it that will be configured
	  for ZC Rx.
	* A pair of shared ringbuffers w/ userspace, dubbed registered buf
	  (rbuf) rings. Each entry contains a pool region id and an offset + len
	  within that region. The kernel writes entries into the completion ring
	  to tell userspace where RX data is relative to the start of a region.
	  Userspace writes entries into the refill ring to tell the kernel when
	  it is done with the data.

	For now, each io_uring instance has a single ifq, and each ifq has a
	single pool region associated with one Rx queue.

AF_XDP allows shared pools, but otherwise this sounds like the same
feature set.

> Queue formats will be different

I'd like to makes sure that this is for a reason. Not just divergence
because we did not consider reusing existing user/kernel queue formats.

> there won't be a separate CQ
> for zc all they will lend in the main io_uring CQ in next revisions.

Okay, that's different.

> io_uring also supports multiple sockets per zc ifq and other quirks
> reflected in the uapi.
> 
> Receive has to work with generic sockets and skbs if we want
> to be able to reuse the protocol stack. Queue allocation and
> mapping is similar but that one thing that should be bound to
> the API (i.e. io_uring vs af xdp) together with locking and
> synchronisation. Wakeups are different as well.
> 
> And IIUC AF_XDP is still operates with raw packets quite early
> in the stack, while io_uring completes from a syscall, that
> would definitely make sync diverging a lot.

The difference is in frame payload, not in the queue structure:
a fixed frame buffer pool plus sets of post + completion queues that
store a relative offset and length into that pool.

I don't intend to ask for the impossible, to be extra clear: If there
are reasons the structures need to be different, so be it. And no
intention to complicate development. Anything not ABI can be
refactored later, too, if overlap becomes clear. But for ABI it's
worth asking now whether these queue formats really are different for
a concrete reason.

> I don't see many opportunities here.
> 
> > As a side effect, unification would also show a path of moving AF_XDP
> > from its custom allocator to the page_pool infra.
> 
> I assume it's about xsk_buff_alloc() and likes of it. I'm lacking
> here, I it's much better to ask XDP guys what they think about
> moving to pp, whether it's needed, etc. And if so, it'd likely
> be easier to base it on raw page pool providers api than the io_uring
> provider implementation, probably having some common helpers if
> things come to that.

Fair enough, on giving it some more thought and reviewing a recent
use case of the AF_XDP allocation APIs including xsk_buff_alloc.

> 
> > Related: what is the story wrt the process crashing while user memory
> > is posted to the NIC or present in the kernel stack.
> 
> Buffers are pinned by io_uring. If the process crashes closing the
> ring, io_uring will release the pp provider and wait for all buffer
> to come back before unpinning pages and freeing the rest. I.e.
> it's not going to unpin before pp's ->destroy is called.

Great. That's how all page pools work iirc. There is some potential
concern with unbound delay until all buffers are recycled. But that
is not unique to the io_uring provider.

> > SO_DEVMEM already demonstrates zerocopy into user buffers using usdma.
> > To a certain extent that and asyncronous I/O with iouring are two
> > independent goals. SO_DEVMEM imposes limitations on the stack because
> > it might hold opaque device mem. That is too strong for this case.
> 
> Basing it onto ppiov simplifies refcounting a lot, with that we
> don't need any dirty hacks nor adding any extra changes in the stack,
> and I think it's aligned with the net stack goals.

Great to hear.

> What I think
> we can do on top is allowing ppiov's to optionally have pages
> (via a callback ->get_page), and use it it in those rare cases
> when someone has to peek at the payload.
> 
> > But for this iouring provider, is there anything ioring specific about
> > it beyond being user memory? If not, maybe just call it a umem
> > provider, and anticipate it being usable for AF_XDP in the future too?
> 
> Queue formats with a set of features, synchronisation, mostly
> answered above, but I also think it should as easy to just have
> a separate provider and reuse some code later if there is anything
> to reuse.
> 
> > Besides delivery up to the intended socket, packets may also end up
> > in other code paths, such as packet sockets or forwarding. All of
> > this is simpler with userspace backed buffers than with device mem.
> > But good to call out explicitly how this is handled. MSG_ZEROCOPY
> > makes a deep packet copy in unexpected code paths, for instance. To
> > avoid indefinite latency to buffer reclaim.
> 
> Yeah, that's concerning, I intend to add something for the sockets
> we used, but there is nothing for truly unexpected paths. How devmem
> handles it?

MSG_ZEROCOPY handles this by copying to regular kernel memory using
skb_orphan_frags_rx whenever a tx packet could get looped onto an rx
queue and thus held indefinitely. This is not allowed for MSG_ZEROCOPY
as it causes a potentially unbound latency before data can be reused
by the application. Called from __netif_receive_skb_core,
dev_queue_xmit_nit and a few others.

SO_DEVMEM does allow data to enter packet sockets, but instruments
each point that might reference memory to not do this. For instance:

	@@ -2156,7 +2156,7 @@  static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
			}
		}
	 
	-	snaplen = skb->len;
	+	snaplen = skb_frags_readable(skb) ? skb->len : skb_headlen(skb);

https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

Either approach could be extended to cover io_uring packets.

Multicast is a perhaps an interesting other receive case. I have not
given that much thought.

> It's probably not a huge worry for now, I expect killing the
> task/sockets should resolve dependencies, but would be great to find
> such scenarios. I'd appreciate any pointers if you have some in mind.
> 
> -- 
> Pavel Begunkov



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov
  2023-12-20  1:29     ` Pavel Begunkov
@ 2024-01-02 16:11       ` Mina Almasry
  0 siblings, 0 replies; 50+ messages in thread
From: Mina Almasry @ 2024-01-02 16:11 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Tue, Dec 19, 2023 at 5:34 PM Pavel Begunkov <[email protected]> wrote:
>
> On 12/19/23 23:24, Mina Almasry wrote:
> > On Tue, Dec 19, 2023 at 1:04 PM David Wei <[email protected]> wrote:
> >>
> >> From: Pavel Begunkov <[email protected]>
> >>
> >> NOT FOR UPSTREAM
> >>
> >> There will be more users of struct page_pool_iov, and ppiovs from one
> >> subsystem must not be used by another. That should never happen for any
> >> sane application, but we need to enforce it in case of bufs and/or
> >> malicious users.
> >>
> >> Signed-off-by: Pavel Begunkov <[email protected]>
> >> Signed-off-by: David Wei <[email protected]>
> >> ---
> >>   net/ipv4/tcp.c | 7 +++++++
> >>   1 file changed, 7 insertions(+)
> >>
> >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >> index 33a8bb63fbf5..9c6b18eebb5b 100644
> >> --- a/net/ipv4/tcp.c
> >> +++ b/net/ipv4/tcp.c
> >> @@ -2384,6 +2384,13 @@ static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb,
> >>                          }
> >>
> >>                          ppiov = skb_frag_page_pool_iov(frag);
> >> +
> >> +                       /* Disallow non devmem owned buffers */
> >> +                       if (ppiov->pp->p.memory_provider != PP_MP_DMABUF_DEVMEM) {
> >> +                               err = -ENODEV;
> >> +                               goto out;
> >> +                       }
> >> +
> >
> > Instead of this, I maybe recommend modifying the skb->dmabuf flag? My
> > mental model is that flag means all the frags in the skb are
>
> That's a good point, we need to separate them, and I have it in my
> todo list.
>
> > specifically dmabuf, not general ppiovs or net_iovs. Is it possible to
> > add skb->io_uring or something?
>
> ->io_uring flag is not feasible, converting ->devmem into a type
> {page,devmem,iouring} is better but not great either.
>
> > If that bloats the skb headers, then maybe we need another place to
> > put this flag. Maybe the [page_pool|net]_iov should declare whether
> > it's dmabuf or otherwise, and we can check frag[0] and assume all
>
> ppiov->pp should be enough, either not mixing buffers from different
> pools or comparing pp->ops or some pp->type.
>
> > frags are the same as frag0.
>
> I think I like this one the most. I think David Ahern mentioned
> before, but would be nice having it on per frag basis and kill
> ->devmem flag. That would still stop collapsing if frags are
> from different pools or so.
>

This sounds reasonable to me. I'll look into applying this change to
my next devmem TCP RFC, thanks.

> > But IMO the page pool internals should not leak into the
> > implementation of generic tcp stack functions.
>
> --
> Pavel Begunkov



-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2024-01-02 16:12 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-19 21:03 [RFC PATCH v3 00/20] Zero copy Rx using io_uring David Wei
2023-12-19 21:03 ` [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper David Wei
2023-12-19 23:22   ` Mina Almasry
2023-12-19 23:59     ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov David Wei
2023-12-19 23:24   ` Mina Almasry
2023-12-20  1:29     ` Pavel Begunkov
2024-01-02 16:11       ` Mina Almasry
2023-12-19 21:03 ` [RFC PATCH v3 03/20] net: page pool: rework ppiov life cycle David Wei
2023-12-19 23:35   ` Mina Almasry
2023-12-20  0:49     ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 04/20] net: enable napi_pp_put_page for ppiov David Wei
2023-12-19 21:03 ` [RFC PATCH v3 05/20] net: page_pool: add ->scrub mem provider callback David Wei
2023-12-19 21:03 ` [RFC PATCH v3 06/20] io_uring: separate header for exported net bits David Wei
2023-12-20 16:01   ` Jens Axboe
2023-12-19 21:03 ` [RFC PATCH v3 07/20] io_uring: add interface queue David Wei
2023-12-20 16:13   ` Jens Axboe
2023-12-20 16:23     ` Pavel Begunkov
2023-12-21  1:44     ` David Wei
2023-12-21 17:57   ` Willem de Bruijn
2023-12-30 16:25     ` Pavel Begunkov
2023-12-31 22:25       ` Willem de Bruijn
2023-12-19 21:03 ` [RFC PATCH v3 08/20] io_uring: add mmap support for shared ifq ringbuffers David Wei
2023-12-20 16:13   ` Jens Axboe
2023-12-19 21:03 ` [RFC PATCH v3 09/20] netdev: add XDP_SETUP_ZC_RX command David Wei
2023-12-19 21:03 ` [RFC PATCH v3 10/20] io_uring: setup ZC for an Rx queue when registering an ifq David Wei
2023-12-20 16:06   ` Jens Axboe
2023-12-20 16:24     ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 11/20] io_uring/zcrx: implement socket registration David Wei
2023-12-19 21:03 ` [RFC PATCH v3 12/20] io_uring: add ZC buf and pool David Wei
2023-12-19 21:03 ` [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx David Wei
2023-12-19 23:44   ` Mina Almasry
2023-12-20  0:39     ` Pavel Begunkov
2023-12-21 19:36   ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 14/20] net: page pool: add io_uring memory provider David Wei
2023-12-19 23:39   ` Mina Almasry
2023-12-20  0:04     ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 15/20] io_uring: add io_recvzc request David Wei
2023-12-20 16:27   ` Jens Axboe
2023-12-20 17:04     ` Pavel Begunkov
2023-12-20 18:09       ` Jens Axboe
2023-12-21 18:59         ` Pavel Begunkov
2023-12-21 21:32           ` Jens Axboe
2023-12-30 21:15             ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 16/20] net: execute custom callback from napi David Wei
2023-12-19 21:03 ` [RFC PATCH v3 17/20] io_uring/zcrx: add copy fallback David Wei
2023-12-19 21:03 ` [RFC PATCH v3 18/20] veth: add support for io_uring zc rx David Wei
2023-12-19 21:03 ` [RFC PATCH v3 19/20] net: page pool: generalise ppiov dma address get David Wei
2023-12-21 19:51   ` Mina Almasry
2023-12-19 21:03 ` [RFC PATCH v3 20/20] bnxt: enable io_uring zc page pool David Wei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox