[RFC PATCH v4 00/16] Zero copy Rx using io

public inbox for [email protected]
 help / color / mirror / Atom feed

* [RFC PATCH v4 00/16] Zero copy Rx using io_uring
@ 2024-03-12 21:44 David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 01/16] net: generalise pp provider params passing David Wei
                   ` (15 more replies)
  0 siblings, 16 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

This patchset is a proposal that adds zero copy network Rx to io_uring.
With it, userspace can register a region of host memory for receiving
data directly from a NIC using DMA, without needing a kernel to user
copy.

This is still a WIP and has a list of known issues. We're looking for
feedback on the overall approach.

Full kernel tree including some out of tree BNXT changes:

https://github.com/spikeh/linux/tree/zcrx

On the userspace side, support is added to both liburing and Netbench:

https://github.com/spikeh/liburing/tree/zcrx2
https://github.com/spikeh/netbench/tree/zcrx

Hardware support is added to the Broadcom BNXT driver. This patchset +
userspace code was tested on an Intel Xeon Platinum 8321HC CPU and
Broadcom BCM57504 NIC.

Early benchmarks using this prototype, with iperf3 as a load generator,
showed a ~50% reduction in overall system memory bandwidth as measured
using perf counters. Note that DDIO must be disabled on Intel systems.
Build Netbench using the modified liburing above.

This patchset is based on the work by Jonathan Lemon
<[email protected]>:
https://lore.kernel.org/io-uring/[email protected]/

Changes in RFC v4:
------------------

* Rebased on top of Mina Almasry's TCP devmem patchset and latest
  net-next, now sharing common infra e.g.:
    * netmem_t and net_iovs
    * Page pool memory provider
* The registered buffer (rbuf) completion queue where completions from
  io_recvzc requests are posted is removed. Now these post into the main
  completion queue, using big (32-byte) CQEs. The first 16 bytes is an
  ordinary CQE, while the latter 16 bytes contain the io_uring_rbuf_cqe
  as before. This vastly simplifies the uAPI and removes a level of
  indirection in userspace when looking for payloads.
  * The rbuf refill queue is still needed for userspace to return
    buffers to kernel.
* Simplified code and uAPI on the io_uring side, particularly
  io_recvzc() and io_zc_rx_recv(). Many unnecessary lines were removed
  e.g. extra msg flags, readlen, etc.

Changes in RFC v3:
------------------

* Rebased on top of Jakub Kicinski's memory provider API RFC. The ZC
  pool added is now a backend for memory provider.
* We're also reusing ppiov infrastructure. The refcounting rules stay
  the same but it's shifted into ppiov->refcount. That lets us to
  flexibly manage buffer lifetimes without adding any extra code to the
  common networking paths. It'd also make it easier to support dmabufs
  and device memory in the future.
  * io_uring also knows about pages, and so ppiovs might unnecessarily
    break tools inspecting data, that can easily be solved later.

Many patches are not for upstream as they depend on work in progress,
namely from Mina:

* struct netmem_t
* Driver ndo commands for Rx queue configs
* struct page_pool_iov and shared pp infra

Changes in RFC v2:
------------------

* Added copy fallback support if userspace memory allocated for ZC Rx
  runs out, or if header splitting or flow steering fails.
* Added veth support for ZC Rx, for testing and demonstration. We will
  need to figure out what driver would be best for such testing
  functionality in the future. Perhaps netdevsim?
* Added socket registration API to io_uring to associate specific
  sockets with ifqs/Rx queues for ZC.
* Added multi-socket support, such that multiple connections can be
  steered into the same hardware Rx queue.
* Added Netbench server/client support.

Known deficiencies that we will address in a future patchset:

* Proper test driver + selftests, maybe netdevsim.
* Revisiting userspace API.
* Multi-region support.
* Steering setup.
* Further optimisation work.
* ...and more.

If you would like to try out this patchset, build and run the kernel
tree then build Netbench using liburing, all from forks above.

Run setup.sh first:

https://gist.github.com/isilence/e6a28ce41a545a261566672104afa461

Then run the following commands:

sudo ip netns exec nsserv ./netbench --server_only 1 --v6 false \
    --rx "io_uring --provide_buffers 0 --use_zc 1 \
    --zc_pool_pages 16384 --zc_ifname ptp-serv" --use_port 9999

sudo ip netns exec nscl ./netbench --client_only 1 --v6 false \
    --tx "epoll --threads 1 --per_thread 1 --size 2800" \
    --host 10.10.10.20 --use_port 9999

David Wei (7):
  io_uring: introduce interface queue
  io_uring: add mmap support for shared ifq ringbuffers
  netdev: add XDP_SETUP_ZC_RX command
  io_uring: setup ZC for an Rx queue when registering an ifq
  io_uring: add zero copy buf representation and pool
  io_uring: add io_recvzc request
  io_uring/zcrx: add copy fallback

Pavel Begunkov (9):
  net: generalise pp provider params passing
  io_uring: delayed cqe commit
  net: page_pool: add ->scrub mem provider callback
  io_uring: separate header for exported net bits
  io_uring/zcrx: implement socket registration
  io_uring: implement pp memory provider for zc rx
  io_uring/zcrx: implement PP_FLAG_DMA_* handling
  net: execute custom callback from napi
  veth: add support for io_uring zc rx

 drivers/net/veth.c             | 214 +++++++-
 include/linux/io_uring.h       |   6 -
 include/linux/io_uring/net.h   |  30 ++
 include/linux/io_uring_types.h |   5 +
 include/linux/net.h            |   2 +
 include/linux/netdevice.h      |   6 +
 include/net/busy_poll.h        |   7 +
 include/net/netdev_rx_queue.h  |   3 +
 include/net/page_pool/types.h  |   2 +
 include/uapi/linux/io_uring.h  |  50 ++
 io_uring/Makefile              |   3 +-
 io_uring/io_uring.c            |  15 +-
 io_uring/io_uring.h            |  10 +
 io_uring/net.c                 | 108 +++-
 io_uring/opdef.c               |  16 +
 io_uring/register.c            |  13 +
 io_uring/uring_cmd.c           |   1 +
 io_uring/zc_rx.c               | 916 +++++++++++++++++++++++++++++++++
 io_uring/zc_rx.h               |  83 +++
 net/core/dev.c                 |  48 +-
 net/core/page_pool.c           |   8 +-
 net/socket.c                   |   3 +-
 22 files changed, 1530 insertions(+), 19 deletions(-)
 create mode 100644 include/linux/io_uring/net.h
 create mode 100644 io_uring/zc_rx.c
 create mode 100644 io_uring/zc_rx.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 01/16] net: generalise pp provider params passing
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 02/16] io_uring: delayed cqe commit David Wei
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

RFC only, not for upstream

Add a way to pass custom page pool parameters, but the final version
should converge with devmem.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/netdev_rx_queue.h | 3 +++
 net/core/dev.c                | 2 +-
 net/core/page_pool.c          | 3 +++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index 5dc35628633a..41f8c4e049bb 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -26,6 +26,9 @@ struct netdev_rx_queue {
 	 */
 	struct napi_struct		*napi;
 	struct netdev_dmabuf_binding *binding;
+
+	const struct memory_provider_ops	*pp_ops;
+	void					*pp_private;
 } ____cacheline_aligned_in_smp;
 
 /*
diff --git a/net/core/dev.c b/net/core/dev.c
index 255a38cf59b1..2096ff57685a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2189,7 +2189,7 @@ int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 
 	rxq = __netif_get_rx_queue(dev, rxq_idx);
 
-	if (rxq->binding)
+	if (rxq->binding || rxq->pp_ops)
 		return -EEXIST;
 
 	err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 53039d2f8514..5d5b78878473 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -262,6 +262,9 @@ static int page_pool_init(struct page_pool *pool,
 	if (binding) {
 		pool->mp_ops = &dmabuf_devmem_ops;
 		pool->mp_priv = binding;
+	} else if (pool->p.queue && pool->p.queue->pp_ops) {
+		pool->mp_ops = pool->p.queue->pp_ops;
+		pool->mp_priv = pool->p.queue->pp_private;
 	}
 
 	if (pool->mp_ops) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 02/16] io_uring: delayed cqe commit
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 01/16] net: generalise pp provider params passing David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 03/16] net: page_pool: add ->scrub mem provider callback David Wei
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

RFC only, not for upstream

A stub patch allowing to delay and batch the final step of cqe posting
for aux cqes. A different version will be sent separately to upstream.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring_types.h | 1 +
 io_uring/io_uring.c            | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index d8111d64812b..500772189fee 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -205,6 +205,7 @@ struct io_submit_state {
 
 	bool			plug_started;
 	bool			need_plug;
+	bool			flush_cqes;
 	unsigned short		submit_nr;
 	unsigned int		cqes_count;
 	struct blk_plug		plug;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index cf2f514b7cc0..e44c2ef271b9 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -176,7 +176,7 @@ static struct ctl_table kernel_io_uring_disabled_table[] = {
 static inline void io_submit_flush_completions(struct io_ring_ctx *ctx)
 {
 	if (!wq_list_empty(&ctx->submit_state.compl_reqs) ||
-	    ctx->submit_state.cqes_count)
+	    ctx->submit_state.cqes_count || ctx->submit_state.flush_cqes)
 		__io_submit_flush_completions(ctx);
 }
 
@@ -1598,6 +1598,7 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 		io_free_batch_list(ctx, state->compl_reqs.first);
 		INIT_WQ_LIST(&state->compl_reqs);
 	}
+	ctx->submit_state.flush_cqes = false;
 }
 
 static unsigned io_cqring_events(struct io_ring_ctx *ctx)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 03/16] net: page_pool: add ->scrub mem provider callback
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 01/16] net: generalise pp provider params passing David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 02/16] io_uring: delayed cqe commit David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 04/16] io_uring: separate header for exported net bits David Wei
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

page pool is now waiting for all ppiovs to return before destroying
itself, and for that to happen the memory provider might need to push
some buffers, flush caches and so on.

todo: we'll try to get by without it before the final release

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/types.h | 1 +
 net/core/page_pool.c          | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 096cd2455b2c..347837b83d36 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -134,6 +134,7 @@ struct memory_provider_ops {
 	void (*destroy)(struct page_pool *pool);
 	netmem_ref (*alloc_pages)(struct page_pool *pool, gfp_t gfp);
 	bool (*release_page)(struct page_pool *pool, netmem_ref netmem);
+	void (*scrub)(struct page_pool *pool);
 };
 
 extern const struct memory_provider_ops dmabuf_devmem_ops;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 5d5b78878473..fc92e551ed13 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -984,6 +984,9 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 
 static void page_pool_scrub(struct page_pool *pool)
 {
+	if (pool->mp_ops && pool->mp_ops->scrub)
+		pool->mp_ops->scrub(pool);
+
 	page_pool_empty_alloc_cache_once(pool);
 	pool->destroy_cnt++;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 04/16] io_uring: separate header for exported net bits
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (2 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 03/16] net: page_pool: add ->scrub mem provider callback David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 05/16] io_uring: introduce interface queue David Wei
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

We're exporting some io_uring bits to networking, e.g. for implementing
a net callback for io_uring cmds, but we don't want to expose more than
needed. Add a separate header for networking.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring.h     |  6 ------
 include/linux/io_uring/net.h | 18 ++++++++++++++++++
 io_uring/uring_cmd.c         |  1 +
 net/socket.c                 |  2 +-
 4 files changed, 20 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/io_uring/net.h

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 68ed6697fece..e123d5e17b52 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -11,7 +11,6 @@ void __io_uring_cancel(bool cancel_all);
 void __io_uring_free(struct task_struct *tsk);
 void io_uring_unreg_ringfd(void);
 const char *io_uring_get_opcode(u8 opcode);
-int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
 bool io_is_uring_fops(struct file *file);
 
 static inline void io_uring_files_cancel(void)
@@ -45,11 +44,6 @@ static inline const char *io_uring_get_opcode(u8 opcode)
 {
 	return "";
 }
-static inline int io_uring_cmd_sock(struct io_uring_cmd *cmd,
-				    unsigned int issue_flags)
-{
-	return -EOPNOTSUPP;
-}
 static inline bool io_is_uring_fops(struct file *file)
 {
 	return false;
diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
new file mode 100644
index 000000000000..b58f39fed4d5
--- /dev/null
+++ b/include/linux/io_uring/net.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_IO_URING_NET_H
+#define _LINUX_IO_URING_NET_H
+
+struct io_uring_cmd;
+
+#if defined(CONFIG_IO_URING)
+int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
+
+#else
+static inline int io_uring_cmd_sock(struct io_uring_cmd *cmd,
+				    unsigned int issue_flags)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
+#endif
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 42f63adfa54a..0b504b33806c 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -3,6 +3,7 @@
 #include <linux/errno.h>
 #include <linux/file.h>
 #include <linux/io_uring/cmd.h>
+#include <linux/io_uring/net.h>
 #include <linux/security.h>
 #include <linux/nospec.h>
 #include <net/sock.h>
diff --git a/net/socket.c b/net/socket.c
index ed3df2f749bf..c69cd0e652b8 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -88,7 +88,7 @@
 #include <linux/xattr.h>
 #include <linux/nospec.h>
 #include <linux/indirect_call_wrapper.h>
-#include <linux/io_uring.h>
+#include <linux/io_uring/net.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 05/16] io_uring: introduce interface queue
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (3 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 04/16] io_uring: separate header for exported net bits David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 06/16] io_uring: add mmap support for shared ifq ringbuffers David Wei
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

This patch introduces a new object in io_uring called an interface queue
(ifq) which contains:

* A pool region allocated by userspace and registered w/ io_uring where
  Rx data is transferred to via DMA.
* A net device and one specific Rx queue in it that will be configured
  for ZC Rx.
* A new shared ringbuf w/ userspace called a refill ring. When userspace
  is done with bufs with Rx packet payloads, it writes entries into this
  ring to tell the kernel that bufs can be re-used by the NIC again.
  Each entry in the refill ring is a struct io_uring_rbuf_rqe.

On the completion side, the main CQ ring is used to notify userspace of
recv()'d packets. Big CQEs (32 bytes) are required to support this, as
the upper 16 bytes are used by ZC Rx to store a feature specific struct
io_uring_rbuf_cqe.

Add two new struct types:

1. io_uring_rbuf_rqe - entry in refill ring
2. io_uring_rbuf_cqe - entry in upper 16 bytes of a big CQE

For now, each io_uring instance has a single ifq, and each ifq has a
single pool region associated with one Rx queue.

Add a new opcode and functions to setup and tear down an ifq. Size and
offsets of the shared refill ring are returned to userspace for it to
mmap in the registration struct io_uring_zc_rx_ifq_reg, similar to the
main SQ/CQ rings.

Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring_types.h |   4 ++
 include/uapi/linux/io_uring.h  |  40 ++++++++++++
 io_uring/Makefile              |   3 +-
 io_uring/io_uring.c            |   7 +++
 io_uring/register.c            |   7 +++
 io_uring/zc_rx.c               | 109 +++++++++++++++++++++++++++++++++
 io_uring/zc_rx.h               |  35 +++++++++++
 7 files changed, 204 insertions(+), 1 deletion(-)
 create mode 100644 io_uring/zc_rx.c
 create mode 100644 io_uring/zc_rx.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 500772189fee..27e750a02ea5 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -39,6 +39,8 @@ enum io_uring_cmd_flags {
 	IO_URING_F_COMPAT		= (1 << 12),
 };
 
+struct io_zc_rx_ifq;
+
 struct io_wq_work_node {
 	struct io_wq_work_node *next;
 };
@@ -385,6 +387,8 @@ struct io_ring_ctx {
 	struct io_rsrc_data		*file_data;
 	struct io_rsrc_data		*buf_data;
 
+	struct io_zc_rx_ifq		*ifq;
+
 	/* protected by ->uring_lock */
 	struct list_head		rsrc_ref_list;
 	struct io_alloc_cache		rsrc_node_cache;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 7bd10201a02b..7b643fe420c5 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -575,6 +575,9 @@ enum {
 	IORING_REGISTER_NAPI			= 27,
 	IORING_UNREGISTER_NAPI			= 28,
 
+	/* register a network interface queue for zerocopy */
+	IORING_REGISTER_ZC_RX_IFQ		= 29,
+
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
@@ -782,6 +785,43 @@ enum {
 	SOCKET_URING_OP_SETSOCKOPT,
 };
 
+struct io_uring_rbuf_rqe {
+	__u32	off;
+	__u32	len;
+	__u16	region;
+	__u8	__pad[6];
+};
+
+struct io_uring_rbuf_cqe {
+	__u32	off;
+	__u32	len;
+	__u16	region;
+	__u8	__pad[6];
+};
+
+struct io_rbuf_rqring_offsets {
+	__u32	head;
+	__u32	tail;
+	__u32	rqes;
+	__u8	__pad[4];
+};
+
+/*
+ * Argument for IORING_REGISTER_ZC_RX_IFQ
+ */
+struct io_uring_zc_rx_ifq_reg {
+	__u32	if_idx;
+	/* hw rx descriptor ring id */
+	__u32	if_rxq_id;
+	__u32	region_id;
+	__u32	rq_entries;
+	__u32	flags;
+	__u16	cpu;
+
+	__u32	mmap_sz;
+	struct io_rbuf_rqring_offsets rq_off;
+};
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 2e1d4e03799c..bb47231c611b 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -8,7 +8,8 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o xattr.o nop.o fs.o splice.o \
 					statx.o net.o msg_ring.o timeout.o \
 					sqpoll.o fdinfo.o tctx.o poll.o \
 					cancel.o kbuf.o rsrc.o rw.o opdef.o \
-					notif.o waitid.o register.o truncate.o
+					notif.o waitid.o register.o truncate.o \
+					zc_rx.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
 obj-$(CONFIG_FUTEX)		+= futex.o
 obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e44c2ef271b9..5614c47cecd9 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -95,6 +95,7 @@
 #include "waitid.h"
 #include "futex.h"
 #include "napi.h"
+#include "zc_rx.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -2861,6 +2862,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		return;
 
 	mutex_lock(&ctx->uring_lock);
+	io_unregister_zc_rx_ifqs(ctx);
 	if (ctx->buf_data)
 		__io_sqe_buffers_unregister(ctx);
 	if (ctx->file_data)
@@ -3032,6 +3034,11 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 			io_cqring_overflow_kill(ctx);
 			mutex_unlock(&ctx->uring_lock);
 		}
+		if (ctx->ifq) {
+			mutex_lock(&ctx->uring_lock);
+			io_shutdown_zc_rx_ifqs(ctx);
+			mutex_unlock(&ctx->uring_lock);
+		}
 
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
 			io_move_task_work_from_local(ctx);
diff --git a/io_uring/register.c b/io_uring/register.c
index 99c37775f974..760f0b6a051c 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -27,6 +27,7 @@
 #include "cancel.h"
 #include "kbuf.h"
 #include "napi.h"
+#include "zc_rx.h"
 
 #define IORING_MAX_RESTRICTIONS	(IORING_RESTRICTION_LAST + \
 				 IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -563,6 +564,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_unregister_napi(ctx, arg);
 		break;
+	case IORING_REGISTER_ZC_RX_IFQ:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_zc_rx_ifq(ctx, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
new file mode 100644
index 000000000000..e6c33f94c086
--- /dev/null
+++ b/io_uring/zc_rx.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0
+#if defined(CONFIG_PAGE_POOL)
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/io_uring.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "kbuf.h"
+#include "zc_rx.h"
+
+static int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq,
+				 struct io_uring_zc_rx_ifq_reg *reg)
+{
+	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP;
+	size_t off, rq_size;
+	void *ptr;
+
+	off = sizeof(struct io_uring);
+	rq_size = reg->rq_entries * sizeof(struct io_uring_rbuf_rqe);
+	ptr = (void *) __get_free_pages(gfp, get_order(off + rq_size));
+	if (!ptr)
+		return -ENOMEM;
+	ifq->rq_ring = (struct io_uring *)ptr;
+	ifq->rqes = (struct io_uring_rbuf_rqe *)((char *)ptr + off);
+	return 0;
+}
+
+static void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq)
+{
+	if (ifq->rq_ring)
+		folio_put(virt_to_folio(ifq->rq_ring));
+}
+
+static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx)
+{
+	struct io_zc_rx_ifq *ifq;
+
+	ifq = kzalloc(sizeof(*ifq), GFP_KERNEL);
+	if (!ifq)
+		return NULL;
+
+	ifq->if_rxq_id = -1;
+	ifq->ctx = ctx;
+	return ifq;
+}
+
+static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq)
+{
+	io_free_rbuf_ring(ifq);
+	kfree(ifq);
+}
+
+int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
+			  struct io_uring_zc_rx_ifq_reg __user *arg)
+{
+	struct io_uring_zc_rx_ifq_reg reg;
+	struct io_zc_rx_ifq *ifq;
+	int ret;
+
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN &&
+	      ctx->flags & IORING_SETUP_CQE32))
+		return -EINVAL;
+	if (copy_from_user(&reg, arg, sizeof(reg)))
+		return -EFAULT;
+	if (ctx->ifq)
+		return -EBUSY;
+	if (reg.if_rxq_id == -1)
+		return -EINVAL;
+
+	ifq = io_zc_rx_ifq_alloc(ctx);
+	if (!ifq)
+		return -ENOMEM;
+
+	ret = io_allocate_rbuf_ring(ifq, &reg);
+	if (ret)
+		goto err;
+
+	ifq->rq_entries = reg.rq_entries;
+	ifq->if_rxq_id = reg.if_rxq_id;
+	ctx->ifq = ifq;
+
+	return 0;
+err:
+	io_zc_rx_ifq_free(ifq);
+	return ret;
+}
+
+void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx)
+{
+	struct io_zc_rx_ifq *ifq = ctx->ifq;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	if (!ifq)
+		return;
+
+	ctx->ifq = NULL;
+	io_zc_rx_ifq_free(ifq);
+}
+
+void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
+{
+	lockdep_assert_held(&ctx->uring_lock);
+}
+
+#endif
diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
new file mode 100644
index 000000000000..35b019b275e0
--- /dev/null
+++ b/io_uring/zc_rx.h
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_ZC_RX_H
+#define IOU_ZC_RX_H
+
+struct io_zc_rx_ifq {
+	struct io_ring_ctx		*ctx;
+	struct net_device		*dev;
+	struct io_uring			*rq_ring;
+	struct io_uring_rbuf_rqe 	*rqes;
+	u32				rq_entries;
+
+	/* hw rx descriptor ring id */
+	u32				if_rxq_id;
+};
+
+#if defined(CONFIG_PAGE_POOL)
+int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
+			  struct io_uring_zc_rx_ifq_reg __user *arg);
+void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx);
+void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx);
+#else
+static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
+			  struct io_uring_zc_rx_ifq_reg __user *arg)
+{
+	return -EOPNOTSUPP;
+}
+static inline void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx)
+{
+}
+static inline void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
+{
+}
+#endif
+
+#endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 06/16] io_uring: add mmap support for shared ifq ringbuffers
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (4 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 05/16] io_uring: introduce interface queue David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 07/16] netdev: add XDP_SETUP_ZC_RX command David Wei
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

This patch adds mmap support for ifq refill ring. Just like the io_uring
SQ/CQ rings, userspace issues a single mmap call using the io_uring fd
w/ magic offset IORING_OFF_RQ_RING. An opaque ptr is returned to
userspace, which is then expected to use the offsets returned in the
registration struct to get access to the head/tail and rings.

Signed-off-by: David Wei <[email protected]>
---
 include/uapi/linux/io_uring.h |  2 ++
 io_uring/io_uring.c           |  5 +++++
 io_uring/zc_rx.c              | 15 ++++++++++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 7b643fe420c5..a085ed60478f 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -438,6 +438,8 @@ enum {
 #define IORING_OFF_PBUF_RING		0x80000000ULL
 #define IORING_OFF_PBUF_SHIFT		16
 #define IORING_OFF_MMAP_MASK		0xf8000000ULL
+#define IORING_OFF_RQ_RING		0x20000000ULL
+#define IORING_OFF_RQ_SHIFT		16
 
 /*
  * Filled with the offset for mmap(2)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 5614c47cecd9..280f2a2fd1fe 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3434,6 +3434,11 @@ static void *io_uring_validate_mmap_request(struct file *file,
 			return ERR_PTR(-EINVAL);
 		break;
 		}
+	case IORING_OFF_RQ_RING:
+		if (!ctx->ifq)
+			return ERR_PTR(-EINVAL);
+		ptr = ctx->ifq->rq_ring;
+		break;
 	default:
 		return ERR_PTR(-EINVAL);
 	}
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index e6c33f94c086..6987bb991418 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -58,6 +58,7 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 {
 	struct io_uring_zc_rx_ifq_reg reg;
 	struct io_zc_rx_ifq *ifq;
+	size_t ring_sz, rqes_sz;
 	int ret;
 
 	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN &&
@@ -80,8 +81,20 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 
 	ifq->rq_entries = reg.rq_entries;
 	ifq->if_rxq_id = reg.if_rxq_id;
-	ctx->ifq = ifq;
 
+	ring_sz = sizeof(struct io_uring);
+	rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries;
+	reg.mmap_sz = ring_sz + rqes_sz;
+	reg.rq_off.rqes = ring_sz;
+	reg.rq_off.head = offsetof(struct io_uring, head);
+	reg.rq_off.tail = offsetof(struct io_uring, tail);
+
+	if (copy_to_user(arg, &reg, sizeof(reg))) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	ctx->ifq = ifq;
 	return 0;
 err:
 	io_zc_rx_ifq_free(ifq);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 07/16] netdev: add XDP_SETUP_ZC_RX command
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (5 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 06/16] io_uring: add mmap support for shared ifq ringbuffers David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 08/16] io_uring: setup ZC for an Rx queue when registering an ifq David Wei
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

RFC only, not for upstream.
This will be replaced with a separate ndo callback or some other
mechanism in next patchset revisions.

This patch adds a new XDP_SETUP_ZC_RX command that will be used in a
later patch to enable or disable ZC RX for a specific RX queue.

We are open to suggestions on a better way of doing this. Google's TCP
devmem proposal sets up struct netdev_rx_queue which persists across
device reset, then expects userspace to use an out-of-band method (e.g.
ethtool) to reset the device, thus re-filling a hardware Rx queue.

Signed-off-by: David Wei <[email protected]>
---
 include/linux/netdevice.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ac7102118d68..699cce69a5a6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1009,6 +1009,7 @@ enum bpf_netdev_command {
 	BPF_OFFLOAD_MAP_ALLOC,
 	BPF_OFFLOAD_MAP_FREE,
 	XDP_SETUP_XSK_POOL,
+	XDP_SETUP_ZC_RX,
 };
 
 struct bpf_prog_offload_ops;
@@ -1047,6 +1048,11 @@ struct netdev_bpf {
 			struct xsk_buff_pool *pool;
 			u16 queue_id;
 		} xsk;
+		/* XDP_SETUP_ZC_RX */
+		struct {
+			struct io_zc_rx_ifq *ifq;
+			u16 queue_id;
+		} zc_rx;
 	};
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 08/16] io_uring: setup ZC for an Rx queue when registering an ifq
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (6 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 07/16] netdev: add XDP_SETUP_ZC_RX command David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 09/16] io_uring/zcrx: implement socket registration David Wei
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

RFC only, not for upstream

Just as with the previous patch, it will be migrated from ndo_bpf

This patch sets up ZC for an Rx queue in a net device when an ifq is
registered with io_uring. The Rx queue is specified in the registration
struct.

For now since there is only one ifq, its destruction is implicit during
io_uring cleanup.

Signed-off-by: David Wei <[email protected]>
---
 io_uring/zc_rx.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index 6987bb991418..521eeea04f9d 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -4,6 +4,7 @@
 #include <linux/errno.h>
 #include <linux/mm.h>
 #include <linux/io_uring.h>
+#include <linux/netdevice.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -11,6 +12,34 @@
 #include "kbuf.h"
 #include "zc_rx.h"
 
+typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
+
+static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq,
+			   u16 queue_id)
+{
+	struct netdev_bpf cmd;
+	bpf_op_t ndo_bpf;
+
+	ndo_bpf = dev->netdev_ops->ndo_bpf;
+	if (!ndo_bpf)
+		return -EINVAL;
+
+	cmd.command = XDP_SETUP_ZC_RX;
+	cmd.zc_rx.ifq = ifq;
+	cmd.zc_rx.queue_id = queue_id;
+	return ndo_bpf(dev, &cmd);
+}
+
+static int io_open_zc_rxq(struct io_zc_rx_ifq *ifq)
+{
+	return __io_queue_mgmt(ifq->dev, ifq, ifq->if_rxq_id);
+}
+
+static int io_close_zc_rxq(struct io_zc_rx_ifq *ifq)
+{
+	return __io_queue_mgmt(ifq->dev, NULL, ifq->if_rxq_id);
+}
+
 static int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq,
 				 struct io_uring_zc_rx_ifq_reg *reg)
 {
@@ -49,6 +78,10 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx)
 
 static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq)
 {
+	if (ifq->if_rxq_id != -1)
+		io_close_zc_rxq(ifq);
+	if (ifq->dev)
+		dev_put(ifq->dev);
 	io_free_rbuf_ring(ifq);
 	kfree(ifq);
 }
@@ -79,9 +112,18 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 	if (ret)
 		goto err;
 
+	ret = -ENODEV;
+	ifq->dev = dev_get_by_index(current->nsproxy->net_ns, reg.if_idx);
+	if (!ifq->dev)
+		goto err;
+
 	ifq->rq_entries = reg.rq_entries;
 	ifq->if_rxq_id = reg.if_rxq_id;
 
+	ret = io_open_zc_rxq(ifq);
+	if (ret)
+		goto err;
+
 	ring_sz = sizeof(struct io_uring);
 	rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries;
 	reg.mmap_sz = ring_sz + rqes_sz;
@@ -90,6 +132,7 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 	reg.rq_off.tail = offsetof(struct io_uring, tail);
 
 	if (copy_to_user(arg, &reg, sizeof(reg))) {
+		io_close_zc_rxq(ifq);
 		ret = -EFAULT;
 		goto err;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 09/16] io_uring/zcrx: implement socket registration
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (7 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 08/16] io_uring: setup ZC for an Rx queue when registering an ifq David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 10/16] io_uring: add zero copy buf representation and pool David Wei
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

We want userspace to explicitly list all sockets it'll be using with a
particular zc ifq, so we can properly configure them, e.g. binding the
sockets to the corresponding interface and setting steering rules. We'll
also need it to better control ifq lifetime and for
termination / unregistration purposes.

TODO: remove zc_rx_idx from struct socket, which will fix zc_rx_idx
token init races and re-registration bug.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/linux/net.h           |  2 +
 include/uapi/linux/io_uring.h |  7 +++
 io_uring/net.c                | 20 ++++++++
 io_uring/register.c           |  6 +++
 io_uring/zc_rx.c              | 91 +++++++++++++++++++++++++++++++++--
 io_uring/zc_rx.h              | 17 +++++++
 net/socket.c                  |  1 +
 7 files changed, 141 insertions(+), 3 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index c9b4a63791a4..867061a91d30 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -126,6 +126,8 @@ struct socket {
 	const struct proto_ops	*ops; /* Might change with IPV6_ADDRFORM or MPTCP. */
 
 	struct socket_wq	wq;
+
+	unsigned		zc_rx_idx;
 };
 
 /*
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index a085ed60478f..26e945e6258d 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -579,6 +579,7 @@ enum {
 
 	/* register a network interface queue for zerocopy */
 	IORING_REGISTER_ZC_RX_IFQ		= 29,
+	IORING_REGISTER_ZC_RX_SOCK		= 30,
 
 	/* this goes last */
 	IORING_REGISTER_LAST,
@@ -824,6 +825,12 @@ struct io_uring_zc_rx_ifq_reg {
 	struct io_rbuf_rqring_offsets rq_off;
 };
 
+struct io_uring_zc_rx_sock_reg {
+	__u32	sockfd;
+	__u32	zc_rx_ifq_idx;
+	__u32	__resv[2];
+};
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/io_uring/net.c b/io_uring/net.c
index 54dff492e064..1fa7c1fa6b5d 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -16,6 +16,7 @@
 #include "net.h"
 #include "notif.h"
 #include "rsrc.h"
+#include "zc_rx.h"
 
 #if defined(CONFIG_NET)
 struct io_shutdown {
@@ -1033,6 +1034,25 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
+static __maybe_unused
+struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
+					struct socket *sock)
+{
+	unsigned token = READ_ONCE(sock->zc_rx_idx);
+	unsigned ifq_idx = token >> IO_ZC_IFQ_IDX_OFFSET;
+	unsigned sock_idx = token & IO_ZC_IFQ_IDX_MASK;
+	struct io_zc_rx_ifq *ifq;
+
+	if (ifq_idx)
+		return NULL;
+	ifq = req->ctx->ifq;
+	if (!ifq || sock_idx >= ifq->nr_sockets)
+		return NULL;
+	if (ifq->sockets[sock_idx] != req->file)
+		return NULL;
+	return ifq;
+}
+
 void io_send_zc_cleanup(struct io_kiocb *req)
 {
 	struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
diff --git a/io_uring/register.c b/io_uring/register.c
index 760f0b6a051c..7f40403a1716 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -570,6 +570,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_zc_rx_ifq(ctx, arg);
 		break;
+	case IORING_REGISTER_ZC_RX_SOCK:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_zc_rx_sock(ctx, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index 521eeea04f9d..77459c0fc14b 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -5,12 +5,15 @@
 #include <linux/mm.h>
 #include <linux/io_uring.h>
 #include <linux/netdevice.h>
+#include <net/tcp.h>
+#include <net/af_unix.h>
 
 #include <uapi/linux/io_uring.h>
 
 #include "io_uring.h"
 #include "kbuf.h"
 #include "zc_rx.h"
+#include "rsrc.h"
 
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 
@@ -76,10 +79,31 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx)
 	return ifq;
 }
 
-static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq)
+static void io_shutdown_ifq(struct io_zc_rx_ifq *ifq)
 {
-	if (ifq->if_rxq_id != -1)
+	int i;
+
+	if (!ifq)
+		return;
+
+	for (i = 0; i < ifq->nr_sockets; i++) {
+		if (ifq->sockets[i]) {
+			fput(ifq->sockets[i]);
+			ifq->sockets[i] = NULL;
+		}
+	}
+	ifq->nr_sockets = 0;
+
+	if (ifq->if_rxq_id != -1) {
 		io_close_zc_rxq(ifq);
+		ifq->if_rxq_id = -1;
+	}
+}
+
+static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq)
+{
+	io_shutdown_ifq(ifq);
+
 	if (ifq->dev)
 		dev_put(ifq->dev);
 	io_free_rbuf_ring(ifq);
@@ -132,7 +156,6 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 	reg.rq_off.tail = offsetof(struct io_uring, tail);
 
 	if (copy_to_user(arg, &reg, sizeof(reg))) {
-		io_close_zc_rxq(ifq);
 		ret = -EFAULT;
 		goto err;
 	}
@@ -153,6 +176,8 @@ void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx)
 	if (!ifq)
 		return;
 
+	WARN_ON_ONCE(ifq->nr_sockets);
+
 	ctx->ifq = NULL;
 	io_zc_rx_ifq_free(ifq);
 }
@@ -160,6 +185,66 @@ void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx)
 void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
 {
 	lockdep_assert_held(&ctx->uring_lock);
+
+	io_shutdown_ifq(ctx->ifq);
+}
+
+int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
+			   struct io_uring_zc_rx_sock_reg __user *arg)
+{
+	struct io_uring_zc_rx_sock_reg sr;
+	struct io_zc_rx_ifq *ifq;
+	struct socket *sock;
+	struct file *file;
+	int ret = -EEXIST;
+	int idx;
+
+	if (copy_from_user(&sr, arg, sizeof(sr)))
+		return -EFAULT;
+	if (sr.__resv[0] || sr.__resv[1])
+		return -EINVAL;
+	if (sr.zc_rx_ifq_idx != 0 || !ctx->ifq)
+		return -EINVAL;
+
+	ifq = ctx->ifq;
+	if (ifq->nr_sockets >= ARRAY_SIZE(ifq->sockets))
+		return -EINVAL;
+
+	BUILD_BUG_ON(ARRAY_SIZE(ifq->sockets) > IO_ZC_IFQ_IDX_MASK);
+
+	file = fget(sr.sockfd);
+	if (!file)
+		return -EBADF;
+
+	if (!!unix_get_socket(file)) {
+		fput(file);
+		return -EBADF;
+	}
+
+	sock = sock_from_file(file);
+	if (unlikely(!sock || !sock->sk)) {
+		fput(file);
+		return -ENOTSOCK;
+	}
+
+	idx = ifq->nr_sockets;
+	lock_sock(sock->sk);
+	if (!sock->zc_rx_idx) {
+		unsigned token;
+
+		token = idx + (sr.zc_rx_ifq_idx << IO_ZC_IFQ_IDX_OFFSET);
+		WRITE_ONCE(sock->zc_rx_idx, token);
+		ret = 0;
+	}
+	release_sock(sock->sk);
+
+	if (ret) {
+		fput(file);
+		return ret;
+	}
+	ifq->sockets[idx] = file;
+	ifq->nr_sockets++;
+	return 0;
 }
 
 #endif
diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
index 35b019b275e0..d7b8397d525f 100644
--- a/io_uring/zc_rx.h
+++ b/io_uring/zc_rx.h
@@ -2,6 +2,13 @@
 #ifndef IOU_ZC_RX_H
 #define IOU_ZC_RX_H
 
+#include <linux/io_uring_types.h>
+#include <linux/skbuff.h>
+
+#define IO_ZC_MAX_IFQ_SOCKETS		16
+#define IO_ZC_IFQ_IDX_OFFSET		16
+#define IO_ZC_IFQ_IDX_MASK		((1U << IO_ZC_IFQ_IDX_OFFSET) - 1)
+
 struct io_zc_rx_ifq {
 	struct io_ring_ctx		*ctx;
 	struct net_device		*dev;
@@ -11,6 +18,9 @@ struct io_zc_rx_ifq {
 
 	/* hw rx descriptor ring id */
 	u32				if_rxq_id;
+
+	unsigned			nr_sockets;
+	struct file			*sockets[IO_ZC_MAX_IFQ_SOCKETS];
 };
 
 #if defined(CONFIG_PAGE_POOL)
@@ -18,6 +28,8 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 			  struct io_uring_zc_rx_ifq_reg __user *arg);
 void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx);
 void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx);
+int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
+			   struct io_uring_zc_rx_sock_reg __user *arg);
 #else
 static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 			  struct io_uring_zc_rx_ifq_reg __user *arg)
@@ -30,6 +42,11 @@ static inline void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx)
 static inline void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx)
 {
 }
+static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
+				struct io_uring_zc_rx_sock_reg __user *arg)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 #endif
diff --git a/net/socket.c b/net/socket.c
index c69cd0e652b8..18181a4e0295 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -637,6 +637,7 @@ struct socket *sock_alloc(void)
 
 	sock = SOCKET_I(inode);
 
+	sock->zc_rx_idx = 0;
 	inode->i_ino = get_next_ino();
 	inode->i_mode = S_IFSOCK | S_IRWXUGO;
 	inode->i_uid = current_fsuid();
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 10/16] io_uring: add zero copy buf representation and pool
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (8 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 09/16] io_uring/zcrx: implement socket registration David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 11/16] io_uring: implement pp memory provider for zc rx David Wei
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: David Wei <[email protected]>

This patch adds two objects:

* Zero copy buffer representation, holding a page and a net_iov. The
  page is needed as net_iov is designed for opaque device memory,
  whereas we are backed by real pages.
* Zero copy pool, spiritually similar to page pool, that holds ZC bufs
  and hands them out to net devices. This will be used as an
  implementation of page pool memory provider.

Pool regions are registered w/ io_uring using the registered buffer API,
with a 1:1 mapping between region and nr_iovec in
io_uring_register_buffers. This does the heavy lifting of pinning and
chunking into bvecs into a struct io_mapped_ubuf for us.

For now as there is only one pool region per ifq, there is no separate
API for adding/removing regions yet and it is mapped implicitly during
ifq registration.

Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring/net.h |   7 +++
 io_uring/zc_rx.c             | 110 +++++++++++++++++++++++++++++++++++
 io_uring/zc_rx.h             |  15 +++++
 3 files changed, 132 insertions(+)

diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
index b58f39fed4d5..05d5a6a97264 100644
--- a/include/linux/io_uring/net.h
+++ b/include/linux/io_uring/net.h
@@ -2,8 +2,15 @@
 #ifndef _LINUX_IO_URING_NET_H
 #define _LINUX_IO_URING_NET_H
 
+#include <net/page_pool/types.h>
+
 struct io_uring_cmd;
 
+struct io_zc_rx_buf {
+	struct net_iov		niov;
+	struct page		*page;
+};
+
 #if defined(CONFIG_IO_URING)
 int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
 
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index 77459c0fc14b..326ae3fcc643 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/io_uring.h>
 #include <linux/netdevice.h>
+#include <linux/nospec.h>
 #include <net/tcp.h>
 #include <net/af_unix.h>
 
@@ -66,6 +67,109 @@ static void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq)
 		folio_put(virt_to_folio(ifq->rq_ring));
 }
 
+static int io_zc_rx_init_buf(struct page *page, struct io_zc_rx_buf *buf)
+{
+	memset(&buf->niov, 0, sizeof(buf->niov));
+	atomic_long_set(&buf->niov.pp_ref_count, 0);
+
+	buf->page = page;
+	get_page(page);
+	return 0;
+}
+
+static void io_zc_rx_free_buf(struct io_zc_rx_buf *buf)
+{
+	struct page *page = buf->page;
+
+	put_page(page);
+}
+
+static int io_zc_rx_init_pool(struct io_zc_rx_pool *pool,
+			     struct io_mapped_ubuf *imu)
+{
+	struct io_zc_rx_buf *buf;
+	struct page *page;
+	int i, ret;
+
+	for (i = 0; i < imu->nr_bvecs; i++) {
+		page = imu->bvec[i].bv_page;
+		buf = &pool->bufs[i];
+		ret = io_zc_rx_init_buf(page, buf);
+		if (ret)
+			goto err;
+
+		pool->freelist[i] = i;
+	}
+
+	pool->free_count = imu->nr_bvecs;
+	return 0;
+err:
+	while (i--) {
+		buf = &pool->bufs[i];
+		io_zc_rx_free_buf(buf);
+	}
+	return ret;
+}
+
+static int io_zc_rx_create_pool(struct io_ring_ctx *ctx,
+				struct io_zc_rx_ifq *ifq,
+				u16 id)
+{
+	struct io_mapped_ubuf *imu;
+	struct io_zc_rx_pool *pool;
+	int nr_pages;
+	int ret;
+
+	if (ifq->pool)
+		return -EFAULT;
+
+	if (unlikely(id >= ctx->nr_user_bufs))
+		return -EFAULT;
+	id = array_index_nospec(id, ctx->nr_user_bufs);
+	imu = ctx->user_bufs[id];
+	if (imu->ubuf & ~PAGE_MASK || imu->ubuf_end & ~PAGE_MASK)
+		return -EFAULT;
+
+	ret = -ENOMEM;
+	nr_pages = imu->nr_bvecs;
+	pool = kvmalloc(struct_size(pool, freelist, nr_pages), GFP_KERNEL);
+	if (!pool)
+		goto err;
+
+	pool->bufs = kvmalloc_array(nr_pages, sizeof(*pool->bufs), GFP_KERNEL);
+	if (!pool->bufs)
+		goto err_buf;
+
+	ret = io_zc_rx_init_pool(pool, imu);
+	if (ret)
+		goto err_map;
+
+	pool->ifq = ifq;
+	pool->pool_id = id;
+	pool->nr_bufs = nr_pages;
+	spin_lock_init(&pool->freelist_lock);
+	ifq->pool = pool;
+	return 0;
+err_map:
+	kvfree(pool->bufs);
+err_buf:
+	kvfree(pool);
+err:
+	return ret;
+}
+
+static void io_zc_rx_free_pool(struct io_zc_rx_pool *pool)
+{
+	struct io_zc_rx_buf *buf;
+
+	for (int i = 0; i < pool->nr_bufs; i++) {
+		buf = &pool->bufs[i];
+		io_zc_rx_free_buf(buf);
+	}
+	kvfree(pool->bufs);
+	kvfree(pool);
+}
+
 static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx)
 {
 	struct io_zc_rx_ifq *ifq;
@@ -104,6 +208,8 @@ static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq)
 {
 	io_shutdown_ifq(ifq);
 
+	if (ifq->pool)
+		io_zc_rx_free_pool(ifq->pool);
 	if (ifq->dev)
 		dev_put(ifq->dev);
 	io_free_rbuf_ring(ifq);
@@ -141,6 +247,10 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 	if (!ifq->dev)
 		goto err;
 
+	ret = io_zc_rx_create_pool(ctx, ifq, reg.region_id);
+	if (ret)
+		goto err;
+
 	ifq->rq_entries = reg.rq_entries;
 	ifq->if_rxq_id = reg.if_rxq_id;
 
diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
index d7b8397d525f..466b2b8f9813 100644
--- a/io_uring/zc_rx.h
+++ b/io_uring/zc_rx.h
@@ -3,15 +3,30 @@
 #define IOU_ZC_RX_H
 
 #include <linux/io_uring_types.h>
+#include <linux/io_uring/net.h>
 #include <linux/skbuff.h>
 
 #define IO_ZC_MAX_IFQ_SOCKETS		16
 #define IO_ZC_IFQ_IDX_OFFSET		16
 #define IO_ZC_IFQ_IDX_MASK		((1U << IO_ZC_IFQ_IDX_OFFSET) - 1)
 
+struct io_zc_rx_pool {
+	struct io_zc_rx_ifq	*ifq;
+	struct io_zc_rx_buf	*bufs;
+	u32			nr_bufs;
+	u16			pool_id;
+
+	/* freelist */
+	spinlock_t		freelist_lock;
+	u32			free_count;
+	u32			freelist[];
+};
+
 struct io_zc_rx_ifq {
 	struct io_ring_ctx		*ctx;
 	struct net_device		*dev;
+	struct io_zc_rx_pool		*pool;
+
 	struct io_uring			*rq_ring;
 	struct io_uring_rbuf_rqe 	*rqes;
 	u32				rq_entries;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 11/16] io_uring: implement pp memory provider for zc rx
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (9 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 10/16] io_uring: add zero copy buf representation and pool David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 12/16] io_uring/zcrx: implement PP_FLAG_DMA_* handling David Wei
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Implement a new pp memory provider for io_uring zerocopy receive.

All buffers are backed by struct io_zc_rx_buf, which is a thin extension
of struct net_iov. Initially, all of them are unallocated and placed in
a spinlock protected ->freelist. Then, they will be allocate via
the ->alloc_pages callback, which sets refcount to 1.

Later, buffers would either be dropped by the net stack and recycled
back into page pool / released by ->release_page, or, more likely, get
transferred to the userspace by posting a corresponding CQE and
elevating refcount by IO_ZC_RX_UREF. When the user is done with a buffer,
it should be put into the refill ring.

Next time io_pp_zc_alloc_pages() runs it'll check the ring, put user
refs and ultimately grab buffers from there. That's done in the attached
napi context and so doesn't need any additional synchronisation. That is
the second hottest path after getting a buffer from the pp lockless cache.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring/net.h  |   5 +
 include/net/page_pool/types.h |   1 +
 io_uring/zc_rx.c              | 202 ++++++++++++++++++++++++++++++++++
 io_uring/zc_rx.h              |   5 +
 net/core/page_pool.c          |   2 +-
 5 files changed, 214 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
index 05d5a6a97264..a225d7090b6b 100644
--- a/include/linux/io_uring/net.h
+++ b/include/linux/io_uring/net.h
@@ -12,6 +12,11 @@ struct io_zc_rx_buf {
 };
 
 #if defined(CONFIG_IO_URING)
+
+#if defined(CONFIG_PAGE_POOL)
+extern const struct memory_provider_ops io_uring_pp_zc_ops;
+#endif
+
 int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
 
 #else
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 347837b83d36..9e91f2cdbe61 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -227,6 +227,7 @@ netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
 struct page_pool *page_pool_create(const struct page_pool_params *params);
 struct page_pool *page_pool_create_percpu(const struct page_pool_params *params,
 					  int cpuid);
+void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
 
 struct xdp_mem_info;
 
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index 326ae3fcc643..b2507df121fb 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -8,6 +8,7 @@
 #include <linux/nospec.h>
 #include <net/tcp.h>
 #include <net/af_unix.h>
+#include <trace/events/page_pool.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -357,4 +358,205 @@ int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
 	return 0;
 }
 
+static inline struct io_zc_rx_buf *io_niov_to_buf(struct net_iov *niov)
+{
+	return container_of(niov, struct io_zc_rx_buf, niov);
+}
+
+static inline unsigned io_buf_pgid(struct io_zc_rx_pool *pool,
+				   struct io_zc_rx_buf *buf)
+{
+	return buf - pool->bufs;
+}
+
+static __maybe_unused void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf)
+{
+	atomic_long_add(IO_ZC_RX_UREF, &buf->niov.pp_ref_count);
+}
+
+static bool io_zc_rx_buf_put(struct io_zc_rx_buf *buf, int nr)
+{
+	return atomic_long_sub_and_test(nr, &buf->niov.pp_ref_count);
+}
+
+static bool io_zc_rx_put_buf_uref(struct io_zc_rx_buf *buf)
+{
+	if (atomic_long_read(&buf->niov.pp_ref_count) < IO_ZC_RX_UREF)
+		return false;
+
+	return io_zc_rx_buf_put(buf, IO_ZC_RX_UREF);
+}
+
+static inline netmem_ref io_zc_buf_to_netmem(struct io_zc_rx_buf *buf)
+{
+	return net_iov_to_netmem(&buf->niov);
+}
+
+static inline void io_zc_add_pp_cache(struct page_pool *pp,
+				      struct io_zc_rx_buf *buf)
+{
+	netmem_ref netmem = io_zc_buf_to_netmem(buf);
+
+	page_pool_set_pp_info(pp, netmem);
+	pp->alloc.cache[pp->alloc.count++] = netmem;
+}
+
+static inline u32 io_zc_rx_rqring_entries(struct io_zc_rx_ifq *ifq)
+{
+	u32 entries;
+
+	entries = smp_load_acquire(&ifq->rq_ring->tail) - ifq->cached_rq_head;
+	return min(entries, ifq->rq_entries);
+}
+
+static void io_zc_rx_ring_refill(struct page_pool *pp,
+				 struct io_zc_rx_ifq *ifq)
+{
+	unsigned int entries = io_zc_rx_rqring_entries(ifq);
+	unsigned int mask = ifq->rq_entries - 1;
+	struct io_zc_rx_pool *pool = ifq->pool;
+
+	if (unlikely(!entries))
+		return;
+
+	while (entries--) {
+		unsigned int rq_idx = ifq->cached_rq_head++ & mask;
+		struct io_uring_rbuf_rqe *rqe = &ifq->rqes[rq_idx];
+		u32 pgid = rqe->off / PAGE_SIZE;
+		struct io_zc_rx_buf *buf = &pool->bufs[pgid];
+
+		if (!io_zc_rx_put_buf_uref(buf))
+			continue;
+		io_zc_add_pp_cache(pp, buf);
+		if (pp->alloc.count >= PP_ALLOC_CACHE_REFILL)
+			break;
+	}
+	smp_store_release(&ifq->rq_ring->head, ifq->cached_rq_head);
+}
+
+static void io_zc_rx_refill_slow(struct page_pool *pp, struct io_zc_rx_ifq *ifq)
+{
+	struct io_zc_rx_pool *pool = ifq->pool;
+
+	spin_lock_bh(&pool->freelist_lock);
+	while (pool->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
+		struct io_zc_rx_buf *buf;
+		u32 pgid;
+
+		pgid = pool->freelist[--pool->free_count];
+		buf = &pool->bufs[pgid];
+
+		io_zc_add_pp_cache(pp, buf);
+		pp->pages_state_hold_cnt++;
+		trace_page_pool_state_hold(pp, io_zc_buf_to_netmem(buf),
+					   pp->pages_state_hold_cnt);
+	}
+	spin_unlock_bh(&pool->freelist_lock);
+}
+
+static void io_zc_rx_recycle_buf(struct io_zc_rx_pool *pool,
+				 struct io_zc_rx_buf *buf)
+{
+	spin_lock_bh(&pool->freelist_lock);
+	pool->freelist[pool->free_count++] = io_buf_pgid(pool, buf);
+	spin_unlock_bh(&pool->freelist_lock);
+}
+
+static netmem_ref io_pp_zc_alloc_pages(struct page_pool *pp, gfp_t gfp)
+{
+	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+
+	/* pp should already be ensuring that */
+	if (unlikely(pp->alloc.count))
+		goto out_return;
+
+	io_zc_rx_ring_refill(pp, ifq);
+	if (likely(pp->alloc.count))
+		goto out_return;
+
+	io_zc_rx_refill_slow(pp, ifq);
+	if (!pp->alloc.count)
+		return 0;
+out_return:
+	return pp->alloc.cache[--pp->alloc.count];
+}
+
+static bool io_pp_zc_release_page(struct page_pool *pp, netmem_ref netmem)
+{
+	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+	struct io_zc_rx_buf *buf;
+	struct net_iov *niov;
+
+	if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+		return false;
+
+	niov = netmem_to_net_iov(netmem);
+	buf = io_niov_to_buf(niov);
+
+	if (io_zc_rx_buf_put(buf, 1))
+		io_zc_rx_recycle_buf(ifq->pool, buf);
+	return false;
+}
+
+static void io_pp_zc_scrub(struct page_pool *pp)
+{
+	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+	struct io_zc_rx_pool *pool = ifq->pool;
+	int i;
+
+	for (i = 0; i < pool->nr_bufs; i++) {
+		struct io_zc_rx_buf *buf = &pool->bufs[i];
+		int count;
+
+		if (!io_zc_rx_put_buf_uref(buf))
+			continue;
+		io_zc_rx_recycle_buf(pool, buf);
+
+		count = atomic_inc_return_relaxed(&pp->pages_state_release_cnt);
+		trace_page_pool_state_release(pp, io_zc_buf_to_netmem(buf), count);
+	}
+}
+
+static int io_pp_zc_init(struct page_pool *pp)
+{
+	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+
+	if (!ifq)
+		return -EINVAL;
+	if (pp->p.order != 0)
+		return -EINVAL;
+	if (!pp->p.napi)
+		return -EINVAL;
+	if (pp->p.flags & PP_FLAG_DMA_MAP)
+		return -EOPNOTSUPP;
+	if (pp->p.flags & PP_FLAG_DMA_SYNC_DEV)
+		return -EOPNOTSUPP;
+
+	percpu_ref_get(&ifq->ctx->refs);
+	ifq->pp = pp;
+	return 0;
+}
+
+static void io_pp_zc_destroy(struct page_pool *pp)
+{
+	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+	struct io_zc_rx_pool *pool = ifq->pool;
+
+	ifq->pp = NULL;
+
+	if (WARN_ON_ONCE(pool->free_count != pool->nr_bufs))
+		return;
+	percpu_ref_put(&ifq->ctx->refs);
+}
+
+const struct memory_provider_ops io_uring_pp_zc_ops = {
+	.alloc_pages		= io_pp_zc_alloc_pages,
+	.release_page		= io_pp_zc_release_page,
+	.init			= io_pp_zc_init,
+	.destroy		= io_pp_zc_destroy,
+	.scrub			= io_pp_zc_scrub,
+};
+EXPORT_SYMBOL(io_uring_pp_zc_ops);
+
+
 #endif
diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
index 466b2b8f9813..c02bf8cabc6c 100644
--- a/io_uring/zc_rx.h
+++ b/io_uring/zc_rx.h
@@ -10,6 +10,9 @@
 #define IO_ZC_IFQ_IDX_OFFSET		16
 #define IO_ZC_IFQ_IDX_MASK		((1U << IO_ZC_IFQ_IDX_OFFSET) - 1)
 
+#define IO_ZC_RX_UREF			0x10000
+#define IO_ZC_RX_KREF_MASK		(IO_ZC_RX_UREF - 1)
+
 struct io_zc_rx_pool {
 	struct io_zc_rx_ifq	*ifq;
 	struct io_zc_rx_buf	*bufs;
@@ -26,10 +29,12 @@ struct io_zc_rx_ifq {
 	struct io_ring_ctx		*ctx;
 	struct net_device		*dev;
 	struct io_zc_rx_pool		*pool;
+	struct page_pool		*pp;
 
 	struct io_uring			*rq_ring;
 	struct io_uring_rbuf_rqe 	*rqes;
 	u32				rq_entries;
+	u32				cached_rq_head;
 
 	/* hw rx descriptor ring id */
 	u32				if_rxq_id;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index fc92e551ed13..f83ddbb4ebd8 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -460,7 +460,7 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	return false;
 }
 
-static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
+void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
 {
 	netmem_set_pp(netmem, pool);
 	netmem_or_pp_magic(netmem, PP_SIGNATURE);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 12/16] io_uring/zcrx: implement PP_FLAG_DMA_* handling
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (10 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 11/16] io_uring: implement pp memory provider for zc rx David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 13/16] io_uring: add io_recvzc request David Wei
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

The patch implements support for PP_FLAG_DMA_MAP and
PP_FLAG_DMA_SYNC_DEV. Dma map buffers when creating a page pool if
needed, and unmap on tear down. Most of synching is done by page pool
apart from when we're grabbing buffers from the refill ring, in which
case it we need to do it by hand.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 io_uring/zc_rx.c | 90 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 86 insertions(+), 4 deletions(-)

diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index b2507df121fb..4bd27eda4bc9 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -9,6 +9,7 @@
 #include <net/tcp.h>
 #include <net/af_unix.h>
 #include <trace/events/page_pool.h>
+#include <net/page_pool/helpers.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -72,6 +73,7 @@ static int io_zc_rx_init_buf(struct page *page, struct io_zc_rx_buf *buf)
 {
 	memset(&buf->niov, 0, sizeof(buf->niov));
 	atomic_long_set(&buf->niov.pp_ref_count, 0);
+	page_pool_set_dma_addr_netmem(net_iov_to_netmem(&buf->niov), 0);
 
 	buf->page = page;
 	get_page(page);
@@ -392,12 +394,25 @@ static inline netmem_ref io_zc_buf_to_netmem(struct io_zc_rx_buf *buf)
 	return net_iov_to_netmem(&buf->niov);
 }
 
+static inline void io_zc_sync_for_device(struct page_pool *pp,
+					 netmem_ref netmem)
+{
+	if (pp->p.flags & PP_FLAG_DMA_SYNC_DEV) {
+		dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
+
+		dma_sync_single_range_for_device(pp->p.dev, dma_addr,
+						 pp->p.offset, pp->p.max_len,
+						 pp->p.dma_dir);
+	}
+}
+
 static inline void io_zc_add_pp_cache(struct page_pool *pp,
 				      struct io_zc_rx_buf *buf)
 {
 	netmem_ref netmem = io_zc_buf_to_netmem(buf);
 
 	page_pool_set_pp_info(pp, netmem);
+	io_zc_sync_for_device(pp, netmem);
 	pp->alloc.cache[pp->alloc.count++] = netmem;
 }
 
@@ -517,9 +532,71 @@ static void io_pp_zc_scrub(struct page_pool *pp)
 	}
 }
 
+#define IO_PP_DMA_ATTRS (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
+
+static void io_pp_unmap_buf(struct io_zc_rx_buf *buf, struct page_pool *pp)
+{
+	netmem_ref netmem = net_iov_to_netmem(&buf->niov);
+	dma_addr_t dma = page_pool_get_dma_addr_netmem(netmem);
+
+	dma_unmap_page_attrs(pp->p.dev, dma, PAGE_SIZE << pp->p.order,
+			     pp->p.dma_dir, IO_PP_DMA_ATTRS);
+	page_pool_set_dma_addr_netmem(netmem, 0);
+}
+
+static int io_pp_map_buf(struct io_zc_rx_buf *buf, struct page_pool *pp)
+{
+	netmem_ref netmem = net_iov_to_netmem(&buf->niov);
+	dma_addr_t dma_addr;
+	int ret;
+
+	dma_addr = dma_map_page_attrs(pp->p.dev, buf->page, 0,
+				      PAGE_SIZE << pp->p.order, pp->p.dma_dir,
+				      IO_PP_DMA_ATTRS);
+	ret = dma_mapping_error(pp->p.dev, dma_addr);
+	if (ret)
+		return ret;
+
+	if (WARN_ON_ONCE(page_pool_set_dma_addr_netmem(netmem, dma_addr))) {
+		dma_unmap_page_attrs(pp->p.dev, dma_addr,
+				     PAGE_SIZE << pp->p.order, pp->p.dma_dir,
+				     IO_PP_DMA_ATTRS);
+		return -EFAULT;
+	}
+
+	io_zc_sync_for_device(pp, netmem);
+	return 0;
+}
+
+static int io_pp_map_pool(struct io_zc_rx_pool *pool, struct page_pool *pp)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < pool->nr_bufs; i++) {
+		ret = io_pp_map_buf(&pool->bufs[i], pp);
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		while (i--)
+			io_pp_unmap_buf(&pool->bufs[i], pp);
+	}
+	return ret;
+}
+
+static void io_pp_unmap_pool(struct io_zc_rx_pool *pool, struct page_pool *pp)
+{
+	int i;
+
+	for (i = 0; i < pool->nr_bufs; i++)
+		io_pp_unmap_buf(&pool->bufs[i], pp);
+}
+
 static int io_pp_zc_init(struct page_pool *pp)
 {
 	struct io_zc_rx_ifq *ifq = pp->mp_priv;
+	int ret;
 
 	if (!ifq)
 		return -EINVAL;
@@ -527,10 +604,12 @@ static int io_pp_zc_init(struct page_pool *pp)
 		return -EINVAL;
 	if (!pp->p.napi)
 		return -EINVAL;
-	if (pp->p.flags & PP_FLAG_DMA_MAP)
-		return -EOPNOTSUPP;
-	if (pp->p.flags & PP_FLAG_DMA_SYNC_DEV)
-		return -EOPNOTSUPP;
+
+	if (pp->p.flags & PP_FLAG_DMA_MAP) {
+		ret = io_pp_map_pool(ifq->pool, pp);
+		if (ret)
+			return ret;
+	}
 
 	percpu_ref_get(&ifq->ctx->refs);
 	ifq->pp = pp;
@@ -542,6 +621,9 @@ static void io_pp_zc_destroy(struct page_pool *pp)
 	struct io_zc_rx_ifq *ifq = pp->mp_priv;
 	struct io_zc_rx_pool *pool = ifq->pool;
 
+	if (pp->p.flags & PP_FLAG_DMA_MAP)
+		io_pp_unmap_pool(ifq->pool, pp);
+
 	ifq->pp = NULL;
 
 	if (WARN_ON_ONCE(pool->free_count != pool->nr_bufs))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (11 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 12/16] io_uring/zcrx: implement PP_FLAG_DMA_* handling David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-13 20:25   ` Jens Axboe
  2024-03-12 21:44 ` [RFC PATCH v4 14/16] net: execute custom callback from napi David Wei
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

Add an io_uring opcode OP_RECV_ZC for doing ZC reads from a socket that
is set up for ZC Rx. The request reads skbs from a socket. Completions
are posted into the main CQ for each page frag read.

Big CQEs (CQE32) is required as the OP_RECV_ZC specific metadata (ZC
region, offset, len) are stored in the extended 16 bytes as a
struct io_uring_rbuf_cqe.

For now there is no limit as to how much work each OP_RECV_ZC request
does. It will attempt to drain a socket of all available data.

Multishot requests are also supported. The first time an io_recvzc
request completes, EAGAIN is returned which arms an async poll. Then, in
subsequent runs in task work, IOU_ISSUE_SKIP_COMPLETE is returned to
continue async polling.

Signed-off-by: David Wei <[email protected]>
---
 include/uapi/linux/io_uring.h |   1 +
 io_uring/io_uring.h           |  10 ++
 io_uring/net.c                |  94 +++++++++++++++++-
 io_uring/opdef.c              |  16 +++
 io_uring/zc_rx.c              | 177 +++++++++++++++++++++++++++++++++-
 io_uring/zc_rx.h              |  11 +++
 6 files changed, 302 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 26e945e6258d..ad2ec60b0390 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -256,6 +256,7 @@ enum io_uring_op {
 	IORING_OP_FUTEX_WAITV,
 	IORING_OP_FIXED_FD_INSTALL,
 	IORING_OP_FTRUNCATE,
+	IORING_OP_RECV_ZC,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 6426ee382276..cd1b3da96f62 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -180,6 +180,16 @@ static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret
 	return io_get_cqe_overflow(ctx, ret, false);
 }
 
+static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
+					       struct io_uring_cqe **cqe_ret)
+{
+	io_lockdep_assert_cq_locked(ctx);
+
+	ctx->cq_extra++;
+	ctx->submit_state.flush_cqes = true;
+	return io_get_cqe(ctx, cqe_ret);
+}
+
 static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
 					    struct io_kiocb *req)
 {
diff --git a/io_uring/net.c b/io_uring/net.c
index 1fa7c1fa6b5d..56172335387e 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -79,6 +79,12 @@ struct io_sr_msg {
  */
 #define MULTISHOT_MAX_RETRY	32
 
+struct io_recvzc {
+	struct file			*file;
+	unsigned			msg_flags;
+	u16				flags;
+};
+
 static inline bool io_check_multishot(struct io_kiocb *req,
 				      unsigned int issue_flags)
 {
@@ -695,7 +701,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
 	unsigned int cflags;
 
 	cflags = io_put_kbuf(req, issue_flags);
-	if (msg->msg_inq && msg->msg_inq != -1)
+	if (msg && msg->msg_inq && msg->msg_inq != -1)
 		cflags |= IORING_CQE_F_SOCK_NONEMPTY;
 
 	if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
@@ -723,7 +729,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
 			goto enobufs;
 
 		/* Known not-empty or unknown state, retry */
-		if (cflags & IORING_CQE_F_SOCK_NONEMPTY || msg->msg_inq == -1) {
+		if (cflags & IORING_CQE_F_SOCK_NONEMPTY || (msg && msg->msg_inq == -1)) {
 			if (sr->nr_multishot_loops++ < MULTISHOT_MAX_RETRY)
 				return false;
 			/* mshot retries exceeded, force a requeue */
@@ -1034,9 +1040,8 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
-static __maybe_unused
-struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
-					struct socket *sock)
+static struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
+					      struct socket *sock)
 {
 	unsigned token = READ_ONCE(sock->zc_rx_idx);
 	unsigned ifq_idx = token >> IO_ZC_IFQ_IDX_OFFSET;
@@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
 	return ifq;
 }
 
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+
+	/* non-iopoll defer_taskrun only */
+	if (!req->ctx->task_complete)
+		return -EINVAL;
+	if (unlikely(sqe->file_index || sqe->addr2))
+		return -EINVAL;
+	if (READ_ONCE(sqe->len) || READ_ONCE(sqe->addr3))
+		return -EINVAL;
+
+	zc->flags = READ_ONCE(sqe->ioprio);
+	zc->msg_flags = READ_ONCE(sqe->msg_flags);
+
+	if (zc->msg_flags)
+		return -EINVAL;
+	if (zc->flags & ~RECVMSG_FLAGS)
+		return -EINVAL;
+	if (zc->flags & IORING_RECV_MULTISHOT)
+		req->flags |= REQ_F_APOLL_MULTISHOT;
+#ifdef CONFIG_COMPAT
+	if (req->ctx->compat)
+		zc->msg_flags |= MSG_CMSG_COMPAT;
+#endif
+	return 0;
+}
+
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+	struct io_zc_rx_ifq *ifq;
+	struct socket *sock;
+	int ret;
+
+	/*
+	 * We're posting CQEs deeper in the stack, and to avoid taking CQ locks
+	 * we serialise by having only the master thread modifying the CQ with
+	 * DEFER_TASkRUN checked earlier and forbidding executing it from io-wq.
+	 * That's similar to io_check_multishot() for multishot CQEs.
+	 */
+	if (issue_flags & IO_URING_F_IOWQ)
+		return -EAGAIN;
+	if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_NONBLOCK)))
+		return -EAGAIN;
+	if (!(req->flags & REQ_F_POLLED) &&
+	    (zc->flags & IORING_RECVSEND_POLL_FIRST))
+		return -EAGAIN;
+
+	sock = sock_from_file(req->file);
+	if (unlikely(!sock))
+		return -ENOTSOCK;
+	ifq = io_zc_verify_sock(req, sock);
+	if (!ifq)
+		return -EINVAL;
+
+	ret = io_zc_rx_recv(req, ifq, sock, zc->msg_flags | MSG_DONTWAIT);
+	if (unlikely(ret <= 0)) {
+		if (ret == -EAGAIN) {
+			if (issue_flags & IO_URING_F_MULTISHOT)
+				return IOU_ISSUE_SKIP_COMPLETE;
+			return -EAGAIN;
+		}
+		if (ret == -ERESTARTSYS)
+			ret = -EINTR;
+
+		req_set_fail(req);
+		io_req_set_res(req, ret, 0);
+
+		if (issue_flags & IO_URING_F_MULTISHOT)
+			return IOU_STOP_MULTISHOT;
+		return IOU_OK;
+	}
+
+	if (issue_flags & IO_URING_F_MULTISHOT)
+		return IOU_ISSUE_SKIP_COMPLETE;
+	return -EAGAIN;
+}
+
 void io_send_zc_cleanup(struct io_kiocb *req)
 {
 	struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 9c080aadc5a6..78ec5197917e 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -36,6 +36,7 @@
 #include "waitid.h"
 #include "futex.h"
 #include "truncate.h"
+#include "zc_rx.h"
 
 static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
 {
@@ -481,6 +482,18 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_ftruncate_prep,
 		.issue			= io_ftruncate,
 	},
+	[IORING_OP_RECV_ZC] = {
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
+		.ioprio			= 1,
+#if defined(CONFIG_NET)
+		.prep			= io_recvzc_prep,
+		.issue			= io_recvzc,
+#else
+		.prep			= io_eopnotsupp_prep,
+#endif
+	},
 };
 
 const struct io_cold_def io_cold_defs[] = {
@@ -722,6 +735,9 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_FTRUNCATE] = {
 		.name			= "FTRUNCATE",
 	},
+	[IORING_OP_RECV_ZC] = {
+		.name			= "RECV_ZC",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index 4bd27eda4bc9..bb9251111735 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -6,10 +6,12 @@
 #include <linux/io_uring.h>
 #include <linux/netdevice.h>
 #include <linux/nospec.h>
+
+#include <net/page_pool/helpers.h>
 #include <net/tcp.h>
 #include <net/af_unix.h>
+
 #include <trace/events/page_pool.h>
-#include <net/page_pool/helpers.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -18,6 +20,12 @@
 #include "zc_rx.h"
 #include "rsrc.h"
 
+struct io_zc_rx_args {
+	struct io_kiocb		*req;
+	struct io_zc_rx_ifq	*ifq;
+	struct socket		*sock;
+};
+
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 
 static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq,
@@ -371,7 +379,7 @@ static inline unsigned io_buf_pgid(struct io_zc_rx_pool *pool,
 	return buf - pool->bufs;
 }
 
-static __maybe_unused void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf)
+static void io_zc_rx_get_buf_uref(struct io_zc_rx_buf *buf)
 {
 	atomic_long_add(IO_ZC_RX_UREF, &buf->niov.pp_ref_count);
 }
@@ -640,5 +648,170 @@ const struct memory_provider_ops io_uring_pp_zc_ops = {
 };
 EXPORT_SYMBOL(io_uring_pp_zc_ops);
 
+static bool zc_rx_queue_cqe(struct io_kiocb *req, struct io_zc_rx_buf *buf,
+			   struct io_zc_rx_ifq *ifq, int off, int len)
+{
+	struct io_uring_rbuf_cqe *rcqe;
+	struct io_uring_cqe *cqe;
+
+	if (!io_defer_get_uncommited_cqe(req->ctx, &cqe))
+		return false;
+
+	cqe->user_data = req->cqe.user_data;
+	cqe->res = 0;
+	cqe->flags = IORING_CQE_F_MORE;
+
+	rcqe = (struct io_uring_rbuf_cqe *)(cqe + 1);
+	rcqe->region = 0;
+	rcqe->off = io_buf_pgid(ifq->pool, buf) * PAGE_SIZE + off;
+	rcqe->len = len;
+	memset(rcqe->__pad, 0, sizeof(rcqe->__pad));
+	return true;
+}
+
+static int zc_rx_recv_frag(struct io_kiocb *req, struct io_zc_rx_ifq *ifq,
+			   const skb_frag_t *frag, int off, int len)
+{
+	off += skb_frag_off(frag);
+
+	if (likely(skb_frag_is_net_iov(frag))) {
+		struct io_zc_rx_buf *buf;
+		struct net_iov *niov;
+
+		niov = netmem_to_net_iov(frag->netmem);
+		if (niov->pp->mp_ops != &io_uring_pp_zc_ops ||
+		    niov->pp->mp_priv != ifq)
+			return -EFAULT;
+
+		buf = io_niov_to_buf(niov);
+		if (!zc_rx_queue_cqe(req, buf, ifq, off, len))
+			return -ENOSPC;
+		io_zc_rx_get_buf_uref(buf);
+	} else {
+		return -EOPNOTSUPP;
+	}
+
+	return len;
+}
+
+static int
+zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+	       unsigned int offset, size_t len)
+{
+	struct io_zc_rx_args *args = desc->arg.data;
+	struct io_zc_rx_ifq *ifq = args->ifq;
+	struct io_kiocb *req = args->req;
+	struct sk_buff *frag_iter;
+	unsigned start, start_off;
+	int i, copy, end, off;
+	int ret = 0;
+
+	start = skb_headlen(skb);
+	start_off = offset;
+
+	if (offset < start)
+		return -EOPNOTSUPP;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		const skb_frag_t *frag;
+
+		if (WARN_ON(start > offset + len))
+			return -EFAULT;
+
+		frag = &skb_shinfo(skb)->frags[i];
+		end = start + skb_frag_size(frag);
+
+		if (offset < end) {
+			copy = end - offset;
+			if (copy > len)
+				copy = len;
+
+			off = offset - start;
+			ret = zc_rx_recv_frag(req, ifq, frag, off, copy);
+			if (ret < 0)
+				goto out;
+
+			offset += ret;
+			len -= ret;
+			if (len == 0 || ret != copy)
+				goto out;
+		}
+		start = end;
+	}
+
+	skb_walk_frags(skb, frag_iter) {
+		if (WARN_ON(start > offset + len))
+			return -EFAULT;
+
+		end = start + frag_iter->len;
+		if (offset < end) {
+			copy = end - offset;
+			if (copy > len)
+				copy = len;
+
+			off = offset - start;
+			ret = zc_rx_recv_skb(desc, frag_iter, off, copy);
+			if (ret < 0)
+				goto out;
+
+			offset += ret;
+			len -= ret;
+			if (len == 0 || ret != copy)
+				goto out;
+		}
+		start = end;
+	}
+
+out:
+	if (offset == start_off)
+		return ret;
+	return offset - start_off;
+}
+
+static int io_zc_rx_tcp_recvmsg(struct io_kiocb *req, struct io_zc_rx_ifq *ifq,
+				struct sock *sk, int flags)
+{
+	struct io_zc_rx_args args = {
+		.req = req,
+		.ifq = ifq,
+		.sock = sk->sk_socket,
+	};
+	read_descriptor_t rd_desc = {
+		.count = 1,
+		.arg.data = &args,
+	};
+	int ret;
+
+	lock_sock(sk);
+	ret = tcp_read_sock(sk, &rd_desc, zc_rx_recv_skb);
+	if (ret <= 0) {
+		if (ret < 0 || sock_flag(sk, SOCK_DONE))
+			goto out;
+		if (sk->sk_err)
+			ret = sock_error(sk);
+		else if (sk->sk_shutdown & RCV_SHUTDOWN)
+			goto out;
+		else if (sk->sk_state == TCP_CLOSE)
+			ret = -ENOTCONN;
+		else
+			ret = -EAGAIN;
+	}
+out:
+	release_sock(sk);
+	return ret;
+}
+
+int io_zc_rx_recv(struct io_kiocb *req, struct io_zc_rx_ifq *ifq,
+		  struct socket *sock, unsigned int flags)
+{
+	struct sock *sk = sock->sk;
+	const struct proto *prot = READ_ONCE(sk->sk_prot);
+
+	if (prot->recvmsg != tcp_recvmsg)
+		return -EPROTONOSUPPORT;
+
+	sock_rps_record_flow(sk);
+	return io_zc_rx_tcp_recvmsg(req, ifq, sk, flags);
+}
 
 #endif
diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h
index c02bf8cabc6c..c14ea3cf544a 100644
--- a/io_uring/zc_rx.h
+++ b/io_uring/zc_rx.h
@@ -50,6 +50,8 @@ void io_unregister_zc_rx_ifqs(struct io_ring_ctx *ctx);
 void io_shutdown_zc_rx_ifqs(struct io_ring_ctx *ctx);
 int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
 			   struct io_uring_zc_rx_sock_reg __user *arg);
+int io_zc_rx_recv(struct io_kiocb *req, struct io_zc_rx_ifq *ifq,
+		  struct socket *sock, unsigned int flags);
 #else
 static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx,
 			  struct io_uring_zc_rx_ifq_reg __user *arg)
@@ -67,6 +69,15 @@ static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline int io_zc_rx_recv(struct io_kiocb *req, struct io_zc_rx_ifq *ifq,
+				struct socket *sock, unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags);
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-12 21:44 ` [RFC PATCH v4 13/16] io_uring: add io_recvzc request David Wei
@ 2024-03-13 20:25   ` Jens Axboe
  2024-03-13 20:26     ` Pavel Begunkov
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2024-03-13 20:25 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 3/12/24 3:44 PM, David Wei wrote:
> Add an io_uring opcode OP_RECV_ZC for doing ZC reads from a socket that
> is set up for ZC Rx. The request reads skbs from a socket. Completions
> are posted into the main CQ for each page frag read.
> 
> Big CQEs (CQE32) is required as the OP_RECV_ZC specific metadata (ZC
> region, offset, len) are stored in the extended 16 bytes as a
> struct io_uring_rbuf_cqe.
> 
> For now there is no limit as to how much work each OP_RECV_ZC request
> does. It will attempt to drain a socket of all available data.
> 
> Multishot requests are also supported. The first time an io_recvzc
> request completes, EAGAIN is returned which arms an async poll. Then, in
> subsequent runs in task work, IOU_ISSUE_SKIP_COMPLETE is returned to
> continue async polling.

I'd probably drop that last paragraph, this is how all multishot
requests work and is implementation detail that need not go in the
commit message. Probably suffices just to say it supports multishot.

> @@ -695,7 +701,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>  	unsigned int cflags;
>  
>  	cflags = io_put_kbuf(req, issue_flags);
> -	if (msg->msg_inq && msg->msg_inq != -1)
> +	if (msg && msg->msg_inq && msg->msg_inq != -1)
>  		cflags |= IORING_CQE_F_SOCK_NONEMPTY;
>  
>  	if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
> @@ -723,7 +729,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>  			goto enobufs;
>  
>  		/* Known not-empty or unknown state, retry */
> -		if (cflags & IORING_CQE_F_SOCK_NONEMPTY || msg->msg_inq == -1) {
> +		if (cflags & IORING_CQE_F_SOCK_NONEMPTY || (msg && msg->msg_inq == -1)) {
>  			if (sr->nr_multishot_loops++ < MULTISHOT_MAX_RETRY)
>  				return false;
>  			/* mshot retries exceeded, force a requeue */

Maybe refactor this a bit so that you don't need to add these NULL
checks? That seems pretty fragile, hard to read, and should be doable
without extra checks.

> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
>  	return ifq;
>  }
>  
> +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> +	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
> +
> +	/* non-iopoll defer_taskrun only */
> +	if (!req->ctx->task_complete)
> +		return -EINVAL;

What's the reasoning behind this?

> +	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
> +	struct io_zc_rx_ifq *ifq;
> +	struct socket *sock;
> +	int ret;
> +
> +	/*
> +	 * We're posting CQEs deeper in the stack, and to avoid taking CQ locks
> +	 * we serialise by having only the master thread modifying the CQ with
> +	 * DEFER_TASkRUN checked earlier and forbidding executing it from io-wq.
> +	 * That's similar to io_check_multishot() for multishot CQEs.
> +	 */
> +	if (issue_flags & IO_URING_F_IOWQ)
> +		return -EAGAIN;
> +	if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_NONBLOCK)))
> +		return -EAGAIN;

If rebased on the current tree, does this go away?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-13 20:25   ` Jens Axboe
@ 2024-03-13 20:26     ` Pavel Begunkov
  2024-03-13 21:03       ` Jens Axboe
  2024-03-14 16:14       ` Jens Axboe
  0 siblings, 2 replies; 27+ messages in thread
From: Pavel Begunkov @ 2024-03-13 20:26 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 3/13/24 20:25, Jens Axboe wrote:
> On 3/12/24 3:44 PM, David Wei wrote:
>> Add an io_uring opcode OP_RECV_ZC for doing ZC reads from a socket that
>> is set up for ZC Rx. The request reads skbs from a socket. Completions
>> are posted into the main CQ for each page frag read.
>>
>> Big CQEs (CQE32) is required as the OP_RECV_ZC specific metadata (ZC
>> region, offset, len) are stored in the extended 16 bytes as a
>> struct io_uring_rbuf_cqe.
>>
>> For now there is no limit as to how much work each OP_RECV_ZC request
>> does. It will attempt to drain a socket of all available data.
>>
>> Multishot requests are also supported. The first time an io_recvzc
>> request completes, EAGAIN is returned which arms an async poll. Then, in
>> subsequent runs in task work, IOU_ISSUE_SKIP_COMPLETE is returned to
>> continue async polling.
> 
> I'd probably drop that last paragraph, this is how all multishot
> requests work and is implementation detail that need not go in the
> commit message. Probably suffices just to say it supports multishot.
> 
>> @@ -695,7 +701,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>   	unsigned int cflags;
>>   
>>   	cflags = io_put_kbuf(req, issue_flags);
>> -	if (msg->msg_inq && msg->msg_inq != -1)
>> +	if (msg && msg->msg_inq && msg->msg_inq != -1)
>>   		cflags |= IORING_CQE_F_SOCK_NONEMPTY;
>>   
>>   	if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
>> @@ -723,7 +729,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>   			goto enobufs;
>>   
>>   		/* Known not-empty or unknown state, retry */
>> -		if (cflags & IORING_CQE_F_SOCK_NONEMPTY || msg->msg_inq == -1) {
>> +		if (cflags & IORING_CQE_F_SOCK_NONEMPTY || (msg && msg->msg_inq == -1)) {
>>   			if (sr->nr_multishot_loops++ < MULTISHOT_MAX_RETRY)
>>   				return false;
>>   			/* mshot retries exceeded, force a requeue */
> 
> Maybe refactor this a bit so that you don't need to add these NULL
> checks? That seems pretty fragile, hard to read, and should be doable
> without extra checks.

That chunk can be completely thrown away, we're not using
io_recv_finish() here anymore


>> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
>>   	return ifq;
>>   }
>>   
>> +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>> +
>> +	/* non-iopoll defer_taskrun only */
>> +	if (!req->ctx->task_complete)
>> +		return -EINVAL;
> 
> What's the reasoning behind this?

CQ locking, see the comment a couple lines below


>> +	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>> +	struct io_zc_rx_ifq *ifq;
>> +	struct socket *sock;
>> +	int ret;
>> +
>> +	/*
>> +	 * We're posting CQEs deeper in the stack, and to avoid taking CQ locks
>> +	 * we serialise by having only the master thread modifying the CQ with
>> +	 * DEFER_TASkRUN checked earlier and forbidding executing it from io-wq.
>> +	 * That's similar to io_check_multishot() for multishot CQEs.
>> +	 */

This one ^^, though it doesn't read well, I should reword it for
next time.

>> +	if (issue_flags & IO_URING_F_IOWQ)
>> +		return -EAGAIN;
>> +	if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_NONBLOCK)))
>> +		return -EAGAIN;
> 
> If rebased on the current tree, does this go away?

It's just a little behind not to have that change

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-13 20:26     ` Pavel Begunkov
@ 2024-03-13 21:03       ` Jens Axboe
  2024-03-14 16:14       ` Jens Axboe
  1 sibling, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-03-13 21:03 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 3/13/24 2:26 PM, Pavel Begunkov wrote:
> On 3/13/24 20:25, Jens Axboe wrote:
>> On 3/12/24 3:44 PM, David Wei wrote:
>>> Add an io_uring opcode OP_RECV_ZC for doing ZC reads from a socket that
>>> is set up for ZC Rx. The request reads skbs from a socket. Completions
>>> are posted into the main CQ for each page frag read.
>>>
>>> Big CQEs (CQE32) is required as the OP_RECV_ZC specific metadata (ZC
>>> region, offset, len) are stored in the extended 16 bytes as a
>>> struct io_uring_rbuf_cqe.
>>>
>>> For now there is no limit as to how much work each OP_RECV_ZC request
>>> does. It will attempt to drain a socket of all available data.
>>>
>>> Multishot requests are also supported. The first time an io_recvzc
>>> request completes, EAGAIN is returned which arms an async poll. Then, in
>>> subsequent runs in task work, IOU_ISSUE_SKIP_COMPLETE is returned to
>>> continue async polling.
>>
>> I'd probably drop that last paragraph, this is how all multishot
>> requests work and is implementation detail that need not go in the
>> commit message. Probably suffices just to say it supports multishot.
>>
>>> @@ -695,7 +701,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>>       unsigned int cflags;
>>>         cflags = io_put_kbuf(req, issue_flags);
>>> -    if (msg->msg_inq && msg->msg_inq != -1)
>>> +    if (msg && msg->msg_inq && msg->msg_inq != -1)
>>>           cflags |= IORING_CQE_F_SOCK_NONEMPTY;
>>>         if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
>>> @@ -723,7 +729,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>>               goto enobufs;
>>>             /* Known not-empty or unknown state, retry */
>>> -        if (cflags & IORING_CQE_F_SOCK_NONEMPTY || msg->msg_inq == -1) {
>>> +        if (cflags & IORING_CQE_F_SOCK_NONEMPTY || (msg && msg->msg_inq == -1)) {
>>>               if (sr->nr_multishot_loops++ < MULTISHOT_MAX_RETRY)
>>>                   return false;
>>>               /* mshot retries exceeded, force a requeue */
>>
>> Maybe refactor this a bit so that you don't need to add these NULL
>> checks? That seems pretty fragile, hard to read, and should be doable
>> without extra checks.
> 
> That chunk can be completely thrown away, we're not using
> io_recv_finish() here anymore
> 
> 
>>> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
>>>       return ifq;
>>>   }
>>>   +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>> +{
>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>> +
>>> +    /* non-iopoll defer_taskrun only */
>>> +    if (!req->ctx->task_complete)
>>> +        return -EINVAL;
>>
>> What's the reasoning behind this?
> 
> CQ locking, see the comment a couple lines below
> 
> 
>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>> +    struct io_zc_rx_ifq *ifq;
>>> +    struct socket *sock;
>>> +    int ret;
>>> +
>>> +    /*
>>> +     * We're posting CQEs deeper in the stack, and to avoid taking CQ locks
>>> +     * we serialise by having only the master thread modifying the CQ with
>>> +     * DEFER_TASkRUN checked earlier and forbidding executing it from io-wq.
>>> +     * That's similar to io_check_multishot() for multishot CQEs.
>>> +     */
> 
> This one ^^, though it doesn't read well, I should reword it for
> next time.
> 
>>> +    if (issue_flags & IO_URING_F_IOWQ)
>>> +        return -EAGAIN;
>>> +    if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_NONBLOCK)))
>>> +        return -EAGAIN;
>>
>> If rebased on the current tree, does this go away?
> 
> It's just a little behind not to have that change
> 

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-13 20:26     ` Pavel Begunkov
  2024-03-13 21:03       ` Jens Axboe
@ 2024-03-14 16:14       ` Jens Axboe
  2024-03-15 17:34         ` Pavel Begunkov
  1 sibling, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2024-03-14 16:14 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

(Apparently this went out without my comments attached, only one thing
worth noting so repeating that)

>>> @@ -695,7 +701,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>>       unsigned int cflags;
>>>         cflags = io_put_kbuf(req, issue_flags);
>>> -    if (msg->msg_inq && msg->msg_inq != -1)
>>> +    if (msg && msg->msg_inq && msg->msg_inq != -1)
>>>           cflags |= IORING_CQE_F_SOCK_NONEMPTY;
>>>         if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
>>> @@ -723,7 +729,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
>>>               goto enobufs;
>>>             /* Known not-empty or unknown state, retry */
>>> -        if (cflags & IORING_CQE_F_SOCK_NONEMPTY || msg->msg_inq == -1) {
>>> +        if (cflags & IORING_CQE_F_SOCK_NONEMPTY || (msg && msg->msg_inq == -1)) {
>>>               if (sr->nr_multishot_loops++ < MULTISHOT_MAX_RETRY)
>>>                   return false;
>>>               /* mshot retries exceeded, force a requeue */
>>
>> Maybe refactor this a bit so that you don't need to add these NULL
>> checks? That seems pretty fragile, hard to read, and should be doable
>> without extra checks.
> 
> That chunk can be completely thrown away, we're not using
> io_recv_finish() here anymore

OK good!

>>> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
>>>       return ifq;
>>>   }
>>>   +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>> +{
>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>> +
>>> +    /* non-iopoll defer_taskrun only */
>>> +    if (!req->ctx->task_complete)
>>> +        return -EINVAL;
>>
>> What's the reasoning behind this?
> 
> CQ locking, see the comment a couple lines below

My question here was more towards "is this something we want to do".
Maybe this is just a temporary work-around and it's nothing to discuss,
but I'm not sure we want to have opcodes only work on certain ring
setups.


-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-14 16:14       ` Jens Axboe
@ 2024-03-15 17:34         ` Pavel Begunkov
  2024-03-15 18:38           ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Pavel Begunkov @ 2024-03-15 17:34 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 3/14/24 16:14, Jens Axboe wrote:
[...]
>>>> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
>>>>        return ifq;
>>>>    }
>>>>    +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>> +{
>>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>>> +
>>>> +    /* non-iopoll defer_taskrun only */
>>>> +    if (!req->ctx->task_complete)
>>>> +        return -EINVAL;
>>>
>>> What's the reasoning behind this?
>>
>> CQ locking, see the comment a couple lines below
> 
> My question here was more towards "is this something we want to do".
> Maybe this is just a temporary work-around and it's nothing to discuss,
> but I'm not sure we want to have opcodes only work on certain ring
> setups.

I don't think it's that unreasonable restricting it. It's hard to
care about !DEFER_TASKRUN for net workloads, it makes CQE posting a bit
cleaner, and who knows where the single task part would become handy.
Thinking about ifq termination, which should better cancel and wait
for all corresponding zc requests, it's should be easier without
parallel threads. E.g. what if another thread is in the enter syscall
using ifq, or running task_work and not cancellable. Then apart
from (non-atomic) refcounting, we'd need to somehow wait for it,
doing wake ups on the zc side, and so on.

The CQ side is easy to support though, put conditional locking
around the posting like fill/post_cqe does with the todays
patchset.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-15 17:34         ` Pavel Begunkov
@ 2024-03-15 18:38           ` Jens Axboe
  2024-03-15 23:52             ` Pavel Begunkov
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2024-03-15 18:38 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 3/15/24 11:34 AM, Pavel Begunkov wrote:
> On 3/14/24 16:14, Jens Axboe wrote:
> [...]
>>>>> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
>>>>>        return ifq;
>>>>>    }
>>>>>    +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>> +{
>>>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>>>> +
>>>>> +    /* non-iopoll defer_taskrun only */
>>>>> +    if (!req->ctx->task_complete)
>>>>> +        return -EINVAL;
>>>>
>>>> What's the reasoning behind this?
>>>
>>> CQ locking, see the comment a couple lines below
>>
>> My question here was more towards "is this something we want to do".
>> Maybe this is just a temporary work-around and it's nothing to discuss,
>> but I'm not sure we want to have opcodes only work on certain ring
>> setups.
> 
> I don't think it's that unreasonable restricting it. It's hard to
> care about !DEFER_TASKRUN for net workloads, it makes CQE posting a bit

I think there's a distinction between "not reasonable to support because
it's complicated/impossible to do so", and "we prefer not to support
it". I agree, as a developer it's hard to care about !DEFER_TASKRUN for
networking workloads, but as a user, they will just setup a default
queue until they wise up. And maybe this can be a good thing in that
they'd be nudged toward DEFER_TASKRUN, but I can also see some head
scratching when something just returns (the worst of all error codes)
-EINVAL when they attempt to use it.

> cleaner, and who knows where the single task part would become handy.

But you can still take advantage of single task, since you know if
that's going to be true or not. It just can't be unconditional.

> Thinking about ifq termination, which should better cancel and wait
> for all corresponding zc requests, it's should be easier without
> parallel threads. E.g. what if another thread is in the enter syscall
> using ifq, or running task_work and not cancellable. Then apart
> from (non-atomic) refcounting, we'd need to somehow wait for it,
> doing wake ups on the zc side, and so on.

I don't know, not seeing a lot of strong arguments for making it
DEFER_TASKRUN only. My worry is that once we starting doing that, then
more will follow. And honestly I think that would be a shame.

For ifq termination, surely these things are referenced, and termination
would need to wait for the last reference to drop? And if that isn't an
expected condition (it should not be), then a percpu ref would suffice.
Nobody cares if the teardown side is more expensive, as long as the fast
path is efficient.

Dunno - anyway, for now let's just leave it as-is, it's just something
to consider once we get closer to a more finished patchset.

> The CQ side is easy to support though, put conditional locking
> around the posting like fill/post_cqe does with the todays
> patchset.

Yep, which is one of the reasons why I was hopeful this could go away!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-15 18:38           ` Jens Axboe
@ 2024-03-15 23:52             ` Pavel Begunkov
  2024-03-16 16:59               ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Pavel Begunkov @ 2024-03-15 23:52 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 3/15/24 18:38, Jens Axboe wrote:
> On 3/15/24 11:34 AM, Pavel Begunkov wrote:
>> On 3/14/24 16:14, Jens Axboe wrote:
>> [...]
>>>>>> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
>>>>>>         return ifq;
>>>>>>     }
>>>>>>     +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>> +{
>>>>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>>>>> +
>>>>>> +    /* non-iopoll defer_taskrun only */
>>>>>> +    if (!req->ctx->task_complete)
>>>>>> +        return -EINVAL;
>>>>>
>>>>> What's the reasoning behind this?
>>>>
>>>> CQ locking, see the comment a couple lines below
>>>
>>> My question here was more towards "is this something we want to do".
>>> Maybe this is just a temporary work-around and it's nothing to discuss,
>>> but I'm not sure we want to have opcodes only work on certain ring
>>> setups.
>>
>> I don't think it's that unreasonable restricting it. It's hard to
>> care about !DEFER_TASKRUN for net workloads, it makes CQE posting a bit
> 
> I think there's a distinction between "not reasonable to support because
> it's complicated/impossible to do so", and "we prefer not to support
> it". I agree, as a developer it's hard to care about !DEFER_TASKRUN for
> networking workloads, but as a user, they will just setup a default
> queue until they wise up. And maybe this can be a good thing in that

They'd still need to find a supported NIC and do all the other
setup, comparably to that it doesn't add much trouble. And my
usual argument is that io_uring is a low-level api, it's expected
that people interacting with it directly are experienced enough,
expect to spend some time to make it right and likely library
devs.

> they'd be nudged toward DEFER_TASKRUN, but I can also see some head
> scratching when something just returns (the worst of all error codes)
> -EINVAL when they attempt to use it.

Yeah, we should try to find a better error code, and the check
should migrate to ifq registration.

>> cleaner, and who knows where the single task part would become handy.
> 
> But you can still take advantage of single task, since you know if
> that's going to be true or not. It just can't be unconditional.
> 
>> Thinking about ifq termination, which should better cancel and wait
>> for all corresponding zc requests, it's should be easier without
>> parallel threads. E.g. what if another thread is in the enter syscall
>> using ifq, or running task_work and not cancellable. Then apart
>> from (non-atomic) refcounting, we'd need to somehow wait for it,
>> doing wake ups on the zc side, and so on.
> 
> I don't know, not seeing a lot of strong arguments for making it
> DEFER_TASKRUN only. My worry is that once we starting doing that, then
> more will follow. And honestly I think that would be a shame.
> 
> For ifq termination, surely these things are referenced, and termination
> would need to wait for the last reference to drop? And if that isn't an
> expected condition (it should not be), then a percpu ref would suffice.
> Nobody cares if the teardown side is more expensive, as long as the fast
> path is efficient.

You can solve any of that, it's true, the question how much crap
you'd need to add in hot paths and diffstat wise. Just take a look
at what a nice function io_recvmsg() is together with its helpers
like io_recvmsg_multishot().

The biggest concern is optimisations and quirks that we can't
predict at the moment. DEFER_TASKRUN/SINGLE_ISSUER provide a simpler
model, I'd rather keep recvzc simple than having tens of conditional
optimisations with different execution flavours and contexts.
Especially, since it can be implemented later, wouldn't work the
other way around.

> Dunno - anyway, for now let's just leave it as-is, it's just something
> to consider once we get closer to a more finished patchset.
> 
>> The CQ side is easy to support though, put conditional locking
>> around the posting like fill/post_cqe does with the todays
>> patchset.
> 
> Yep, which is one of the reasons why I was hopeful this could go away!
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-15 23:52             ` Pavel Begunkov
@ 2024-03-16 16:59               ` Jens Axboe
  2024-03-17 21:22                 ` Pavel Begunkov
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2024-03-16 16:59 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 3/15/24 5:52 PM, Pavel Begunkov wrote:
> On 3/15/24 18:38, Jens Axboe wrote:
>> On 3/15/24 11:34 AM, Pavel Begunkov wrote:
>>> On 3/14/24 16:14, Jens Axboe wrote:
>>> [...]
>>>>>>> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
>>>>>>>         return ifq;
>>>>>>>     }
>>>>>>>     +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>> +{
>>>>>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>>>>>> +
>>>>>>> +    /* non-iopoll defer_taskrun only */
>>>>>>> +    if (!req->ctx->task_complete)
>>>>>>> +        return -EINVAL;
>>>>>>
>>>>>> What's the reasoning behind this?
>>>>>
>>>>> CQ locking, see the comment a couple lines below
>>>>
>>>> My question here was more towards "is this something we want to do".
>>>> Maybe this is just a temporary work-around and it's nothing to discuss,
>>>> but I'm not sure we want to have opcodes only work on certain ring
>>>> setups.
>>>
>>> I don't think it's that unreasonable restricting it. It's hard to
>>> care about !DEFER_TASKRUN for net workloads, it makes CQE posting a bit
>>
>> I think there's a distinction between "not reasonable to support because
>> it's complicated/impossible to do so", and "we prefer not to support
>> it". I agree, as a developer it's hard to care about !DEFER_TASKRUN for
>> networking workloads, but as a user, they will just setup a default
>> queue until they wise up. And maybe this can be a good thing in that
> 
> They'd still need to find a supported NIC and do all the other
> setup, comparably to that it doesn't add much trouble. And my

Hopefully down the line, it'll work on more NICs, and configuration will
be less of a nightmare than it is now.

> usual argument is that io_uring is a low-level api, it's expected
> that people interacting with it directly are experienced enough,
> expect to spend some time to make it right and likely library
> devs.

Have you seen some of the code that has gone in to libraries for
io_uring support? I have, and I don't think that statement is true at
all for that side.

It should work out of the box even with a naive approach, while the best
approach may require some knowledge. At least I think that's the sanest
stance on that.

>> they'd be nudged toward DEFER_TASKRUN, but I can also see some head
>> scratching when something just returns (the worst of all error codes)
>> -EINVAL when they attempt to use it.
> 
> Yeah, we should try to find a better error code, and the check
> should migrate to ifq registration.

Wasn't really a jab at the code in question, just more that -EINVAL is
the ubiqitious error code for all kinds of things and it's hard to
diagnose in general for a user. You just have to start guessing...

>>> cleaner, and who knows where the single task part would become handy.
>>
>> But you can still take advantage of single task, since you know if
>> that's going to be true or not. It just can't be unconditional.
>>
>>> Thinking about ifq termination, which should better cancel and wait
>>> for all corresponding zc requests, it's should be easier without
>>> parallel threads. E.g. what if another thread is in the enter syscall
>>> using ifq, or running task_work and not cancellable. Then apart
>>> from (non-atomic) refcounting, we'd need to somehow wait for it,
>>> doing wake ups on the zc side, and so on.
>>
>> I don't know, not seeing a lot of strong arguments for making it
>> DEFER_TASKRUN only. My worry is that once we starting doing that, then
>> more will follow. And honestly I think that would be a shame.
>>
>> For ifq termination, surely these things are referenced, and termination
>> would need to wait for the last reference to drop? And if that isn't an
>> expected condition (it should not be), then a percpu ref would suffice.
>> Nobody cares if the teardown side is more expensive, as long as the fast
>> path is efficient.
> 
> You can solve any of that, it's true, the question how much crap
> you'd need to add in hot paths and diffstat wise. Just take a look
> at what a nice function io_recvmsg() is together with its helpers
> like io_recvmsg_multishot().

That is true, and I guess my real question is "what would it look like
if we supported !DEFER_TASKRUN". Which I think is a valid question.

> The biggest concern is optimisations and quirks that we can't
> predict at the moment. DEFER_TASKRUN/SINGLE_ISSUER provide a simpler
> model, I'd rather keep recvzc simple than having tens of conditional
> optimisations with different execution flavours and contexts.
> Especially, since it can be implemented later, wouldn't work the
> other way around.

Yes me too, and I'd hate to have two variants just because of that. But
comparing to eg io_recv() and helpers, it's really not that bad. Hence
my question on how much would it take, and how nasty would it be, to
support !DEFER_TASKRUN.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-16 16:59               ` Jens Axboe
@ 2024-03-17 21:22                 ` Pavel Begunkov
  2024-03-17 21:30                   ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Pavel Begunkov @ 2024-03-17 21:22 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 3/16/24 16:59, Jens Axboe wrote:
> On 3/15/24 5:52 PM, Pavel Begunkov wrote:
>> On 3/15/24 18:38, Jens Axboe wrote:
>>> On 3/15/24 11:34 AM, Pavel Begunkov wrote:
>>>> On 3/14/24 16:14, Jens Axboe wrote:
>>>> [...]
>>>>>>>> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
>>>>>>>>          return ifq;
>>>>>>>>      }
>>>>>>>>      +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>>> +{
>>>>>>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>>>>>>> +
>>>>>>>> +    /* non-iopoll defer_taskrun only */
>>>>>>>> +    if (!req->ctx->task_complete)
>>>>>>>> +        return -EINVAL;
>>>>>>>
>>>>>>> What's the reasoning behind this?
>>>>>>
>>>>>> CQ locking, see the comment a couple lines below
>>>>>
>>>>> My question here was more towards "is this something we want to do".
>>>>> Maybe this is just a temporary work-around and it's nothing to discuss,
>>>>> but I'm not sure we want to have opcodes only work on certain ring
>>>>> setups.
>>>>
>>>> I don't think it's that unreasonable restricting it. It's hard to
>>>> care about !DEFER_TASKRUN for net workloads, it makes CQE posting a bit
>>>
>>> I think there's a distinction between "not reasonable to support because
>>> it's complicated/impossible to do so", and "we prefer not to support
>>> it". I agree, as a developer it's hard to care about !DEFER_TASKRUN for
>>> networking workloads, but as a user, they will just setup a default
>>> queue until they wise up. And maybe this can be a good thing in that
>>
>> They'd still need to find a supported NIC and do all the other
>> setup, comparably to that it doesn't add much trouble. And my
> 
> Hopefully down the line, it'll work on more NICs,

I wouldn't hope all necessary features will be seen in consumer
cards

> and configuration will be less of a nightmare than it is now.

I'm already assuming steering will be taken care by the kernel,
but you have to choose your nic, allocate an ifq, mmap a ring,
and then you're getting scattered chunks instead of

recv((void *)one_large_buffer);

My point is that it requires more involvement from user by design.
  
>> usual argument is that io_uring is a low-level api, it's expected
>> that people interacting with it directly are experienced enough,
>> expect to spend some time to make it right and likely library
>> devs.
> 
> Have you seen some of the code that has gone in to libraries for
> io_uring support? I have, and I don't think that statement is true at
> all for that side.

Well, some implementations are crappy, some are ok, some are
learning and improving what they have.

> 
> It should work out of the box even with a naive approach, while the best
> approach may require some knowledge. At least I think that's the sanest
> stance on that.
> 
>>> they'd be nudged toward DEFER_TASKRUN, but I can also see some head
>>> scratching when something just returns (the worst of all error codes)
>>> -EINVAL when they attempt to use it.
>>
>> Yeah, we should try to find a better error code, and the check
>> should migrate to ifq registration.
> 
> Wasn't really a jab at the code in question, just more that -EINVAL is
> the ubiqitious error code for all kinds of things and it's hard to
> diagnose in general for a user. You just have to start guessing...
> 
>>>> cleaner, and who knows where the single task part would become handy.
>>>
>>> But you can still take advantage of single task, since you know if
>>> that's going to be true or not. It just can't be unconditional.
>>>
>>>> Thinking about ifq termination, which should better cancel and wait
>>>> for all corresponding zc requests, it's should be easier without
>>>> parallel threads. E.g. what if another thread is in the enter syscall
>>>> using ifq, or running task_work and not cancellable. Then apart
>>>> from (non-atomic) refcounting, we'd need to somehow wait for it,
>>>> doing wake ups on the zc side, and so on.
>>>
>>> I don't know, not seeing a lot of strong arguments for making it
>>> DEFER_TASKRUN only. My worry is that once we starting doing that, then
>>> more will follow. And honestly I think that would be a shame.
>>>
>>> For ifq termination, surely these things are referenced, and termination
>>> would need to wait for the last reference to drop? And if that isn't an
>>> expected condition (it should not be), then a percpu ref would suffice.
>>> Nobody cares if the teardown side is more expensive, as long as the fast
>>> path is efficient.
>>
>> You can solve any of that, it's true, the question how much crap
>> you'd need to add in hot paths and diffstat wise. Just take a look
>> at what a nice function io_recvmsg() is together with its helpers
>> like io_recvmsg_multishot().
> 
> That is true, and I guess my real question is "what would it look like
> if we supported !DEFER_TASKRUN". Which I think is a valid question.
> 
>> The biggest concern is optimisations and quirks that we can't
>> predict at the moment. DEFER_TASKRUN/SINGLE_ISSUER provide a simpler
>> model, I'd rather keep recvzc simple than having tens of conditional
>> optimisations with different execution flavours and contexts.
>> Especially, since it can be implemented later, wouldn't work the
>> other way around.
> 
> Yes me too, and I'd hate to have two variants just because of that. But
> comparing to eg io_recv() and helpers, it's really not that bad. Hence
> my question on how much would it take, and how nasty would it be, to
> support !DEFER_TASKRUN.

It might look bearable... at first, but when it stops on that?
There will definitely be fixes and optimisations, whenever in my
mind it's something that is not even needed. I guess I'm too
traumatised by the amount of uapi binding features I wish I
could axe out and never see again.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 13/16] io_uring: add io_recvzc request
  2024-03-17 21:22                 ` Pavel Begunkov
@ 2024-03-17 21:30                   ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2024-03-17 21:30 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 3/17/24 3:22 PM, Pavel Begunkov wrote:
> On 3/16/24 16:59, Jens Axboe wrote:
>> On 3/15/24 5:52 PM, Pavel Begunkov wrote:
>>> On 3/15/24 18:38, Jens Axboe wrote:
>>>> On 3/15/24 11:34 AM, Pavel Begunkov wrote:
>>>>> On 3/14/24 16:14, Jens Axboe wrote:
>>>>> [...]
>>>>>>>>> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req,
>>>>>>>>>          return ifq;
>>>>>>>>>      }
>>>>>>>>>      +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>>>> +{
>>>>>>>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>>>>>>>> +
>>>>>>>>> +    /* non-iopoll defer_taskrun only */
>>>>>>>>> +    if (!req->ctx->task_complete)
>>>>>>>>> +        return -EINVAL;
>>>>>>>>
>>>>>>>> What's the reasoning behind this?
>>>>>>>
>>>>>>> CQ locking, see the comment a couple lines below
>>>>>>
>>>>>> My question here was more towards "is this something we want to do".
>>>>>> Maybe this is just a temporary work-around and it's nothing to discuss,
>>>>>> but I'm not sure we want to have opcodes only work on certain ring
>>>>>> setups.
>>>>>
>>>>> I don't think it's that unreasonable restricting it. It's hard to
>>>>> care about !DEFER_TASKRUN for net workloads, it makes CQE posting a bit
>>>>
>>>> I think there's a distinction between "not reasonable to support because
>>>> it's complicated/impossible to do so", and "we prefer not to support
>>>> it". I agree, as a developer it's hard to care about !DEFER_TASKRUN for
>>>> networking workloads, but as a user, they will just setup a default
>>>> queue until they wise up. And maybe this can be a good thing in that
>>>
>>> They'd still need to find a supported NIC and do all the other
>>> setup, comparably to that it doesn't add much trouble. And my
>>
>> Hopefully down the line, it'll work on more NICs,
> 
> I wouldn't hope all necessary features will be seen in consumer
> cards

Nah that would never be the case, but normal users aren't going to be
interested in zerocopy rx. If they are, then it's power users, and they
can pick an appropriate NIC rather than just rely on what's in their
laptop or desktop PC. But hopefully on the server production front,
there will be more NICs that support it. It's all driven by demand. If
it's a useful feature, then customers will ask for it.

>> and configuration will be less of a nightmare than it is now.
> 
> I'm already assuming steering will be taken care by the kernel,
> but you have to choose your nic, allocate an ifq, mmap a ring,
> and then you're getting scattered chunks instead of
> 
> recv((void *)one_large_buffer);
> 
> My point is that it requires more involvement from user by design.

For sure, it's more complicated than non-zerocopy, that's unavoidable.

>>> usual argument is that io_uring is a low-level api, it's expected
>>> that people interacting with it directly are experienced enough,
>>> expect to spend some time to make it right and likely library
>>> devs.
>>
>> Have you seen some of the code that has gone in to libraries for
>> io_uring support? I have, and I don't think that statement is true at
>> all for that side.
> 
> Well, some implementations are crappy, some are ok, some are
> learning and improving what they have.

Right, it wasn't a jab at them in general, it's natural to start
somewhere simple and then improve things as they go along. I don't
expect folks to have the level of knowledge of the internals that we do,
nor should they need to.

>> It should work out of the box even with a naive approach, while the best
>> approach may require some knowledge. At least I think that's the sanest
>> stance on that.
>>
>>>> they'd be nudged toward DEFER_TASKRUN, but I can also see some head
>>>> scratching when something just returns (the worst of all error codes)
>>>> -EINVAL when they attempt to use it.
>>>
>>> Yeah, we should try to find a better error code, and the check
>>> should migrate to ifq registration.
>>
>> Wasn't really a jab at the code in question, just more that -EINVAL is
>> the ubiqitious error code for all kinds of things and it's hard to
>> diagnose in general for a user. You just have to start guessing...
>>
>>>>> cleaner, and who knows where the single task part would become handy.
>>>>
>>>> But you can still take advantage of single task, since you know if
>>>> that's going to be true or not. It just can't be unconditional.
>>>>
>>>>> Thinking about ifq termination, which should better cancel and wait
>>>>> for all corresponding zc requests, it's should be easier without
>>>>> parallel threads. E.g. what if another thread is in the enter syscall
>>>>> using ifq, or running task_work and not cancellable. Then apart
>>>>> from (non-atomic) refcounting, we'd need to somehow wait for it,
>>>>> doing wake ups on the zc side, and so on.
>>>>
>>>> I don't know, not seeing a lot of strong arguments for making it
>>>> DEFER_TASKRUN only. My worry is that once we starting doing that, then
>>>> more will follow. And honestly I think that would be a shame.
>>>>
>>>> For ifq termination, surely these things are referenced, and termination
>>>> would need to wait for the last reference to drop? And if that isn't an
>>>> expected condition (it should not be), then a percpu ref would suffice.
>>>> Nobody cares if the teardown side is more expensive, as long as the fast
>>>> path is efficient.
>>>
>>> You can solve any of that, it's true, the question how much crap
>>> you'd need to add in hot paths and diffstat wise. Just take a look
>>> at what a nice function io_recvmsg() is together with its helpers
>>> like io_recvmsg_multishot().
>>
>> That is true, and I guess my real question is "what would it look like
>> if we supported !DEFER_TASKRUN". Which I think is a valid question.
>>
>>> The biggest concern is optimisations and quirks that we can't
>>> predict at the moment. DEFER_TASKRUN/SINGLE_ISSUER provide a simpler
>>> model, I'd rather keep recvzc simple than having tens of conditional
>>> optimisations with different execution flavours and contexts.
>>> Especially, since it can be implemented later, wouldn't work the
>>> other way around.
>>
>> Yes me too, and I'd hate to have two variants just because of that. But
>> comparing to eg io_recv() and helpers, it's really not that bad. Hence
>> my question on how much would it take, and how nasty would it be, to
>> support !DEFER_TASKRUN.
> 
> It might look bearable... at first, but when it stops on that?
> There will definitely be fixes and optimisations, whenever in my
> mind it's something that is not even needed. I guess I'm too
> traumatised by the amount of uapi binding features I wish I
> could axe out and never see again.

But that's real world though, particularly for the kernel. We'd all love
to restart things from scratch, and sometimes that'd lead to something
better which then down the line inevitably you'd love to redo again.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 14/16] net: execute custom callback from napi
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (12 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 13/16] io_uring: add io_recvzc request David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 15/16] io_uring/zcrx: add copy fallback David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 16/16] veth: add support for io_uring zc rx David Wei
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Sometimes we want to access a napi protected resource from task
context like in the case of io_uring zc falling back to copy and
accessing the buffer ring. Add a helper function that allows to execute
a custom function from napi context by first stopping it similarly to
napi_busy_loop().

Experimental, needs much polishing and sharing bits with
napi_busy_loop().

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/busy_poll.h |  7 +++++++
 net/core/dev.c          | 46 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 9b09acac538e..9f4a40898118 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -47,6 +47,8 @@ bool sk_busy_loop_end(void *p, unsigned long start_time);
 void napi_busy_loop(unsigned int napi_id,
 		    bool (*loop_end)(void *, unsigned long),
 		    void *loop_end_arg, bool prefer_busy_poll, u16 budget);
+void napi_execute(struct napi_struct *napi,
+		  void (*cb)(void *), void *cb_arg);
 
 void napi_busy_loop_rcu(unsigned int napi_id,
 			bool (*loop_end)(void *, unsigned long),
@@ -63,6 +65,11 @@ static inline bool sk_can_busy_loop(struct sock *sk)
 	return false;
 }
 
+static inline void napi_execute(struct napi_struct *napi,
+				void (*cb)(void *), void *cb_arg)
+{
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static inline unsigned long busy_loop_current_time(void)
diff --git a/net/core/dev.c b/net/core/dev.c
index 2096ff57685a..4de173667233 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6663,6 +6663,52 @@ void napi_busy_loop(unsigned int napi_id,
 }
 EXPORT_SYMBOL(napi_busy_loop);
 
+void napi_execute(struct napi_struct *napi,
+		  void (*cb)(void *), void *cb_arg)
+{
+	bool done = false;
+	unsigned long val;
+	void *have_poll_lock = NULL;
+
+	rcu_read_lock();
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+	for (;;) {
+		local_bh_disable();
+		val = READ_ONCE(napi->state);
+
+		/* If multiple threads are competing for this napi,
+		* we avoid dirtying napi->state as much as we can.
+		*/
+		if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
+			  NAPIF_STATE_IN_BUSY_POLL))
+			goto restart;
+
+		if (cmpxchg(&napi->state, val,
+			   val | NAPIF_STATE_IN_BUSY_POLL |
+				 NAPIF_STATE_SCHED) != val)
+			goto restart;
+
+		have_poll_lock = netpoll_poll_lock(napi);
+		cb(cb_arg);
+		done = true;
+		gro_normal_list(napi);
+		local_bh_enable();
+		break;
+restart:
+		local_bh_enable();
+		if (unlikely(need_resched()))
+			break;
+		cpu_relax();
+	}
+	if (done)
+		busy_poll_stop(napi, have_poll_lock, false, 1);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
+	rcu_read_unlock();
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static void napi_hash_add(struct napi_struct *napi)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 15/16] io_uring/zcrx: add copy fallback
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (13 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 14/16] net: execute custom callback from napi David Wei
@ 2024-03-12 21:44 ` David Wei
  2024-03-12 21:44 ` [RFC PATCH v4 16/16] veth: add support for io_uring zc rx David Wei
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

Currently, if user fails to keep up with the network and doesn't refill
the buffer ring fast enough the NIC/driver will start dropping packets.
That might be too punishing. Add a fallback path, which would allow
drivers to allocate normal pages when there is starvation, then
zc_rx_recv_skb() we'll detect them and copy into the user specified
buffers, when they become available.

That should help with adoption and also help the user striking the right
balance allocating just the right amount of zerocopy buffers but also
being resilient to sudden surges in traffic.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 io_uring/zc_rx.c | 111 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 105 insertions(+), 6 deletions(-)

diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c
index bb9251111735..d5f49590e682 100644
--- a/io_uring/zc_rx.c
+++ b/io_uring/zc_rx.c
@@ -8,6 +8,7 @@
 #include <linux/nospec.h>
 
 #include <net/page_pool/helpers.h>
+#include <net/busy_poll.h>
 #include <net/tcp.h>
 #include <net/af_unix.h>
 
@@ -26,6 +27,11 @@ struct io_zc_rx_args {
 	struct socket		*sock;
 };
 
+struct io_zc_refill_data {
+	struct io_zc_rx_ifq *ifq;
+	struct io_zc_rx_buf *buf;
+};
+
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 
 static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq,
@@ -648,6 +654,34 @@ const struct memory_provider_ops io_uring_pp_zc_ops = {
 };
 EXPORT_SYMBOL(io_uring_pp_zc_ops);
 
+static void io_napi_refill(void *data)
+{
+	struct io_zc_refill_data *rd = data;
+	struct io_zc_rx_ifq *ifq = rd->ifq;
+	netmem_ref netmem;
+
+	if (WARN_ON_ONCE(!ifq->pp))
+		return;
+
+	netmem = page_pool_alloc_netmem(ifq->pp, GFP_ATOMIC | __GFP_NOWARN);
+	if (!netmem)
+		return;
+	if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+		return;
+
+	rd->buf = io_niov_to_buf(netmem_to_net_iov(netmem));
+}
+
+static struct io_zc_rx_buf *io_zc_get_buf_task_safe(struct io_zc_rx_ifq *ifq)
+{
+	struct io_zc_refill_data rd = {
+		.ifq = ifq,
+	};
+
+	napi_execute(ifq->pp->p.napi, io_napi_refill, &rd);
+	return rd.buf;
+}
+
 static bool zc_rx_queue_cqe(struct io_kiocb *req, struct io_zc_rx_buf *buf,
 			   struct io_zc_rx_ifq *ifq, int off, int len)
 {
@@ -669,6 +703,42 @@ static bool zc_rx_queue_cqe(struct io_kiocb *req, struct io_zc_rx_buf *buf,
 	return true;
 }
 
+static ssize_t zc_rx_copy_chunk(struct io_kiocb *req, struct io_zc_rx_ifq *ifq,
+				void *data, unsigned int offset, size_t len)
+{
+	size_t copy_size, copied = 0;
+	struct io_zc_rx_buf *buf;
+	int ret = 0, off = 0;
+	u8 *vaddr;
+
+	do {
+		buf = io_zc_get_buf_task_safe(ifq);
+		if (!buf) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		vaddr = kmap_local_page(buf->page);
+		copy_size = min_t(size_t, PAGE_SIZE, len);
+		memcpy(vaddr, data + offset, copy_size);
+		kunmap_local(vaddr);
+
+		if (!zc_rx_queue_cqe(req, buf, ifq, off, copy_size)) {
+			napi_pp_put_page(net_iov_to_netmem(&buf->niov), false);
+			return -ENOSPC;
+		}
+
+		io_zc_rx_get_buf_uref(buf);
+		napi_pp_put_page(net_iov_to_netmem(&buf->niov), false);
+
+		offset += copy_size;
+		len -= copy_size;
+		copied += copy_size;
+	} while (offset < len);
+
+	return copied ? copied : ret;
+}
+
 static int zc_rx_recv_frag(struct io_kiocb *req, struct io_zc_rx_ifq *ifq,
 			   const skb_frag_t *frag, int off, int len)
 {
@@ -688,7 +758,22 @@ static int zc_rx_recv_frag(struct io_kiocb *req, struct io_zc_rx_ifq *ifq,
 			return -ENOSPC;
 		io_zc_rx_get_buf_uref(buf);
 	} else {
-		return -EOPNOTSUPP;
+		struct page *page = skb_frag_page(frag);
+		u32 p_off, p_len, t, copied = 0;
+		u8 *vaddr;
+		int ret = 0;
+
+		skb_frag_foreach_page(frag, off, len,
+				      page, p_off, p_len, t) {
+			vaddr = kmap_local_page(page);
+			ret = zc_rx_copy_chunk(req, ifq, vaddr, p_off, p_len);
+			kunmap_local(vaddr);
+
+			if (ret < 0)
+				return copied ? copied : ret;
+			copied += ret;
+		}
+		len = copied;
 	}
 
 	return len;
@@ -702,15 +787,29 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 	struct io_zc_rx_ifq *ifq = args->ifq;
 	struct io_kiocb *req = args->req;
 	struct sk_buff *frag_iter;
-	unsigned start, start_off;
+	unsigned start, start_off = offset;
 	int i, copy, end, off;
 	int ret = 0;
 
-	start = skb_headlen(skb);
-	start_off = offset;
+	if (unlikely(offset < skb_headlen(skb))) {
+		ssize_t copied;
+		size_t to_copy;
 
-	if (offset < start)
-		return -EOPNOTSUPP;
+		to_copy = min_t(size_t, skb_headlen(skb) - offset, len);
+		copied = zc_rx_copy_chunk(req, ifq, skb->data, offset, to_copy);
+		if (copied < 0) {
+			ret = copied;
+			goto out;
+		}
+		offset += copied;
+		len -= copied;
+		if (!len)
+			goto out;
+		if (offset != skb_headlen(skb))
+			goto out;
+	}
+
+	start = skb_headlen(skb);
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		const skb_frag_t *frag;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 16/16] veth: add support for io_uring zc rx
  2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
                   ` (14 preceding siblings ...)
  2024-03-12 21:44 ` [RFC PATCH v4 15/16] io_uring/zcrx: add copy fallback David Wei
@ 2024-03-12 21:44 ` David Wei
  15 siblings, 0 replies; 27+ messages in thread
From: David Wei @ 2024-03-12 21:44 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Not for upstream, testing only

Add io_uring zerocopy support for veth. It's not truly zerocopy, the
data is copied in napi, but that's early in the stack and so useful
for now for testing.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 drivers/net/veth.c | 214 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 208 insertions(+), 6 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 500b9dfccd08..b56e06113453 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -26,6 +26,8 @@
 #include <linux/ptr_ring.h>
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
+#include <linux/io_uring/net.h>
+#include <net/netdev_rx_queue.h>
 #include <net/page_pool/helpers.h>
 
 #define DRV_NAME	"veth"
@@ -67,6 +69,7 @@ struct veth_rq {
 	struct ptr_ring		xdp_ring;
 	struct xdp_rxq_info	xdp_rxq;
 	struct page_pool	*page_pool;
+	struct netdev_rx_queue	rq;
 };
 
 struct veth_priv {
@@ -75,6 +78,7 @@ struct veth_priv {
 	struct bpf_prog		*_xdp_prog;
 	struct veth_rq		*rq;
 	unsigned int		requested_headroom;
+	bool			zc_installed;
 };
 
 struct veth_xdp_tx_bq {
@@ -335,9 +339,12 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev,
 					 const struct net_device *rcv,
 					 const struct sk_buff *skb)
 {
+	struct veth_priv *rcv_priv = netdev_priv(rcv);
+
 	return !(dev->features & NETIF_F_ALL_TSO) ||
 		(skb->destructor == sock_wfree &&
-		 rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD));
+		 rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD)) ||
+		rcv_priv->zc_installed;
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -726,6 +733,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
 	struct sk_buff *skb = *pskb;
 	u32 frame_sz;
 
+	if (WARN_ON_ONCE(1))
+		return -EFAULT;
+
 	if (skb_shared(skb) || skb_head_is_locked(skb) ||
 	    skb_shinfo(skb)->nr_frags ||
 	    skb_headroom(skb) < XDP_PACKET_HEADROOM) {
@@ -758,6 +768,90 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
 	return -ENOMEM;
 }
 
+static noinline struct sk_buff *veth_iou_rcv_skb(struct veth_rq *rq,
+					struct sk_buff *skb)
+{
+	struct sk_buff *nskb;
+	u32 size, len, off, max_head_size;
+	struct page *page;
+	int ret, i, head_off;
+	void *vaddr;
+
+	/* Testing only, randomly send normal pages to test copy fallback */
+	if (ktime_get_ns() % 16 == 0)
+		return skb;
+
+	skb_prepare_for_gro(skb);
+	max_head_size = skb_headlen(skb);
+
+	rcu_read_lock();
+	nskb = napi_alloc_skb(&rq->xdp_napi, max_head_size);
+	if (!nskb)
+		goto drop;
+
+	skb_copy_header(nskb, skb);
+	skb_mark_for_recycle(nskb);
+
+	size = max_head_size;
+	if (skb_copy_bits(skb, 0, nskb->data, size)) {
+	consume_skb(nskb);
+		goto drop;
+	}
+	skb_put(nskb, size);
+	head_off = skb_headroom(nskb) - skb_headroom(skb);
+	skb_headers_offset_update(nskb, head_off);
+
+	/* Allocate paged area of new skb */
+	off = size;
+	len = skb->len - off;
+
+	for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
+		struct io_zc_rx_buf *buf;
+		netmem_ref netmem;
+
+		netmem = page_pool_alloc_netmem(rq->page_pool, GFP_ATOMIC | __GFP_NOWARN);
+		if (!netmem) {
+			consume_skb(nskb);
+			goto drop;
+		}
+		if (WARN_ON_ONCE(!netmem_is_net_iov(netmem))) {
+			consume_skb(nskb);
+			goto drop;
+		}
+
+		buf = container_of(netmem_to_net_iov(netmem),
+				   struct io_zc_rx_buf, niov);
+		page = buf->page;
+
+		if (WARN_ON_ONCE(buf->niov.pp != rq->page_pool))
+			goto drop;
+
+		size = min_t(u32, len, PAGE_SIZE);
+		skb_add_rx_frag_netmem(nskb, i, netmem, 0, size, PAGE_SIZE);
+
+		vaddr = kmap_atomic(page);
+		ret = skb_copy_bits(skb, off, vaddr, size);
+		kunmap_atomic(vaddr);
+
+		if (ret) {
+			consume_skb(nskb);
+			goto drop;
+		}
+		len -= size;
+		off += size;
+	}
+	rcu_read_unlock();
+
+	consume_skb(skb);
+	skb = nskb;
+	return skb;
+drop:
+	rcu_read_unlock();
+	kfree_skb(skb);
+	return NULL;
+}
+
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 					struct sk_buff *skb,
 					struct veth_xdp_tx_bq *bq,
@@ -901,8 +995,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			/* ndo_start_xmit */
 			struct sk_buff *skb = ptr;
 
-			stats->xdp_bytes += skb->len;
-			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
+			if (rq->page_pool->mp_ops == &io_uring_pp_zc_ops) {
+				skb = veth_iou_rcv_skb(rq, skb);
+			} else {
+				stats->xdp_bytes += skb->len;
+				skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
+			}
+
 			if (skb) {
 				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
 					netif_receive_skb(skb);
@@ -961,15 +1060,22 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	return done;
 }
 
-static int veth_create_page_pool(struct veth_rq *rq)
+static int veth_create_page_pool(struct veth_rq *rq, struct io_zc_rx_ifq *ifq)
 {
 	struct page_pool_params pp_params = {
 		.order = 0,
 		.pool_size = VETH_RING_SIZE,
 		.nid = NUMA_NO_NODE,
 		.dev = &rq->dev->dev,
+		.napi = &rq->xdp_napi,
 	};
 
+	if (ifq) {
+		rq->rq.pp_private = ifq;
+		rq->rq.pp_ops = &io_uring_pp_zc_ops;
+		pp_params.queue = &rq->rq;
+	}
+
 	rq->page_pool = page_pool_create(&pp_params);
 	if (IS_ERR(rq->page_pool)) {
 		int err = PTR_ERR(rq->page_pool);
@@ -987,7 +1093,7 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end)
 	int err, i;
 
 	for (i = start; i < end; i++) {
-		err = veth_create_page_pool(&priv->rq[i]);
+		err = veth_create_page_pool(&priv->rq[i], NULL);
 		if (err)
 			goto err_page_pool;
 	}
@@ -1043,9 +1149,17 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 
 	for (i = start; i < end; i++) {
 		struct veth_rq *rq = &priv->rq[i];
+		void *ptr;
+		int nr = 0;
 
 		rq->rx_notify_masked = false;
-		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
+
+		while ((ptr = ptr_ring_consume(&rq->xdp_ring))) {
+			veth_ptr_free(ptr);
+			nr++;
+		}
+
+		ptr_ring_cleanup(&rq->xdp_ring, NULL);
 	}
 
 	for (i = start; i < end; i++) {
@@ -1281,6 +1395,9 @@ static int veth_set_channels(struct net_device *dev,
 	struct net_device *peer;
 	int err;
 
+	if (priv->zc_installed)
+		return -EINVAL;
+
 	/* sanity check. Upper bounds are already enforced by the caller */
 	if (!ch->rx_count || !ch->tx_count)
 		return -EINVAL;
@@ -1358,6 +1475,8 @@ static int veth_open(struct net_device *dev)
 	struct net_device *peer = rtnl_dereference(priv->peer);
 	int err;
 
+	priv->zc_installed = false;
+
 	if (!peer)
 		return -ENOTCONN;
 
@@ -1536,6 +1655,84 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	rcu_read_unlock();
 }
 
+static int __veth_iou_set(struct net_device *dev,
+			  struct netdev_bpf *xdp)
+{
+	bool napi_already_on = veth_gro_requested(dev) && (dev->flags & IFF_UP);
+	unsigned qid = xdp->zc_rx.queue_id;
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
+	struct veth_rq *rq;
+	int ret;
+
+	if (priv->_xdp_prog)
+		return -EINVAL;
+	if (qid >= dev->real_num_rx_queues)
+		return -EINVAL;
+	if (!(dev->flags & IFF_UP))
+		return -EOPNOTSUPP;
+	if (dev->real_num_rx_queues != 1)
+		return -EINVAL;
+	rq = &priv->rq[qid];
+
+	if (!xdp->zc_rx.ifq) {
+		if (!priv->zc_installed)
+			return -EINVAL;
+
+		veth_napi_del(dev);
+		priv->zc_installed = false;
+		if (!veth_gro_requested(dev) && netif_running(dev)) {
+			dev->features &= ~NETIF_F_GRO;
+			netdev_features_change(dev);
+		}
+		return 0;
+	}
+
+	if (priv->zc_installed)
+		return -EINVAL;
+
+	peer = rtnl_dereference(priv->peer);
+	peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+
+	ret = veth_create_page_pool(rq, xdp->zc_rx.ifq);
+	if (ret)
+		return ret;
+
+	ret = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
+	if (ret) {
+		page_pool_destroy(rq->page_pool);
+		rq->page_pool = NULL;
+		return ret;
+	}
+
+	priv->zc_installed = true;
+
+	if (!veth_gro_requested(dev)) {
+		/* user-space did not require GRO, but adding XDP
+		 * is supposed to get GRO working
+		 */
+		dev->features |= NETIF_F_GRO;
+		netdev_features_change(dev);
+	}
+	if (!napi_already_on) {
+		netif_napi_add(dev, &rq->xdp_napi, veth_poll);
+		napi_enable(&rq->xdp_napi);
+		rcu_assign_pointer(rq->napi, &rq->xdp_napi);
+	}
+	return 0;
+}
+
+static int veth_iou_set(struct net_device *dev,
+			struct netdev_bpf *xdp)
+{
+	int ret;
+
+	rtnl_lock();
+	ret = __veth_iou_set(dev, xdp);
+	rtnl_unlock();
+	return ret;
+}
+
 static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 			struct netlink_ext_ack *extack)
 {
@@ -1545,6 +1742,9 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	unsigned int max_mtu;
 	int err;
 
+	if (priv->zc_installed)
+		return -EINVAL;
+
 	old_prog = priv->_xdp_prog;
 	priv->_xdp_prog = prog;
 	peer = rtnl_dereference(priv->peer);
@@ -1623,6 +1823,8 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
 		return veth_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_SETUP_ZC_RX:
+		return veth_iou_set(dev, xdp);
 	default:
 		return -EINVAL;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-03-17 21:30 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-12 21:44 [RFC PATCH v4 00/16] Zero copy Rx using io_uring David Wei
2024-03-12 21:44 ` [RFC PATCH v4 01/16] net: generalise pp provider params passing David Wei
2024-03-12 21:44 ` [RFC PATCH v4 02/16] io_uring: delayed cqe commit David Wei
2024-03-12 21:44 ` [RFC PATCH v4 03/16] net: page_pool: add ->scrub mem provider callback David Wei
2024-03-12 21:44 ` [RFC PATCH v4 04/16] io_uring: separate header for exported net bits David Wei
2024-03-12 21:44 ` [RFC PATCH v4 05/16] io_uring: introduce interface queue David Wei
2024-03-12 21:44 ` [RFC PATCH v4 06/16] io_uring: add mmap support for shared ifq ringbuffers David Wei
2024-03-12 21:44 ` [RFC PATCH v4 07/16] netdev: add XDP_SETUP_ZC_RX command David Wei
2024-03-12 21:44 ` [RFC PATCH v4 08/16] io_uring: setup ZC for an Rx queue when registering an ifq David Wei
2024-03-12 21:44 ` [RFC PATCH v4 09/16] io_uring/zcrx: implement socket registration David Wei
2024-03-12 21:44 ` [RFC PATCH v4 10/16] io_uring: add zero copy buf representation and pool David Wei
2024-03-12 21:44 ` [RFC PATCH v4 11/16] io_uring: implement pp memory provider for zc rx David Wei
2024-03-12 21:44 ` [RFC PATCH v4 12/16] io_uring/zcrx: implement PP_FLAG_DMA_* handling David Wei
2024-03-12 21:44 ` [RFC PATCH v4 13/16] io_uring: add io_recvzc request David Wei
2024-03-13 20:25   ` Jens Axboe
2024-03-13 20:26     ` Pavel Begunkov
2024-03-13 21:03       ` Jens Axboe
2024-03-14 16:14       ` Jens Axboe
2024-03-15 17:34         ` Pavel Begunkov
2024-03-15 18:38           ` Jens Axboe
2024-03-15 23:52             ` Pavel Begunkov
2024-03-16 16:59               ` Jens Axboe
2024-03-17 21:22                 ` Pavel Begunkov
2024-03-17 21:30                   ` Jens Axboe
2024-03-12 21:44 ` [RFC PATCH v4 14/16] net: execute custom callback from napi David Wei
2024-03-12 21:44 ` [RFC PATCH v4 15/16] io_uring/zcrx: add copy fallback David Wei
2024-03-12 21:44 ` [RFC PATCH v4 16/16] veth: add support for io_uring zc rx David Wei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox