From: David Wei <[email protected]>
To: [email protected], [email protected]
Cc: Jens Axboe <[email protected]>,
Pavel Begunkov <[email protected]>,
Jakub Kicinski <[email protected]>, Paolo Abeni <[email protected]>,
"David S. Miller" <[email protected]>,
Eric Dumazet <[email protected]>,
Jesper Dangaard Brouer <[email protected]>,
David Ahern <[email protected]>,
Mina Almasry <[email protected]>,
Stanislav Fomichev <[email protected]>,
Joe Damato <[email protected]>,
Pedro Tammela <[email protected]>
Subject: [PATCH net-next v10 21/22] net: add documentation for io_uring zcrx
Date: Wed, 8 Jan 2025 14:06:42 -0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
Add documentation for io_uring zero copy Rx that explains requirements
and the user API.
Signed-off-by: David Wei <[email protected]>
---
Documentation/networking/index.rst | 1 +
Documentation/networking/iou-zcrx.rst | 201 ++++++++++++++++++++++++++
2 files changed, 202 insertions(+)
create mode 100644 Documentation/networking/iou-zcrx.rst
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 058193ed2eeb..c64133d309bf 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -63,6 +63,7 @@ Contents:
gtp
ila
ioam6-sysctl
+ iou-zcrx
ip_dynaddr
ipsec
ip-sysctl
diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst
new file mode 100644
index 000000000000..7f6b7c072b59
--- /dev/null
+++ b/Documentation/networking/iou-zcrx.rst
@@ -0,0 +1,201 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+io_uring zero copy Rx
+=====================
+
+Introduction
+============
+
+io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on
+the network receive path, allowing packet data to be received directly into
+userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that
+there are no strict alignment requirements and no need to mmap()/munmap().
+Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are
+processed by the kernel TCP stack as normal.
+
+NIC HW Requirements
+===================
+
+Several NIC HW features are required for io_uring ZC Rx to work. For now the
+kernel API does not configure the NIC and it must be done by the user.
+
+Header/data split
+-----------------
+
+Required to split packets at the L4 boundary into a header and a payload.
+Headers are received into kernel memory as normal and processed by the TCP
+stack as normal. Payloads are received into userspace memory directly.
+
+Flow steering
+-------------
+
+Specific HW Rx queues are configured for this feature, but modern NICs
+typically distribute flows across all HW Rx queues. Flow steering is required
+to ensure that only desired flows are directed towards HW queues that are
+configured for io_uring ZC Rx.
+
+RSS
+---
+
+In addition to flow steering above, RSS is required to steer all other non-zero
+copy flows away from queues that are configured for io_uring ZC Rx.
+
+Usage
+=====
+
+Setup NIC
+---------
+
+Must be done out of band for now.
+
+Ensure there are at least two queues::
+
+ ethtool -L eth0 combined 2
+
+Enable header/data split::
+
+ ethtool -G eth0 tcp-data-split on
+
+Carve out half of the HW Rx queues for zero copy using RSS::
+
+ ethtool -X eth0 equal 1
+
+Set up flow steering, bearing in mind that queues are 0-indexed::
+
+ ethtool -N eth0 flow-type tcp6 ... action 1
+
+Setup io_uring
+--------------
+
+This section describes the low level io_uring kernel API. Please refer to
+liburing documentation for how to use the higher level API.
+
+Create an io_uring instance with the following required setup flags::
+
+ IORING_SETUP_SINGLE_ISSUER
+ IORING_SETUP_DEFER_TASKRUN
+ IORING_SETUP_CQE32
+
+Create memory area
+------------------
+
+Allocate userspace memory area for receiving zero copy data::
+
+ void *area_ptr = mmap(NULL, area_size,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ 0, 0);
+
+Create refill ring
+------------------
+
+Allocate memory for a shared ringbuf used for returning consumed buffers::
+
+ void *ring_ptr = mmap(NULL, ring_size,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ 0, 0);
+
+This refill ring consists of some space for the header, followed by an array of
+``struct io_uring_zcrx_rqe``::
+
+ size_t rq_entries = 4096;
+ size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE;
+ /* align to page size */
+ ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
+
+Register ZC Rx
+--------------
+
+Fill in registration structs::
+
+ struct io_uring_zcrx_area_reg area_reg = {
+ .addr = (__u64)(unsigned long)area_ptr,
+ .len = area_size,
+ .flags = 0,
+ };
+
+ struct io_uring_region_desc region_reg = {
+ .user_addr = (__u64)(unsigned long)ring_ptr,
+ .size = ring_size,
+ .flags = IORING_MEM_REGION_TYPE_USER,
+ };
+
+ struct io_uring_zcrx_ifq_reg reg = {
+ .if_idx = if_nametoindex("eth0"),
+ /* this is the HW queue with desired flow steered into it */
+ .if_rxq = 1,
+ .rq_entries = rq_entries,
+ .area_ptr = (__u64)(unsigned long)&area_reg,
+ .region_ptr = (__u64)(unsigned long)®ion_reg,
+ };
+
+Register with kernel::
+
+ io_uring_register_ifq(ring, ®);
+
+Map refill ring
+---------------
+
+The kernel fills in fields for the refill ring in the registration ``struct
+io_uring_zcrx_ifq_reg``. Map it into userspace::
+
+ struct io_uring_zcrx_rq refill_ring;
+
+ refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head);
+ refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail);
+ refill_ring.rqes =
+ (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
+ refill_ring.rq_tail = 0;
+ refill_ring.ring_ptr = ring_ptr;
+
+Receiving data
+--------------
+
+Prepare a zero copy recv request::
+
+ struct io_uring_sqe *sqe;
+
+ sqe = io_uring_get_sqe(ring);
+ io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0);
+ sqe->ioprio |= IORING_RECV_MULTISHOT;
+
+Now, submit and wait::
+
+ io_uring_submit_and_wait(ring, 1);
+
+Finally, process completions::
+
+ struct io_uring_cqe *cqe;
+ unsigned int count = 0;
+ unsigned int head;
+
+ io_uring_for_each_cqe(ring, head, cqe) {
+ struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
+
+ unsigned char *data = area_ptr + (rcqe->off & IORING_ZCRX_AREA_MASK);
+ /* do something with the data */
+
+ count++;
+ }
+ io_uring_cq_advance(ring, count);
+
+Recycling buffers
+-----------------
+
+Return buffers back to the kernel to be used again::
+
+ struct io_uring_zcrx_rqe *rqe;
+ unsigned mask = refill_ring.ring_entries - 1;
+ rqe = &refill_ring.rqes[refill_ring.rq_tail & mask];
+
+ area_offset = rcqe->off & IORING_ZCRX_AREA_MASK;
+ rqe->off = area_offset | area_reg.rq_area_token;
+ rqe->len = cqe->res;
+ IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);
+
+Testing
+=======
+
+See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c``
--
2.43.5
next prev parent reply other threads:[~2025-01-08 22:07 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-08 22:06 [PATCH net-next v10 00/22] io_uring zero copy rx David Wei
2025-01-08 22:06 ` [PATCH net-next v10 01/22] net: make page_pool_ref_netmem work with net iovs David Wei
2025-01-16 0:30 ` Jakub Kicinski
2025-01-16 2:12 ` Pavel Begunkov
2025-01-16 2:48 ` Jakub Kicinski
2025-01-16 16:45 ` Pavel Begunkov
2025-01-08 22:06 ` [PATCH net-next v10 02/22] net: page_pool: don't cast mp param to devmem David Wei
2025-01-08 22:06 ` [PATCH net-next v10 03/22] net: prefix devmem specific helpers David Wei
2025-01-08 22:06 ` [PATCH net-next v10 04/22] net: generalise net_iov chunk owners David Wei
2025-01-16 0:31 ` Jakub Kicinski
2025-01-08 22:06 ` [PATCH net-next v10 05/22] net: page pool: export page_pool_set_dma_addr_netmem() David Wei
2025-01-16 0:35 ` Jakub Kicinski
2025-01-16 0:39 ` Jakub Kicinski
2025-01-16 2:12 ` Pavel Begunkov
2025-01-08 22:06 ` [PATCH net-next v10 06/22] net: page_pool: create hooks for custom memory providers David Wei
2025-01-16 0:44 ` Jakub Kicinski
2025-01-16 2:25 ` Pavel Begunkov
2025-01-08 22:06 ` [PATCH net-next v10 07/22] netdev: add io_uring memory provider info David Wei
2025-01-16 0:45 ` Jakub Kicinski
2025-01-08 22:06 ` [PATCH net-next v10 08/22] net: page_pool: add callback for mp info printing David Wei
2025-01-16 0:46 ` Jakub Kicinski
2025-01-08 22:06 ` [PATCH net-next v10 09/22] net: page_pool: add a mp hook to unregister_netdevice* David Wei
2025-01-08 22:06 ` [PATCH net-next v10 10/22] net: prepare for non devmem TCP memory providers David Wei
2025-01-08 22:06 ` [PATCH net-next v10 11/22] net: page_pool: add memory provider helpers David Wei
2025-01-16 0:49 ` Jakub Kicinski
2025-01-08 22:06 ` [PATCH net-next v10 12/22] io_uring/zcrx: add interface queue and refill queue David Wei
2025-01-08 22:06 ` [PATCH net-next v10 13/22] io_uring/zcrx: add io_zcrx_area David Wei
2025-01-08 22:06 ` [PATCH net-next v10 14/22] io_uring/zcrx: grab a net device David Wei
2025-01-16 1:06 ` Jakub Kicinski
2025-01-16 2:33 ` Pavel Begunkov
2025-01-16 3:12 ` Jakub Kicinski
2025-01-16 16:46 ` Pavel Begunkov
2025-01-08 22:06 ` [PATCH net-next v10 15/22] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
2025-01-13 22:32 ` Jens Axboe
2025-01-08 22:06 ` [PATCH net-next v10 16/22] io_uring/zcrx: dma-map area for the device David Wei
2025-01-08 22:06 ` [PATCH net-next v10 17/22] io_uring/zcrx: add io_recvzc request David Wei
2025-01-08 22:06 ` [PATCH net-next v10 18/22] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2025-01-16 1:12 ` Jakub Kicinski
2025-01-16 2:27 ` Pavel Begunkov
2025-01-08 22:06 ` [PATCH net-next v10 19/22] io_uring/zcrx: throttle receive requests David Wei
2025-01-08 22:06 ` [PATCH net-next v10 20/22] io_uring/zcrx: add copy fallback David Wei
2025-01-08 22:06 ` David Wei [this message]
2025-01-08 22:06 ` [PATCH net-next v10 22/22] io_uring/zcrx: add selftest David Wei
2025-01-09 17:33 ` Stanislav Fomichev
2025-01-09 17:50 ` David Wei
2025-01-13 21:32 ` Pavel Begunkov
2025-01-14 0:11 ` Stanislav Fomichev
2025-01-16 0:53 ` Jakub Kicinski
2025-01-16 22:58 ` David Wei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox