From: David Wei <[email protected]>
To: [email protected], [email protected]
Cc: Jens Axboe <[email protected]>,
Pavel Begunkov <[email protected]>,
Jakub Kicinski <[email protected]>, Paolo Abeni <[email protected]>,
"David S. Miller" <[email protected]>,
Eric Dumazet <[email protected]>,
Jesper Dangaard Brouer <[email protected]>,
David Ahern <[email protected]>,
Mina Almasry <[email protected]>,
Stanislav Fomichev <[email protected]>,
Joe Damato <[email protected]>,
Pedro Tammela <[email protected]>
Subject: [PATCH net-next v9 19/20] net: add documentation for io_uring zcrx
Date: Tue, 17 Dec 2024 16:37:45 -0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
Add documentation for io_uring zero copy Rx that explains requirements
and the user API.
Signed-off-by: David Wei <[email protected]>
---
Documentation/networking/index.rst | 1 +
Documentation/networking/iou-zcrx.rst | 201 ++++++++++++++++++++++++++
2 files changed, 202 insertions(+)
create mode 100644 Documentation/networking/iou-zcrx.rst
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 46c178e564b3..d6ce9c5f179c 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -63,6 +63,7 @@ Contents:
gtp
ila
ioam6-sysctl
+ iou-zcrx
ip_dynaddr
ipsec
ip-sysctl
diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst
new file mode 100644
index 000000000000..7f6b7c072b59
--- /dev/null
+++ b/Documentation/networking/iou-zcrx.rst
@@ -0,0 +1,201 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+io_uring zero copy Rx
+=====================
+
+Introduction
+============
+
+io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on
+the network receive path, allowing packet data to be received directly into
+userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that
+there are no strict alignment requirements and no need to mmap()/munmap().
+Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are
+processed by the kernel TCP stack as normal.
+
+NIC HW Requirements
+===================
+
+Several NIC HW features are required for io_uring ZC Rx to work. For now the
+kernel API does not configure the NIC and it must be done by the user.
+
+Header/data split
+-----------------
+
+Required to split packets at the L4 boundary into a header and a payload.
+Headers are received into kernel memory as normal and processed by the TCP
+stack as normal. Payloads are received into userspace memory directly.
+
+Flow steering
+-------------
+
+Specific HW Rx queues are configured for this feature, but modern NICs
+typically distribute flows across all HW Rx queues. Flow steering is required
+to ensure that only desired flows are directed towards HW queues that are
+configured for io_uring ZC Rx.
+
+RSS
+---
+
+In addition to flow steering above, RSS is required to steer all other non-zero
+copy flows away from queues that are configured for io_uring ZC Rx.
+
+Usage
+=====
+
+Setup NIC
+---------
+
+Must be done out of band for now.
+
+Ensure there are at least two queues::
+
+ ethtool -L eth0 combined 2
+
+Enable header/data split::
+
+ ethtool -G eth0 tcp-data-split on
+
+Carve out half of the HW Rx queues for zero copy using RSS::
+
+ ethtool -X eth0 equal 1
+
+Set up flow steering, bearing in mind that queues are 0-indexed::
+
+ ethtool -N eth0 flow-type tcp6 ... action 1
+
+Setup io_uring
+--------------
+
+This section describes the low level io_uring kernel API. Please refer to
+liburing documentation for how to use the higher level API.
+
+Create an io_uring instance with the following required setup flags::
+
+ IORING_SETUP_SINGLE_ISSUER
+ IORING_SETUP_DEFER_TASKRUN
+ IORING_SETUP_CQE32
+
+Create memory area
+------------------
+
+Allocate userspace memory area for receiving zero copy data::
+
+ void *area_ptr = mmap(NULL, area_size,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ 0, 0);
+
+Create refill ring
+------------------
+
+Allocate memory for a shared ringbuf used for returning consumed buffers::
+
+ void *ring_ptr = mmap(NULL, ring_size,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ 0, 0);
+
+This refill ring consists of some space for the header, followed by an array of
+``struct io_uring_zcrx_rqe``::
+
+ size_t rq_entries = 4096;
+ size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE;
+ /* align to page size */
+ ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
+
+Register ZC Rx
+--------------
+
+Fill in registration structs::
+
+ struct io_uring_zcrx_area_reg area_reg = {
+ .addr = (__u64)(unsigned long)area_ptr,
+ .len = area_size,
+ .flags = 0,
+ };
+
+ struct io_uring_region_desc region_reg = {
+ .user_addr = (__u64)(unsigned long)ring_ptr,
+ .size = ring_size,
+ .flags = IORING_MEM_REGION_TYPE_USER,
+ };
+
+ struct io_uring_zcrx_ifq_reg reg = {
+ .if_idx = if_nametoindex("eth0"),
+ /* this is the HW queue with desired flow steered into it */
+ .if_rxq = 1,
+ .rq_entries = rq_entries,
+ .area_ptr = (__u64)(unsigned long)&area_reg,
+ .region_ptr = (__u64)(unsigned long)®ion_reg,
+ };
+
+Register with kernel::
+
+ io_uring_register_ifq(ring, ®);
+
+Map refill ring
+---------------
+
+The kernel fills in fields for the refill ring in the registration ``struct
+io_uring_zcrx_ifq_reg``. Map it into userspace::
+
+ struct io_uring_zcrx_rq refill_ring;
+
+ refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head);
+ refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail);
+ refill_ring.rqes =
+ (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
+ refill_ring.rq_tail = 0;
+ refill_ring.ring_ptr = ring_ptr;
+
+Receiving data
+--------------
+
+Prepare a zero copy recv request::
+
+ struct io_uring_sqe *sqe;
+
+ sqe = io_uring_get_sqe(ring);
+ io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0);
+ sqe->ioprio |= IORING_RECV_MULTISHOT;
+
+Now, submit and wait::
+
+ io_uring_submit_and_wait(ring, 1);
+
+Finally, process completions::
+
+ struct io_uring_cqe *cqe;
+ unsigned int count = 0;
+ unsigned int head;
+
+ io_uring_for_each_cqe(ring, head, cqe) {
+ struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
+
+ unsigned char *data = area_ptr + (rcqe->off & IORING_ZCRX_AREA_MASK);
+ /* do something with the data */
+
+ count++;
+ }
+ io_uring_cq_advance(ring, count);
+
+Recycling buffers
+-----------------
+
+Return buffers back to the kernel to be used again::
+
+ struct io_uring_zcrx_rqe *rqe;
+ unsigned mask = refill_ring.ring_entries - 1;
+ rqe = &refill_ring.rqes[refill_ring.rq_tail & mask];
+
+ area_offset = rcqe->off & IORING_ZCRX_AREA_MASK;
+ rqe->off = area_offset | area_reg.rq_area_token;
+ rqe->len = cqe->res;
+ IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);
+
+Testing
+=======
+
+See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c``
--
2.43.5
next prev parent reply other threads:[~2024-12-18 0:38 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-18 0:37 [PATCH RESEND net-next v9 00/21] io_uring zero copy rx David Wei
2024-12-18 0:37 ` [PATCH net-next v9 01/20] net: page_pool: don't cast mp param to devmem David Wei
2024-12-20 22:04 ` Jakub Kicinski
2024-12-18 0:37 ` [PATCH net-next v9 02/20] net: prefix devmem specific helpers David Wei
2024-12-20 22:05 ` Jakub Kicinski
2024-12-18 0:37 ` [PATCH net-next v9 03/20] net: generalise net_iov chunk owners David Wei
2024-12-20 22:14 ` Jakub Kicinski
2024-12-21 0:50 ` Pavel Begunkov
2024-12-21 2:17 ` Jakub Kicinski
2024-12-18 0:37 ` [PATCH net-next v9 04/20] net: page_pool: create hooks for custom page providers David Wei
2024-12-18 0:37 ` [PATCH net-next v9 05/20] net: page_pool: add mp op for netlink reporting David Wei
2024-12-20 22:16 ` Jakub Kicinski
2024-12-18 0:37 ` [PATCH net-next v9 06/20] net: page_pool: add a mp hook to unregister_netdevice* David Wei
2024-12-20 22:18 ` Jakub Kicinski
2024-12-18 0:37 ` [PATCH net-next v9 07/20] net: prepare for non devmem TCP memory providers David Wei
2024-12-20 22:18 ` Jakub Kicinski
2024-12-18 0:37 ` [PATCH net-next v9 08/20] net: expose page_pool_{set,clear}_pp_info David Wei
2024-12-20 22:31 ` Jakub Kicinski
2024-12-21 1:07 ` Pavel Begunkov
2024-12-21 2:23 ` Jakub Kicinski
2024-12-18 0:37 ` [PATCH net-next v9 09/20] net: page_pool: introduce page_pool_mp_return_in_cache David Wei
2024-12-18 0:37 ` [PATCH net-next v9 10/20] io_uring/zcrx: add interface queue and refill queue David Wei
2024-12-18 0:37 ` [PATCH net-next v9 11/20] io_uring/zcrx: add io_zcrx_area David Wei
2024-12-18 0:37 ` [PATCH net-next v9 12/20] io_uring/zcrx: grab a net device David Wei
2024-12-18 0:37 ` [PATCH net-next v9 13/20] net: page pool: export page_pool_set_dma_addr_netmem() David Wei
2024-12-18 0:37 ` [PATCH net-next v9 14/20] io_uring/zcrx: dma-map area for the device David Wei
2024-12-20 22:38 ` Jakub Kicinski
2024-12-21 1:04 ` Pavel Begunkov
2024-12-18 0:37 ` [PATCH net-next v9 15/20] io_uring/zcrx: add io_recvzc request David Wei
2024-12-18 0:37 ` [PATCH net-next v9 16/20] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2024-12-18 0:37 ` [PATCH net-next v9 17/20] io_uring/zcrx: throttle receive requests David Wei
2024-12-18 0:37 ` [PATCH net-next v9 18/20] io_uring/zcrx: add copy fallback David Wei
2024-12-18 0:37 ` David Wei [this message]
2024-12-18 0:37 ` [PATCH net-next v9 20/20] io_uring/zcrx: add selftest David Wei
2024-12-18 14:40 ` [PATCH RESEND net-next v9 00/21] io_uring zero copy rx Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox