From: David Wei <[email protected]>
To: [email protected], [email protected]
Cc: Jens Axboe <[email protected]>,
Pavel Begunkov <[email protected]>,
Jakub Kicinski <[email protected]>, Paolo Abeni <[email protected]>,
"David S. Miller" <[email protected]>,
Eric Dumazet <[email protected]>,
Jesper Dangaard Brouer <[email protected]>,
David Ahern <[email protected]>,
Mina Almasry <[email protected]>
Subject: [RFC PATCH v3 00/20] Zero copy Rx using io_uring
Date: Tue, 19 Dec 2023 13:03:37 -0800 [thread overview]
Message-ID: <[email protected]> (raw)
This patchset is a proposal that adds zero copy network Rx to io_uring.
With it, userspace can register a region of host memory for receiving
data directly from a NIC using DMA, without needing a kernel to user
copy.
Full kernel tree including some out of tree BNXT changes:
https://github.com/spikeh/linux/tree/zcrx_sil
On the userspace side, support is added to both liburing and Netbench:
https://github.com/spikeh/liburing/tree/zcrx2
https://github.com/spikeh/netbench/tree/zcrx
Hardware support is added to the Broadcom BNXT driver. This patchset +
userspace code was tested on an Intel Xeon Platinum 8321HC CPU and
Broadcom BCM57504 NIC.
Early benchmarks using this prototype, with iperf3 as a load generator,
showed a ~50% reduction in overall system memory bandwidth as measured
using perf counters. Note that DDIO must be disabled on Intel systems.
Build Netbench using the modified liburing above.
This patchset is based on the work by Jonathan Lemon
<[email protected]>:
https://lore.kernel.org/io-uring/[email protected]/
Changes in RFC v3:
------------------
* Rebased on top of Jakub Kicinski's memory provider API RFC. The ZC
pool added is now a backend for memory provider.
* We're also reusing ppiov infrastructure. The refcounting rules stay
the same but it's shifted into ppiov->refcount. That lets us to
flexibly manage buffer lifetimes without adding any extra code to the
common networking paths. It'd also make it easier to support dmabufs
and device memory in the future.
* io_uring also knows about pages, and so ppiovs might unnecessarily
break tools inspecting data, that can easily be solved later.
Many patches are not for upstream as they depend on work in progress,
namely from Mina:
* struct netmem_t
* Driver ndo commands for Rx queue configs
* struct page_pool_iov and shared pp infra
Changes in RFC v2:
------------------
* Added copy fallback support if userspace memory allocated for ZC Rx
runs out, or if header splitting or flow steering fails.
* Added veth support for ZC Rx, for testing and demonstration. We will
need to figure out what driver would be best for such testing
functionality in the future. Perhaps netdevsim?
* Added socket registration API to io_uring to associate specific
sockets with ifqs/Rx queues for ZC.
* Added multi-socket support, such that multiple connections can be
steered into the same hardware Rx queue.
* Added Netbench server/client support.
Known deficiencies that we will address in a future patchset:
* Proper test driver + selftests, maybe netdevsim.
* Revisiting userspace API.
* Multi-region support.
* Steering setup.
* Further optimisation work.
* ...and more.
If you would like to try out this patchset, build and run the kernel
tree then build Netbench using liburing, all from forks above.
Run setup.sh first:
https://gist.github.com/isilence/e6a28ce41a545a261566672104afa461
Then run the following commands:
sudo ip netns exec nsserv ./netbench --server_only 1 --v6 false \
--rx "io_uring --provide_buffers 0 --use_zc 1 \
--zc_pool_pages 16384 --zc_ifname ptp-serv" --use_port 9999
sudo ip netns exec nscl ./netbench --client_only 1 --v6 false \
--tx "epoll --threads 1 --per_thread 1 --size 2800" \
--host 10.10.10.20 --use_port 9999
David Wei (6):
io_uring: add interface queue
io_uring: add mmap support for shared ifq ringbuffers
netdev: add XDP_SETUP_ZC_RX command
io_uring: setup ZC for an Rx queue when registering an ifq
io_uring: add ZC buf and pool
io_uring: add io_recvzc request
Pavel Begunkov (14):
net: page_pool: add ppiov mangling helper
tcp: don't allow non-devmem originated ppiov
net: page pool: rework ppiov life cycle
net: enable napi_pp_put_page for ppiov
net: page_pool: add ->scrub mem provider callback
io_uring: separate header for exported net bits
io_uring/zcrx: implement socket registration
io_uring: implement pp memory provider for zc rx
net: page pool: add io_uring memory provider
net: execute custom callback from napi
io_uring/zcrx: add copy fallback
veth: add support for io_uring zc rx
net: page pool: generalise ppiov dma address get
bnxt: enable io_uring zc page pool
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 71 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.h | 7 +
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 3 +
drivers/net/veth.c | 211 +++-
include/linux/io_uring.h | 6 -
include/linux/io_uring/net.h | 31 +
include/linux/io_uring_types.h | 8 +
include/linux/net.h | 2 +
include/linux/netdevice.h | 6 +
include/net/busy_poll.h | 7 +
include/net/page_pool/helpers.h | 27 +-
include/net/page_pool/types.h | 4 +
include/uapi/linux/io_uring.h | 61 ++
io_uring/Makefile | 2 +-
io_uring/io_uring.c | 24 +
io_uring/net.c | 133 ++-
io_uring/opdef.c | 16 +
io_uring/uring_cmd.c | 1 +
io_uring/zc_rx.c | 954 ++++++++++++++++++
io_uring/zc_rx.h | 80 ++
net/core/dev.c | 46 +
net/core/page_pool.c | 68 +-
net/core/skbuff.c | 28 +-
net/ipv4/tcp.c | 7 +
net/socket.c | 3 +-
25 files changed, 1737 insertions(+), 69 deletions(-)
create mode 100644 include/linux/io_uring/net.h
create mode 100644 io_uring/zc_rx.c
create mode 100644 io_uring/zc_rx.h
--
2.39.3
next reply other threads:[~2023-12-19 21:04 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-19 21:03 David Wei [this message]
2023-12-19 21:03 ` [RFC PATCH v3 01/20] net: page_pool: add ppiov mangling helper David Wei
2023-12-19 23:22 ` Mina Almasry
2023-12-19 23:59 ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 02/20] tcp: don't allow non-devmem originated ppiov David Wei
2023-12-19 23:24 ` Mina Almasry
2023-12-20 1:29 ` Pavel Begunkov
2024-01-02 16:11 ` Mina Almasry
2023-12-19 21:03 ` [RFC PATCH v3 03/20] net: page pool: rework ppiov life cycle David Wei
2023-12-19 23:35 ` Mina Almasry
2023-12-20 0:49 ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 04/20] net: enable napi_pp_put_page for ppiov David Wei
2023-12-19 21:03 ` [RFC PATCH v3 05/20] net: page_pool: add ->scrub mem provider callback David Wei
2023-12-19 21:03 ` [RFC PATCH v3 06/20] io_uring: separate header for exported net bits David Wei
2023-12-20 16:01 ` Jens Axboe
2023-12-19 21:03 ` [RFC PATCH v3 07/20] io_uring: add interface queue David Wei
2023-12-20 16:13 ` Jens Axboe
2023-12-20 16:23 ` Pavel Begunkov
2023-12-21 1:44 ` David Wei
2023-12-21 17:57 ` Willem de Bruijn
2023-12-30 16:25 ` Pavel Begunkov
2023-12-31 22:25 ` Willem de Bruijn
2023-12-19 21:03 ` [RFC PATCH v3 08/20] io_uring: add mmap support for shared ifq ringbuffers David Wei
2023-12-20 16:13 ` Jens Axboe
2023-12-19 21:03 ` [RFC PATCH v3 09/20] netdev: add XDP_SETUP_ZC_RX command David Wei
2023-12-19 21:03 ` [RFC PATCH v3 10/20] io_uring: setup ZC for an Rx queue when registering an ifq David Wei
2023-12-20 16:06 ` Jens Axboe
2023-12-20 16:24 ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 11/20] io_uring/zcrx: implement socket registration David Wei
2023-12-19 21:03 ` [RFC PATCH v3 12/20] io_uring: add ZC buf and pool David Wei
2023-12-19 21:03 ` [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx David Wei
2023-12-19 23:44 ` Mina Almasry
2023-12-20 0:39 ` Pavel Begunkov
2023-12-21 19:36 ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 14/20] net: page pool: add io_uring memory provider David Wei
2023-12-19 23:39 ` Mina Almasry
2023-12-20 0:04 ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 15/20] io_uring: add io_recvzc request David Wei
2023-12-20 16:27 ` Jens Axboe
2023-12-20 17:04 ` Pavel Begunkov
2023-12-20 18:09 ` Jens Axboe
2023-12-21 18:59 ` Pavel Begunkov
2023-12-21 21:32 ` Jens Axboe
2023-12-30 21:15 ` Pavel Begunkov
2023-12-19 21:03 ` [RFC PATCH v3 16/20] net: execute custom callback from napi David Wei
2023-12-19 21:03 ` [RFC PATCH v3 17/20] io_uring/zcrx: add copy fallback David Wei
2023-12-19 21:03 ` [RFC PATCH v3 18/20] veth: add support for io_uring zc rx David Wei
2023-12-19 21:03 ` [RFC PATCH v3 19/20] net: page pool: generalise ppiov dma address get David Wei
2023-12-21 19:51 ` Mina Almasry
2023-12-19 21:03 ` [RFC PATCH v3 20/20] bnxt: enable io_uring zc page pool David Wei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox