* [RFC v1 0/9] zero-copy RX for io_uring
@ 2022-10-07 21:17 Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 1/9] io_uring: add zctap ifq definition Jonathan Lemon
` (9 more replies)
0 siblings, 10 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-07 21:17 UTC (permalink / raw)
To: io-uring
This series is a RFC for io_uring/zctap. This is an evolution of
the earlier zctap work, re-targeted to use io_uring as the userspace
API. The current code is intended to provide a zero-copy RX path for
upper-level networking protocols (aka TCP and UDP). The current draft
focuses on host-provided memory (not GPU memory).
This RFC contains the upper-level core code required for operation,
with the intent of soliciting feedback on the general API. This does
not contain the network driver side changes required for complete
operation. Also please note that as an RFC, there are some things
which are incomplete or in need of rework.
The intent is to use a network driver which provides header/data
splitting, so the frame header (which is processed by the networking
stack) does not reside in user memory.
The code is roughly working (in that it has successfully received
a TCP stream from a remote sender), but as an RFC, the intent is
to solicit feedback on the API and overall design. The current code
will also work with system pages, copying the data out to the
application - this is intended as a fallback/testing path.
High level description:
The application allocates a frame backing store, and provides this
to the kernel for use. An interface queue is requested from the
networking device, and incoming frames are deposited into the provided
memory region.
Responsibility for correctly steering incoming frames to the queue
is outside the scope of this work - it is assumed that the user
has set steering rules up separately.
Incoming frames are sent up the stack as skb's and eventually
land in the application's socket receive queue. This differs
from AF_XDP, which receives raw frames directly to userspace,
without protocol processing.
The RECV_ZC opcode then returns an iov[] style vector which points
to the data in userspace memory. When the application has completed
processing of the data, the buffer is returned back to the kernel
through a fill ring for reuse.
Jonathan Lemon (9):
io_uring: add zctap ifq definition
netdevice: add SETUP_ZCTAP to the netdev_bpf structure
io_uring: add register ifq opcode
io_uring: add provide_ifq_region opcode
io_uring: Add io_uring zctap iov structure and helpers
io_uring: introduce reference tracking for user pages.
page_pool: add page allocation and free hooks.
io_uring: provide functions for the page_pool.
io_uring: add OP_RECV_ZC command.
include/linux/io_uring.h | 24 ++
include/linux/io_uring_types.h | 10 +
include/linux/netdevice.h | 6 +
include/net/page_pool.h | 6 +
include/uapi/linux/io_uring.h | 26 ++
io_uring/Makefile | 3 +-
io_uring/io_uring.c | 10 +
io_uring/kbuf.c | 13 +
io_uring/kbuf.h | 2 +
io_uring/net.c | 123 ++++++
io_uring/opdef.c | 23 +
io_uring/zctap.c | 749 +++++++++++++++++++++++++++++++++
io_uring/zctap.h | 20 +
net/core/page_pool.c | 41 +-
14 files changed, 1048 insertions(+), 8 deletions(-)
create mode 100644 io_uring/zctap.c
create mode 100644 io_uring/zctap.h
--
2.30.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC v1 1/9] io_uring: add zctap ifq definition
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
@ 2022-10-07 21:17 ` Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 2/9] netdevice: add SETUP_ZCTAP to the netdev_bpf structure Jonathan Lemon
` (8 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-07 21:17 UTC (permalink / raw)
To: io-uring
Add structure definition for io_zctap_ifq for use by lower level
networking hooks.
Signed-off-by: Jonathan Lemon <[email protected]>
---
include/linux/io_uring_types.h | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 677a25d44d7f..680fbf1f34e7 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -323,6 +323,7 @@ struct io_ring_ctx {
struct io_mapped_ubuf *dummy_ubuf;
struct io_rsrc_data *file_data;
struct io_rsrc_data *buf_data;
+ struct xarray zctap_ifq_xa;
struct delayed_work rsrc_put_work;
struct llist_head rsrc_put_llist;
@@ -578,4 +579,12 @@ struct io_overflow_cqe {
struct io_uring_cqe cqe;
};
+struct io_zctap_ifq {
+ struct net_device *dev;
+ struct io_ring_ctx *ctx;
+ u16 queue_id;
+ u16 id;
+ u16 fill_bgid;
+};
+
#endif
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC v1 2/9] netdevice: add SETUP_ZCTAP to the netdev_bpf structure
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 1/9] io_uring: add zctap ifq definition Jonathan Lemon
@ 2022-10-07 21:17 ` Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 3/9] io_uring: add register ifq opcode Jonathan Lemon
` (7 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-07 21:17 UTC (permalink / raw)
To: io-uring
This command requests the networking device setup or teardown
a new interface queue, backed by a region of user supplied memory.
The queue will be managed by io-uring.
Signed-off-by: Jonathan Lemon <[email protected]>
---
include/linux/netdevice.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9f42fc871c3b..49ecfc276411 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -979,6 +979,7 @@ enum bpf_netdev_command {
BPF_OFFLOAD_MAP_ALLOC,
BPF_OFFLOAD_MAP_FREE,
XDP_SETUP_XSK_POOL,
+ XDP_SETUP_ZCTAP,
};
struct bpf_prog_offload_ops;
@@ -1017,6 +1018,11 @@ struct netdev_bpf {
struct xsk_buff_pool *pool;
u16 queue_id;
} xsk;
+ /* XDP_SETUP_ZCTAP */
+ struct {
+ struct io_zctap_ifq *ifq;
+ u16 queue_id;
+ } zct;
};
};
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC v1 3/9] io_uring: add register ifq opcode
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 1/9] io_uring: add zctap ifq definition Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 2/9] netdevice: add SETUP_ZCTAP to the netdev_bpf structure Jonathan Lemon
@ 2022-10-07 21:17 ` Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 4/9] io_uring: add provide_ifq_region opcode Jonathan Lemon
` (6 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-07 21:17 UTC (permalink / raw)
To: io-uring
Add initial support for support for hooking in zero-copy interface
queues to io_uring. This command requests a user-managed queue
from the specified network device.
This only includes the register opcode, unregistration is currently
done implicitly when the ring is removed.
Signed-off-by: Jonathan Lemon <[email protected]>
---
include/uapi/linux/io_uring.h | 14 ++++
io_uring/Makefile | 3 +-
io_uring/io_uring.c | 10 +++
io_uring/zctap.c | 146 ++++++++++++++++++++++++++++++++++
io_uring/zctap.h | 11 +++
5 files changed, 183 insertions(+), 1 deletion(-)
create mode 100644 io_uring/zctap.c
create mode 100644 io_uring/zctap.h
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 6b83177fd41d..bc5108d65c0a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -473,6 +473,9 @@ enum {
/* register a range of fixed file slots for automatic slot allocation */
IORING_REGISTER_FILE_ALLOC_RANGE = 25,
+ /* register a network ifq for zerocopy RX */
+ IORING_REGISTER_IFQ = 26,
+
/* this goes last */
IORING_REGISTER_LAST
};
@@ -649,6 +652,17 @@ struct io_uring_recvmsg_out {
__u32 flags;
};
+/*
+ * Argument for IORING_REGISTER_IFQ
+ */
+struct io_uring_ifq_req {
+ __u32 ifindex;
+ __u16 queue_id;
+ __u16 ifq_id;
+ __u16 fill_bgid;
+ __u16 __pad[3];
+};
+
#ifdef __cplusplus
}
#endif
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 8cc8e5387a75..9d87e2e45ef9 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -7,5 +7,6 @@ obj-$(CONFIG_IO_URING) += io_uring.o xattr.o nop.o fs.o splice.o \
openclose.o uring_cmd.o epoll.o \
statx.o net.o msg_ring.o timeout.o \
sqpoll.o fdinfo.o tctx.o poll.o \
- cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o
+ cancel.o kbuf.o rsrc.o rw.o opdef.o \
+ notif.o zctap.o
obj-$(CONFIG_IO_WQ) += io-wq.o
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index b9640ad5069f..8dd988b33af0 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -91,6 +91,7 @@
#include "cancel.h"
#include "net.h"
#include "notif.h"
+#include "zctap.h"
#include "timeout.h"
#include "poll.h"
@@ -321,6 +322,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
INIT_WQ_LIST(&ctx->locked_free_list);
INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
+ xa_init(&ctx->zctap_ifq_xa);
return ctx;
err:
kfree(ctx->dummy_ubuf);
@@ -2639,6 +2641,8 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
__io_cqring_overflow_flush(ctx, true);
xa_for_each(&ctx->personalities, index, creds)
io_unregister_personality(ctx, index);
+ xa_for_each(&ctx->zctap_ifq_xa, index, creds)
+ io_unregister_zctap_ifq(ctx, index);
if (ctx->rings)
io_poll_remove_all(ctx, NULL, true);
mutex_unlock(&ctx->uring_lock);
@@ -3839,6 +3843,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_register_file_alloc_range(ctx, arg);
break;
+ case IORING_REGISTER_IFQ:
+ ret = -EINVAL;
+ if (!arg || nr_args != 1)
+ break;
+ ret = io_register_ifq(ctx, arg);
+ break;
default:
ret = -EINVAL;
break;
diff --git a/io_uring/zctap.c b/io_uring/zctap.c
new file mode 100644
index 000000000000..41feb76b7a35
--- /dev/null
+++ b/io_uring/zctap.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/io_uring.h>
+#include <linux/netdevice.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "zctap.h"
+
+static DEFINE_XARRAY_ALLOC1(io_zctap_ifq_xa);
+
+typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
+
+static int __io_queue_mgmt(struct net_device *dev, struct io_zctap_ifq *ifq,
+ u16 *queue_id)
+{
+ struct netdev_bpf cmd;
+ bpf_op_t ndo_bpf;
+ int err;
+
+ ndo_bpf = dev->netdev_ops->ndo_bpf;
+ if (!ndo_bpf)
+ return -EINVAL;
+
+ cmd.command = XDP_SETUP_ZCTAP;
+ cmd.zct.ifq = ifq;
+ cmd.zct.queue_id = *queue_id;
+
+ err = ndo_bpf(dev, &cmd);
+ if (!err)
+ *queue_id = cmd.zct.queue_id;
+
+ return err;
+}
+
+static int io_open_zctap_ifq(struct io_zctap_ifq *ifq, u16 *queue_id)
+{
+ return __io_queue_mgmt(ifq->dev, ifq, queue_id);
+}
+
+static int io_close_zctap_ifq(struct io_zctap_ifq *ifq, u16 queue_id)
+{
+ return __io_queue_mgmt(ifq->dev, NULL, &queue_id);
+}
+
+static struct io_zctap_ifq *io_zctap_ifq_alloc(void)
+{
+ struct io_zctap_ifq *ifq;
+
+ ifq = kzalloc(sizeof(*ifq), GFP_KERNEL);
+ if (!ifq)
+ return NULL;
+
+ ifq->queue_id = -1;
+ return ifq;
+}
+
+static void io_zctap_ifq_free(struct io_zctap_ifq *ifq)
+{
+ if (ifq->queue_id != -1)
+ io_close_zctap_ifq(ifq, ifq->queue_id);
+ if (ifq->dev)
+ dev_put(ifq->dev);
+ if (ifq->id)
+ xa_erase(&io_zctap_ifq_xa, ifq->id);
+ kfree(ifq);
+}
+
+int io_register_ifq(struct io_ring_ctx *ctx,
+ struct io_uring_ifq_req __user *arg)
+{
+ struct io_uring_ifq_req req;
+ struct io_zctap_ifq *ifq;
+ int id, err;
+
+ if (copy_from_user(&req, arg, sizeof(req)))
+ return -EFAULT;
+
+ ifq = io_zctap_ifq_alloc();
+ if (!ifq)
+ return -ENOMEM;
+ ifq->ctx = ctx;
+ ifq->fill_bgid = req.fill_bgid;
+
+ err = -ENODEV;
+ ifq->dev = dev_get_by_index(&init_net, req.ifindex);
+ if (!ifq->dev)
+ goto out;
+
+ err = io_open_zctap_ifq(ifq, &req.queue_id);
+ if (err)
+ goto out;
+ ifq->queue_id = req.queue_id;
+
+ /* aka idr */
+ err = xa_alloc(&io_zctap_ifq_xa, &id, ifq,
+ XA_LIMIT(1, PAGE_SIZE - 1), GFP_KERNEL);
+ if (err)
+ goto out;
+ ifq->id = id;
+ req.ifq_id = id;
+
+ err = xa_err(xa_store(&ctx->zctap_ifq_xa, id, ifq, GFP_KERNEL));
+ if (err)
+ goto out;
+
+ if (copy_to_user(arg, &req, sizeof(req))) {
+ xa_erase(&ctx->zctap_ifq_xa, id);
+ err = -EFAULT;
+ goto out;
+ }
+
+ return 0;
+
+out:
+ io_zctap_ifq_free(ifq);
+ return err;
+}
+
+int io_unregister_zctap_ifq(struct io_ring_ctx *ctx, unsigned long index)
+{
+ struct io_zctap_ifq *ifq;
+
+ ifq = xa_erase(&ctx->zctap_ifq_xa, index);
+ if (!ifq)
+ return -EINVAL;
+
+ io_zctap_ifq_free(ifq);
+ return 0;
+}
+
+int io_unregister_ifq(struct io_ring_ctx *ctx,
+ struct io_uring_ifq_req __user *arg)
+{
+ struct io_uring_ifq_req req;
+
+ if (copy_from_user(&req, arg, sizeof(req)))
+ return -EFAULT;
+
+ return io_unregister_zctap_ifq(ctx, req.ifq_id);
+}
diff --git a/io_uring/zctap.h b/io_uring/zctap.h
new file mode 100644
index 000000000000..bda15d218fe3
--- /dev/null
+++ b/io_uring/zctap.h
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_ZCTAP_H
+#define IOU_ZCTAP_H
+
+int io_register_ifq(struct io_ring_ctx *ctx,
+ struct io_uring_ifq_req __user *arg);
+int io_unregister_ifq(struct io_ring_ctx *ctx,
+ struct io_uring_ifq_req __user *arg);
+int io_unregister_zctap_ifq(struct io_ring_ctx *ctx, unsigned long index);
+
+#endif
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC v1 4/9] io_uring: add provide_ifq_region opcode
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
` (2 preceding siblings ...)
2022-10-07 21:17 ` [RFC v1 3/9] io_uring: add register ifq opcode Jonathan Lemon
@ 2022-10-07 21:17 ` Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 5/9] io_uring: Add io_uring zctap iov structure and helpers Jonathan Lemon
` (5 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-07 21:17 UTC (permalink / raw)
To: io-uring
This opcode takes part or all of a memory region that was previously
registered with io_uring, and assigns it as the backing store for
the specified ifq.
The entire region is registered instead of providing individual
bufferrs, as this allows the hardware to select the optimal buffer
size for incoming packets.
Signed-off-by: Jonathan Lemon <[email protected]>
---
include/linux/io_uring_types.h | 1 +
include/uapi/linux/io_uring.h | 1 +
io_uring/opdef.c | 9 ++++
io_uring/zctap.c | 96 ++++++++++++++++++++++++++++++++++
io_uring/zctap.h | 4 ++
5 files changed, 111 insertions(+)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 680fbf1f34e7..56257e8afd0a 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -582,6 +582,7 @@ struct io_overflow_cqe {
struct io_zctap_ifq {
struct net_device *dev;
struct io_ring_ctx *ctx;
+ void *region; /* XXX relocate? */
u16 queue_id;
u16 id;
u16 fill_bgid;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index bc5108d65c0a..3b392f8270dc 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -206,6 +206,7 @@ enum io_uring_op {
IORING_OP_SOCKET,
IORING_OP_URING_CMD,
IORING_OP_SEND_ZC,
+ IORING_OP_PROVIDE_IFQ_REGION,
/* this goes last, obviously */
IORING_OP_LAST,
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index c4dddd0fd709..bf28c43117c3 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -33,6 +33,7 @@
#include "poll.h"
#include "cancel.h"
#include "rw.h"
+#include "zctap.h"
static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
{
@@ -488,6 +489,14 @@ const struct io_op_def io_op_defs[] = {
.prep = io_eopnotsupp_prep,
#endif
},
+ [IORING_OP_PROVIDE_IFQ_REGION] = {
+ .audit_skip = 1,
+ .iopoll = 1,
+ .buffer_select = 1,
+ .name = "PROVIDE_IFQ_REGION",
+ .prep = io_provide_ifq_region_prep,
+ .issue = io_provide_ifq_region,
+ },
};
const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/zctap.c b/io_uring/zctap.c
index 41feb76b7a35..728f7c938b7b 100644
--- a/io_uring/zctap.c
+++ b/io_uring/zctap.c
@@ -6,11 +6,14 @@
#include <linux/mm.h>
#include <linux/io_uring.h>
#include <linux/netdevice.h>
+#include <linux/nospec.h>
#include <uapi/linux/io_uring.h>
#include "io_uring.h"
#include "zctap.h"
+#include "rsrc.h"
+#include "kbuf.h"
static DEFINE_XARRAY_ALLOC1(io_zctap_ifq_xa);
@@ -144,3 +147,96 @@ int io_unregister_ifq(struct io_ring_ctx *ctx,
return io_unregister_zctap_ifq(ctx, req.ifq_id);
}
+
+struct io_ifq_region {
+ struct file *file;
+ struct io_zctap_ifq *ifq;
+ __u64 addr;
+ __u32 len;
+ __u32 bgid;
+};
+
+struct ifq_region {
+ struct io_mapped_ubuf *imu;
+ u64 start;
+ u64 end;
+ int count;
+ int imu_idx;
+ int nr_pages;
+ struct page *page[];
+};
+
+int io_provide_ifq_region_prep(struct io_kiocb *req,
+ const struct io_uring_sqe *sqe)
+{
+ struct io_ifq_region *r = io_kiocb_to_cmd(req, struct io_ifq_region);
+ struct io_ring_ctx *ctx = req->ctx;
+ struct io_mapped_ubuf *imu;
+ u32 index;
+
+ if (!(req->flags & REQ_F_BUFFER_SELECT))
+ return -EINVAL;
+
+ r->addr = READ_ONCE(sqe->addr);
+ r->len = READ_ONCE(sqe->len);
+ index = READ_ONCE(sqe->fd);
+
+ if (!r->addr || r->addr & ~PAGE_MASK)
+ return -EFAULT;
+
+ if (!r->len || r->len & ~PAGE_MASK)
+ return -EFAULT;
+
+ r->ifq = xa_load(&ctx->zctap_ifq_xa, index);
+ if (!r->ifq)
+ return -EFAULT;
+
+ /* XXX for now, only allow one region per ifq. */
+ if (r->ifq->region)
+ return -EFAULT;
+
+ if (unlikely(req->buf_index >= ctx->nr_user_bufs))
+ return -EFAULT;
+ index = array_index_nospec(req->buf_index, ctx->nr_user_bufs);
+ imu = ctx->user_bufs[index];
+
+ if (r->addr < imu->ubuf || r->addr + r->len > imu->ubuf_end)
+ return -EFAULT;
+ req->imu = imu;
+
+ io_req_set_rsrc_node(req, ctx, 0);
+
+ return 0;
+}
+
+int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_ifq_region *r = io_kiocb_to_cmd(req, struct io_ifq_region);
+ struct ifq_region *ifr;
+ int i, idx, nr_pages;
+ struct page *page;
+
+ nr_pages = r->len >> PAGE_SHIFT;
+ idx = (r->addr - req->imu->ubuf) >> PAGE_SHIFT;
+
+ ifr = kvmalloc(struct_size(ifr, page, nr_pages), GFP_KERNEL);
+ if (!ifr)
+ return -ENOMEM;
+
+
+ ifr->nr_pages = nr_pages;
+ ifr->imu_idx = idx;
+ ifr->count = nr_pages;
+ ifr->imu = req->imu;
+ ifr->start = r->addr;
+ ifr->end = r->addr + r->len;
+
+ for (i = 0; i < nr_pages; i++, idx++) {
+ page = req->imu->bvec[idx].bv_page;
+ ifr->page[i] = page;
+ }
+
+ WRITE_ONCE(r->ifq->region, ifr);
+
+ return 0;
+}
diff --git a/io_uring/zctap.h b/io_uring/zctap.h
index bda15d218fe3..709c803220f4 100644
--- a/io_uring/zctap.h
+++ b/io_uring/zctap.h
@@ -8,4 +8,8 @@ int io_unregister_ifq(struct io_ring_ctx *ctx,
struct io_uring_ifq_req __user *arg);
int io_unregister_zctap_ifq(struct io_ring_ctx *ctx, unsigned long index);
+int io_provide_ifq_region_prep(struct io_kiocb *req,
+ const struct io_uring_sqe *sqe);
+int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags);
+
#endif
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC v1 5/9] io_uring: Add io_uring zctap iov structure and helpers
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
` (3 preceding siblings ...)
2022-10-07 21:17 ` [RFC v1 4/9] io_uring: add provide_ifq_region opcode Jonathan Lemon
@ 2022-10-07 21:17 ` Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 6/9] io_uring: introduce reference tracking for user pages Jonathan Lemon
` (4 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-07 21:17 UTC (permalink / raw)
To: io-uring
With networking zero-copy receive, the incoming data is placed
directly into user-supplied buffers. Instead of returning the
buffer address, return the buffer group id and buffer id, and
let the application figure out the base address.
Add helpers for storing and retrieving the encoding, which is
stored in the page_private field. This will be used in the
zerocopy RX routine, when handling pages from skb fragments.
Signed-off-by: Jonathan Lemon <[email protected]>
---
include/uapi/linux/io_uring.h | 10 +++++++++
io_uring/zctap.c | 39 ++++++++++++++++++++++++++++++++++-
2 files changed, 48 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 3b392f8270dc..145d55280919 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -664,6 +664,16 @@ struct io_uring_ifq_req {
__u16 __pad[3];
};
+struct io_uring_zctap_iov {
+ __u32 off;
+ __u32 len;
+ __u16 bgid;
+ __u16 bid;
+ __u16 ifq_id;
+ __u16 resv;
+};
+
+
#ifdef __cplusplus
}
#endif
diff --git a/io_uring/zctap.c b/io_uring/zctap.c
index 728f7c938b7b..58b4c5417650 100644
--- a/io_uring/zctap.c
+++ b/io_uring/zctap.c
@@ -19,6 +19,26 @@ static DEFINE_XARRAY_ALLOC1(io_zctap_ifq_xa);
typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
+static u64 zctap_page_info(u16 region_id, u16 pgid, u16 ifq_id)
+{
+ return (u64)region_id << 32 | (u64)pgid << 16 | ifq_id;
+}
+
+static u16 zctap_page_region_id(const struct page *page)
+{
+ return (page_private(page) >> 32) & 0xffff;
+}
+
+static u16 zctap_page_id(const struct page *page)
+{
+ return (page_private(page) >> 16) & 0xffff;
+}
+
+static u16 zctap_page_ifq_id(const struct page *page)
+{
+ return page_private(page) & 0xffff;
+}
+
static int __io_queue_mgmt(struct net_device *dev, struct io_zctap_ifq *ifq,
u16 *queue_id)
{
@@ -213,8 +233,9 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_ifq_region *r = io_kiocb_to_cmd(req, struct io_ifq_region);
struct ifq_region *ifr;
- int i, idx, nr_pages;
+ int i, id, idx, nr_pages;
struct page *page;
+ u64 info;
nr_pages = r->len >> PAGE_SHIFT;
idx = (r->addr - req->imu->ubuf) >> PAGE_SHIFT;
@@ -231,12 +252,28 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags)
ifr->start = r->addr;
ifr->end = r->addr + r->len;
+ id = r->ifq->id;
for (i = 0; i < nr_pages; i++, idx++) {
page = req->imu->bvec[idx].bv_page;
+ if (PagePrivate(page))
+ goto out;
+ SetPagePrivate(page);
+ info = zctap_page_info(r->bgid, idx + i, id);
+ set_page_private(page, info);
ifr->page[i] = page;
}
WRITE_ONCE(r->ifq->region, ifr);
return 0;
+out:
+ while (i--) {
+ page = req->imu->bvec[idx + i].bv_page;
+ ClearPagePrivate(page);
+ set_page_private(page, 0);
+ }
+
+ kvfree(ifr);
+
+ return -EEXIST;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC v1 6/9] io_uring: introduce reference tracking for user pages.
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
` (4 preceding siblings ...)
2022-10-07 21:17 ` [RFC v1 5/9] io_uring: Add io_uring zctap iov structure and helpers Jonathan Lemon
@ 2022-10-07 21:17 ` Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 7/9] page_pool: add page allocation and free hooks Jonathan Lemon
` (3 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-07 21:17 UTC (permalink / raw)
To: io-uring
This is currently a WIP.
If only part of a page is used by a skb fragment, and then provided
to the user, the page should not be reused by the kernel until all
sub-page fragments are not in use.
If only full pages are used (and not sub-page fragments), then this
code shouldn't be needed.
Signed-off-by: Jonathan Lemon <[email protected]>
---
io_uring/zctap.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/io_uring/zctap.c b/io_uring/zctap.c
index 58b4c5417650..9db3421fb9fa 100644
--- a/io_uring/zctap.c
+++ b/io_uring/zctap.c
@@ -183,9 +183,36 @@ struct ifq_region {
int count;
int imu_idx;
int nr_pages;
+ u8 *page_uref;
struct page *page[];
};
+static void io_add_page_uref(struct ifq_region *ifr, u16 pgid)
+{
+ if (WARN_ON(!ifr))
+ return;
+
+ if (WARN_ON(pgid < ifr->imu_idx))
+ return;
+
+ ifr->page_uref[pgid - ifr->imu_idx]++;
+}
+
+static bool io_put_page_last_uref(struct ifq_region *ifr, u64 addr)
+{
+ int idx;
+
+ if (WARN_ON(addr < ifr->start || addr > ifr->end))
+ return false;
+
+ idx = (addr - ifr->start) >> PAGE_SHIFT;
+
+ if (WARN_ON(!ifr->page_uref[idx]))
+ return false;
+
+ return --ifr->page_uref[idx] == 0;
+}
+
int io_provide_ifq_region_prep(struct io_kiocb *req,
const struct io_uring_sqe *sqe)
{
@@ -244,6 +271,11 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags)
if (!ifr)
return -ENOMEM;
+ ifr->page_uref = kvmalloc_array(nr_pages, sizeof(u8), GFP_KERNEL);
+ if (!ifr->page_uref) {
+ kvfree(ifr);
+ return -ENOMEM;
+ }
ifr->nr_pages = nr_pages;
ifr->imu_idx = idx;
@@ -261,6 +293,7 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags)
info = zctap_page_info(r->bgid, idx + i, id);
set_page_private(page, info);
ifr->page[i] = page;
+ ifr->page_uref[i] = 0;
}
WRITE_ONCE(r->ifq->region, ifr);
@@ -273,6 +306,7 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags)
set_page_private(page, 0);
}
+ kvfree(ifr->page_uref);
kvfree(ifr);
return -EEXIST;
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC v1 7/9] page_pool: add page allocation and free hooks.
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
` (5 preceding siblings ...)
2022-10-07 21:17 ` [RFC v1 6/9] io_uring: introduce reference tracking for user pages Jonathan Lemon
@ 2022-10-07 21:17 ` Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 8/9] io_uring: provide functions for the page_pool Jonathan Lemon
` (2 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-07 21:17 UTC (permalink / raw)
To: io-uring
In order to allow for user-allocated page backing, add hooks to the
page pool so pages can be obtained and released from a user-supplied
provider instead of the system page allocator.
skbs are marked with skb_mark_for_recycle() if they contain pages
belonging to a page pool, and page_put() will deliver the pages back
to the pool instead of freeing them to the system page allocator.
Signed-off-by: Jonathan Lemon <[email protected]>
---
include/net/page_pool.h | 6 ++++++
net/core/page_pool.c | 41 ++++++++++++++++++++++++++++++++++-------
2 files changed, 40 insertions(+), 7 deletions(-)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 813c93499f20..85c8423f9a7e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -82,6 +82,12 @@ struct page_pool_params {
unsigned int offset; /* DMA addr offset */
void (*init_callback)(struct page *page, void *arg);
void *init_arg;
+ struct page *(*alloc_pages)(void *arg, int nid, gfp_t gfp,
+ unsigned int order);
+ unsigned long (*alloc_bulk)(void *arg, gfp_t gfp, int nid,
+ unsigned long nr_pages,
+ struct page **page_array);
+ void (*put_page)(void *arg, struct page *page);
};
#ifdef CONFIG_PAGE_POOL_STATS
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 9b203d8660e4..21c6ee97bc7f 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -342,19 +342,47 @@ static void page_pool_clear_pp_info(struct page *page)
page->pp = NULL;
}
+/* hooks to either page provider or system page allocator */
+static void page_pool_mm_put_page(struct page_pool *pool, struct page *page)
+{
+ if (pool->p.put_page)
+ return pool->p.put_page(pool->p.init_arg, page);
+ put_page(page);
+}
+
+static unsigned long page_pool_mm_alloc_bulk(struct page_pool *pool,
+ gfp_t gfp,
+ unsigned long nr_pages)
+{
+ if (pool->p.alloc_bulk)
+ return pool->p.alloc_bulk(pool->p.init_arg, gfp,
+ pool->p.nid, nr_pages,
+ pool->alloc.cache);
+ return alloc_pages_bulk_array_node(gfp, pool->p.nid,
+ nr_pages, pool->alloc.cache);
+}
+
+static struct page *page_pool_mm_alloc(struct page_pool *pool, gfp_t gfp)
+{
+ if (pool->p.alloc_pages)
+ return pool->p.alloc_pages(pool->p.init_arg, pool->p.nid,
+ gfp, pool->p.order);
+ return alloc_pages_node(pool->p.nid, gfp, pool->p.order);
+}
+
static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
gfp_t gfp)
{
struct page *page;
gfp |= __GFP_COMP;
- page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
+ page = page_pool_mm_alloc(pool, gfp);
if (unlikely(!page))
return NULL;
if ((pool->p.flags & PP_FLAG_DMA_MAP) &&
unlikely(!page_pool_dma_map(pool, page))) {
- put_page(page);
+ page_pool_mm_put_page(pool, page);
return NULL;
}
@@ -389,8 +417,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
/* Mark empty alloc.cache slots "empty" for alloc_pages_bulk_array */
memset(&pool->alloc.cache, 0, sizeof(void *) * bulk);
- nr_pages = alloc_pages_bulk_array_node(gfp, pool->p.nid, bulk,
- pool->alloc.cache);
+ nr_pages = page_pool_mm_alloc_bulk(pool, gfp, bulk);
if (unlikely(!nr_pages))
return NULL;
@@ -401,7 +428,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
page = pool->alloc.cache[i];
if ((pp_flags & PP_FLAG_DMA_MAP) &&
unlikely(!page_pool_dma_map(pool, page))) {
- put_page(page);
+ page_pool_mm_put_page(pool, page);
continue;
}
@@ -501,7 +528,7 @@ static void page_pool_return_page(struct page_pool *pool, struct page *page)
{
page_pool_release_page(pool, page);
- put_page(page);
+ page_pool_mm_put_page(pool, page);
/* An optimization would be to call __free_pages(page, pool->p.order)
* knowing page is not part of page-cache (thus avoiding a
* __page_cache_release() call).
@@ -593,7 +620,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
recycle_stat_inc(pool, released_refcnt);
/* Do not replace this with page_pool_return_page() */
page_pool_release_page(pool, page);
- put_page(page);
+ page_pool_mm_put_page(pool, page);
return NULL;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC v1 8/9] io_uring: provide functions for the page_pool.
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
` (6 preceding siblings ...)
2022-10-07 21:17 ` [RFC v1 7/9] page_pool: add page allocation and free hooks Jonathan Lemon
@ 2022-10-07 21:17 ` Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 9/9] io_uring: add OP_RECV_ZC command Jonathan Lemon
2022-10-10 7:37 ` [RFC v1 0/9] zero-copy RX for io_uring dust.li
9 siblings, 0 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-07 21:17 UTC (permalink / raw)
To: io-uring
These functions are called by the page_pool, in order to refill
the pool with user-supplied pages, or returning excess pages back
from the pool.
If no pages are present in the region cache, then an attempt is
made to obtain more pages from the interface fill queue.
Signed-off-by: Jonathan Lemon <[email protected]>
---
include/linux/io_uring.h | 24 ++++++++++++
io_uring/kbuf.c | 13 +++++++
io_uring/kbuf.h | 2 +
io_uring/zctap.c | 82 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 121 insertions(+)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 4a2f6cc5a492..b92e65e0a469 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -37,6 +37,14 @@ void __io_uring_free(struct task_struct *tsk);
void io_uring_unreg_ringfd(void);
const char *io_uring_get_opcode(u8 opcode);
+struct io_zctap_ifq;
+struct page *io_zctap_ifq_get_page(struct io_zctap_ifq *ifq,
+ unsigned int order);
+unsigned long io_zctap_ifq_get_bulk(struct io_zctap_ifq *ifq,
+ unsigned long nr_pages,
+ struct page **page_array);
+bool io_zctap_ifq_put_page(struct io_zctap_ifq *ifq, struct page *page);
+
static inline void io_uring_files_cancel(void)
{
if (current->io_uring) {
@@ -80,6 +88,22 @@ static inline const char *io_uring_get_opcode(u8 opcode)
{
return "";
}
+static inline struct page *io_zctap_ifq_get_page(struct io_zctap_ifq *ifq,
+ unsigned int order)
+{
+ return NULL;
+}
+sttaic unsigned long io_zctap_ifq_get_bulk(struct io_zctap_ifq *ifq,
+ unsigned long nr_pages,
+ struct page **page_array)
+{
+ return 0;
+}
+bool io_zctap_ifq_put_page(struct io_zctap_ifq *ifq, struct page *page)
+{
+ return false;
+}
+
#endif
#endif
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 25cd724ade18..caae2755e3d5 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -188,6 +188,19 @@ void __user *io_buffer_select(struct io_kiocb *req, size_t *len,
return ret;
}
+/* XXX May called from the driver, in napi context. */
+u64 io_zctap_buffer(struct io_kiocb *req, size_t *len)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+ struct io_buffer_list *bl;
+ void __user *ret = NULL;
+
+ bl = io_buffer_get_list(ctx, req->buf_index);
+ if (likely(bl))
+ ret = io_ring_buffer_select(req, len, bl, IO_URING_F_UNLOCKED);
+ return (u64)ret;
+}
+
static __cold int io_init_bl_list(struct io_ring_ctx *ctx)
{
int i;
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index 746fbf31a703..1379e0e9f870 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -50,6 +50,8 @@ unsigned int __io_put_kbuf(struct io_kiocb *req, unsigned issue_flags);
void io_kbuf_recycle_legacy(struct io_kiocb *req, unsigned issue_flags);
+u64 io_zctap_buffer(struct io_kiocb *req, size_t *len);
+
static inline void io_kbuf_recycle_ring(struct io_kiocb *req)
{
/*
diff --git a/io_uring/zctap.c b/io_uring/zctap.c
index 9db3421fb9fa..8bebe7c36c82 100644
--- a/io_uring/zctap.c
+++ b/io_uring/zctap.c
@@ -311,3 +311,85 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags)
return -EEXIST;
}
+
+/* gets a user-supplied buffer from the fill queue */
+static struct page *io_zctap_get_buffer(struct io_zctap_ifq *ifq)
+{
+ struct io_kiocb req = {
+ .ctx = ifq->ctx,
+ .buf_index = ifq->fill_bgid,
+ };
+ struct io_mapped_ubuf *imu;
+ struct ifq_region *ifr;
+ size_t len;
+ u64 addr;
+ int idx;
+
+ len = 0;
+ ifr = ifq->region;
+ imu = ifr->imu;
+
+ addr = io_zctap_buffer(&req, &len);
+ if (!addr)
+ goto fail;
+
+ /* XXX poor man's implementation of io_import_fixed */
+
+ if (addr < ifr->start || addr + len > ifr->end)
+ goto fail;
+
+ idx = (addr - ifr->start) >> PAGE_SHIFT;
+
+ return imu->bvec[ifr->imu_idx + idx].bv_page;
+
+fail:
+ /* warn and just drop buffer */
+ WARN_RATELIMIT(1, "buffer addr %llx invalid", addr);
+ return NULL;
+}
+
+struct page *io_zctap_ifq_get_page(struct io_zctap_ifq *ifq,
+ unsigned int order)
+{
+ struct ifq_region *ifr = ifq->region;
+
+ if (WARN_RATELIMIT(order != 1, "order %d", order))
+ return NULL;
+
+ if (ifr->count)
+ return ifr->page[--ifr->count];
+
+ return io_zctap_get_buffer(ifq);
+}
+
+unsigned long io_zctap_ifq_get_bulk(struct io_zctap_ifq *ifq,
+ unsigned long nr_pages,
+ struct page **page_array)
+{
+ struct ifq_region *ifr = ifq->region;
+ int count;
+
+ count = min_t(unsigned long, nr_pages, ifr->count);
+ if (count) {
+ ifr->count -= count;
+ memcpy(page_array, &ifr->page[ifr->count],
+ count * sizeof(struct page *));
+ }
+
+ return count;
+}
+
+bool io_zctap_ifq_put_page(struct io_zctap_ifq *ifq, struct page *page)
+{
+ struct ifq_region *ifr = ifq->region;
+
+ /* if page is not usermapped, then throw an error */
+
+ /* sanity check - leak pages here if hit */
+ if (WARN_RATELIMIT(ifr->count >= ifr->nr_pages, "page overflow"))
+ return true;
+
+ ifr->page[ifr->count++] = page;
+
+ return true;
+}
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC v1 9/9] io_uring: add OP_RECV_ZC command.
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
` (7 preceding siblings ...)
2022-10-07 21:17 ` [RFC v1 8/9] io_uring: provide functions for the page_pool Jonathan Lemon
@ 2022-10-07 21:17 ` Jonathan Lemon
2022-10-10 7:37 ` [RFC v1 0/9] zero-copy RX for io_uring dust.li
9 siblings, 0 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-07 21:17 UTC (permalink / raw)
To: io-uring
This is still a WIP. The current code (temporarily) uses addr3
as a hack in order to leverage code in io_recvmsg_prep.
The recvzc opcode uses a metadata buffer either supplied directly
with buf/len, or indirectly from the buffer group. The expectation
is that this buffer is then filled with an array of io_uring_zctap_iov
structures, which point to the data in user-memory.
addr3 = (readlen << 32) | (copy_bgid << 16) | ctx->ifq_id;
The amount of returned data is limited by the number of iovs that
the metadata area can hold, and also the readlen parameter.
As a fallback (and for testing purposes), if the skb data is not
present in user memory (perhaps due to system misconfiguration), then
a seprate buffer is obtained from the copy_bgid and the data is
copied into user-memory.
Signed-off-by: Jonathan Lemon <[email protected]>
---
include/uapi/linux/io_uring.h | 1 +
io_uring/net.c | 123 ++++++++++++
io_uring/opdef.c | 14 ++
io_uring/zctap.c | 354 ++++++++++++++++++++++++++++++++++
io_uring/zctap.h | 5 +
5 files changed, 497 insertions(+)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 145d55280919..3c31a966687e 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -207,6 +207,7 @@ enum io_uring_op {
IORING_OP_URING_CMD,
IORING_OP_SEND_ZC,
IORING_OP_PROVIDE_IFQ_REGION,
+ IORING_OP_RECV_ZC,
/* this goes last, obviously */
IORING_OP_LAST,
diff --git a/io_uring/net.c b/io_uring/net.c
index 60e392f7f2dc..89c57ad83a79 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -16,6 +16,7 @@
#include "net.h"
#include "notif.h"
#include "rsrc.h"
+#include "zctap.h"
#if defined(CONFIG_NET)
struct io_shutdown {
@@ -73,6 +74,14 @@ struct io_sendzc {
struct io_kiocb *notif;
};
+struct io_recvzc {
+ struct io_sr_msg sr;
+ struct io_zctap_ifq *ifq;
+ u32 datalen;
+ u16 ifq_id;
+ u16 copy_bgid;
+};
+
#define IO_APOLL_MULTI_POLLED (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)
int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -879,6 +888,120 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
return ret;
}
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+ u64 recvzc_cmd;
+ u16 ifq_id;
+
+ /* XXX hack so we can temporarily use io_recvmsg_prep */
+ recvzc_cmd = READ_ONCE(sqe->addr3);
+
+ ifq_id = recvzc_cmd & 0xffff;
+ zc->copy_bgid = (recvzc_cmd >> 16) & 0xffff;
+ zc->datalen = recvzc_cmd >> 32;
+
+ zc->ifq = xa_load(&req->ctx->zctap_ifq_xa, ifq_id);
+ if (!zc->ifq)
+ return -EINVAL;
+ if (zc->ifq->ctx != req->ctx)
+ return -EINVAL;
+
+ return io_recvmsg_prep(req, sqe);
+}
+
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+ struct msghdr msg;
+ struct socket *sock;
+ struct iovec iov;
+ unsigned int cflags;
+ unsigned flags;
+ int ret, min_ret = 0;
+ bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+ size_t len = zc->sr.len;
+
+ if (!(req->flags & REQ_F_POLLED) &&
+ (zc->sr.flags & IORING_RECVSEND_POLL_FIRST))
+ return -EAGAIN;
+
+ sock = sock_from_file(req->file);
+ if (unlikely(!sock))
+ return -ENOTSOCK;
+
+retry_multishot:
+ if (io_do_buffer_select(req)) {
+ void __user *buf;
+
+ buf = io_buffer_select(req, &len, issue_flags);
+ if (!buf)
+ return -ENOBUFS;
+ zc->sr.buf = buf;
+ }
+
+ ret = import_single_range(READ, zc->sr.buf, len, &iov, &msg.msg_iter);
+ if (unlikely(ret))
+ goto out_free;
+
+ msg.msg_name = NULL;
+ msg.msg_namelen = 0;
+ msg.msg_control = NULL;
+ msg.msg_get_inq = 1;
+ msg.msg_flags = 0;
+ msg.msg_controllen = 0;
+ msg.msg_iocb = NULL;
+ msg.msg_ubuf = NULL;
+
+ flags = zc->sr.msg_flags;
+ if (force_nonblock)
+ flags |= MSG_DONTWAIT;
+ if (flags & MSG_WAITALL)
+ min_ret = iov_iter_count(&msg.msg_iter);
+
+ ret = io_zctap_recv(zc->ifq, sock, &msg, flags, zc->datalen,
+ zc->copy_bgid);
+ if (ret < min_ret) {
+ if (ret == -EAGAIN && force_nonblock) {
+ if ((req->flags & IO_APOLL_MULTI_POLLED) == IO_APOLL_MULTI_POLLED) {
+ io_kbuf_recycle(req, issue_flags);
+ return IOU_ISSUE_SKIP_COMPLETE;
+ }
+
+ return -EAGAIN;
+ }
+ if (ret == -ERESTARTSYS)
+ ret = -EINTR;
+ if (ret > 0 && io_net_retry(sock, flags)) {
+ zc->sr.len -= ret;
+ zc->sr.buf += ret;
+ zc->sr.done_io += ret;
+ req->flags |= REQ_F_PARTIAL_IO;
+ return -EAGAIN;
+ }
+ req_set_fail(req);
+ } else if ((flags & MSG_WAITALL) && (msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
+out_free:
+ req_set_fail(req);
+ }
+
+ if (ret > 0)
+ ret += zc->sr.done_io;
+ else if (zc->sr.done_io)
+ ret = zc->sr.done_io;
+ else
+ io_kbuf_recycle(req, issue_flags);
+
+ cflags = io_put_kbuf(req, issue_flags);
+ if (msg.msg_inq)
+ cflags |= IORING_CQE_F_SOCK_NONEMPTY;
+
+ if (!io_recv_finish(req, &ret, cflags, ret <= 0))
+ goto retry_multishot;
+
+ return ret;
+}
+
void io_sendzc_cleanup(struct io_kiocb *req)
{
struct io_sendzc *zc = io_kiocb_to_cmd(req, struct io_sendzc);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index bf28c43117c3..f3782e7b707b 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -497,6 +497,20 @@ const struct io_op_def io_op_defs[] = {
.prep = io_provide_ifq_region_prep,
.issue = io_provide_ifq_region,
},
+ [IORING_OP_RECV_ZC] = {
+ .name = "RECV_ZC",
+ .needs_file = 1,
+ .unbound_nonreg_file = 1,
+ .pollin = 1,
+ .buffer_select = 1,
+ .ioprio = 1,
+#if defined(CONFIG_NET)
+ .prep = io_recvzc_prep,
+ .issue = io_recvzc,
+#else
+ .prep = io_eopnotsupp_prep,
+#endif
+ },
};
const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/zctap.c b/io_uring/zctap.c
index 8bebe7c36c82..1d334ac55c0b 100644
--- a/io_uring/zctap.c
+++ b/io_uring/zctap.c
@@ -7,6 +7,7 @@
#include <linux/io_uring.h>
#include <linux/netdevice.h>
#include <linux/nospec.h>
+#include <net/tcp.h>
#include <uapi/linux/io_uring.h>
@@ -393,3 +394,356 @@ bool io_zctap_ifq_put_page(struct io_zctap_ifq *ifq, struct page *page)
return true;
}
+
+static inline bool
+zctap_skb_ours(struct sk_buff *skb)
+{
+ return skb->pp_recycle;
+}
+
+struct zctap_read_desc {
+ struct iov_iter *iter;
+ struct ifq_region *ifr;
+ u32 iov_space;
+ u32 iov_limit;
+ u32 recv_limit;
+
+ struct io_kiocb req;
+ u8 *buf;
+ size_t offset;
+ size_t buflen;
+
+ struct io_zctap_ifq *ifq;
+ u16 ifq_id;
+ u16 copy_bgid; /* XXX move to register ifq? */
+};
+
+static int __zctap_get_user_buffer(struct zctap_read_desc *ztr, int len)
+{
+ if (!ztr->buflen) {
+ ztr->req = (struct io_kiocb) {
+ .ctx = ztr->ifq->ctx,
+ .buf_index = ztr->copy_bgid,
+ };
+
+ ztr->buf = (u8 *)io_zctap_buffer(&ztr->req, &ztr->buflen);
+ ztr->offset = 0;
+ }
+ return len > ztr->buflen ? ztr->buflen : len;
+}
+
+static int zctap_copy_data(struct zctap_read_desc *ztr, int len, u8 *kaddr)
+{
+ struct io_uring_zctap_iov zov;
+ u32 space;
+ int err;
+
+ space = ztr->iov_space + sizeof(zov);
+ if (space > ztr->iov_limit)
+ return 0;
+
+ len = __zctap_get_user_buffer(ztr, len);
+ if (!len)
+ return -ENOBUFS;
+
+ err = copy_to_user(ztr->buf + ztr->offset, kaddr, len);
+ if (err)
+ return -EFAULT;
+
+ zov = (struct io_uring_zctap_iov) {
+ .off = ztr->offset,
+ .len = len,
+ .bgid = ztr->copy_bgid,
+ .bid = ztr->req.buf_index,
+ .ifq_id = ztr->ifq_id,
+ };
+
+ if (copy_to_iter(&zov, sizeof(zov), ztr->iter) != sizeof(zov))
+ return -EFAULT;
+
+ ztr->offset += len;
+ ztr->buflen -= len;
+
+ ztr->iov_space = space;
+
+ return len;
+}
+
+static int zctap_copy_frag(struct zctap_read_desc *ztr, struct page *page,
+ int off, int len, int id,
+ struct io_uring_zctap_iov *zov)
+{
+ u8 *kaddr;
+ int err;
+
+ len = __zctap_get_user_buffer(ztr, len);
+ if (!len)
+ return -ENOBUFS;
+
+ if (id == 0) {
+ kaddr = kmap(page) + off;
+ err = copy_to_user(ztr->buf + ztr->offset, kaddr, len);
+ kunmap(page);
+ } else {
+ kaddr = page_address(page) + off;
+ err = copy_to_user(ztr->buf + ztr->offset, kaddr, len);
+ }
+
+ if (err)
+ return -EFAULT;
+
+ *zov = (struct io_uring_zctap_iov) {
+ .off = ztr->offset,
+ .len = len,
+ .bgid = ztr->copy_bgid,
+ .bid = ztr->req.buf_index,
+ .ifq_id = ztr->ifq_id,
+ };
+
+ ztr->offset += len;
+ ztr->buflen -= len;
+
+ return len;
+}
+
+static int zctap_recv_frag(struct zctap_read_desc *ztr,
+ const skb_frag_t *frag, int off, int len)
+{
+ struct io_uring_zctap_iov zov;
+ struct page *page;
+ int id, pgid;
+ u32 space;
+
+ space = ztr->iov_space + sizeof(zov);
+ if (space > ztr->iov_limit)
+ return 0;
+
+ page = skb_frag_page(frag);
+ id = zctap_page_ifq_id(page);
+ off += skb_frag_off(frag);
+
+ if (likely(id == ztr->ifq_id)) {
+ pgid = zctap_page_id(page);
+ io_add_page_uref(ztr->ifr, pgid);
+ zov = (struct io_uring_zctap_iov) {
+ .off = off,
+ .len = len,
+ .bgid = zctap_page_region_id(page),
+ .bid = pgid,
+ .ifq_id = id,
+ };
+ } else {
+ len = zctap_copy_frag(ztr, page, off, len, id, &zov);
+ if (len <= 0)
+ return len;
+ }
+
+ if (copy_to_iter(&zov, sizeof(zov), ztr->iter) != sizeof(zov))
+ return -EFAULT;
+
+ ztr->iov_space = space;
+
+ return len;
+}
+
+/* Our version of __skb_datagram_iter -- should work for UDP also. */
+static int
+zctap_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+ unsigned int offset, size_t len)
+{
+ struct zctap_read_desc *ztr = desc->arg.data;
+ unsigned start, start_off;
+ struct sk_buff *frag_iter;
+ int i, copy, end, ret = 0;
+
+ if (ztr->iov_space >= ztr->iov_limit) {
+ desc->count = 0;
+ return 0;
+ }
+ if (len > ztr->recv_limit)
+ len = ztr->recv_limit;
+
+ start = skb_headlen(skb);
+ start_off = offset;
+
+ if (offset < start) {
+ copy = start - offset;
+ if (copy > len)
+ copy = len;
+
+ /* copy out linear data */
+ ret = zctap_copy_data(ztr, copy, skb->data + offset);
+ if (ret < 0)
+ goto out;
+ offset += ret;
+ len -= ret;
+ if (len == 0 || ret != copy)
+ goto out;
+ }
+
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ const skb_frag_t *frag;
+
+ WARN_ON(start > offset + len);
+
+ frag = &skb_shinfo(skb)->frags[i];
+ end = start + skb_frag_size(frag);
+
+ if (offset < end) {
+ copy = end - offset;
+ if (copy > len)
+ copy = len;
+
+ ret = zctap_recv_frag(ztr, frag, offset - start, copy);
+ if (ret < 0)
+ goto out;
+
+ offset += ret;
+ len -= ret;
+ if (len == 0 || ret != copy)
+ goto out;
+ }
+ start = end;
+ }
+
+ skb_walk_frags(skb, frag_iter) {
+ WARN_ON(start > offset + len);
+
+ end = start + frag_iter->len;
+ if (offset < end) {
+ int off;
+
+ copy = end - offset;
+ if (copy > len)
+ copy = len;
+
+ off = offset - start;
+ ret = zctap_recv_skb(desc, frag_iter, off, copy);
+ if (ret < 0)
+ goto out;
+
+ offset += ret;
+ len -= ret;
+ if (len == 0 || ret != copy)
+ goto out;
+ }
+ start = end;
+ }
+
+out:
+ if (offset == start_off)
+ return ret;
+ return offset - start_off;
+}
+
+static int __io_zctap_tcp_read(struct sock *sk, struct zctap_read_desc *zrd)
+{
+ read_descriptor_t rd_desc = {
+ .arg.data = zrd,
+ .count = 1,
+ };
+
+ return tcp_read_sock(sk, &rd_desc, zctap_recv_skb);
+}
+
+static int io_zctap_tcp_recvmsg(struct sock *sk, struct zctap_read_desc *zrd,
+ int flags, int *addr_len)
+{
+ size_t used;
+ long timeo;
+ int ret;
+
+ ret = used = 0;
+
+ lock_sock(sk);
+
+ timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+ while (zrd->recv_limit) {
+ ret = __io_zctap_tcp_read(sk, zrd);
+ if (ret < 0)
+ break;
+ if (!ret) {
+ if (used)
+ break;
+ if (sock_flag(sk, SOCK_DONE))
+ break;
+ if (sk->sk_err) {
+ ret = sock_error(sk);
+ break;
+ }
+ if (sk->sk_shutdown & RCV_SHUTDOWN)
+ break;
+ if (sk->sk_state == TCP_CLOSE) {
+ ret = -ENOTCONN;
+ break;
+ }
+ if (!timeo) {
+ ret = -EAGAIN;
+ break;
+ }
+ if (!skb_queue_empty(&sk->sk_receive_queue))
+ break;
+ sk_wait_data(sk, &timeo, NULL);
+ if (signal_pending(current)) {
+ ret = sock_intr_errno(timeo);
+ break;
+ }
+ continue;
+ }
+ zrd->recv_limit -= ret;
+ used += ret;
+
+ if (!timeo)
+ break;
+ release_sock(sk);
+ lock_sock(sk);
+
+ if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
+ (sk->sk_shutdown & RCV_SHUTDOWN) ||
+ signal_pending(current))
+ break;
+ }
+
+ release_sock(sk);
+
+ /* XXX, handle timestamping */
+
+ if (used)
+ return used;
+
+ return ret;
+}
+
+int io_zctap_recv(struct io_zctap_ifq *ifq, struct socket *sock,
+ struct msghdr *msg, int flags, u32 datalen, u16 copy_bgid)
+{
+ struct sock *sk = sock->sk;
+ struct zctap_read_desc zrd = {
+ .iov_limit = msg_data_left(msg),
+ .recv_limit = datalen,
+ .iter = &msg->msg_iter,
+ .ifq = ifq,
+ .ifq_id = ifq->id,
+ .copy_bgid = copy_bgid,
+ .ifr = ifq->region,
+ };
+ const struct proto *prot;
+ int addr_len = 0;
+ int ret;
+
+ if (flags & MSG_ERRQUEUE)
+ return -EOPNOTSUPP;
+
+ prot = READ_ONCE(sk->sk_prot);
+ if (prot->recvmsg != tcp_recvmsg)
+ return -EPROTONOSUPPORT;
+
+ sock_rps_record_flow(sk);
+
+ ret = io_zctap_tcp_recvmsg(sk, &zrd, flags, &addr_len);
+ if (ret >= 0) {
+ msg->msg_namelen = addr_len;
+ ret = zrd.iov_space;
+ }
+ return ret;
+}
diff --git a/io_uring/zctap.h b/io_uring/zctap.h
index 709c803220f4..2c3e23a6a07a 100644
--- a/io_uring/zctap.h
+++ b/io_uring/zctap.h
@@ -12,4 +12,9 @@ int io_provide_ifq_region_prep(struct io_kiocb *req,
const struct io_uring_sqe *sqe);
int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags);
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags);
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_zctap_recv(struct io_zctap_ifq *ifq, struct socket *sock,
+ struct msghdr *msg, int flags, u32 datalen, u16 copy_bgid);
+
#endif
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC v1 0/9] zero-copy RX for io_uring
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
` (8 preceding siblings ...)
2022-10-07 21:17 ` [RFC v1 9/9] io_uring: add OP_RECV_ZC command Jonathan Lemon
@ 2022-10-10 7:37 ` dust.li
2022-10-10 19:34 ` Jonathan Lemon
9 siblings, 1 reply; 12+ messages in thread
From: dust.li @ 2022-10-10 7:37 UTC (permalink / raw)
To: Jonathan Lemon, io-uring
On Fri, Oct 07, 2022 at 02:17:04PM -0700, Jonathan Lemon wrote:
>This series is a RFC for io_uring/zctap. This is an evolution of
>the earlier zctap work, re-targeted to use io_uring as the userspace
>API. The current code is intended to provide a zero-copy RX path for
>upper-level networking protocols (aka TCP and UDP). The current draft
>focuses on host-provided memory (not GPU memory).
>
>This RFC contains the upper-level core code required for operation,
>with the intent of soliciting feedback on the general API. This does
>not contain the network driver side changes required for complete
>operation. Also please note that as an RFC, there are some things
>which are incomplete or in need of rework.
>
>The intent is to use a network driver which provides header/data
>splitting, so the frame header (which is processed by the networking
>stack) does not reside in user memory.
>
>The code is roughly working (in that it has successfully received
>a TCP stream from a remote sender), but as an RFC, the intent is
>to solicit feedback on the API and overall design. The current code
>will also work with system pages, copying the data out to the
>application - this is intended as a fallback/testing path.
>
>High level description:
>
>The application allocates a frame backing store, and provides this
>to the kernel for use. An interface queue is requested from the
>networking device, and incoming frames are deposited into the provided
>memory region.
>
>Responsibility for correctly steering incoming frames to the queue
>is outside the scope of this work - it is assumed that the user
>has set steering rules up separately.
>
>Incoming frames are sent up the stack as skb's and eventually
>land in the application's socket receive queue. This differs
>from AF_XDP, which receives raw frames directly to userspace,
>without protocol processing.
>
>The RECV_ZC opcode then returns an iov[] style vector which points
>to the data in userspace memory. When the application has completed
>processing of the data, the buffer is returned back to the kernel
>through a fill ring for reuse.
Interesting work ! Any userspace demo and performance data ?
>
>Jonathan Lemon (9):
> io_uring: add zctap ifq definition
> netdevice: add SETUP_ZCTAP to the netdev_bpf structure
> io_uring: add register ifq opcode
> io_uring: add provide_ifq_region opcode
> io_uring: Add io_uring zctap iov structure and helpers
> io_uring: introduce reference tracking for user pages.
> page_pool: add page allocation and free hooks.
> io_uring: provide functions for the page_pool.
> io_uring: add OP_RECV_ZC command.
>
> include/linux/io_uring.h | 24 ++
> include/linux/io_uring_types.h | 10 +
> include/linux/netdevice.h | 6 +
> include/net/page_pool.h | 6 +
> include/uapi/linux/io_uring.h | 26 ++
> io_uring/Makefile | 3 +-
> io_uring/io_uring.c | 10 +
> io_uring/kbuf.c | 13 +
> io_uring/kbuf.h | 2 +
> io_uring/net.c | 123 ++++++
> io_uring/opdef.c | 23 +
> io_uring/zctap.c | 749 +++++++++++++++++++++++++++++++++
> io_uring/zctap.h | 20 +
> net/core/page_pool.c | 41 +-
> 14 files changed, 1048 insertions(+), 8 deletions(-)
> create mode 100644 io_uring/zctap.c
> create mode 100644 io_uring/zctap.h
>
>--
>2.30.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC v1 0/9] zero-copy RX for io_uring
2022-10-10 7:37 ` [RFC v1 0/9] zero-copy RX for io_uring dust.li
@ 2022-10-10 19:34 ` Jonathan Lemon
0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Lemon @ 2022-10-10 19:34 UTC (permalink / raw)
To: dust.li, Jonathan Lemon, io-uring
On 10/10/22 12:37 AM, dust.li wrote:
> On Fri, Oct 07, 2022 at 02:17:04PM -0700, Jonathan Lemon wrote:
>> This series is a RFC for io_uring/zctap. This is an evolution of
>> the earlier zctap work, re-targeted to use io_uring as the userspace
>> API. The current code is intended to provide a zero-copy RX path for
>> upper-level networking protocols (aka TCP and UDP). The current draft
>> focuses on host-provided memory (not GPU memory).
>>
>> This RFC contains the upper-level core code required for operation,
>> with the intent of soliciting feedback on the general API. This does
>> not contain the network driver side changes required for complete
>> operation. Also please note that as an RFC, there are some things
>> which are incomplete or in need of rework.
>>
>> The intent is to use a network driver which provides header/data
>> splitting, so the frame header (which is processed by the networking
>> stack) does not reside in user memory.
>>
>> The code is roughly working (in that it has successfully received
>> a TCP stream from a remote sender), but as an RFC, the intent is
>> to solicit feedback on the API and overall design. The current code
>> will also work with system pages, copying the data out to the
>> application - this is intended as a fallback/testing path.
>>
>> High level description:
>>
>> The application allocates a frame backing store, and provides this
>> to the kernel for use. An interface queue is requested from the
>> networking device, and incoming frames are deposited into the provided
>> memory region.
>>
>> Responsibility for correctly steering incoming frames to the queue
>> is outside the scope of this work - it is assumed that the user
>> has set steering rules up separately.
>>
>> Incoming frames are sent up the stack as skb's and eventually
>> land in the application's socket receive queue. This differs
>>from AF_XDP, which receives raw frames directly to userspace,
>> without protocol processing.
>>
>> The RECV_ZC opcode then returns an iov[] style vector which points
>> to the data in userspace memory. When the application has completed
>> processing of the data, the buffer is returned back to the kernel
>> through a fill ring for reuse.
>
> Interesting work ! Any userspace demo and performance data ?
Coming soon! I'm hoping to get feedback on the overall API though, did
you have any thoughts here?
--
Jonathan
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-10-10 19:40 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-07 21:17 [RFC v1 0/9] zero-copy RX for io_uring Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 1/9] io_uring: add zctap ifq definition Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 2/9] netdevice: add SETUP_ZCTAP to the netdev_bpf structure Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 3/9] io_uring: add register ifq opcode Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 4/9] io_uring: add provide_ifq_region opcode Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 5/9] io_uring: Add io_uring zctap iov structure and helpers Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 6/9] io_uring: introduce reference tracking for user pages Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 7/9] page_pool: add page allocation and free hooks Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 8/9] io_uring: provide functions for the page_pool Jonathan Lemon
2022-10-07 21:17 ` [RFC v1 9/9] io_uring: add OP_RECV_ZC command Jonathan Lemon
2022-10-10 7:37 ` [RFC v1 0/9] zero-copy RX for io_uring dust.li
2022-10-10 19:34 ` Jonathan Lemon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox