public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v2 0/2] io_uring zc rx fixed len recvzc
@ 2025-02-21 20:51 David Wei
  2025-02-21 20:51 ` [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc David Wei
  2025-02-21 20:51 ` [PATCH v2 2/2] io_uring/zcrx: add selftest case for " David Wei
  0 siblings, 2 replies; 20+ messages in thread
From: David Wei @ 2025-02-21 20:51 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov

Currently only multishot recvzc requests are supported in io_uring zc
rx, but sometimes there is a need to do a single recvzc e.g. peeking at
some data in the socket.

In this series, add single shot recvzc request and a selftest for the
feature.

Changes in v2:
--------------
* Consistently use u32/unsigned int for len
* Remove nowait semantics, request will not complete until requested len
  has been received
* Always set REQ_F_APOLL_MULTISHOT
* Fix return codes from io_recvzc request
* Fix changing len if set to UINT_MAX in io_zcrx_recv_skb()
* Use read_desc->count

David Wei (2):
  io_uring/zcrx: add single shot recvzc
  io_uring/zcrx: add selftest case for single shot recvzc

 io_uring/net.c                                | 19 +++++++-
 io_uring/zcrx.c                               | 17 +++++---
 io_uring/zcrx.h                               |  2 +-
 .../selftests/drivers/net/hw/iou-zcrx.c       | 43 ++++++++++++++++---
 .../selftests/drivers/net/hw/iou-zcrx.py      | 27 +++++++++++-
 5 files changed, 92 insertions(+), 16 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-21 20:51 [PATCH v2 0/2] io_uring zc rx fixed len recvzc David Wei
@ 2025-02-21 20:51 ` David Wei
  2025-02-22  0:08   ` Jens Axboe
                     ` (3 more replies)
  2025-02-21 20:51 ` [PATCH v2 2/2] io_uring/zcrx: add selftest case for " David Wei
  1 sibling, 4 replies; 20+ messages in thread
From: David Wei @ 2025-02-21 20:51 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov

Currently only multishot recvzc requests are supported, but sometimes
there is a need to do a single recv e.g. peeking at some data in the
socket. Add single shot recvzc requests where IORING_RECV_MULTISHOT is
_not_ set and the sqe->len field is set to the number of bytes to read
N.

There could be multiple completions containing data, like the multishot
case, since N bytes could be split across multiple frags. This is
followed by a final completion with res and cflags both set to 0 that
indicate the completion of the request, or a -res that indicate an
error.

Signed-off-by: David Wei <[email protected]>
---
 io_uring/net.c  | 19 +++++++++++++++++--
 io_uring/zcrx.c | 17 ++++++++++++-----
 io_uring/zcrx.h |  2 +-
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 000dc70d08d0..cae34a24266c 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -94,6 +94,7 @@ struct io_recvzc {
 	struct file			*file;
 	unsigned			msg_flags;
 	u16				flags;
+	u32				len;
 	struct io_zcrx_ifq		*ifq;
 };
 
@@ -1241,7 +1242,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	unsigned ifq_idx;
 
 	if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
-		     sqe->len || sqe->addr3))
+		     sqe->addr3))
 		return -EINVAL;
 
 	ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx);
@@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	zc->ifq = req->ctx->ifq;
 	if (!zc->ifq)
 		return -EINVAL;
+	zc->len = READ_ONCE(sqe->len);
+	if (zc->len == UINT_MAX)
+		return -EINVAL;
+	/* UINT_MAX means no limit on readlen */
+	if (!zc->len)
+		zc->len = UINT_MAX;
 
 	zc->flags = READ_ONCE(sqe->ioprio);
 	zc->msg_flags = READ_ONCE(sqe->msg_flags);
@@ -1269,6 +1276,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+	bool limit = zc->len != UINT_MAX;
 	struct socket *sock;
 	int ret;
 
@@ -1281,7 +1289,7 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
 		return -ENOTSOCK;
 
 	ret = io_zcrx_recv(req, zc->ifq, sock, zc->msg_flags | MSG_DONTWAIT,
-			   issue_flags);
+			   issue_flags, &zc->len);
 	if (unlikely(ret <= 0) && ret != -EAGAIN) {
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
@@ -1296,6 +1304,13 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
 		return IOU_OK;
 	}
 
+	if (zc->len == 0) {
+		io_req_set_res(req, 0, 0);
+
+		if (issue_flags & IO_URING_F_MULTISHOT)
+			return IOU_STOP_MULTISHOT;
+		return IOU_OK;
+	}
 	if (issue_flags & IO_URING_F_MULTISHOT)
 		return IOU_ISSUE_SKIP_COMPLETE;
 	return -EAGAIN;
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index f2d326e18e67..74bca4e471bc 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -817,6 +817,7 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 	int i, copy, end, off;
 	int ret = 0;
 
+	len = min_t(size_t, len, desc->count);
 	if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT))
 		return -EAGAIN;
 
@@ -894,26 +895,32 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 out:
 	if (offset == start_off)
 		return ret;
+	if (desc->count != UINT_MAX)
+		desc->count -= (offset - start_off);
 	return offset - start_off;
 }
 
 static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 				struct sock *sk, int flags,
-				unsigned issue_flags)
+				unsigned issue_flags, unsigned int *outlen)
 {
+	unsigned int len = *outlen;
+	bool limit = len != UINT_MAX;
 	struct io_zcrx_args args = {
 		.req = req,
 		.ifq = ifq,
 		.sock = sk->sk_socket,
 	};
 	read_descriptor_t rd_desc = {
-		.count = 1,
+		.count = len,
 		.arg.data = &args,
 	};
 	int ret;
 
 	lock_sock(sk);
 	ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
+	if (limit && ret)
+		*outlen = len - ret;
 	if (ret <= 0) {
 		if (ret < 0 || sock_flag(sk, SOCK_DONE))
 			goto out;
@@ -930,7 +937,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 		ret = IOU_REQUEUE;
 	} else if (sock_flag(sk, SOCK_DONE)) {
 		/* Make it to retry until it finally gets 0. */
-		if (issue_flags & IO_URING_F_MULTISHOT)
+		if (!limit && (issue_flags & IO_URING_F_MULTISHOT))
 			ret = IOU_REQUEUE;
 		else
 			ret = -EAGAIN;
@@ -942,7 +949,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 
 int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 		 struct socket *sock, unsigned int flags,
-		 unsigned issue_flags)
+		 unsigned issue_flags, unsigned int *len)
 {
 	struct sock *sk = sock->sk;
 	const struct proto *prot = READ_ONCE(sk->sk_prot);
@@ -951,5 +958,5 @@ int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 		return -EPROTONOSUPPORT;
 
 	sock_rps_record_flow(sk);
-	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags, issue_flags);
+	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags, issue_flags, len);
 }
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index a16bdd921f03..1b4042dc48e2 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -46,7 +46,7 @@ void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);
 void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);
 int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 		 struct socket *sock, unsigned int flags,
-		 unsigned issue_flags);
+		 unsigned issue_flags, unsigned int *len);
 #else
 static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 					struct io_uring_zcrx_ifq_reg __user *arg)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/2] io_uring/zcrx: add selftest case for single shot recvzc
  2025-02-21 20:51 [PATCH v2 0/2] io_uring zc rx fixed len recvzc David Wei
  2025-02-21 20:51 ` [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc David Wei
@ 2025-02-21 20:51 ` David Wei
  1 sibling, 0 replies; 20+ messages in thread
From: David Wei @ 2025-02-21 20:51 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov

Add a selftest case to iou-zcrx where the sender sends 4x4K = 16K and
the receiver does 4x4K single shot recvzc. Validate that the requests
are successful and the data is not corrupted.

Signed-off-by: David Wei <[email protected]>
---
 .../selftests/drivers/net/hw/iou-zcrx.c       | 43 ++++++++++++++++---
 .../selftests/drivers/net/hw/iou-zcrx.py      | 27 +++++++++++-
 2 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 5d04dd55ae55..c26b4180eddd 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -61,6 +61,9 @@ static int cfg_port = 8000;
 static int cfg_payload_len;
 static const char *cfg_ifname;
 static int cfg_queue_id = -1;
+static bool cfg_oneshot;
+static int cfg_oneshot_recvs;
+static int cfg_send_size = SEND_SIZE;
 static struct sockaddr_in6 cfg_addr;
 
 static char payload[SEND_SIZE] __attribute__((aligned(PAGE_SIZE)));
@@ -196,6 +199,17 @@ static void add_recvzc(struct io_uring *ring, int sockfd)
 	sqe->user_data = 2;
 }
 
+static void add_recvzc_oneshot(struct io_uring *ring, int sockfd, size_t len)
+{
+	struct io_uring_sqe *sqe;
+
+	sqe = io_uring_get_sqe(ring);
+
+	io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, len, 0);
+	sqe->ioprio |= IORING_RECV_MULTISHOT;
+	sqe->user_data = 2;
+}
+
 static void process_accept(struct io_uring *ring, struct io_uring_cqe *cqe)
 {
 	if (cqe->res < 0)
@@ -204,7 +218,10 @@ static void process_accept(struct io_uring *ring, struct io_uring_cqe *cqe)
 		error(1, 0, "Unexpected second connection");
 
 	connfd = cqe->res;
-	add_recvzc(ring, connfd);
+	if (cfg_oneshot)
+		add_recvzc_oneshot(ring, connfd, PAGE_SIZE);
+	else
+		add_recvzc(ring, connfd);
 }
 
 static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
@@ -218,7 +235,7 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
 	ssize_t n;
 	int i;
 
-	if (cqe->res == 0 && cqe->flags == 0) {
+	if (cqe->res == 0 && cqe->flags == 0 && cfg_oneshot_recvs == 0) {
 		stop = true;
 		return;
 	}
@@ -226,8 +243,14 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
 	if (cqe->res < 0)
 		error(1, 0, "recvzc(): %d", cqe->res);
 
-	if (!(cqe->flags & IORING_CQE_F_MORE))
+	if (cfg_oneshot) {
+		if (cqe->res == 0 && cqe->flags == 0 && cfg_oneshot_recvs) {
+			add_recvzc_oneshot(ring, connfd, PAGE_SIZE);
+			cfg_oneshot_recvs--;
+		}
+	} else if (!(cqe->flags & IORING_CQE_F_MORE)) {
 		add_recvzc(ring, connfd);
+	}
 
 	rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
 
@@ -237,7 +260,7 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
 
 	for (i = 0; i < n; i++) {
 		if (*(data + i) != payload[(received + i)])
-			error(1, 0, "payload mismatch");
+			error(1, 0, "payload mismatch at ", i);
 	}
 	received += n;
 
@@ -313,7 +336,7 @@ static void run_server(void)
 
 static void run_client(void)
 {
-	ssize_t to_send = SEND_SIZE;
+	ssize_t to_send = cfg_send_size;
 	ssize_t sent = 0;
 	ssize_t chunk, res;
 	int fd;
@@ -360,7 +383,7 @@ static void parse_opts(int argc, char **argv)
 		usage(argv[0]);
 	cfg_payload_len = max_payload_len;
 
-	while ((c = getopt(argc, argv, "46sch:p:l:i:q:")) != -1) {
+	while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:")) != -1) {
 		switch (c) {
 		case 's':
 			if (cfg_client)
@@ -387,6 +410,14 @@ static void parse_opts(int argc, char **argv)
 		case 'q':
 			cfg_queue_id = strtoul(optarg, NULL, 0);
 			break;
+		case 'o': {
+			cfg_oneshot = true;
+			cfg_oneshot_recvs = strtoul(optarg, NULL, 0);
+			break;
+		}
+		case 'z':
+			cfg_send_size = strtoul(optarg, NULL, 0);
+			break;
 		}
 	}
 
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
index ea0a346c3eff..d301d9b356f7 100755
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
@@ -34,14 +34,37 @@ def test_zcrx(cfg) -> None:
         raise KsftSkipEx('at least 2 combined channels required')
     rx_ring = _get_rx_ring_entries(cfg)
 
-    rx_cmd = f"{cfg.bin_remote} -s -p 9999 -i {cfg.ifname} -q {combined_chans - 1}"
-    tx_cmd = f"{cfg.bin_local} -c -h {cfg.remote_v6} -p 9999 -l 12840"
+    try:
+        ethtool(f"-G {cfg.ifname} rx 64", host=cfg.remote)
+        ethtool(f"-X {cfg.ifname} equal {combined_chans - 1}", host=cfg.remote)
+        flow_rule_id = _set_flow_rule(cfg, combined_chans - 1)
+
+        rx_cmd = f"{cfg.bin_remote} -s -p 9999 -i {cfg.ifname} -q {combined_chans - 1}"
+        tx_cmd = f"{cfg.bin_local} -c -h {cfg.remote_v6} -p 9999 -l 12840"
+        with bkg(rx_cmd, host=cfg.remote, exit_wait=True):
+            wait_port_listen(9999, proto="tcp", host=cfg.remote)
+            cmd(tx_cmd)
+    finally:
+        ethtool(f"-N {cfg.ifname} delete {flow_rule_id}", host=cfg.remote)
+        ethtool(f"-X {cfg.ifname} default", host=cfg.remote)
+        ethtool(f"-G {cfg.ifname} rx {rx_ring}", host=cfg.remote)
+
+
+def test_zcrx_oneshot(cfg) -> None:
+    cfg.require_v6()
+
+    combined_chans = _get_combined_channels(cfg)
+    if combined_chans < 2:
+        raise KsftSkipEx('at least 2 combined channels required')
+    rx_ring = _get_rx_ring_entries(cfg)
 
     try:
         ethtool(f"-G {cfg.ifname} rx 64", host=cfg.remote)
         ethtool(f"-X {cfg.ifname} equal {combined_chans - 1}", host=cfg.remote)
         flow_rule_id = _set_flow_rule(cfg, combined_chans - 1)
 
+        rx_cmd = f"{cfg.bin_remote} -s -p 9999 -i {cfg.ifname} -q {combined_chans - 1} -o 4"
+        tx_cmd = f"{cfg.bin_local} -c -h {cfg.remote_v6} -p 9999 -l 4096 -z 16384"
         with bkg(rx_cmd, host=cfg.remote, exit_wait=True):
             wait_port_listen(9999, proto="tcp", host=cfg.remote)
             cmd(tx_cmd)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-21 20:51 ` [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc David Wei
@ 2025-02-22  0:08   ` Jens Axboe
  2025-02-22  1:01     ` Pavel Begunkov
  2025-02-23 22:39     ` David Wei
  2025-02-22  0:40   ` Pavel Begunkov
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2025-02-22  0:08 UTC (permalink / raw)
  To: David Wei, io-uring; +Cc: Pavel Begunkov

Just a few minor drive-by nits.

> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  	zc->ifq = req->ctx->ifq;
>  	if (!zc->ifq)
>  		return -EINVAL;
> +	zc->len = READ_ONCE(sqe->len);
> +	if (zc->len == UINT_MAX)
> +		return -EINVAL;
> +	/* UINT_MAX means no limit on readlen */
> +	if (!zc->len)
> +		zc->len = UINT_MAX;

Why not just make UINT_MAX allowed, meaning no limit? Would avoid two
branches here, and as far as I can tell not really change anything in
terms of API niceness.

> @@ -1269,6 +1276,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>  {
>  	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
> +	bool limit = zc->len != UINT_MAX;
>  	struct socket *sock;
>  	int ret;

Doesn't seem to be used?

> @@ -1296,6 +1304,13 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>  		return IOU_OK;
>  	}
>  
> +	if (zc->len == 0) {
> +		io_req_set_res(req, 0, 0);
> +
> +		if (issue_flags & IO_URING_F_MULTISHOT)
> +			return IOU_STOP_MULTISHOT;
> +		return IOU_OK;
> +	}
>  	if (issue_flags & IO_URING_F_MULTISHOT)
>  		return IOU_ISSUE_SKIP_COMPLETE;
>  	return -EAGAIN;

Might be cleaner as:

	ret = -EAGAIN;
	if (!zc->len) {
		io_req_set_res(req, 0, 0);
		ret = -IOU_OK;
	}
  	if (issue_flags & IO_URING_F_MULTISHOT)
  		return IOU_ISSUE_SKIP_COMPLETE;
  	return ret;

rather than duplicate the flag checking.


>  static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>  				struct sock *sk, int flags,
> -				unsigned issue_flags)
> +				unsigned issue_flags, unsigned int *outlen)
>  {
> +	unsigned int len = *outlen;
> +	bool limit = len != UINT_MAX;
>  	struct io_zcrx_args args = {
>  		.req = req,
>  		.ifq = ifq,
>  		.sock = sk->sk_socket,
>  	};
>  	read_descriptor_t rd_desc = {
> -		.count = 1,
> +		.count = len,
>  		.arg.data = &args,
>  	};
>  	int ret;
>  
>  	lock_sock(sk);
>  	ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
> +	if (limit && ret)
> +		*outlen = len - ret;
>  	if (ret <= 0) {
>  		if (ret < 0 || sock_flag(sk, SOCK_DONE))
>  			goto out;
> @@ -930,7 +937,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>  		ret = IOU_REQUEUE;
>  	} else if (sock_flag(sk, SOCK_DONE)) {
>  		/* Make it to retry until it finally gets 0. */
> -		if (issue_flags & IO_URING_F_MULTISHOT)
> +		if (!limit && (issue_flags & IO_URING_F_MULTISHOT))
>  			ret = IOU_REQUEUE;
>  		else
>  			ret = -EAGAIN;

In the two above hunks, the limit checking feels a bit hackish. For
example, why is it related to multishot or not? I think the patch would
benefit from being split into separate patches for singleshot and length
support. Eg one that adds singleshot support, and then one that adds
length capping on top. That would make it much easier to reason about
hunks like the above one.

> @@ -942,7 +949,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>  
>  int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>  		 struct socket *sock, unsigned int flags,
> -		 unsigned issue_flags)
> +		 unsigned issue_flags, unsigned int *len)
>  {
>  	struct sock *sk = sock->sk;
>  	const struct proto *prot = READ_ONCE(sk->sk_prot);

Shouldn't recv support it too?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-21 20:51 ` [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc David Wei
  2025-02-22  0:08   ` Jens Axboe
@ 2025-02-22  0:40   ` Pavel Begunkov
  2025-02-22  0:52     ` Jens Axboe
  2025-02-23 22:43     ` David Wei
  2025-02-22  0:56   ` Pavel Begunkov
  2025-02-22  8:54   ` lizetao
  3 siblings, 2 replies; 20+ messages in thread
From: Pavel Begunkov @ 2025-02-22  0:40 UTC (permalink / raw)
  To: David Wei, io-uring; +Cc: Jens Axboe

On 2/21/25 20:51, David Wei wrote:
> Currently only multishot recvzc requests are supported, but sometimes
> there is a need to do a single recv e.g. peeking at some data in the
> socket. Add single shot recvzc requests where IORING_RECV_MULTISHOT is
> _not_ set and the sqe->len field is set to the number of bytes to read
> N.

There is no oneshot, we need to change the message.


> There could be multiple completions containing data, like the multishot
> case, since N bytes could be split across multiple frags. This is
> followed by a final completion with res and cflags both set to 0 that
> indicate the completion of the request, or a -res that indicate an
> error.
> 
> Signed-off-by: David Wei <[email protected]>
> ---
>   io_uring/net.c  | 19 +++++++++++++++++--
>   io_uring/zcrx.c | 17 ++++++++++++-----
>   io_uring/zcrx.h |  2 +-
>   3 files changed, 30 insertions(+), 8 deletions(-)
> 
> diff --git a/io_uring/net.c b/io_uring/net.c
> index 000dc70d08d0..cae34a24266c 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -94,6 +94,7 @@ struct io_recvzc {
>   	struct file			*file;
>   	unsigned			msg_flags;
>   	u16				flags;
> +	u32				len;
>   	struct io_zcrx_ifq		*ifq;
>   };
>   
> @@ -1241,7 +1242,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>   	unsigned ifq_idx;
>   
>   	if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
> -		     sqe->len || sqe->addr3))
> +		     sqe->addr3))
>   		return -EINVAL;
>   
>   	ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx);
> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>   	zc->ifq = req->ctx->ifq;
>   	if (!zc->ifq)
>   		return -EINVAL;
> +	zc->len = READ_ONCE(sqe->len);
> +	if (zc->len == UINT_MAX)
> +		return -EINVAL;

The uapi gives u32, if we're using a special value it should
match the type. ~(u32)0

> +	/* UINT_MAX means no limit on readlen */
> +	if (!zc->len)
> +		zc->len = UINT_MAX;
>   
>   	zc->flags = READ_ONCE(sqe->ioprio);
>   	zc->msg_flags = READ_ONCE(sqe->msg_flags);
> @@ -1269,6 +1276,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>   int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>   {
>   	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
> +	bool limit = zc->len != UINT_MAX;
>   	struct socket *sock;
>   	int ret;
>   
> @@ -1281,7 +1289,7 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>   		return -ENOTSOCK;
>   
>   	ret = io_zcrx_recv(req, zc->ifq, sock, zc->msg_flags | MSG_DONTWAIT,
> -			   issue_flags);
> +			   issue_flags, &zc->len);
>   	if (unlikely(ret <= 0) && ret != -EAGAIN) {
>   		if (ret == -ERESTARTSYS)
>   			ret = -EINTR;
> @@ -1296,6 +1304,13 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>   		return IOU_OK;
>   	}
>   
> +	if (zc->len == 0) {

If len hits zero we should always complete it, regardless
of errors the stack might have returned, so might be
cleaner if you do that check right after io_zcrx_recv().

> +		io_req_set_res(req, 0, 0);
> +
> +		if (issue_flags & IO_URING_F_MULTISHOT)
> +			return IOU_STOP_MULTISHOT;
> +		return IOU_OK;
> +	}
>   	if (issue_flags & IO_URING_F_MULTISHOT)
>   		return IOU_ISSUE_SKIP_COMPLETE;
>   	return -EAGAIN;
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> index f2d326e18e67..74bca4e471bc 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
> @@ -817,6 +817,7 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
>   	int i, copy, end, off;
>   	int ret = 0;
>   
> +	len = min_t(size_t, len, desc->count);
>   	if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT))
>   		return -EAGAIN;
>   
> @@ -894,26 +895,32 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
>   out:
>   	if (offset == start_off)
>   		return ret;
> +	if (desc->count != UINT_MAX)
> +		desc->count -= (offset - start_off);

I'd say just set desc->count to it's max value (size_t), and
never care about checking for limits after.

>   	return offset - start_off;
>   }
>   
>   static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>   				struct sock *sk, int flags,
> -				unsigned issue_flags)
> +				unsigned issue_flags, unsigned int *outlen)
>   {
> +	unsigned int len = *outlen;
> +	bool limit = len != UINT_MAX;
>   	struct io_zcrx_args args = {
>   		.req = req,
>   		.ifq = ifq,
>   		.sock = sk->sk_socket,
>   	};
>   	read_descriptor_t rd_desc = {
> -		.count = 1,
> +		.count = len,
>   		.arg.data = &args,
>   	};
>   	int ret;
>   
>   	lock_sock(sk);
>   	ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
> +	if (limit && ret)
> +		*outlen = len - ret;
>   	if (ret <= 0) {
>   		if (ret < 0 || sock_flag(sk, SOCK_DONE))
>   			goto out;
> @@ -930,7 +937,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>   		ret = IOU_REQUEUE;
>   	} else if (sock_flag(sk, SOCK_DONE)) {
>   		/* Make it to retry until it finally gets 0. */
> -		if (issue_flags & IO_URING_F_MULTISHOT)
> +		if (!limit && (issue_flags & IO_URING_F_MULTISHOT))
>   			ret = IOU_REQUEUE;

And with earlier len check in net.c you don't need this change,
which feels wrong, as it's only here to circumvent some handling
in net.c, I assume

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  0:40   ` Pavel Begunkov
@ 2025-02-22  0:52     ` Jens Axboe
  2025-02-22  1:06       ` Pavel Begunkov
  2025-02-23 22:43     ` David Wei
  1 sibling, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2025-02-22  0:52 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>       zc->ifq = req->ctx->ifq;
>>       if (!zc->ifq)
>>           return -EINVAL;
>> +    zc->len = READ_ONCE(sqe->len);
>> +    if (zc->len == UINT_MAX)
>> +        return -EINVAL;
> 
> The uapi gives u32, if we're using a special value it should
> match the type. ~(u32)0

Any syscall in Linux is capped at 2G anyway, so I think all of this
special meaning of ->len just needs to go away. Just ask for whatever
bytes you want, but yes more than 2G will not be supported anyway.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-21 20:51 ` [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc David Wei
  2025-02-22  0:08   ` Jens Axboe
  2025-02-22  0:40   ` Pavel Begunkov
@ 2025-02-22  0:56   ` Pavel Begunkov
  2025-02-23 22:44     ` David Wei
  2025-02-22  8:54   ` lizetao
  3 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2025-02-22  0:56 UTC (permalink / raw)
  To: David Wei, io-uring; +Cc: Jens Axboe

On 2/21/25 20:51, David Wei wrote:
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> index f2d326e18e67..74bca4e471bc 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
...
>   static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>   				struct sock *sk, int flags,
> -				unsigned issue_flags)
> +				unsigned issue_flags, unsigned int *outlen)
>   {
> +	unsigned int len = *outlen;
> +	bool limit = len != UINT_MAX;
>   	struct io_zcrx_args args = {
>   		.req = req,
>   		.ifq = ifq,
>   		.sock = sk->sk_socket,
>   	};
>   	read_descriptor_t rd_desc = {
> -		.count = 1,
> +		.count = len,
>   		.arg.data = &args,
>   	};
>   	int ret;
>   
>   	lock_sock(sk);
>   	ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
> +	if (limit && ret)
> +		*outlen = len - ret;

ret can be negative, the check will pass and the calculations
will turn it into something weird.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  0:08   ` Jens Axboe
@ 2025-02-22  1:01     ` Pavel Begunkov
  2025-02-22  1:07       ` Jens Axboe
  2025-02-23 22:39     ` David Wei
  1 sibling, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2025-02-22  1:01 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 2/22/25 00:08, Jens Axboe wrote:
> Just a few minor drive-by nits.
> 
>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>   	zc->ifq = req->ctx->ifq;
>>   	if (!zc->ifq)
>>   		return -EINVAL;
>> +	zc->len = READ_ONCE(sqe->len);
>> +	if (zc->len == UINT_MAX)
>> +		return -EINVAL;
>> +	/* UINT_MAX means no limit on readlen */
>> +	if (!zc->len)
>> +		zc->len = UINT_MAX;
> 
> Why not just make UINT_MAX allowed, meaning no limit? Would avoid two
> branches here, and as far as I can tell not really change anything in
> terms of API niceness.

I think 0 goes better as a special uapi value. It doesn't alter the
uapi, and commonly understood as "no limits", which is the opposite
to the other option, especially since UINT_MAX is not a max value for
an unlimited request, I'd easily expect it to drive more than 4GB.

  
>> @@ -1269,6 +1276,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>   int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>>   {
>>   	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>> +	bool limit = zc->len != UINT_MAX;
>>   	struct socket *sock;
>>   	int ret;
> 
> Doesn't seem to be used?
> 
>> @@ -1296,6 +1304,13 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>>   		return IOU_OK;
>>   	}
>>   
>> +	if (zc->len == 0) {
>> +		io_req_set_res(req, 0, 0);
>> +
>> +		if (issue_flags & IO_URING_F_MULTISHOT)
>> +			return IOU_STOP_MULTISHOT;
>> +		return IOU_OK;
>> +	}
>>   	if (issue_flags & IO_URING_F_MULTISHOT)
>>   		return IOU_ISSUE_SKIP_COMPLETE;
>>   	return -EAGAIN;
> 
> Might be cleaner as:
> 
> 	ret = -EAGAIN;
> 	if (!zc->len) {
> 		io_req_set_res(req, 0, 0);
> 		ret = -IOU_OK;
> 	}
>    	if (issue_flags & IO_URING_F_MULTISHOT)
>    		return IOU_ISSUE_SKIP_COMPLETE;
>    	return ret;
> 
> rather than duplicate the flag checking.

You missed IOU_STOP_MULTISHOT, but even without it separate error
codes for polling is already an utter mess to try be smart about it.
I'll try to clean it up, but that's orthogonal.

...
>> @@ -930,7 +937,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>   		ret = IOU_REQUEUE;
>>   	} else if (sock_flag(sk, SOCK_DONE)) {
>>   		/* Make it to retry until it finally gets 0. */
>> -		if (issue_flags & IO_URING_F_MULTISHOT)
>> +		if (!limit && (issue_flags & IO_URING_F_MULTISHOT))
>>   			ret = IOU_REQUEUE;
>>   		else
>>   			ret = -EAGAIN;
> 
> In the two above hunks, the limit checking feels a bit hackish. For
> example, why is it related to multishot or not? I think the patch would
> benefit from being split into separate patches for singleshot and length

I agree with the statement, but fwiw there are no single shots and
can't be that, the message is misleading.

> support. Eg one that adds singleshot support, and then one that adds
> length capping on top. That would make it much easier to reason about
> hunks like the above one.
> 
>> @@ -942,7 +949,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>   
>>   int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>   		 struct socket *sock, unsigned int flags,
>> -		 unsigned issue_flags)
>> +		 unsigned issue_flags, unsigned int *len)
>>   {
>>   	struct sock *sk = sock->sk;
>>   	const struct proto *prot = READ_ONCE(sk->sk_prot);
> 
> Shouldn't recv support it too?

There is no "msg" variant of the request, if that's what you mean.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  0:52     ` Jens Axboe
@ 2025-02-22  1:06       ` Pavel Begunkov
  2025-02-22  1:09         ` Jens Axboe
  2025-02-22  1:09         ` Pavel Begunkov
  0 siblings, 2 replies; 20+ messages in thread
From: Pavel Begunkov @ 2025-02-22  1:06 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 2/22/25 00:52, Jens Axboe wrote:
>>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>        zc->ifq = req->ctx->ifq;
>>>        if (!zc->ifq)
>>>            return -EINVAL;
>>> +    zc->len = READ_ONCE(sqe->len);
>>> +    if (zc->len == UINT_MAX)
>>> +        return -EINVAL;
>>
>> The uapi gives u32, if we're using a special value it should
>> match the type. ~(u32)0
> 
> Any syscall in Linux is capped at 2G anyway, so I think all of this

I don't see how it related, you don't have to have a weird
00111111b as a special value.

> special meaning of ->len just needs to go away. Just ask for whatever
> bytes you want, but yes more than 2G will not be supported anyway.

That's not the case here, the request does support more than 2G,
it's just spread across multiple CQEs, and the limit accounts
for multiple CQEs.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  1:01     ` Pavel Begunkov
@ 2025-02-22  1:07       ` Jens Axboe
  2025-02-23 22:35         ` David Wei
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2025-02-22  1:07 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 2/21/25 6:01 PM, Pavel Begunkov wrote:
> On 2/22/25 00:08, Jens Axboe wrote:
>> Just a few minor drive-by nits.
>>
>>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>       zc->ifq = req->ctx->ifq;
>>>       if (!zc->ifq)
>>>           return -EINVAL;
>>> +    zc->len = READ_ONCE(sqe->len);
>>> +    if (zc->len == UINT_MAX)
>>> +        return -EINVAL;
>>> +    /* UINT_MAX means no limit on readlen */
>>> +    if (!zc->len)
>>> +        zc->len = UINT_MAX;
>>
>> Why not just make UINT_MAX allowed, meaning no limit? Would avoid two
>> branches here, and as far as I can tell not really change anything in
>> terms of API niceness.
> 
> I think 0 goes better as a special uapi value. It doesn't alter the
> uapi, and commonly understood as "no limits", which is the opposite
> to the other option, especially since UINT_MAX is not a max value for
> an unlimited request, I'd easily expect it to drive more than 4GB.

Yeah that's certainly better, and as you say also has the same (forced)
semantics for multishot.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  1:06       ` Pavel Begunkov
@ 2025-02-22  1:09         ` Jens Axboe
  2025-02-22  1:15           ` Pavel Begunkov
  2025-02-22  1:09         ` Pavel Begunkov
  1 sibling, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2025-02-22  1:09 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring

On 2/21/25 6:06 PM, Pavel Begunkov wrote:
> On 2/22/25 00:52, Jens Axboe wrote:
>>>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>        zc->ifq = req->ctx->ifq;
>>>>        if (!zc->ifq)
>>>>            return -EINVAL;
>>>> +    zc->len = READ_ONCE(sqe->len);
>>>> +    if (zc->len == UINT_MAX)
>>>> +        return -EINVAL;
>>>
>>> The uapi gives u32, if we're using a special value it should
>>> match the type. ~(u32)0
>>
>> Any syscall in Linux is capped at 2G anyway, so I think all of this
> 
> I don't see how it related, you don't have to have a weird
> 00111111b as a special value.
> 
>> special meaning of ->len just needs to go away. Just ask for whatever
>> bytes you want, but yes more than 2G will not be supported anyway.
> 
> That's not the case here, the request does support more than 2G,
> it's just spread across multiple CQEs, and the limit accounts
> for multiple CQEs.

All pretty moot if we just go with 0 as the "transfer whatever length
you want", as obviously each individual transfer will be limited anyway.
Which is the better choice than having some odd 4G value.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  1:06       ` Pavel Begunkov
  2025-02-22  1:09         ` Jens Axboe
@ 2025-02-22  1:09         ` Pavel Begunkov
  1 sibling, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2025-02-22  1:09 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 2/22/25 01:06, Pavel Begunkov wrote:
> On 2/22/25 00:52, Jens Axboe wrote:
>>>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>        zc->ifq = req->ctx->ifq;
>>>>        if (!zc->ifq)
>>>>            return -EINVAL;
>>>> +    zc->len = READ_ONCE(sqe->len);
>>>> +    if (zc->len == UINT_MAX)
>>>> +        return -EINVAL;
>>>
>>> The uapi gives u32, if we're using a special value it should
>>> match the type. ~(u32)0
>>
>> Any syscall in Linux is capped at 2G anyway, so I think all of this
> 
> I don't see how it related, you don't have to have a weird
> 00111111b as a special value.
> 
>> special meaning of ->len just needs to go away. Just ask for whatever

And there will need to be some special value for that. Setting
the maximum limit to 4GB is reasonable, but when it's crunching
data without limits 4GB is nothing. IMHO not worth growing the
uapi field u32 -> u64.

>> bytes you want, but yes more than 2G will not be supported anyway.
> 
> That's not the case here, the request does support more than 2G,
> it's just spread across multiple CQEs, and the limit accounts
> for multiple CQEs.
> 

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  1:09         ` Jens Axboe
@ 2025-02-22  1:15           ` Pavel Begunkov
  0 siblings, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2025-02-22  1:15 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring

On 2/22/25 01:09, Jens Axboe wrote:
> On 2/21/25 6:06 PM, Pavel Begunkov wrote:
>> On 2/22/25 00:52, Jens Axboe wrote:
>>>>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>         zc->ifq = req->ctx->ifq;
>>>>>         if (!zc->ifq)
>>>>>             return -EINVAL;
>>>>> +    zc->len = READ_ONCE(sqe->len);
>>>>> +    if (zc->len == UINT_MAX)
>>>>> +        return -EINVAL;
>>>>
>>>> The uapi gives u32, if we're using a special value it should
>>>> match the type. ~(u32)0
>>>
>>> Any syscall in Linux is capped at 2G anyway, so I think all of this
>>
>> I don't see how it related, you don't have to have a weird
>> 00111111b as a special value.
>>
>>> special meaning of ->len just needs to go away. Just ask for whatever
>>> bytes you want, but yes more than 2G will not be supported anyway.
>>
>> That's not the case here, the request does support more than 2G,
>> it's just spread across multiple CQEs, and the limit accounts
>> for multiple CQEs.
> 
> All pretty moot if we just go with 0 as the "transfer whatever length
> you want", as obviously each individual transfer will be limited anyway.
> Which is the better choice than having some odd 4G value.

I think so as well, and we'd need to check for 0 otherwise.
And it's probably fine to reserve a U32_MAX as above if it
has a chance to make (even if in the future) handling easier,
not like prep is performance sensitive.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-21 20:51 ` [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc David Wei
                     ` (2 preceding siblings ...)
  2025-02-22  0:56   ` Pavel Begunkov
@ 2025-02-22  8:54   ` lizetao
  2025-02-24  0:17     ` David Wei
  3 siblings, 1 reply; 20+ messages in thread
From: lizetao @ 2025-02-22  8:54 UTC (permalink / raw)
  To: David Wei; +Cc: Jens Axboe, Pavel Begunkov, [email protected]

Hi,

> -----Original Message-----
> From: David Wei <[email protected]>
> Sent: Saturday, February 22, 2025 4:52 AM
> To: [email protected]
> Cc: Jens Axboe <[email protected]>; Pavel Begunkov <[email protected]>
> Subject: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
> 
> Currently only multishot recvzc requests are supported, but sometimes there is
> a need to do a single recv e.g. peeking at some data in the socket. Add single
> shot recvzc requests where IORING_RECV_MULTISHOT is _not_ set and the sqe-
> >len field is set to the number of bytes to read N.
> 
> There could be multiple completions containing data, like the multishot case,
> since N bytes could be split across multiple frags. This is followed by a final
> completion with res and cflags both set to 0 that indicate the completion of the
> request, or a -res that indicate an error.
> 
> Signed-off-by: David Wei <[email protected]>
> ---
>  io_uring/net.c  | 19 +++++++++++++++++--  io_uring/zcrx.c | 17 ++++++++++++--
> ---  io_uring/zcrx.h |  2 +-
>  3 files changed, 30 insertions(+), 8 deletions(-)
> 
> diff --git a/io_uring/net.c b/io_uring/net.c index 000dc70d08d0..cae34a24266c
> 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -94,6 +94,7 @@ struct io_recvzc {
>  	struct file			*file;
>  	unsigned			msg_flags;
>  	u16				flags;
> +	u32				len;
>  	struct io_zcrx_ifq		*ifq;
>  };
> 
> @@ -1241,7 +1242,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct
> io_uring_sqe *sqe)
>  	unsigned ifq_idx;
> 
>  	if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
> -		     sqe->len || sqe->addr3))
> +		     sqe->addr3))
>  		return -EINVAL;
> 
>  	ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx); @@ -1250,6 +1251,12 @@
> int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  	zc->ifq = req->ctx->ifq;
>  	if (!zc->ifq)
>  		return -EINVAL;
> +	zc->len = READ_ONCE(sqe->len);
> +	if (zc->len == UINT_MAX)
> +		return -EINVAL;
> +	/* UINT_MAX means no limit on readlen */

What is the difference between unlimited read length and IO_URING_F_MULTISHOT?
In what specific scenarios would unlimited read length be used?

> +	if (!zc->len)
> +		zc->len = UINT_MAX;
> 
>  	zc->flags = READ_ONCE(sqe->ioprio);
>  	zc->msg_flags = READ_ONCE(sqe->msg_flags); @@ -1269,6 +1276,7
> @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)  {
>  	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
> +	bool limit = zc->len != UINT_MAX;

It is not used.

>  	struct socket *sock;
>  	int ret;
> 
> @@ -1281,7 +1289,7 @@ int io_recvzc(struct io_kiocb *req, unsigned int
> issue_flags)
>  		return -ENOTSOCK;
> 
>  	ret = io_zcrx_recv(req, zc->ifq, sock, zc->msg_flags | MSG_DONTWAIT,
> -			   issue_flags);
> +			   issue_flags, &zc->len);
>  	if (unlikely(ret <= 0) && ret != -EAGAIN) {
>  		if (ret == -ERESTARTSYS)
>  			ret = -EINTR;
> @@ -1296,6 +1304,13 @@ int io_recvzc(struct io_kiocb *req, unsigned int
> issue_flags)
>  		return IOU_OK;
>  	}
> 
> +	if (zc->len == 0) {
> +		io_req_set_res(req, 0, 0);
> +
> +		if (issue_flags & IO_URING_F_MULTISHOT)
> +			return IOU_STOP_MULTISHOT;
> +		return IOU_OK;
> +	}
>  	if (issue_flags & IO_URING_F_MULTISHOT)
>  		return IOU_ISSUE_SKIP_COMPLETE;
>  	return -EAGAIN;
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index
> f2d326e18e67..74bca4e471bc 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
> @@ -817,6 +817,7 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct
> sk_buff *skb,
>  	int i, copy, end, off;
>  	int ret = 0;
> 
> +	len = min_t(size_t, len, desc->count);
>  	if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT))
>  		return -EAGAIN;
> 
> @@ -894,26 +895,32 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct
> sk_buff *skb,
>  out:
>  	if (offset == start_off)
>  		return ret;
> +	if (desc->count != UINT_MAX)
> +		desc->count -= (offset - start_off);
>  	return offset - start_off;
>  }
> 
>  static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>  				struct sock *sk, int flags,
> -				unsigned issue_flags)
> +				unsigned issue_flags, unsigned int *outlen)
>  {
> +	unsigned int len = *outlen;
> +	bool limit = len != UINT_MAX;
>  	struct io_zcrx_args args = {
>  		.req = req,
>  		.ifq = ifq,
>  		.sock = sk->sk_socket,
>  	};
>  	read_descriptor_t rd_desc = {
> -		.count = 1,
> +		.count = len,
>  		.arg.data = &args,
>  	};
>  	int ret;
> 
>  	lock_sock(sk);
>  	ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
> +	if (limit && ret)
> +		*outlen = len - ret;
>  	if (ret <= 0) {
>  		if (ret < 0 || sock_flag(sk, SOCK_DONE))
>  			goto out;
> @@ -930,7 +937,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req,
> struct io_zcrx_ifq *ifq,
>  		ret = IOU_REQUEUE;
>  	} else if (sock_flag(sk, SOCK_DONE)) {
>  		/* Make it to retry until it finally gets 0. */
> -		if (issue_flags & IO_URING_F_MULTISHOT)
> +		if (!limit && (issue_flags & IO_URING_F_MULTISHOT))
>  			ret = IOU_REQUEUE;
>  		else
>  			ret = -EAGAIN;
> @@ -942,7 +949,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req,
> struct io_zcrx_ifq *ifq,
> 
>  int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>  		 struct socket *sock, unsigned int flags,
> -		 unsigned issue_flags)
> +		 unsigned issue_flags, unsigned int *len)
>  {
>  	struct sock *sk = sock->sk;
>  	const struct proto *prot = READ_ONCE(sk->sk_prot); @@ -951,5
> +958,5 @@ int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>  		return -EPROTONOSUPPORT;
> 
>  	sock_rps_record_flow(sk);
> -	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags, issue_flags);
> +	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags, issue_flags, len);
>  }
> diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h index
> a16bdd921f03..1b4042dc48e2 100644
> --- a/io_uring/zcrx.h
> +++ b/io_uring/zcrx.h
> @@ -46,7 +46,7 @@ void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);  void
> io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);  int io_zcrx_recv(struct io_kiocb
> *req, struct io_zcrx_ifq *ifq,
>  		 struct socket *sock, unsigned int flags,
> -		 unsigned issue_flags);
> +		 unsigned issue_flags, unsigned int *len);
>  #else
>  static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
>  					struct io_uring_zcrx_ifq_reg __user
> *arg)
> --
> 2.43.5
> 
> 

---
Li Zetao

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  1:07       ` Jens Axboe
@ 2025-02-23 22:35         ` David Wei
  2025-02-24 12:49           ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: David Wei @ 2025-02-23 22:35 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring

On 2025-02-21 17:07, Jens Axboe wrote:
> On 2/21/25 6:01 PM, Pavel Begunkov wrote:
>> On 2/22/25 00:08, Jens Axboe wrote:
>>> Just a few minor drive-by nits.
>>>
>>>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>       zc->ifq = req->ctx->ifq;
>>>>       if (!zc->ifq)
>>>>           return -EINVAL;
>>>> +    zc->len = READ_ONCE(sqe->len);
>>>> +    if (zc->len == UINT_MAX)
>>>> +        return -EINVAL;
>>>> +    /* UINT_MAX means no limit on readlen */
>>>> +    if (!zc->len)
>>>> +        zc->len = UINT_MAX;
>>>
>>> Why not just make UINT_MAX allowed, meaning no limit? Would avoid two
>>> branches here, and as far as I can tell not really change anything in
>>> terms of API niceness.
>>
>> I think 0 goes better as a special uapi value. It doesn't alter the
>> uapi, and commonly understood as "no limits", which is the opposite
>> to the other option, especially since UINT_MAX is not a max value for
>> an unlimited request, I'd easily expect it to drive more than 4GB.
> 
> Yeah that's certainly better, and as you say also has the same (forced)
> semantics for multishot.
> 

I thought about using 0 originally, but needed a way to distinguish 0
meaning no limit vs a limited read hitting 0 and completing. I could
store a flag in the request at recvzc_prep() time as an alternative.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  0:08   ` Jens Axboe
  2025-02-22  1:01     ` Pavel Begunkov
@ 2025-02-23 22:39     ` David Wei
  1 sibling, 0 replies; 20+ messages in thread
From: David Wei @ 2025-02-23 22:39 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: Pavel Begunkov

On 2025-02-21 16:08, Jens Axboe wrote:
> Just a few minor drive-by nits.
> 
>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>  	zc->ifq = req->ctx->ifq;
>>  	if (!zc->ifq)
>>  		return -EINVAL;
>> +	zc->len = READ_ONCE(sqe->len);
>> +	if (zc->len == UINT_MAX)
>> +		return -EINVAL;
>> +	/* UINT_MAX means no limit on readlen */
>> +	if (!zc->len)
>> +		zc->len = UINT_MAX;
> 
> Why not just make UINT_MAX allowed, meaning no limit? Would avoid two
> branches here, and as far as I can tell not really change anything in
> terms of API niceness.
> 
>> @@ -1269,6 +1276,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>  int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>>  {
>>  	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>> +	bool limit = zc->len != UINT_MAX;
>>  	struct socket *sock;
>>  	int ret;
> 
> Doesn't seem to be used?

Oops, I'll remove it.

> 
>> @@ -1296,6 +1304,13 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>>  		return IOU_OK;
>>  	}
>>  
>> +	if (zc->len == 0) {
>> +		io_req_set_res(req, 0, 0);
>> +
>> +		if (issue_flags & IO_URING_F_MULTISHOT)
>> +			return IOU_STOP_MULTISHOT;
>> +		return IOU_OK;
>> +	}
>>  	if (issue_flags & IO_URING_F_MULTISHOT)
>>  		return IOU_ISSUE_SKIP_COMPLETE;
>>  	return -EAGAIN;
> 
> Might be cleaner as:
> 
> 	ret = -EAGAIN;
> 	if (!zc->len) {
> 		io_req_set_res(req, 0, 0);
> 		ret = -IOU_OK;
> 	}
>   	if (issue_flags & IO_URING_F_MULTISHOT)
>   		return IOU_ISSUE_SKIP_COMPLETE;
>   	return ret;
> 
> rather than duplicate the flag checking.
> 
> 
>>  static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>  				struct sock *sk, int flags,
>> -				unsigned issue_flags)
>> +				unsigned issue_flags, unsigned int *outlen)
>>  {
>> +	unsigned int len = *outlen;
>> +	bool limit = len != UINT_MAX;
>>  	struct io_zcrx_args args = {
>>  		.req = req,
>>  		.ifq = ifq,
>>  		.sock = sk->sk_socket,
>>  	};
>>  	read_descriptor_t rd_desc = {
>> -		.count = 1,
>> +		.count = len,
>>  		.arg.data = &args,
>>  	};
>>  	int ret;
>>  
>>  	lock_sock(sk);
>>  	ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
>> +	if (limit && ret)
>> +		*outlen = len - ret;
>>  	if (ret <= 0) {
>>  		if (ret < 0 || sock_flag(sk, SOCK_DONE))
>>  			goto out;
>> @@ -930,7 +937,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>  		ret = IOU_REQUEUE;
>>  	} else if (sock_flag(sk, SOCK_DONE)) {
>>  		/* Make it to retry until it finally gets 0. */
>> -		if (issue_flags & IO_URING_F_MULTISHOT)
>> +		if (!limit && (issue_flags & IO_URING_F_MULTISHOT))
>>  			ret = IOU_REQUEUE;
>>  		else
>>  			ret = -EAGAIN;
> 
> In the two above hunks, the limit checking feels a bit hackish. For
> example, why is it related to multishot or not? I think the patch would
> benefit from being split into separate patches for singleshot and length
> support. Eg one that adds singleshot support, and then one that adds
> length capping on top. That would make it much easier to reason about
> hunks like the above one.
> 
>> @@ -942,7 +949,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>  
>>  int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>  		 struct socket *sock, unsigned int flags,
>> -		 unsigned issue_flags)
>> +		 unsigned issue_flags, unsigned int *len)
>>  {
>>  	struct sock *sk = sock->sk;
>>  	const struct proto *prot = READ_ONCE(sk->sk_prot);
> 
> Shouldn't recv support it too?
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  0:40   ` Pavel Begunkov
  2025-02-22  0:52     ` Jens Axboe
@ 2025-02-23 22:43     ` David Wei
  1 sibling, 0 replies; 20+ messages in thread
From: David Wei @ 2025-02-23 22:43 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jens Axboe

On 2025-02-21 16:40, Pavel Begunkov wrote:
> On 2/21/25 20:51, David Wei wrote:
>> Currently only multishot recvzc requests are supported, but sometimes
>> there is a need to do a single recv e.g. peeking at some data in the
>> socket. Add single shot recvzc requests where IORING_RECV_MULTISHOT is
>> _not_ set and the sqe->len field is set to the number of bytes to read
>> N.
> 
> There is no oneshot, we need to change the message.

I'll make commit subject/descriptions consistent. There is no single
shot, only multishot with and without limits.

> 
> 
>> There could be multiple completions containing data, like the multishot
>> case, since N bytes could be split across multiple frags. This is
>> followed by a final completion with res and cflags both set to 0 that
>> indicate the completion of the request, or a -res that indicate an
>> error.
>>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>   io_uring/net.c  | 19 +++++++++++++++++--
>>   io_uring/zcrx.c | 17 ++++++++++++-----
>>   io_uring/zcrx.h |  2 +-
>>   3 files changed, 30 insertions(+), 8 deletions(-)
>>
>> diff --git a/io_uring/net.c b/io_uring/net.c
>> index 000dc70d08d0..cae34a24266c 100644
>> --- a/io_uring/net.c
>> +++ b/io_uring/net.c
>> @@ -94,6 +94,7 @@ struct io_recvzc {
>>       struct file            *file;
>>       unsigned            msg_flags;
>>       u16                flags;
>> +    u32                len;
>>       struct io_zcrx_ifq        *ifq;
>>   };
>>   @@ -1241,7 +1242,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>       unsigned ifq_idx;
>>         if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
>> -             sqe->len || sqe->addr3))
>> +             sqe->addr3))
>>           return -EINVAL;
>>         ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx);
>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>       zc->ifq = req->ctx->ifq;
>>       if (!zc->ifq)
>>           return -EINVAL;
>> +    zc->len = READ_ONCE(sqe->len);
>> +    if (zc->len == UINT_MAX)
>> +        return -EINVAL;
> 
> The uapi gives u32, if we're using a special value it should
> match the type. ~(u32)0
> 
>> +    /* UINT_MAX means no limit on readlen */
>> +    if (!zc->len)
>> +        zc->len = UINT_MAX;
>>         zc->flags = READ_ONCE(sqe->ioprio);
>>       zc->msg_flags = READ_ONCE(sqe->msg_flags);
>> @@ -1269,6 +1276,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>   int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>>   {
>>       struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>> +    bool limit = zc->len != UINT_MAX;
>>       struct socket *sock;
>>       int ret;
>>   @@ -1281,7 +1289,7 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>>           return -ENOTSOCK;
>>         ret = io_zcrx_recv(req, zc->ifq, sock, zc->msg_flags | MSG_DONTWAIT,
>> -               issue_flags);
>> +               issue_flags, &zc->len);
>>       if (unlikely(ret <= 0) && ret != -EAGAIN) {
>>           if (ret == -ERESTARTSYS)
>>               ret = -EINTR;
>> @@ -1296,6 +1304,13 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
>>           return IOU_OK;
>>       }
>>   +    if (zc->len == 0) {
> 
> If len hits zero we should always complete it, regardless
> of errors the stack might have returned, so might be
> cleaner if you do that check right after io_zcrx_recv().

Sounds good.

> 
>> +        io_req_set_res(req, 0, 0);
>> +
>> +        if (issue_flags & IO_URING_F_MULTISHOT)
>> +            return IOU_STOP_MULTISHOT;
>> +        return IOU_OK;
>> +    }
>>       if (issue_flags & IO_URING_F_MULTISHOT)
>>           return IOU_ISSUE_SKIP_COMPLETE;
>>       return -EAGAIN;
>> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
>> index f2d326e18e67..74bca4e471bc 100644
>> --- a/io_uring/zcrx.c
>> +++ b/io_uring/zcrx.c
>> @@ -817,6 +817,7 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
>>       int i, copy, end, off;
>>       int ret = 0;
>>   +    len = min_t(size_t, len, desc->count);
>>       if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT))
>>           return -EAGAIN;
>>   @@ -894,26 +895,32 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
>>   out:
>>       if (offset == start_off)
>>           return ret;
>> +    if (desc->count != UINT_MAX)
>> +        desc->count -= (offset - start_off);
> 
> I'd say just set desc->count to it's max value (size_t), and
> never care about checking for limits after.

True, we're limited by IO_SKBS_PER_CALL_LIMIT.

> 
>>       return offset - start_off;
>>   }
>>     static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>                   struct sock *sk, int flags,
>> -                unsigned issue_flags)
>> +                unsigned issue_flags, unsigned int *outlen)
>>   {
>> +    unsigned int len = *outlen;
>> +    bool limit = len != UINT_MAX;
>>       struct io_zcrx_args args = {
>>           .req = req,
>>           .ifq = ifq,
>>           .sock = sk->sk_socket,
>>       };
>>       read_descriptor_t rd_desc = {
>> -        .count = 1,
>> +        .count = len,
>>           .arg.data = &args,
>>       };
>>       int ret;
>>         lock_sock(sk);
>>       ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
>> +    if (limit && ret)
>> +        *outlen = len - ret;
>>       if (ret <= 0) {
>>           if (ret < 0 || sock_flag(sk, SOCK_DONE))
>>               goto out;
>> @@ -930,7 +937,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>           ret = IOU_REQUEUE;
>>       } else if (sock_flag(sk, SOCK_DONE)) {
>>           /* Make it to retry until it finally gets 0. */
>> -        if (issue_flags & IO_URING_F_MULTISHOT)
>> +        if (!limit && (issue_flags & IO_URING_F_MULTISHOT))
>>               ret = IOU_REQUEUE;
> 
> And with earlier len check in net.c you don't need this change,
> which feels wrong, as it's only here to circumvent some handling
> in net.c, I assume
> 

Yeah, I don't think this is needed anymore, will remove.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  0:56   ` Pavel Begunkov
@ 2025-02-23 22:44     ` David Wei
  0 siblings, 0 replies; 20+ messages in thread
From: David Wei @ 2025-02-23 22:44 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jens Axboe

On 2025-02-21 16:56, Pavel Begunkov wrote:
> On 2/21/25 20:51, David Wei wrote:
>> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
>> index f2d326e18e67..74bca4e471bc 100644
>> --- a/io_uring/zcrx.c
>> +++ b/io_uring/zcrx.c
> ...
>>   static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>                   struct sock *sk, int flags,
>> -                unsigned issue_flags)
>> +                unsigned issue_flags, unsigned int *outlen)
>>   {
>> +    unsigned int len = *outlen;
>> +    bool limit = len != UINT_MAX;
>>       struct io_zcrx_args args = {
>>           .req = req,
>>           .ifq = ifq,
>>           .sock = sk->sk_socket,
>>       };
>>       read_descriptor_t rd_desc = {
>> -        .count = 1,
>> +        .count = len,
>>           .arg.data = &args,
>>       };
>>       int ret;
>>         lock_sock(sk);
>>       ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
>> +    if (limit && ret)
>> +        *outlen = len - ret;
> 
> ret can be negative, the check will pass and the calculations
> will turn it into something weird.
> 

:facepalm:

ret is an int and if (ret) is any non-zero value... I'll be more
explicit here to say ret > 0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-22  8:54   ` lizetao
@ 2025-02-24  0:17     ` David Wei
  0 siblings, 0 replies; 20+ messages in thread
From: David Wei @ 2025-02-24  0:17 UTC (permalink / raw)
  To: lizetao; +Cc: Jens Axboe, Pavel Begunkov, [email protected]

On 2025-02-22 00:54, lizetao wrote:
> Hi,
> 
>> -----Original Message-----
>> From: David Wei <[email protected]>
>> Sent: Saturday, February 22, 2025 4:52 AM
>> To: [email protected]
>> Cc: Jens Axboe <[email protected]>; Pavel Begunkov <[email protected]>
>> Subject: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
>>
>> Currently only multishot recvzc requests are supported, but sometimes there is
>> a need to do a single recv e.g. peeking at some data in the socket. Add single
>> shot recvzc requests where IORING_RECV_MULTISHOT is _not_ set and the sqe-
>>> len field is set to the number of bytes to read N.
>>
>> There could be multiple completions containing data, like the multishot case,
>> since N bytes could be split across multiple frags. This is followed by a final
>> completion with res and cflags both set to 0 that indicate the completion of the
>> request, or a -res that indicate an error.
>>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>  io_uring/net.c  | 19 +++++++++++++++++--  io_uring/zcrx.c | 17 ++++++++++++--
>> ---  io_uring/zcrx.h |  2 +-
>>  3 files changed, 30 insertions(+), 8 deletions(-)
>>
>> diff --git a/io_uring/net.c b/io_uring/net.c index 000dc70d08d0..cae34a24266c
>> 100644
>> --- a/io_uring/net.c
>> +++ b/io_uring/net.c
>> @@ -94,6 +94,7 @@ struct io_recvzc {
>>  	struct file			*file;
>>  	unsigned			msg_flags;
>>  	u16				flags;
>> +	u32				len;
>>  	struct io_zcrx_ifq		*ifq;
>>  };
>>
>> @@ -1241,7 +1242,7 @@ int io_recvzc_prep(struct io_kiocb *req, const struct
>> io_uring_sqe *sqe)
>>  	unsigned ifq_idx;
>>
>>  	if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
>> -		     sqe->len || sqe->addr3))
>> +		     sqe->addr3))
>>  		return -EINVAL;
>>
>>  	ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx); @@ -1250,6 +1251,12 @@
>> int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>  	zc->ifq = req->ctx->ifq;
>>  	if (!zc->ifq)
>>  		return -EINVAL;
>> +	zc->len = READ_ONCE(sqe->len);
>> +	if (zc->len == UINT_MAX)
>> +		return -EINVAL;
>> +	/* UINT_MAX means no limit on readlen */
> 
> What is the difference between unlimited read length and IO_URING_F_MULTISHOT?
> In what specific scenarios would unlimited read length be used?

I will clean up the terminology in the next revision. There isn't a
singleshot vs multishot as such. Rather, it's an unlimited multishot vs
a limited multishot. The unlimited multishot request would be a typical
multishot recv that will keep reading data from the socket. The limited
multishot request will read up to a limit N before completing the
request.

> 
>> +	if (!zc->len)
>> +		zc->len = UINT_MAX;
>>
>>  	zc->flags = READ_ONCE(sqe->ioprio);
>>  	zc->msg_flags = READ_ONCE(sqe->msg_flags); @@ -1269,6 +1276,7
>> @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)  {
>>  	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>> +	bool limit = zc->len != UINT_MAX;
> 
> It is not used.
> 
>>  	struct socket *sock;
>>  	int ret;
>>
>> @@ -1281,7 +1289,7 @@ int io_recvzc(struct io_kiocb *req, unsigned int
>> issue_flags)
>>  		return -ENOTSOCK;
>>
>>  	ret = io_zcrx_recv(req, zc->ifq, sock, zc->msg_flags | MSG_DONTWAIT,
>> -			   issue_flags);
>> +			   issue_flags, &zc->len);
>>  	if (unlikely(ret <= 0) && ret != -EAGAIN) {
>>  		if (ret == -ERESTARTSYS)
>>  			ret = -EINTR;
>> @@ -1296,6 +1304,13 @@ int io_recvzc(struct io_kiocb *req, unsigned int
>> issue_flags)
>>  		return IOU_OK;
>>  	}
>>
>> +	if (zc->len == 0) {
>> +		io_req_set_res(req, 0, 0);
>> +
>> +		if (issue_flags & IO_URING_F_MULTISHOT)
>> +			return IOU_STOP_MULTISHOT;
>> +		return IOU_OK;
>> +	}
>>  	if (issue_flags & IO_URING_F_MULTISHOT)
>>  		return IOU_ISSUE_SKIP_COMPLETE;
>>  	return -EAGAIN;
>> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index
>> f2d326e18e67..74bca4e471bc 100644
>> --- a/io_uring/zcrx.c
>> +++ b/io_uring/zcrx.c
>> @@ -817,6 +817,7 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct
>> sk_buff *skb,
>>  	int i, copy, end, off;
>>  	int ret = 0;
>>
>> +	len = min_t(size_t, len, desc->count);
>>  	if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT))
>>  		return -EAGAIN;
>>
>> @@ -894,26 +895,32 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct
>> sk_buff *skb,
>>  out:
>>  	if (offset == start_off)
>>  		return ret;
>> +	if (desc->count != UINT_MAX)
>> +		desc->count -= (offset - start_off);
>>  	return offset - start_off;
>>  }
>>
>>  static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>  				struct sock *sk, int flags,
>> -				unsigned issue_flags)
>> +				unsigned issue_flags, unsigned int *outlen)
>>  {
>> +	unsigned int len = *outlen;
>> +	bool limit = len != UINT_MAX;
>>  	struct io_zcrx_args args = {
>>  		.req = req,
>>  		.ifq = ifq,
>>  		.sock = sk->sk_socket,
>>  	};
>>  	read_descriptor_t rd_desc = {
>> -		.count = 1,
>> +		.count = len,
>>  		.arg.data = &args,
>>  	};
>>  	int ret;
>>
>>  	lock_sock(sk);
>>  	ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
>> +	if (limit && ret)
>> +		*outlen = len - ret;
>>  	if (ret <= 0) {
>>  		if (ret < 0 || sock_flag(sk, SOCK_DONE))
>>  			goto out;
>> @@ -930,7 +937,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req,
>> struct io_zcrx_ifq *ifq,
>>  		ret = IOU_REQUEUE;
>>  	} else if (sock_flag(sk, SOCK_DONE)) {
>>  		/* Make it to retry until it finally gets 0. */
>> -		if (issue_flags & IO_URING_F_MULTISHOT)
>> +		if (!limit && (issue_flags & IO_URING_F_MULTISHOT))
>>  			ret = IOU_REQUEUE;
>>  		else
>>  			ret = -EAGAIN;
>> @@ -942,7 +949,7 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req,
>> struct io_zcrx_ifq *ifq,
>>
>>  int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>  		 struct socket *sock, unsigned int flags,
>> -		 unsigned issue_flags)
>> +		 unsigned issue_flags, unsigned int *len)
>>  {
>>  	struct sock *sk = sock->sk;
>>  	const struct proto *prot = READ_ONCE(sk->sk_prot); @@ -951,5
>> +958,5 @@ int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
>>  		return -EPROTONOSUPPORT;
>>
>>  	sock_rps_record_flow(sk);
>> -	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags, issue_flags);
>> +	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags, issue_flags, len);
>>  }
>> diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h index
>> a16bdd921f03..1b4042dc48e2 100644
>> --- a/io_uring/zcrx.h
>> +++ b/io_uring/zcrx.h
>> @@ -46,7 +46,7 @@ void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);  void
>> io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);  int io_zcrx_recv(struct io_kiocb
>> *req, struct io_zcrx_ifq *ifq,
>>  		 struct socket *sock, unsigned int flags,
>> -		 unsigned issue_flags);
>> +		 unsigned issue_flags, unsigned int *len);
>>  #else
>>  static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
>>  					struct io_uring_zcrx_ifq_reg __user
>> *arg)
>> --
>> 2.43.5
>>
>>
> 
> ---
> Li Zetao

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc
  2025-02-23 22:35         ` David Wei
@ 2025-02-24 12:49           ` Pavel Begunkov
  0 siblings, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2025-02-24 12:49 UTC (permalink / raw)
  To: David Wei, Jens Axboe, io-uring

On 2/23/25 22:35, David Wei wrote:
> On 2025-02-21 17:07, Jens Axboe wrote:
>> On 2/21/25 6:01 PM, Pavel Begunkov wrote:
>>> On 2/22/25 00:08, Jens Axboe wrote:
>>>> Just a few minor drive-by nits.
>>>>
>>>>> @@ -1250,6 +1251,12 @@ int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>        zc->ifq = req->ctx->ifq;
>>>>>        if (!zc->ifq)
>>>>>            return -EINVAL;
>>>>> +    zc->len = READ_ONCE(sqe->len);
>>>>> +    if (zc->len == UINT_MAX)
>>>>> +        return -EINVAL;
>>>>> +    /* UINT_MAX means no limit on readlen */
>>>>> +    if (!zc->len)
>>>>> +        zc->len = UINT_MAX;
>>>>
>>>> Why not just make UINT_MAX allowed, meaning no limit? Would avoid two
>>>> branches here, and as far as I can tell not really change anything in
>>>> terms of API niceness.
>>>
>>> I think 0 goes better as a special uapi value. It doesn't alter the
>>> uapi, and commonly understood as "no limits", which is the opposite
>>> to the other option, especially since UINT_MAX is not a max value for
>>> an unlimited request, I'd easily expect it to drive more than 4GB.
>>
>> Yeah that's certainly better, and as you say also has the same (forced)
>> semantics for multishot.
>>
> 
> I thought about using 0 originally, but needed a way to distinguish 0
> meaning no limit vs a limited read hitting 0 and completing. I could
> store a flag in the request at recvzc_prep() time as an alternative.

FWIW, either option is probably fine if it's for internal
handling and not uapi.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-02-24 12:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-21 20:51 [PATCH v2 0/2] io_uring zc rx fixed len recvzc David Wei
2025-02-21 20:51 ` [PATCH v2 1/2] io_uring/zcrx: add single shot recvzc David Wei
2025-02-22  0:08   ` Jens Axboe
2025-02-22  1:01     ` Pavel Begunkov
2025-02-22  1:07       ` Jens Axboe
2025-02-23 22:35         ` David Wei
2025-02-24 12:49           ` Pavel Begunkov
2025-02-23 22:39     ` David Wei
2025-02-22  0:40   ` Pavel Begunkov
2025-02-22  0:52     ` Jens Axboe
2025-02-22  1:06       ` Pavel Begunkov
2025-02-22  1:09         ` Jens Axboe
2025-02-22  1:15           ` Pavel Begunkov
2025-02-22  1:09         ` Pavel Begunkov
2025-02-23 22:43     ` David Wei
2025-02-22  0:56   ` Pavel Begunkov
2025-02-23 22:44     ` David Wei
2025-02-22  8:54   ` lizetao
2025-02-24  0:17     ` David Wei
2025-02-21 20:51 ` [PATCH v2 2/2] io_uring/zcrx: add selftest case for " David Wei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox