* [PATCHSET 0/3] Add cap for multishot recv receive size
@ 2025-07-08 14:26 Jens Axboe
2025-07-08 14:26 ` [PATCH 1/3] io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags Jens Axboe
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Jens Axboe @ 2025-07-08 14:26 UTC (permalink / raw)
To: io-uring
Hi,
When using multishot receive and handling many simultaneous streams,
there's a potential fairness issue that can occur. For each receive
operation, io_uring will keep retrying a request for up to 32 times
as long as there's data pending in the socket. Depending on data
delivery times, the amount of data received can vary quite a bit.
If the multishot receives is using bundles as well, then each bundle
can use up to 256 vectors of data. This is good for effiency, but
can skew the fairness between sockets.
Multishot recv does not support setting sqe->len currently, it'll
return -EINVAL if that is done. Add support for specifying the length
in the SQE, and have it apply as a per-iteration limit for each
receive. For example, if sr->len is set to 512k, then each multishot
invocation of this request will transfer 512k bytes, at most.
As an example, this test case sets up 4 streams, and uses 32b buffers
for each stream. Each client will read 8k of data, or 256 buffers in
total per stream. If the per-invocation limit isn't set, it looks as
follows:
axboe@m2max-kvm ~> ./recv-streams
bundle=1, mshot=1
Will receive 32768 bytes total
cqe res 8192 (bid=0, id=1)
cqe res 0 (bid=0, id=1)
id=1, done, 8192 bytes
rd switch, prev id=1, bytes=8192, total_bytes=8192
cqe res 8192 (bid=256, id=2)
cqe res 0 (bid=0, id=2)
id=2, done, 8192 bytes
rd switch, prev id=2, bytes=8192, total_bytes=8192
cqe res 8192 (bid=512, id=3)
cqe res 0 (bid=0, id=3)
id=3, done, 8192 bytes
rd switch, prev id=3, bytes=8192, total_bytes=8192
cqe res 8192 (bid=768, id=4)
id=4, done, 8192 bytes
where each stream will end up reading the full 8k before the next stream
is able to make any progress. With this patchset and setting sr->len to
2048, it looks like this instead:
axboe@m2max-kvm ~> ./recv-streams
bundle=1, mshot=1
Will receive 32768 bytes total
cqe res 2048 (bid=0, id=1)
rd switch, prev id=1, bytes=2048, total_bytes=2048
cqe res 2048 (bid=64, id=2)
rd switch, prev id=2, bytes=2048, total_bytes=2048
cqe res 2048 (bid=128, id=3)
rd switch, prev id=3, bytes=2048, total_bytes=2048
cqe res 2048 (bid=192, id=4)
rd switch, prev id=4, bytes=2048, total_bytes=2048
cqe res 2048 (bid=256, id=1)
rd switch, prev id=1, bytes=2048, total_bytes=4096
cqe res 2048 (bid=320, id=2)
rd switch, prev id=2, bytes=2048, total_bytes=4096
cqe res 2048 (bid=384, id=3)
rd switch, prev id=3, bytes=2048, total_bytes=4096
cqe res 2048 (bid=448, id=4)
rd switch, prev id=4, bytes=2048, total_bytes=4096
cqe res 2048 (bid=512, id=1)
rd switch, prev id=1, bytes=2048, total_bytes=6144
cqe res 2048 (bid=576, id=2)
rd switch, prev id=2, bytes=2048, total_bytes=6144
cqe res 2048 (bid=640, id=3)
rd switch, prev id=3, bytes=2048, total_bytes=6144
cqe res 2048 (bid=704, id=4)
rd switch, prev id=4, bytes=2048, total_bytes=6144
cqe res 2048 (bid=768, id=1)
rd switch, prev id=1, bytes=2048, total_bytes=8192
cqe res 2048 (bid=832, id=2)
rd switch, prev id=2, bytes=2048, total_bytes=8192
cqe res 2048 (bid=896, id=3)
rd switch, prev id=3, bytes=2048, total_bytes=8192
cqe res 2048 (bid=960, id=4)
id=4, done, 8192 bytes
where each stream gets to read 2k before switching to the next stream,
and then this repeats until they've all read 8k of data.
Patches 1+2 are just prep patches, patch 3 implements the capping logic.
Can also be found here:
https://git.kernel.dk/cgit/linux/log/?h=io_uring-recv-mshot-len
include/uapi/linux/io_uring.h | 9 ++++++
io_uring/net.c | 52 +++++++++++++++++++++++------------
2 files changed, 44 insertions(+), 17 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 1/3] io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags
2025-07-08 14:26 [PATCHSET 0/3] Add cap for multishot recv receive size Jens Axboe
@ 2025-07-08 14:26 ` Jens Axboe
2025-07-08 14:26 ` [PATCH 2/3] io_uring/net: use passed in 'len' in io_recv_buf_select() Jens Axboe
2025-07-08 14:26 ` [PATCH 3/3] io_uring/net: allow multishot receive per-invocation cap Jens Axboe
2 siblings, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2025-07-08 14:26 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
There's plenty of space left, we just have to cleanly separate the
UAPI flags and the internal ones. This avoids needing to init them for
request initialization, or clear them separately for request retries.
Add a mask for the UAPI flags so that a BUILD_BUG_ON() can be added if
there's ever any overlap. As of this commit, UAPI uses the bottom 5 bits
and the internal uses are the top two bits. This still leaves room for
an additional 8 UAPI flags.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/uapi/linux/io_uring.h | 9 +++++++++
io_uring/net.c | 29 ++++++++++++++++++-----------
2 files changed, 27 insertions(+), 11 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index b8a0e70ee2fd..7c828fe944b1 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -399,6 +399,15 @@ enum io_uring_op {
#define IORING_SEND_ZC_REPORT_USAGE (1U << 3)
#define IORING_RECVSEND_BUNDLE (1U << 4)
+/*
+ * Not immediately useful for application, just a mask of all the exposed flags
+ */
+#define IORING_RECVSEND_FLAGS_ALL (IORING_RECVSEND_POLL_FIRST | \
+ IORING_RECV_MULTISHOT | \
+ IORING_RECVSEND_FIXED_BUF | \
+ IORING_SEND_ZC_REPORT_USAGE | \
+ IORING_RECVSEND_BUNDLE)
+
/*
* cqe.res for IORING_CQE_F_NOTIF if
* IORING_SEND_ZC_REPORT_USAGE was requested
diff --git a/io_uring/net.c b/io_uring/net.c
index 43a43522f406..328301dc9a43 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -75,15 +75,21 @@ struct io_sr_msg {
u16 flags;
/* initialised and used only by !msg send variants */
u16 buf_group;
- unsigned short retry_flags;
void __user *msg_control;
/* used only for send zerocopy */
struct io_kiocb *notif;
};
+/*
+ * Can't overlap with the send/sendmsg or recv/recvmsg flags defined in
+ * the UAPI. Start high and work down.
+ */
enum sr_retry_flags {
- IO_SR_MSG_RETRY = 1,
- IO_SR_MSG_PARTIAL_MAP = 2,
+ IORING_RECV_RETRY = (1U << 15),
+ IORING_RECV_PARTIAL_MAP = (1U << 14),
+
+ IORING_RECV_RETRY_CLEAR = IORING_RECV_RETRY | IORING_RECV_PARTIAL_MAP,
+ IORING_RECV_INTERNAL = IORING_RECV_RETRY | IORING_RECV_PARTIAL_MAP,
};
/*
@@ -190,9 +196,12 @@ static inline void io_mshot_prep_retry(struct io_kiocb *req,
{
struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
+ /* internal and external flags must not overlap */
+ BUILD_BUG_ON(IORING_RECVSEND_FLAGS_ALL & IORING_RECV_INTERNAL);
+
req->flags &= ~REQ_F_BL_EMPTY;
sr->done_io = 0;
- sr->retry_flags = 0;
+ sr->flags &= ~IORING_RECV_RETRY_CLEAR;
sr->len = 0; /* get from the provided buffer */
}
@@ -402,7 +411,6 @@ int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
sr->done_io = 0;
- sr->retry_flags = 0;
sr->len = READ_ONCE(sqe->len);
sr->flags = READ_ONCE(sqe->ioprio);
if (sr->flags & ~SENDMSG_FLAGS)
@@ -756,7 +764,6 @@ int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
sr->done_io = 0;
- sr->retry_flags = 0;
if (unlikely(sqe->file_index || sqe->addr2))
return -EINVAL;
@@ -828,7 +835,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
cflags |= io_put_kbufs(req, this_ret, io_bundle_nbufs(kmsg, this_ret),
issue_flags);
- if (sr->retry_flags & IO_SR_MSG_RETRY)
+ if (sr->flags & IORING_RECV_RETRY)
cflags = req->cqe.flags | (cflags & CQE_F_MASK);
/* bundle with no more immediate buffers, we're done */
if (req->flags & REQ_F_BL_EMPTY)
@@ -837,12 +844,13 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
* If more is available AND it was a full transfer, retry and
* append to this one
*/
- if (!sr->retry_flags && kmsg->msg.msg_inq > 1 && this_ret > 0 &&
+ if (!(sr->flags & IORING_RECV_INTERNAL) &&
+ kmsg->msg.msg_inq > 1 && this_ret > 0 &&
!iov_iter_count(&kmsg->msg.msg_iter)) {
req->cqe.flags = cflags & ~CQE_F_MASK;
sr->len = kmsg->msg.msg_inq;
sr->done_io += this_ret;
- sr->retry_flags |= IO_SR_MSG_RETRY;
+ sr->flags |= IORING_RECV_RETRY;
return false;
}
} else {
@@ -1088,7 +1096,7 @@ static int io_recv_buf_select(struct io_kiocb *req, struct io_async_msghdr *kmsg
req->flags |= REQ_F_NEED_CLEANUP;
}
if (arg.partial_map)
- sr->retry_flags |= IO_SR_MSG_PARTIAL_MAP;
+ sr->flags |= IORING_RECV_PARTIAL_MAP;
/* special case 1 vec, can be a fast path */
if (ret == 1) {
@@ -1283,7 +1291,6 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
int ret;
zc->done_io = 0;
- zc->retry_flags = 0;
if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
return -EINVAL;
--
2.50.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 2/3] io_uring/net: use passed in 'len' in io_recv_buf_select()
2025-07-08 14:26 [PATCHSET 0/3] Add cap for multishot recv receive size Jens Axboe
2025-07-08 14:26 ` [PATCH 1/3] io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags Jens Axboe
@ 2025-07-08 14:26 ` Jens Axboe
2025-07-08 14:26 ` [PATCH 3/3] io_uring/net: allow multishot receive per-invocation cap Jens Axboe
2 siblings, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2025-07-08 14:26 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
len is a pointer to the desired len, use that rather than grab it from
sr->len again. No functional changes as of this patch, but it does
prepare io_recv_buf_select() for getting passed in a value that differs
from sr->len.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
io_uring/net.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/net.c b/io_uring/net.c
index 328301dc9a43..72276339e9e6 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1084,7 +1084,7 @@ static int io_recv_buf_select(struct io_kiocb *req, struct io_async_msghdr *kmsg
}
if (kmsg->msg.msg_inq > 1)
- arg.max_len = min_not_zero(sr->len, kmsg->msg.msg_inq);
+ arg.max_len = min_not_zero(*len, kmsg->msg.msg_inq);
ret = io_buffers_peek(req, &arg);
if (unlikely(ret < 0))
--
2.50.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 3/3] io_uring/net: allow multishot receive per-invocation cap
2025-07-08 14:26 [PATCHSET 0/3] Add cap for multishot recv receive size Jens Axboe
2025-07-08 14:26 ` [PATCH 1/3] io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags Jens Axboe
2025-07-08 14:26 ` [PATCH 2/3] io_uring/net: use passed in 'len' in io_recv_buf_select() Jens Axboe
@ 2025-07-08 14:26 ` Jens Axboe
2 siblings, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2025-07-08 14:26 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
If an application is handling multiple receive streams using recv
multishot, then the amount of retries and buffer peeking for multishot
and bundles can process too much per socket before moving on. This isn't
directly controllable by the application. By default, io_uring will
retry a recv MULTISHOT_MAX_RETRY (32) times, if the socket keeps having
data to receive. And if using bundles, then each bundle peek will
potentially map up to PEEK_MAX_IMPORT (256) iovecs of data. Once these
limits are hit, then a requeue operation will be done, where the request
will get retried after other pending requests have had a time to get
executed.
Add support for capping the per-invocation receive length, before a
requeue condition is considered for each receive. This is done by setting
sqe->mshot_len to the byte value. For example, if this is set to 1024,
then each receive will be requeued by 1024 bytes received.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
io_uring/net.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/io_uring/net.c b/io_uring/net.c
index 72276339e9e6..c96043c4e8ab 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -75,6 +75,7 @@ struct io_sr_msg {
u16 flags;
/* initialised and used only by !msg send variants */
u16 buf_group;
+ unsigned mshot_len;
void __user *msg_control;
/* used only for send zerocopy */
struct io_kiocb *notif;
@@ -87,9 +88,11 @@ struct io_sr_msg {
enum sr_retry_flags {
IORING_RECV_RETRY = (1U << 15),
IORING_RECV_PARTIAL_MAP = (1U << 14),
+ IORING_RECV_MSHOT_CAP = (1U << 13),
IORING_RECV_RETRY_CLEAR = IORING_RECV_RETRY | IORING_RECV_PARTIAL_MAP,
- IORING_RECV_INTERNAL = IORING_RECV_RETRY | IORING_RECV_PARTIAL_MAP,
+ IORING_RECV_INTERNAL = IORING_RECV_RETRY | IORING_RECV_PARTIAL_MAP |
+ IORING_RECV_MSHOT_CAP,
};
/*
@@ -202,7 +205,7 @@ static inline void io_mshot_prep_retry(struct io_kiocb *req,
req->flags &= ~REQ_F_BL_EMPTY;
sr->done_io = 0;
sr->flags &= ~IORING_RECV_RETRY_CLEAR;
- sr->len = 0; /* get from the provided buffer */
+ sr->len = sr->mshot_len;
}
static int io_net_import_vec(struct io_kiocb *req, struct io_async_msghdr *iomsg,
@@ -790,13 +793,14 @@ int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
sr->buf_group = req->buf_index;
req->buf_list = NULL;
}
+ sr->mshot_len = 0;
if (sr->flags & IORING_RECV_MULTISHOT) {
if (!(req->flags & REQ_F_BUFFER_SELECT))
return -EINVAL;
if (sr->msg_flags & MSG_WAITALL)
return -EINVAL;
- if (req->opcode == IORING_OP_RECV && sr->len)
- return -EINVAL;
+ if (req->opcode == IORING_OP_RECV)
+ sr->mshot_len = sr->len;
req->flags |= REQ_F_APOLL_MULTISHOT;
}
if (sr->flags & IORING_RECVSEND_BUNDLE) {
@@ -837,6 +841,8 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
issue_flags);
if (sr->flags & IORING_RECV_RETRY)
cflags = req->cqe.flags | (cflags & CQE_F_MASK);
+ if (sr->mshot_len && *ret >= sr->mshot_len)
+ sr->flags |= IORING_RECV_MSHOT_CAP;
/* bundle with no more immediate buffers, we're done */
if (req->flags & REQ_F_BL_EMPTY)
goto finish;
@@ -867,10 +873,13 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
io_mshot_prep_retry(req, kmsg);
/* Known not-empty or unknown state, retry */
if (cflags & IORING_CQE_F_SOCK_NONEMPTY || kmsg->msg.msg_inq < 0) {
- if (sr->nr_multishot_loops++ < MULTISHOT_MAX_RETRY)
+ if (sr->nr_multishot_loops++ < MULTISHOT_MAX_RETRY &&
+ !(sr->flags & IORING_RECV_MSHOT_CAP)) {
return false;
+ }
/* mshot retries exceeded, force a requeue */
sr->nr_multishot_loops = 0;
+ sr->flags &= ~IORING_RECV_MSHOT_CAP;
if (issue_flags & IO_URING_F_MULTISHOT)
*ret = IOU_REQUEUE;
}
@@ -1083,7 +1092,9 @@ static int io_recv_buf_select(struct io_kiocb *req, struct io_async_msghdr *kmsg
arg.mode |= KBUF_MODE_FREE;
}
- if (kmsg->msg.msg_inq > 1)
+ if (*len)
+ arg.max_len = *len;
+ else if (kmsg->msg.msg_inq > 1)
arg.max_len = min_not_zero(*len, kmsg->msg.msg_inq);
ret = io_buffers_peek(req, &arg);
--
2.50.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-07-08 14:39 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08 14:26 [PATCHSET 0/3] Add cap for multishot recv receive size Jens Axboe
2025-07-08 14:26 ` [PATCH 1/3] io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags Jens Axboe
2025-07-08 14:26 ` [PATCH 2/3] io_uring/net: use passed in 'len' in io_recv_buf_select() Jens Axboe
2025-07-08 14:26 ` [PATCH 3/3] io_uring/net: allow multishot receive per-invocation cap Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox