* [PATCH] io_uring/net: don't fail linked ops when done_io > 0
@ 2026-02-26 22:03 Hannes Furmans
2026-02-27 13:59 ` Stefan Metzmacher
2026-02-27 16:27 ` [PATCH v2] io_uring/net: don't check MSG_CTRUNC for IORING_OP_RECV Hannes Furmans
0 siblings, 2 replies; 5+ messages in thread
From: Hannes Furmans @ 2026-02-26 22:03 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring, linux-kernel, stable, Hannes Furmans
When io_uring recv/send with MSG_WAITALL accumulates partial data
through done_io and then encounters an error or EOF, req_set_fail()
sets REQ_F_FAIL despite the CQE result being positive (done_io bytes).
io_disarm_next() then sees REQ_F_FAIL and cancels all linked operations
with -ECANCELED, even though the user-visible result indicates success.
This manifests in two code paths:
1) Direct completion: io_recv/io_send fall through to req_set_fail()
when ret < min_ret, even if done_io > 0. The CQE shows done_io
(positive) but REQ_F_FAIL severs the link chain.
2) io-wq fallback: after APOLL_MAX_RETRY (128) poll retries, the
request moves to io-wq. io_recv returns IOU_RETRY from the
MSG_WAITALL retry path, io-wq fails the request with -EAGAIN, and
io_req_defer_failed -> io_sendrecv_fail overwrites cqe.res with
done_io but leaves REQ_F_FAIL set.
Fix this by:
- Not calling req_set_fail() when done_io > 0 in io_recv, io_recvmsg,
io_send, io_sendmsg, io_send_zc, io_sendmsg_zc
- Clearing REQ_F_FAIL in io_sendrecv_fail() when done_io > 0
This makes MSG_WAITALL partial completions consistent with
non-MSG_WAITALL behavior, where positive results never sever the
IO_LINK chain.
Reproducer: MSG_WAITALL recv via IO_LINK -> write on a UNIX socketpair
where the sender closes after partial data. The recv CQE shows positive
bytes but the linked write gets -ECANCELED.
Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")
Cc: stable@vger.kernel.org
Signed-off-by: Hannes Furmans <hannes@stillwind.ai>
---
io_uring/net.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/io_uring/net.c b/io_uring/net.c
index 8576c6cb2236..ebe51db34af8 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -576,7 +576,8 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!sr->done_io)
+ req_set_fail(req);
}
io_req_msg_cleanup(req, issue_flags);
if (ret >= 0)
@@ -688,7 +689,8 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!sr->done_io)
+ req_set_fail(req);
}
if (ret >= 0)
ret += sr->done_io;
@@ -1074,7 +1076,8 @@ int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!sr->done_io)
+ req_set_fail(req);
} else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
req_set_fail(req);
}
@@ -1220,7 +1223,8 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!sr->done_io)
+ req_set_fail(req);
} else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
out_free:
req_set_fail(req);
@@ -1498,7 +1502,8 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!zc->done_io)
+ req_set_fail(req);
}
if (ret >= 0)
@@ -1570,7 +1575,8 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!sr->done_io)
+ req_set_fail(req);
}
if (ret >= 0)
@@ -1595,8 +1601,10 @@ void io_sendrecv_fail(struct io_kiocb *req)
{
struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
- if (sr->done_io)
+ if (sr->done_io) {
req->cqe.res = sr->done_io;
+ req->flags &= ~REQ_F_FAIL;
+ }
if ((req->flags & REQ_F_NEED_CLEANUP) &&
(req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
--
2.53.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH] io_uring/net: don't fail linked ops when done_io > 0
2026-02-26 22:03 [PATCH] io_uring/net: don't fail linked ops when done_io > 0 Hannes Furmans
@ 2026-02-27 13:59 ` Stefan Metzmacher
2026-02-27 16:14 ` Hannes Furmans
2026-02-27 16:27 ` [PATCH v2] io_uring/net: don't check MSG_CTRUNC for IORING_OP_RECV Hannes Furmans
1 sibling, 1 reply; 5+ messages in thread
From: Stefan Metzmacher @ 2026-02-27 13:59 UTC (permalink / raw)
To: Hannes Furmans, Jens Axboe; +Cc: io-uring, linux-kernel, stable, Hannes Furmans
Hi Hannes,
Am 26.02.26 um 23:03 schrieb Hannes Furmans:
> When io_uring recv/send with MSG_WAITALL accumulates partial data
> through done_io and then encounters an error or EOF, req_set_fail()
> sets REQ_F_FAIL despite the CQE result being positive (done_io bytes).
> io_disarm_next() then sees REQ_F_FAIL and cancels all linked operations
> with -ECANCELED, even though the user-visible result indicates success.
>
> This manifests in two code paths:
>
> 1) Direct completion: io_recv/io_send fall through to req_set_fail()
> when ret < min_ret, even if done_io > 0. The CQE shows done_io
> (positive) but REQ_F_FAIL severs the link chain.
>
> 2) io-wq fallback: after APOLL_MAX_RETRY (128) poll retries, the
> request moves to io-wq. io_recv returns IOU_RETRY from the
> MSG_WAITALL retry path, io-wq fails the request with -EAGAIN, and
> io_req_defer_failed -> io_sendrecv_fail overwrites cqe.res with
> done_io but leaves REQ_F_FAIL set.
>
> Fix this by:
> - Not calling req_set_fail() when done_io > 0 in io_recv, io_recvmsg,
> io_send, io_sendmsg, io_send_zc, io_sendmsg_zc
> - Clearing REQ_F_FAIL in io_sendrecv_fail() when done_io > 0
>
> This makes MSG_WAITALL partial completions consistent with
> non-MSG_WAITALL behavior, where positive results never sever the
> IO_LINK chain.
>
> Reproducer: MSG_WAITALL recv via IO_LINK -> write on a UNIX socketpair
> where the sender closes after partial data. The recv CQE shows positive
> bytes but the linked write gets -ECANCELED.
>
> Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")
That's by design, if a MSG_WAITALL calls fails it means
not call data the caller expected arrived or were sent.
When there's a LINK after that the linked operation likely
relies on all expected data being processed! Otherwise
the message stream can get out of sync and causes corruption.
Let's assume I want to send a message header with
IO_SEND linked with a IO_SPLICE to send the payload.
If IO_SEND returns short the situation needs to be
recovered by the caller instead of letting the
IO_SPLICE give more data to the socket.
So the current behavior is exactly what MSG_WAITALL
gives you. If you don't want that why are you using it
at all?
metze
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] io_uring/net: don't fail linked ops when done_io > 0
2026-02-27 13:59 ` Stefan Metzmacher
@ 2026-02-27 16:14 ` Hannes Furmans
2026-02-27 16:26 ` Stefan Metzmacher
0 siblings, 1 reply; 5+ messages in thread
From: Hannes Furmans @ 2026-02-27 16:14 UTC (permalink / raw)
To: Stefan Metzmacher
Cc: Jens Axboe, io-uring, linux-kernel, stable, Hannes Furmans
Hi Stefan,
Am 27.02.26 um 14:59 schrieb Stefan Metzmacher:
> That's by design, if a MSG_WAITALL calls fails it means
> not call data the caller expected arrived or were sent.
> When there's a LINK after that the linked operation likely
> relies on all expected data being processed! Otherwise
> the message stream can get out of sync and causes corruption.
You're right — a short MSG_WAITALL read should sever the IO_LINK
chain. The v1 patch was wrong to guard req_set_fail() on done_io > 0.
> Let's assume I want to send a message header with
> IO_SEND linked with a IO_SPLICE to send the payload.
>
> If IO_SEND returns short the situation needs to be
> recovered by the caller instead of letting the
> IO_SPLICE give more data to the socket.
Agreed, the linked operation expects the complete data.
> So the current behavior is exactly what MSG_WAITALL
> gives you. If you don't want that why are you using it
> at all?
The actual bug is narrower. I traced the root cause with kTLS.
When IORING_OP_RECV is used with MSG_WAITALL on a kTLS socket,
the recv completes successfully (ret >= min_ret, full requested
amount received). But kTLS calls put_cmsg(SOL_TLS,
TLS_GET_RECORD_TYPE) for every first record of a recvmsg call
(tls_sw.c:1843). Since io_recv sets up the msghdr with
msg_control=NULL and msg_controllen=0, put_cmsg sets MSG_CTRUNC.
Then io_recv hits the else-if branch:
} else if ((flags & MSG_WAITALL) &&
(msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
req_set_fail(req);
}
This sets REQ_F_FAIL on a fully successful recv. The CQE shows
the full byte count, but the linked write gets -ECANCELED.
I confirmed this with ftrace — the recv completes with
result=67108864 (exactly 64MB requested), then
io_uring_fail_link fires immediately after from an io-wq worker.
I also confirmed with a plain recvmsg debug tool that kTLS
returns msg_flags=0x88 (MSG_EOR | MSG_CTRUNC) on every call.
Your commit 0031275d119e says "For IORING_OP_RECVMSG we also
check for the MSG_TRUNC and MSG_CTRUNC flags" but the code
applies the check to IORING_OP_RECV as well. MSG_CTRUNC is
meaningful for IORING_OP_RECVMSG (user provides a cmsg buffer).
It's meaningless for IORING_OP_RECV which never has a cmsg
buffer.
I'll send a v2 that only removes MSG_CTRUNC from the io_recv
check.
Thanks,
Hannes
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] io_uring/net: don't fail linked ops when done_io > 0
2026-02-27 16:14 ` Hannes Furmans
@ 2026-02-27 16:26 ` Stefan Metzmacher
0 siblings, 0 replies; 5+ messages in thread
From: Stefan Metzmacher @ 2026-02-27 16:26 UTC (permalink / raw)
To: Hannes Furmans; +Cc: Jens Axboe, io-uring, linux-kernel, stable, Hannes Furmans
Am 27.02.26 um 17:14 schrieb Hannes Furmans:
> Hi Stefan,
>
> Am 27.02.26 um 14:59 schrieb Stefan Metzmacher:
>> That's by design, if a MSG_WAITALL calls fails it means
>> not call data the caller expected arrived or were sent.
>> When there's a LINK after that the linked operation likely
>> relies on all expected data being processed! Otherwise
>> the message stream can get out of sync and causes corruption.
>
> You're right — a short MSG_WAITALL read should sever the IO_LINK
> chain. The v1 patch was wrong to guard req_set_fail() on done_io > 0.
>
>> Let's assume I want to send a message header with
>> IO_SEND linked with a IO_SPLICE to send the payload.
>>
>> If IO_SEND returns short the situation needs to be
>> recovered by the caller instead of letting the
>> IO_SPLICE give more data to the socket.
>
> Agreed, the linked operation expects the complete data.
>
>> So the current behavior is exactly what MSG_WAITALL
>> gives you. If you don't want that why are you using it
>> at all?
>
> The actual bug is narrower. I traced the root cause with kTLS.
>
> When IORING_OP_RECV is used with MSG_WAITALL on a kTLS socket,
> the recv completes successfully (ret >= min_ret, full requested
> amount received). But kTLS calls put_cmsg(SOL_TLS,
> TLS_GET_RECORD_TYPE) for every first record of a recvmsg call
> (tls_sw.c:1843). Since io_recv sets up the msghdr with
> msg_control=NULL and msg_controllen=0, put_cmsg sets MSG_CTRUNC.
>
> Then io_recv hits the else-if branch:
>
> } else if ((flags & MSG_WAITALL) &&
> (msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
> req_set_fail(req);
> }
>
> This sets REQ_F_FAIL on a fully successful recv. The CQE shows
> the full byte count, but the linked write gets -ECANCELED.
>
> I confirmed this with ftrace — the recv completes with
> result=67108864 (exactly 64MB requested), then
> io_uring_fail_link fires immediately after from an io-wq worker.
> I also confirmed with a plain recvmsg debug tool that kTLS
> returns msg_flags=0x88 (MSG_EOR | MSG_CTRUNC) on every call.
>
> Your commit 0031275d119e says "For IORING_OP_RECVMSG we also
> check for the MSG_TRUNC and MSG_CTRUNC flags" but the code
> applies the check to IORING_OP_RECV as well. MSG_CTRUNC is
> meaningful for IORING_OP_RECVMSG (user provides a cmsg buffer).
> It's meaningless for IORING_OP_RECV which never has a cmsg
> buffer.
>
> I'll send a v2 that only removes MSG_CTRUNC from the io_recv
> check.
Sounds good :-)
Thanks!
metze
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2] io_uring/net: don't check MSG_CTRUNC for IORING_OP_RECV
2026-02-26 22:03 [PATCH] io_uring/net: don't fail linked ops when done_io > 0 Hannes Furmans
2026-02-27 13:59 ` Stefan Metzmacher
@ 2026-02-27 16:27 ` Hannes Furmans
1 sibling, 0 replies; 5+ messages in thread
From: Hannes Furmans @ 2026-02-27 16:27 UTC (permalink / raw)
To: Jens Axboe
Cc: Stefan Metzmacher, io-uring, linux-kernel, stable, Hannes Furmans
IORING_OP_RECV sets up the msghdr with msg_control=NULL and
msg_controllen=0, as it has no cmsg support. Any socket layer that
calls put_cmsg() will find no buffer space and set MSG_CTRUNC in
msg_flags. This is expected — the caller didn't ask for control data.
However, io_recv checks:
if ((flags & MSG_WAITALL) && (msg_flags & (MSG_TRUNC | MSG_CTRUNC)))
req_set_fail(req);
This sets REQ_F_FAIL on a fully successful recv (ret >= min_ret) when
MSG_CTRUNC is set, which causes io_disarm_next() to cancel all linked
operations with -ECANCELED. The recv CQE shows the full requested byte
count, yet linked operations are cancelled.
This is triggered by kTLS, which calls put_cmsg(SOL_TLS,
TLS_GET_RECORD_TYPE) for every record in tls_record_content_type()
(tls_sw.c), but it affects any protocol that delivers cmsg data on
the kernel side.
The MSG_CTRUNC check was introduced by commit 0031275d119e ("io_uring:
call req_set_fail_links() on short send[msg]()/recv[msg]() with
MSG_WAITALL") whose commit message states "For IORING_OP_RECVMSG we
also check for the MSG_TRUNC and MSG_CTRUNC flags", but the code
applied the check to IORING_OP_RECV as well. MSG_CTRUNC is meaningful
for IORING_OP_RECVMSG where the user provides a cmsg buffer —
truncation there means lost metadata. It is meaningless for
IORING_OP_RECV which never provides a cmsg buffer.
Remove MSG_CTRUNC from the io_recv check. The io_recvmsg check is
left unchanged as MSG_CTRUNC is meaningful there.
Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")
Cc: stable@vger.kernel.org
Signed-off-by: Hannes Furmans <hannes@stillwind.ai>
---
v2: v1 incorrectly guarded req_set_fail() for all done_io > 0 cases.
Stefan Metzmacher correctly pointed out that short MSG_WAITALL
reads should still sever the link chain.
Root-caused via ftrace + msg_flags inspection on a real kTLS
connection (TLS 1.3, AES-128-GCM, S3 download):
ftrace shows io_uring_fail_link firing immediately after
io_uring_complete with result=67108864 (full 64MB), from io-wq:
iou-wrk-52242 io_uring_complete: req ..., result 67108864
iou-wrk-52242 io_uring_fail_link: opcode RECV, link ...
A debug recvmsg on the same kTLS socket shows:
recvmsg: ret=67108864 msg_flags=0x88 (MSG_EOR | MSG_CTRUNC)
MSG_CTRUNC is always set because kTLS calls put_cmsg() but
IORING_OP_RECV provides no cmsg buffer.
io_uring/net.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/net.c b/io_uring/net.c
index 8576c6cb2236..8baaf74e8f8d 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1221,7 +1221,7 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
if (ret == -ERESTARTSYS)
ret = -EINTR;
req_set_fail(req);
- } else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
+ } else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & MSG_TRUNC)) {
out_free:
req_set_fail(req);
}
--
2.53.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-02-27 16:27 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26 22:03 [PATCH] io_uring/net: don't fail linked ops when done_io > 0 Hannes Furmans
2026-02-27 13:59 ` Stefan Metzmacher
2026-02-27 16:14 ` Hannes Furmans
2026-02-27 16:26 ` Stefan Metzmacher
2026-02-27 16:27 ` [PATCH v2] io_uring/net: don't check MSG_CTRUNC for IORING_OP_RECV Hannes Furmans
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox