public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] io_uring/net: don't fail linked ops when done_io > 0
@ 2026-02-26 22:03 Hannes Furmans
  2026-02-27 13:59 ` Stefan Metzmacher
  2026-02-27 16:27 ` [PATCH v2] io_uring/net: don't check MSG_CTRUNC for IORING_OP_RECV Hannes Furmans
  0 siblings, 2 replies; 5+ messages in thread
From: Hannes Furmans @ 2026-02-26 22:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-kernel, stable, Hannes Furmans

When io_uring recv/send with MSG_WAITALL accumulates partial data
through done_io and then encounters an error or EOF, req_set_fail()
sets REQ_F_FAIL despite the CQE result being positive (done_io bytes).
io_disarm_next() then sees REQ_F_FAIL and cancels all linked operations
with -ECANCELED, even though the user-visible result indicates success.

This manifests in two code paths:

1) Direct completion: io_recv/io_send fall through to req_set_fail()
   when ret < min_ret, even if done_io > 0. The CQE shows done_io
   (positive) but REQ_F_FAIL severs the link chain.

2) io-wq fallback: after APOLL_MAX_RETRY (128) poll retries, the
   request moves to io-wq. io_recv returns IOU_RETRY from the
   MSG_WAITALL retry path, io-wq fails the request with -EAGAIN, and
   io_req_defer_failed -> io_sendrecv_fail overwrites cqe.res with
   done_io but leaves REQ_F_FAIL set.

Fix this by:
- Not calling req_set_fail() when done_io > 0 in io_recv, io_recvmsg,
  io_send, io_sendmsg, io_send_zc, io_sendmsg_zc
- Clearing REQ_F_FAIL in io_sendrecv_fail() when done_io > 0

This makes MSG_WAITALL partial completions consistent with
non-MSG_WAITALL behavior, where positive results never sever the
IO_LINK chain.

Reproducer: MSG_WAITALL recv via IO_LINK -> write on a UNIX socketpair
where the sender closes after partial data. The recv CQE shows positive
bytes but the linked write gets -ECANCELED.

Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")
Cc: stable@vger.kernel.org
Signed-off-by: Hannes Furmans <hannes@stillwind.ai>
---
 io_uring/net.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 8576c6cb2236..ebe51db34af8 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -576,7 +576,8 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 		}
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
-		req_set_fail(req);
+		if (!sr->done_io)
+			req_set_fail(req);
 	}
 	io_req_msg_cleanup(req, issue_flags);
 	if (ret >= 0)
@@ -688,7 +689,8 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
 		}
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
-		req_set_fail(req);
+		if (!sr->done_io)
+			req_set_fail(req);
 	}
 	if (ret >= 0)
 		ret += sr->done_io;
@@ -1074,7 +1076,8 @@ int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
 		}
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
-		req_set_fail(req);
+		if (!sr->done_io)
+			req_set_fail(req);
 	} else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
 		req_set_fail(req);
 	}
@@ -1220,7 +1223,8 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 		}
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
-		req_set_fail(req);
+		if (!sr->done_io)
+			req_set_fail(req);
 	} else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
 out_free:
 		req_set_fail(req);
@@ -1498,7 +1502,8 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 		}
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
-		req_set_fail(req);
+		if (!zc->done_io)
+			req_set_fail(req);
 	}
 
 	if (ret >= 0)
@@ -1570,7 +1575,8 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
 		}
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
-		req_set_fail(req);
+		if (!sr->done_io)
+			req_set_fail(req);
 	}
 
 	if (ret >= 0)
@@ -1595,8 +1601,10 @@ void io_sendrecv_fail(struct io_kiocb *req)
 {
 	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
 
-	if (sr->done_io)
+	if (sr->done_io) {
 		req->cqe.res = sr->done_io;
+		req->flags &= ~REQ_F_FAIL;
+	}
 
 	if ((req->flags & REQ_F_NEED_CLEANUP) &&
 	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/net: don't fail linked ops when done_io > 0
  2026-02-26 22:03 [PATCH] io_uring/net: don't fail linked ops when done_io > 0 Hannes Furmans
@ 2026-02-27 13:59 ` Stefan Metzmacher
  2026-02-27 16:14   ` Hannes Furmans
  2026-02-27 16:27 ` [PATCH v2] io_uring/net: don't check MSG_CTRUNC for IORING_OP_RECV Hannes Furmans
  1 sibling, 1 reply; 5+ messages in thread
From: Stefan Metzmacher @ 2026-02-27 13:59 UTC (permalink / raw)
  To: Hannes Furmans, Jens Axboe; +Cc: io-uring, linux-kernel, stable, Hannes Furmans

Hi Hannes,

Am 26.02.26 um 23:03 schrieb Hannes Furmans:
> When io_uring recv/send with MSG_WAITALL accumulates partial data
> through done_io and then encounters an error or EOF, req_set_fail()
> sets REQ_F_FAIL despite the CQE result being positive (done_io bytes).
> io_disarm_next() then sees REQ_F_FAIL and cancels all linked operations
> with -ECANCELED, even though the user-visible result indicates success.
> 
> This manifests in two code paths:
> 
> 1) Direct completion: io_recv/io_send fall through to req_set_fail()
>     when ret < min_ret, even if done_io > 0. The CQE shows done_io
>     (positive) but REQ_F_FAIL severs the link chain.
> 
> 2) io-wq fallback: after APOLL_MAX_RETRY (128) poll retries, the
>     request moves to io-wq. io_recv returns IOU_RETRY from the
>     MSG_WAITALL retry path, io-wq fails the request with -EAGAIN, and
>     io_req_defer_failed -> io_sendrecv_fail overwrites cqe.res with
>     done_io but leaves REQ_F_FAIL set.
> 
> Fix this by:
> - Not calling req_set_fail() when done_io > 0 in io_recv, io_recvmsg,
>    io_send, io_sendmsg, io_send_zc, io_sendmsg_zc
> - Clearing REQ_F_FAIL in io_sendrecv_fail() when done_io > 0
> 
> This makes MSG_WAITALL partial completions consistent with
> non-MSG_WAITALL behavior, where positive results never sever the
> IO_LINK chain.
> 
> Reproducer: MSG_WAITALL recv via IO_LINK -> write on a UNIX socketpair
> where the sender closes after partial data. The recv CQE shows positive
> bytes but the linked write gets -ECANCELED.
> 
> Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")

That's by design, if a MSG_WAITALL calls fails it means
not call data the caller expected arrived or were sent.
When there's a LINK after that the linked operation likely
relies on all expected data being processed! Otherwise
the message stream can get out of sync and causes corruption.

Let's assume I want to send a message header with
IO_SEND linked with a IO_SPLICE to send the payload.

If IO_SEND returns short the situation needs to be
recovered by the caller instead of letting the
IO_SPLICE give more data to the socket.

So the current behavior is exactly what MSG_WAITALL
gives you. If you don't want that why are you using it
at all?

metze

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/net: don't fail linked ops when done_io > 0
  2026-02-27 13:59 ` Stefan Metzmacher
@ 2026-02-27 16:14   ` Hannes Furmans
  2026-02-27 16:26     ` Stefan Metzmacher
  0 siblings, 1 reply; 5+ messages in thread
From: Hannes Furmans @ 2026-02-27 16:14 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Jens Axboe, io-uring, linux-kernel, stable, Hannes Furmans

Hi Stefan,

Am 27.02.26 um 14:59 schrieb Stefan Metzmacher:
> That's by design, if a MSG_WAITALL calls fails it means
> not call data the caller expected arrived or were sent.
> When there's a LINK after that the linked operation likely
> relies on all expected data being processed! Otherwise
> the message stream can get out of sync and causes corruption.

You're right — a short MSG_WAITALL read should sever the IO_LINK
chain. The v1 patch was wrong to guard req_set_fail() on done_io > 0.

> Let's assume I want to send a message header with
> IO_SEND linked with a IO_SPLICE to send the payload.
>
> If IO_SEND returns short the situation needs to be
> recovered by the caller instead of letting the
> IO_SPLICE give more data to the socket.

Agreed, the linked operation expects the complete data.

> So the current behavior is exactly what MSG_WAITALL
> gives you. If you don't want that why are you using it
> at all?

The actual bug is narrower. I traced the root cause with kTLS.

When IORING_OP_RECV is used with MSG_WAITALL on a kTLS socket,
the recv completes successfully (ret >= min_ret, full requested
amount received). But kTLS calls put_cmsg(SOL_TLS,
TLS_GET_RECORD_TYPE) for every first record of a recvmsg call
(tls_sw.c:1843). Since io_recv sets up the msghdr with
msg_control=NULL and msg_controllen=0, put_cmsg sets MSG_CTRUNC.

Then io_recv hits the else-if branch:

    } else if ((flags & MSG_WAITALL) &&
               (msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
        req_set_fail(req);
    }

This sets REQ_F_FAIL on a fully successful recv. The CQE shows
the full byte count, but the linked write gets -ECANCELED.

I confirmed this with ftrace — the recv completes with
result=67108864 (exactly 64MB requested), then
io_uring_fail_link fires immediately after from an io-wq worker.
I also confirmed with a plain recvmsg debug tool that kTLS
returns msg_flags=0x88 (MSG_EOR | MSG_CTRUNC) on every call.

Your commit 0031275d119e says "For IORING_OP_RECVMSG we also
check for the MSG_TRUNC and MSG_CTRUNC flags" but the code
applies the check to IORING_OP_RECV as well. MSG_CTRUNC is
meaningful for IORING_OP_RECVMSG (user provides a cmsg buffer).
It's meaningless for IORING_OP_RECV which never has a cmsg
buffer.

I'll send a v2 that only removes MSG_CTRUNC from the io_recv
check.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/net: don't fail linked ops when done_io > 0
  2026-02-27 16:14   ` Hannes Furmans
@ 2026-02-27 16:26     ` Stefan Metzmacher
  0 siblings, 0 replies; 5+ messages in thread
From: Stefan Metzmacher @ 2026-02-27 16:26 UTC (permalink / raw)
  To: Hannes Furmans; +Cc: Jens Axboe, io-uring, linux-kernel, stable, Hannes Furmans

Am 27.02.26 um 17:14 schrieb Hannes Furmans:
> Hi Stefan,
> 
> Am 27.02.26 um 14:59 schrieb Stefan Metzmacher:
>> That's by design, if a MSG_WAITALL calls fails it means
>> not call data the caller expected arrived or were sent.
>> When there's a LINK after that the linked operation likely
>> relies on all expected data being processed! Otherwise
>> the message stream can get out of sync and causes corruption.
> 
> You're right — a short MSG_WAITALL read should sever the IO_LINK
> chain. The v1 patch was wrong to guard req_set_fail() on done_io > 0.
> 
>> Let's assume I want to send a message header with
>> IO_SEND linked with a IO_SPLICE to send the payload.
>>
>> If IO_SEND returns short the situation needs to be
>> recovered by the caller instead of letting the
>> IO_SPLICE give more data to the socket.
> 
> Agreed, the linked operation expects the complete data.
> 
>> So the current behavior is exactly what MSG_WAITALL
>> gives you. If you don't want that why are you using it
>> at all?
> 
> The actual bug is narrower. I traced the root cause with kTLS.
> 
> When IORING_OP_RECV is used with MSG_WAITALL on a kTLS socket,
> the recv completes successfully (ret >= min_ret, full requested
> amount received). But kTLS calls put_cmsg(SOL_TLS,
> TLS_GET_RECORD_TYPE) for every first record of a recvmsg call
> (tls_sw.c:1843). Since io_recv sets up the msghdr with
> msg_control=NULL and msg_controllen=0, put_cmsg sets MSG_CTRUNC.
> 
> Then io_recv hits the else-if branch:
> 
>      } else if ((flags & MSG_WAITALL) &&
>                 (msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
>          req_set_fail(req);
>      }
> 
> This sets REQ_F_FAIL on a fully successful recv. The CQE shows
> the full byte count, but the linked write gets -ECANCELED.
> 
> I confirmed this with ftrace — the recv completes with
> result=67108864 (exactly 64MB requested), then
> io_uring_fail_link fires immediately after from an io-wq worker.
> I also confirmed with a plain recvmsg debug tool that kTLS
> returns msg_flags=0x88 (MSG_EOR | MSG_CTRUNC) on every call.
> 
> Your commit 0031275d119e says "For IORING_OP_RECVMSG we also
> check for the MSG_TRUNC and MSG_CTRUNC flags" but the code
> applies the check to IORING_OP_RECV as well. MSG_CTRUNC is
> meaningful for IORING_OP_RECVMSG (user provides a cmsg buffer).
> It's meaningless for IORING_OP_RECV which never has a cmsg
> buffer.
> 
> I'll send a v2 that only removes MSG_CTRUNC from the io_recv
> check.

Sounds good :-)

Thanks!
metze

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2] io_uring/net: don't check MSG_CTRUNC for IORING_OP_RECV
  2026-02-26 22:03 [PATCH] io_uring/net: don't fail linked ops when done_io > 0 Hannes Furmans
  2026-02-27 13:59 ` Stefan Metzmacher
@ 2026-02-27 16:27 ` Hannes Furmans
  1 sibling, 0 replies; 5+ messages in thread
From: Hannes Furmans @ 2026-02-27 16:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Stefan Metzmacher, io-uring, linux-kernel, stable, Hannes Furmans

IORING_OP_RECV sets up the msghdr with msg_control=NULL and
msg_controllen=0, as it has no cmsg support. Any socket layer that
calls put_cmsg() will find no buffer space and set MSG_CTRUNC in
msg_flags. This is expected — the caller didn't ask for control data.

However, io_recv checks:

    if ((flags & MSG_WAITALL) && (msg_flags & (MSG_TRUNC | MSG_CTRUNC)))
        req_set_fail(req);

This sets REQ_F_FAIL on a fully successful recv (ret >= min_ret) when
MSG_CTRUNC is set, which causes io_disarm_next() to cancel all linked
operations with -ECANCELED. The recv CQE shows the full requested byte
count, yet linked operations are cancelled.

This is triggered by kTLS, which calls put_cmsg(SOL_TLS,
TLS_GET_RECORD_TYPE) for every record in tls_record_content_type()
(tls_sw.c), but it affects any protocol that delivers cmsg data on
the kernel side.

The MSG_CTRUNC check was introduced by commit 0031275d119e ("io_uring:
call req_set_fail_links() on short send[msg]()/recv[msg]() with
MSG_WAITALL") whose commit message states "For IORING_OP_RECVMSG we
also check for the MSG_TRUNC and MSG_CTRUNC flags", but the code
applied the check to IORING_OP_RECV as well. MSG_CTRUNC is meaningful
for IORING_OP_RECVMSG where the user provides a cmsg buffer —
truncation there means lost metadata. It is meaningless for
IORING_OP_RECV which never provides a cmsg buffer.

Remove MSG_CTRUNC from the io_recv check. The io_recvmsg check is
left unchanged as MSG_CTRUNC is meaningful there.

Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")
Cc: stable@vger.kernel.org
Signed-off-by: Hannes Furmans <hannes@stillwind.ai>
---
v2: v1 incorrectly guarded req_set_fail() for all done_io > 0 cases.
    Stefan Metzmacher correctly pointed out that short MSG_WAITALL
    reads should still sever the link chain.

    Root-caused via ftrace + msg_flags inspection on a real kTLS
    connection (TLS 1.3, AES-128-GCM, S3 download):

    ftrace shows io_uring_fail_link firing immediately after
    io_uring_complete with result=67108864 (full 64MB), from io-wq:

      iou-wrk-52242 io_uring_complete: req ..., result 67108864
      iou-wrk-52242 io_uring_fail_link: opcode RECV, link ...

    A debug recvmsg on the same kTLS socket shows:

      recvmsg: ret=67108864 msg_flags=0x88 (MSG_EOR | MSG_CTRUNC)

    MSG_CTRUNC is always set because kTLS calls put_cmsg() but
    IORING_OP_RECV provides no cmsg buffer.

 io_uring/net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 8576c6cb2236..8baaf74e8f8d 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1221,7 +1221,7 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
 		req_set_fail(req);
-	} else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
+	} else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & MSG_TRUNC)) {
 out_free:
 		req_set_fail(req);
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-27 16:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26 22:03 [PATCH] io_uring/net: don't fail linked ops when done_io > 0 Hannes Furmans
2026-02-27 13:59 ` Stefan Metzmacher
2026-02-27 16:14   ` Hannes Furmans
2026-02-27 16:26     ` Stefan Metzmacher
2026-02-27 16:27 ` [PATCH v2] io_uring/net: don't check MSG_CTRUNC for IORING_OP_RECV Hannes Furmans

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox