* [PATCH AUTOSEL 6.17] io_uring/zcrx: check all niovs filled with dma addresses
[not found] <20251009155752.773732-1-sashal@kernel.org>
@ 2025-10-09 15:54 ` Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.12] io_uring/zctx: check chained notif contexts Sasha Levin
` (2 subsequent siblings)
3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2025-10-09 15:54 UTC (permalink / raw)
To: patches, stable; +Cc: Pavel Begunkov, Jens Axboe, Sasha Levin, io-uring
From: Pavel Begunkov <asml.silence@gmail.com>
[ Upstream commit d7ae46b454eb05e3df0d46c2ac9c61416a4d9057 ]
Add a warning if io_populate_area_dma() can't fill in all net_iovs, it
should never happen.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
YES
- What it changes
- Adds a post-loop invariant check in `io_populate_area_dma()` to
ensure every `net_iov` in the area got a valid DMA address. If not,
it emits a one-time warning and fails the mapping with `-EFAULT`:
- New logic: “if not all niovs filled → WARN_ON_ONCE + return
-EFAULT”
- Before: the function always returned 0 even if it didn’t populate
all niovs.
- This is a small, localized change to `io_uring/zcrx.c` that does not
alter APIs or structures and only affects the zcrx receive path.
- Why it matters (bug/risk being fixed)
- Today, `io_populate_area_dma()` returns success unconditionally
after walking the SG table, even if fewer DMA addresses were written
than `area->nia.num_niovs`. See unconditional return in
`io_uring/zcrx.c:78`.
- On success, `io_zcrx_map_area()` marks the area as mapped (sets
`area->is_mapped = true`), which enables the page_pool memory
provider to start using these entries, assuming per-`net_iov` DMA
addresses are valid:
- `io_uring/zcrx.c:277` and `io_uring/zcrx.c:290-293`
- DMA addresses are later consumed in sync paths (e.g.,
`io_zcrx_sync_for_device()`), which fetches them via
`page_pool_get_dma_addr_netmem()`: `io_uring/zcrx.c:304-306`.
- If some `net_iov`s remained uninitialized (DMA address 0 or stale),
the NIC could be programmed with an invalid DMA address. That is a
correctness and potential security issue (device DMA to the wrong
address). The new check prevents entering that state by failing
early instead of silently accepting a partially-initialized mapping.
- Context and call sites
- `io_populate_area_dma()` is used by both umem and dmabuf mappings:
- dmabuf mapping calls into `io_populate_area_dma()` at
`io_uring/zcrx.c:159-161`. Upstream already validates the SG total
size matches the area length (`io_uring/zcrx.c:139-146`), but the
new invariant check still hardens against any mismatch in the
subsequent per-page DMA population loop.
- umem mapping path invokes it after `dma_map_sgtable()`:
`io_uring/zcrx.c:270-275`.
- On error, both paths propagate the error, and `io_zcrx_map_area()`
will not set `area->is_mapped = true` (`io_uring/zcrx.c:290-293`),
preventing the page_pool consumer from using invalid entries. This
aligns with other existing error checks in the function (e.g.,
`net_mp_niov_set_dma_addr()` failure returns `-EFAULT` already at
`io_uring/zcrx.c:71-72`).
- Stable backport criteria assessment
- Fixes a real bug class (silent partial initialization of DMA
addresses) that can lead to invalid device DMA targets. Even if the
author says it “should never happen,” enforcing the invariant
eliminates a dangerous failure mode rather than allowing undefined
behavior.
- Change is small and tightly scoped to zcrx; it adds one WARN_ON_ONCE
and a conditional `-EFAULT` return. No interface, ABI, or
architectural changes.
- Callers already handle negative returns and will abort mapping
cleanly, so regression risk is minimal and clearly bounded to
failing early when an invariant is violated.
- No new features; pure correctness/hardening.
- Touches io_uring zcrx, not core scheduling or memory subsystems;
impact is contained.
- Additional code references supporting the assessment
- Unconditional success today: `io_uring/zcrx.c:52-79` (function body,
unconditional `return 0` at `io_uring/zcrx.c:78`).
- Caller sets `is_mapped` only on success: `io_uring/zcrx.c:277-293`.
- DMA address is used later for device sync:
`io_uring/zcrx.c:304-306`.
- dmabuf SG size checked already: `io_uring/zcrx.c:139-146`.
Conclusion: This is a low-risk, correctness hardening that prevents a
silent and potentially serious failure mode. It satisfies stable rules
for a minimal, contained bugfix. Backport to stable trees that include
`io_uring/zcrx` and `io_populate_area_dma()` is advisable.
io_uring/zcrx.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index e5ff49f3425e0..39d1ef52a57b1 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -75,6 +75,9 @@ static int io_populate_area_dma(struct io_zcrx_ifq *ifq,
niov_idx++;
}
}
+
+ if (WARN_ON_ONCE(niov_idx != area->nia.num_niovs))
+ return -EFAULT;
return 0;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH AUTOSEL 6.17-6.12] io_uring/zctx: check chained notif contexts
[not found] <20251009155752.773732-1-sashal@kernel.org>
2025-10-09 15:54 ` [PATCH AUTOSEL 6.17] io_uring/zcrx: check all niovs filled with dma addresses Sasha Levin
@ 2025-10-09 15:55 ` Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/rsrc: respect submitter_task in io_register_clone_buffers() Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/zcrx: account niov arrays to cgroup Sasha Levin
3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2025-10-09 15:55 UTC (permalink / raw)
To: patches, stable; +Cc: Pavel Begunkov, Jens Axboe, Sasha Levin, io-uring
From: Pavel Begunkov <asml.silence@gmail.com>
[ Upstream commit ab3ea6eac5f45669b091309f592c4ea324003053 ]
Send zc only links ubuf_info for requests coming from the same context.
There are some ambiguous syz reports, so let's check the assumption on
notification completion.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fd527d8638203fe0f1c5ff06ff2e1d8fd68f831b.1755179962.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Perfect! I can now see the bug at line 88 in the current code. Let me
provide my comprehensive analysis:
## BACKPORT ANALYSIS
**Answer: YES**
This commit should be backported to stable kernel trees that contain the
notification stacking feature (v6.10+).
### DETAILED ANALYSIS
#### **Bug Context and History**
1. **Feature Introduction (v6.10, April 2024):** Commit 6fe4220912d19
introduced notification stacking for io_uring zero-copy sends. This
feature allows multiple io_uring send requests to link their
notifications together, optimizing skb usage.
2. **The Original Bug:** In `io_link_skb()` at line 88 of
io_uring/notif.c:
```c
prev_nd = container_of(prev_uarg, struct io_notif_data, uarg);
prev_notif = cmd_to_io_kiocb(nd); // BUG! Should be prev_nd
/* make sure all noifications can be finished in the same task_work */
if (unlikely(notif->ctx != prev_notif->ctx ||
notif->tctx != prev_notif->tctx))
return -EEXIST;
```
The bug uses `nd` instead of `prev_nd`, causing the validation check
to compare a notification's context with **itself** instead of with
the previous notification's context. This renders the safety check
useless.
3. **Bug Window:** The bug existed from v6.10 (April 2024) until
September 2025 - approximately **16 months**.
4. **This Commit's Purpose:** Adds a defensive runtime check in
`io_notif_tw_complete()` to catch cross-context notification chains
during completion:
```c
struct io_ring_ctx *ctx = notif->ctx;
lockdep_assert_held(&ctx->uring_lock);
do {
notif = cmd_to_io_kiocb(nd);
if (WARN_ON_ONCE(ctx != notif->ctx))
return; // Abort to prevent corruption
```
#### **Security and Stability Implications**
The commit message states: "There are some ambiguous syz reports" -
indicating syzkaller found crashes related to this issue.
**What can go wrong when notifications from different contexts get
chained:**
1. **Use-After-Free:** If one io_ring_ctx is destroyed while
notifications from it are still chained with another context:
- Line 27-28 accesses `notif->ctx->user` for memory accounting
- Line 32 calls `io_req_task_complete(notif, tw)` which may access
freed context
2. **Lock Ordering Violations:** Line 18 adds
`lockdep_assert_held(&ctx->uring_lock)` assuming all notifications
use the same lock. Cross-context chains violate this assumption,
potentially causing deadlocks.
3. **Memory Corruption:** The `__io_unaccount_mem()` call at line 27
operates on freed memory if `notif->ctx` was destroyed.
4. **Task Context Violations:** All notifications must complete in the
same task_work (line 92-93 check in io_link_skb), but the broken
validation allowed violations.
#### **Why This Should Be Backported**
1. **Prevents Real Crashes:** Syzkaller reports confirm this causes real
issues in production kernels.
2. **Defense in Depth:** Even though the root cause was fixed separately
(commit 2c139a47eff8d, September 2025), this check provides:
- Protection against any other code paths that might violate the
invariant
- Early detection with WARN_ON_ONCE for debugging
- Safe failure mode (early return) instead of memory corruption
3. **Minimal Risk:** The change adds only 5 lines:
- 1 variable declaration
- 1 lockdep assertion
- 3 lines for the safety check
- No functional changes to normal operation
- The check should never trigger after the io_link_skb fix
4. **Small and Contained:** Affects only `io_notif_tw_complete()` in
io_uring/notif.c
5. **Stable Tree Rules Compliance:**
- Fixes important bug (potential use-after-free, memory corruption)
- Minimal and obvious change
- Already being backported by maintainers (commit e776dd834cbfa
observed in tree)
6. **Affected Versions:** Only kernels v6.10+ that have notification
stacking. Earlier kernels don't have the vulnerable code.
#### **Code Change Analysis**
**Lines added:**
- **Line 17:** `struct io_ring_ctx *ctx = notif->ctx;` - Cache the
expected context
- **Line 19:** `lockdep_assert_held(&ctx->uring_lock);` - Verify lock is
held
- **Line 24-25:** WARN_ON_ONCE check and early return if context
mismatch detected
The check is placed inside the `do-while` loop that iterates through
chained notifications, ensuring each notification in the chain belongs
to the same context as the first one.
**Worst case scenario without this fix:**
- Attacker or buggy application creates notifications from multiple
contexts
- Due to io_link_skb bug, they get chained together
- One context gets destroyed (ring fd closed)
- Completion path accesses freed memory → crash or exploitable condition
### RECOMMENDATION
**BACKPORT: YES** to all stable trees containing v6.10+
This is a critical defensive fix that prevents use-after-free and memory
corruption bugs in io_uring zero-copy notification handling. The fix is
minimal, low-risk, and addresses a real issue found by syzkaller that
existed for over a year in production kernels.
io_uring/notif.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/io_uring/notif.c b/io_uring/notif.c
index ea9c0116cec2d..d8ba1165c9494 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -14,10 +14,15 @@ static const struct ubuf_info_ops io_ubuf_ops;
static void io_notif_tw_complete(struct io_kiocb *notif, io_tw_token_t tw)
{
struct io_notif_data *nd = io_notif_to_data(notif);
+ struct io_ring_ctx *ctx = notif->ctx;
+
+ lockdep_assert_held(&ctx->uring_lock);
do {
notif = cmd_to_io_kiocb(nd);
+ if (WARN_ON_ONCE(ctx != notif->ctx))
+ return;
lockdep_assert(refcount_read(&nd->uarg.refcnt) == 0);
if (unlikely(nd->zc_report) && (nd->zc_copied || !nd->zc_used))
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH AUTOSEL 6.17-6.16] io_uring/rsrc: respect submitter_task in io_register_clone_buffers()
[not found] <20251009155752.773732-1-sashal@kernel.org>
2025-10-09 15:54 ` [PATCH AUTOSEL 6.17] io_uring/zcrx: check all niovs filled with dma addresses Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.12] io_uring/zctx: check chained notif contexts Sasha Levin
@ 2025-10-09 15:55 ` Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/zcrx: account niov arrays to cgroup Sasha Levin
3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2025-10-09 15:55 UTC (permalink / raw)
To: patches, stable; +Cc: Caleb Sander Mateos, Jens Axboe, Sasha Levin, io-uring
From: Caleb Sander Mateos <csander@purestorage.com>
[ Upstream commit 2f076a453f75de691a081c89bce31b530153d53b ]
io_ring_ctx's enabled with IORING_SETUP_SINGLE_ISSUER are only allowed
a single task submitting to the ctx. Although the documentation only
mentions this restriction applying to io_uring_enter() syscalls,
commit d7cce96c449e ("io_uring: limit registration w/ SINGLE_ISSUER")
extends it to io_uring_register(). Ensuring only one task interacts
with the io_ring_ctx will be important to allow this task to avoid
taking the uring_lock.
There is, however, one gap in these checks: io_register_clone_buffers()
may take the uring_lock on a second (source) io_ring_ctx, but
__io_uring_register() only checks the current thread against the
*destination* io_ring_ctx's submitter_task. Fail the
IORING_REGISTER_CLONE_BUFFERS with -EEXIST if the source io_ring_ctx has
a registered submitter_task other than the current task.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
YES
**Why Backport**
- `io_uring/register.c:646` already enforces
`IORING_SETUP_SINGLE_ISSUER` by rejecting registrations issued by any
task other than `ctx->submitter_task`, but
`io_register_clone_buffers()` still grabs `src_ctx->uring_lock`
without checking `src_ctx->submitter_task` (see current flow in
`io_uring/rsrc.c:1296-1305`). This lets a non-owner thread interact
with the source ring, breaking the SINGLE_ISSUER contract introduced
by d7cce96c449e.
- The patch plugs that only gap by validating `src_ctx->submitter_task
== current` immediately after `lock_two_rings(ctx, src_ctx)` in
`io_uring/rsrc.c`, and returning `-EEXIST` when another task tries to
clone buffers. That keeps all rings with SINGLE_ISSUER consistent with
the locking and lockdep assumptions documented in
`io_uring/io_uring.h:136-144`.
**Impact Without Fix**
- A second task can still take `src_ctx->uring_lock` through cloning,
undermining the guarantee that only the designated submitter ever
touches that ring. Upcoming optimizations that skip `uring_lock` for
the submitter rely on this guarantee; leaving the hole risks future
functional regressions or lockdep splats once those changes land.
- Even today, the gap lets another thread stall a SINGLE_ISSUER ring by
holding its lock via `IORING_REGISTER_CLONE_BUFFERS`, which
contradicts users’ expectations after enabling SINGLE_ISSUER.
**Risk & Scope**
- Change is tiny and self-contained (one extra guard plus an early exit)
with no data structure churn or ABI impact. Rings that are not flagged
SINGLE_ISSUER have `submitter_task == NULL`, so behaviour is
unchanged; legitimate same-thread clones still succeed.
**Backport Notes**
- Needs to go only into trees that already contain the clone-buffer
support (`7cc2a6eadcd7` / `636119af94f2f`) and the SINGLE_ISSUER
registration gating (`d7cce96c449e`). No further prerequisites
identified.
io_uring/rsrc.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index f75f5e43fa4aa..e1e5f0fb0f56d 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1299,10 +1299,17 @@ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg)
if (src_ctx != ctx) {
mutex_unlock(&ctx->uring_lock);
lock_two_rings(ctx, src_ctx);
+
+ if (src_ctx->submitter_task &&
+ src_ctx->submitter_task != current) {
+ ret = -EEXIST;
+ goto out;
+ }
}
ret = io_clone_buffers(ctx, src_ctx, &buf);
+out:
if (src_ctx != ctx)
mutex_unlock(&src_ctx->uring_lock);
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH AUTOSEL 6.17-6.16] io_uring/zcrx: account niov arrays to cgroup
[not found] <20251009155752.773732-1-sashal@kernel.org>
` (2 preceding siblings ...)
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/rsrc: respect submitter_task in io_register_clone_buffers() Sasha Levin
@ 2025-10-09 15:55 ` Sasha Levin
3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2025-10-09 15:55 UTC (permalink / raw)
To: patches, stable; +Cc: Pavel Begunkov, Jens Axboe, Sasha Levin, io-uring
From: Pavel Begunkov <asml.silence@gmail.com>
[ Upstream commit 31bf77dcc3810e08bcc7d15470e92cdfffb7f7f1 ]
net_iov / freelist / etc. arrays can be quite long, make sure they're
accounted.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
YES
- What it does: Switches three kvmalloc_array() allocations used by
io_uring zcrx areas from GFP_KERNEL to GFP_KERNEL_ACCOUNT so their
memory is charged to the creating task’s memcg. Specifically:
- `area->nia.niovs` allocation: io_uring/zcrx.c:425
- `area->freelist` allocation: io_uring/zcrx.c:430
- `area->user_refs` allocation: io_uring/zcrx.c:435
Why it matters
- Fixes unaccounted kernel memory: These arrays can be very large (one
entry per page of the registered area). Without GFP_KERNEL_ACCOUNT, a
cgroup can allocate significant kernel memory that is not charged to
its memcg, breaking containment and potentially causing host memory
pressure. The commit explicitly addresses this: “arrays can be quite
long, make sure they're accounted.”
- Brings consistency with existing accounting in the same path: The user
memory backing the area is already accounted to memcg via
`sg_alloc_table_from_pages(..., GFP_KERNEL_ACCOUNT)`
(io_uring/zcrx.c:196) and to the io_uring context via
`io_account_mem()` (io_uring/zcrx.c:205). Accounting these control
arrays aligns with that design and closes a loophole where only the
big page backing was charged but the (potentially multi‑MiB) array
metadata was not.
- Scope is tiny and contained: The change is three flag substitutions
within `io_zcrx_create_area()` and has no API/ABI or behavioral
changes beyond proper memcg charging. No architectural changes; hot
paths are unaffected (this is registration-time allocation).
Risk assessment
- Low regression risk: Uses a long-standing flag (`GFP_KERNEL_ACCOUNT`)
already used in this file for the data path (io_uring/zcrx.c:196). The
only behavioral change is that allocations will now fail earlier with
`-ENOMEM` if a cgroup’s limits would be exceeded—this is the desired
and correct behavior for accounting fixes.
- No ordering dependencies: The patch doesn’t rely on recent refactors;
the affected allocations exist in v6.15–v6.17 and are currently done
with `GFP_KERNEL`. The change applies cleanly to those stable series
where `io_uring/zcrx.c` is present.
Stable tree fit
- Fixes a real bug affecting users: memcg under-accounting in a new but
shipped subsystem (zcrx is present since v6.15).
- Minimal, localized, and low risk: Three flag changes in one function.
- No feature additions or architectural changes: Pure accounting fix.
- Consistent with stable policy: Similar accounting fixes are regularly
accepted; related earlier work in this area explicitly targeted stable
(e.g., “io_uring/zcrx: account area memory” carries a `Cc:
stable@vger.kernel.org`, complementing this change).
Conclusion
- Backporting will prevent unaccounted kernel memory growth from zcrx
area metadata, aligning with memcg expectations and improving
containment with negligible risk.
io_uring/zcrx.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 39d1ef52a57b1..5928544cd1687 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -426,17 +426,17 @@ static int io_zcrx_create_area(struct io_zcrx_ifq *ifq,
ret = -ENOMEM;
area->nia.niovs = kvmalloc_array(nr_iovs, sizeof(area->nia.niovs[0]),
- GFP_KERNEL | __GFP_ZERO);
+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!area->nia.niovs)
goto err;
area->freelist = kvmalloc_array(nr_iovs, sizeof(area->freelist[0]),
- GFP_KERNEL | __GFP_ZERO);
+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!area->freelist)
goto err;
area->user_refs = kvmalloc_array(nr_iovs, sizeof(area->user_refs[0]),
- GFP_KERNEL | __GFP_ZERO);
+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!area->user_refs)
goto err;
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-10-09 16:00 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20251009155752.773732-1-sashal@kernel.org>
2025-10-09 15:54 ` [PATCH AUTOSEL 6.17] io_uring/zcrx: check all niovs filled with dma addresses Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.12] io_uring/zctx: check chained notif contexts Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/rsrc: respect submitter_task in io_register_clone_buffers() Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/zcrx: account niov arrays to cgroup Sasha Levin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox