public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH AUTOSEL 6.17] io_uring/zcrx: check all niovs filled with dma addresses
       [not found] <20251009155752.773732-1-sashal@kernel.org>
@ 2025-10-09 15:54 ` Sasha Levin
  2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.12] io_uring/zctx: check chained notif contexts Sasha Levin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2025-10-09 15:54 UTC (permalink / raw)
  To: patches, stable; +Cc: Pavel Begunkov, Jens Axboe, Sasha Levin, io-uring

From: Pavel Begunkov <asml.silence@gmail.com>

[ Upstream commit d7ae46b454eb05e3df0d46c2ac9c61416a4d9057 ]

Add a warning if io_populate_area_dma() can't fill in all net_iovs, it
should never happen.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES

- What it changes
  - Adds a post-loop invariant check in `io_populate_area_dma()` to
    ensure every `net_iov` in the area got a valid DMA address. If not,
    it emits a one-time warning and fails the mapping with `-EFAULT`:
    - New logic: “if not all niovs filled → WARN_ON_ONCE + return
      -EFAULT”
    - Before: the function always returned 0 even if it didn’t populate
      all niovs.
  - This is a small, localized change to `io_uring/zcrx.c` that does not
    alter APIs or structures and only affects the zcrx receive path.

- Why it matters (bug/risk being fixed)
  - Today, `io_populate_area_dma()` returns success unconditionally
    after walking the SG table, even if fewer DMA addresses were written
    than `area->nia.num_niovs`. See unconditional return in
    `io_uring/zcrx.c:78`.
  - On success, `io_zcrx_map_area()` marks the area as mapped (sets
    `area->is_mapped = true`), which enables the page_pool memory
    provider to start using these entries, assuming per-`net_iov` DMA
    addresses are valid:
    - `io_uring/zcrx.c:277` and `io_uring/zcrx.c:290-293`
    - DMA addresses are later consumed in sync paths (e.g.,
      `io_zcrx_sync_for_device()`), which fetches them via
      `page_pool_get_dma_addr_netmem()`: `io_uring/zcrx.c:304-306`.
  - If some `net_iov`s remained uninitialized (DMA address 0 or stale),
    the NIC could be programmed with an invalid DMA address. That is a
    correctness and potential security issue (device DMA to the wrong
    address). The new check prevents entering that state by failing
    early instead of silently accepting a partially-initialized mapping.

- Context and call sites
  - `io_populate_area_dma()` is used by both umem and dmabuf mappings:
    - dmabuf mapping calls into `io_populate_area_dma()` at
      `io_uring/zcrx.c:159-161`. Upstream already validates the SG total
      size matches the area length (`io_uring/zcrx.c:139-146`), but the
      new invariant check still hardens against any mismatch in the
      subsequent per-page DMA population loop.
    - umem mapping path invokes it after `dma_map_sgtable()`:
      `io_uring/zcrx.c:270-275`.
  - On error, both paths propagate the error, and `io_zcrx_map_area()`
    will not set `area->is_mapped = true` (`io_uring/zcrx.c:290-293`),
    preventing the page_pool consumer from using invalid entries. This
    aligns with other existing error checks in the function (e.g.,
    `net_mp_niov_set_dma_addr()` failure returns `-EFAULT` already at
    `io_uring/zcrx.c:71-72`).

- Stable backport criteria assessment
  - Fixes a real bug class (silent partial initialization of DMA
    addresses) that can lead to invalid device DMA targets. Even if the
    author says it “should never happen,” enforcing the invariant
    eliminates a dangerous failure mode rather than allowing undefined
    behavior.
  - Change is small and tightly scoped to zcrx; it adds one WARN_ON_ONCE
    and a conditional `-EFAULT` return. No interface, ABI, or
    architectural changes.
  - Callers already handle negative returns and will abort mapping
    cleanly, so regression risk is minimal and clearly bounded to
    failing early when an invariant is violated.
  - No new features; pure correctness/hardening.
  - Touches io_uring zcrx, not core scheduling or memory subsystems;
    impact is contained.

- Additional code references supporting the assessment
  - Unconditional success today: `io_uring/zcrx.c:52-79` (function body,
    unconditional `return 0` at `io_uring/zcrx.c:78`).
  - Caller sets `is_mapped` only on success: `io_uring/zcrx.c:277-293`.
  - DMA address is used later for device sync:
    `io_uring/zcrx.c:304-306`.
  - dmabuf SG size checked already: `io_uring/zcrx.c:139-146`.

Conclusion: This is a low-risk, correctness hardening that prevents a
silent and potentially serious failure mode. It satisfies stable rules
for a minimal, contained bugfix. Backport to stable trees that include
`io_uring/zcrx` and `io_populate_area_dma()` is advisable.

 io_uring/zcrx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index e5ff49f3425e0..39d1ef52a57b1 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -75,6 +75,9 @@ static int io_populate_area_dma(struct io_zcrx_ifq *ifq,
 			niov_idx++;
 		}
 	}
+
+	if (WARN_ON_ONCE(niov_idx != area->nia.num_niovs))
+		return -EFAULT;
 	return 0;
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 6.17-6.12] io_uring/zctx: check chained notif contexts
       [not found] <20251009155752.773732-1-sashal@kernel.org>
  2025-10-09 15:54 ` [PATCH AUTOSEL 6.17] io_uring/zcrx: check all niovs filled with dma addresses Sasha Levin
@ 2025-10-09 15:55 ` Sasha Levin
  2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/rsrc: respect submitter_task in io_register_clone_buffers() Sasha Levin
  2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/zcrx: account niov arrays to cgroup Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2025-10-09 15:55 UTC (permalink / raw)
  To: patches, stable; +Cc: Pavel Begunkov, Jens Axboe, Sasha Levin, io-uring

From: Pavel Begunkov <asml.silence@gmail.com>

[ Upstream commit ab3ea6eac5f45669b091309f592c4ea324003053 ]

Send zc only links ubuf_info for requests coming from the same context.
There are some ambiguous syz reports, so let's check the assumption on
notification completion.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fd527d8638203fe0f1c5ff06ff2e1d8fd68f831b.1755179962.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Perfect! I can now see the bug at line 88 in the current code. Let me
provide my comprehensive analysis:

## BACKPORT ANALYSIS

**Answer: YES**

This commit should be backported to stable kernel trees that contain the
notification stacking feature (v6.10+).

### DETAILED ANALYSIS

#### **Bug Context and History**

1. **Feature Introduction (v6.10, April 2024):** Commit 6fe4220912d19
   introduced notification stacking for io_uring zero-copy sends. This
   feature allows multiple io_uring send requests to link their
   notifications together, optimizing skb usage.

2. **The Original Bug:** In `io_link_skb()` at line 88 of
   io_uring/notif.c:
  ```c
  prev_nd = container_of(prev_uarg, struct io_notif_data, uarg);
  prev_notif = cmd_to_io_kiocb(nd);  // BUG! Should be prev_nd

  /* make sure all noifications can be finished in the same task_work */
  if (unlikely(notif->ctx != prev_notif->ctx ||
  notif->tctx != prev_notif->tctx))
  return -EEXIST;
  ```

  The bug uses `nd` instead of `prev_nd`, causing the validation check
  to compare a notification's context with **itself** instead of with
  the previous notification's context. This renders the safety check
  useless.

3. **Bug Window:** The bug existed from v6.10 (April 2024) until
   September 2025 - approximately **16 months**.

4. **This Commit's Purpose:** Adds a defensive runtime check in
   `io_notif_tw_complete()` to catch cross-context notification chains
   during completion:
  ```c
  struct io_ring_ctx *ctx = notif->ctx;
  lockdep_assert_held(&ctx->uring_lock);

  do {
  notif = cmd_to_io_kiocb(nd);

  if (WARN_ON_ONCE(ctx != notif->ctx))
  return;  // Abort to prevent corruption
  ```

#### **Security and Stability Implications**

The commit message states: "There are some ambiguous syz reports" -
indicating syzkaller found crashes related to this issue.

**What can go wrong when notifications from different contexts get
chained:**

1. **Use-After-Free:** If one io_ring_ctx is destroyed while
   notifications from it are still chained with another context:
   - Line 27-28 accesses `notif->ctx->user` for memory accounting
   - Line 32 calls `io_req_task_complete(notif, tw)` which may access
     freed context

2. **Lock Ordering Violations:** Line 18 adds
   `lockdep_assert_held(&ctx->uring_lock)` assuming all notifications
   use the same lock. Cross-context chains violate this assumption,
   potentially causing deadlocks.

3. **Memory Corruption:** The `__io_unaccount_mem()` call at line 27
   operates on freed memory if `notif->ctx` was destroyed.

4. **Task Context Violations:** All notifications must complete in the
   same task_work (line 92-93 check in io_link_skb), but the broken
   validation allowed violations.

#### **Why This Should Be Backported**

1. **Prevents Real Crashes:** Syzkaller reports confirm this causes real
   issues in production kernels.

2. **Defense in Depth:** Even though the root cause was fixed separately
   (commit 2c139a47eff8d, September 2025), this check provides:
   - Protection against any other code paths that might violate the
     invariant
   - Early detection with WARN_ON_ONCE for debugging
   - Safe failure mode (early return) instead of memory corruption

3. **Minimal Risk:** The change adds only 5 lines:
   - 1 variable declaration
   - 1 lockdep assertion
   - 3 lines for the safety check
   - No functional changes to normal operation
   - The check should never trigger after the io_link_skb fix

4. **Small and Contained:** Affects only `io_notif_tw_complete()` in
   io_uring/notif.c

5. **Stable Tree Rules Compliance:**
   - Fixes important bug (potential use-after-free, memory corruption)
   - Minimal and obvious change
   - Already being backported by maintainers (commit e776dd834cbfa
     observed in tree)

6. **Affected Versions:** Only kernels v6.10+ that have notification
   stacking. Earlier kernels don't have the vulnerable code.

#### **Code Change Analysis**

**Lines added:**
- **Line 17:** `struct io_ring_ctx *ctx = notif->ctx;` - Cache the
  expected context
- **Line 19:** `lockdep_assert_held(&ctx->uring_lock);` - Verify lock is
  held
- **Line 24-25:** WARN_ON_ONCE check and early return if context
  mismatch detected

The check is placed inside the `do-while` loop that iterates through
chained notifications, ensuring each notification in the chain belongs
to the same context as the first one.

**Worst case scenario without this fix:**
- Attacker or buggy application creates notifications from multiple
  contexts
- Due to io_link_skb bug, they get chained together
- One context gets destroyed (ring fd closed)
- Completion path accesses freed memory → crash or exploitable condition

### RECOMMENDATION

**BACKPORT: YES** to all stable trees containing v6.10+

This is a critical defensive fix that prevents use-after-free and memory
corruption bugs in io_uring zero-copy notification handling. The fix is
minimal, low-risk, and addresses a real issue found by syzkaller that
existed for over a year in production kernels.

 io_uring/notif.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/io_uring/notif.c b/io_uring/notif.c
index ea9c0116cec2d..d8ba1165c9494 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -14,10 +14,15 @@ static const struct ubuf_info_ops io_ubuf_ops;
 static void io_notif_tw_complete(struct io_kiocb *notif, io_tw_token_t tw)
 {
 	struct io_notif_data *nd = io_notif_to_data(notif);
+	struct io_ring_ctx *ctx = notif->ctx;
+
+	lockdep_assert_held(&ctx->uring_lock);
 
 	do {
 		notif = cmd_to_io_kiocb(nd);
 
+		if (WARN_ON_ONCE(ctx != notif->ctx))
+			return;
 		lockdep_assert(refcount_read(&nd->uarg.refcnt) == 0);
 
 		if (unlikely(nd->zc_report) && (nd->zc_copied || !nd->zc_used))
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 6.17-6.16] io_uring/rsrc: respect submitter_task in io_register_clone_buffers()
       [not found] <20251009155752.773732-1-sashal@kernel.org>
  2025-10-09 15:54 ` [PATCH AUTOSEL 6.17] io_uring/zcrx: check all niovs filled with dma addresses Sasha Levin
  2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.12] io_uring/zctx: check chained notif contexts Sasha Levin
@ 2025-10-09 15:55 ` Sasha Levin
  2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/zcrx: account niov arrays to cgroup Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2025-10-09 15:55 UTC (permalink / raw)
  To: patches, stable; +Cc: Caleb Sander Mateos, Jens Axboe, Sasha Levin, io-uring

From: Caleb Sander Mateos <csander@purestorage.com>

[ Upstream commit 2f076a453f75de691a081c89bce31b530153d53b ]

io_ring_ctx's enabled with IORING_SETUP_SINGLE_ISSUER are only allowed
a single task submitting to the ctx. Although the documentation only
mentions this restriction applying to io_uring_enter() syscalls,
commit d7cce96c449e ("io_uring: limit registration w/ SINGLE_ISSUER")
extends it to io_uring_register(). Ensuring only one task interacts
with the io_ring_ctx will be important to allow this task to avoid
taking the uring_lock.
There is, however, one gap in these checks: io_register_clone_buffers()
may take the uring_lock on a second (source) io_ring_ctx, but
__io_uring_register() only checks the current thread against the
*destination* io_ring_ctx's submitter_task. Fail the
IORING_REGISTER_CLONE_BUFFERS with -EEXIST if the source io_ring_ctx has
a registered submitter_task other than the current task.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES
**Why Backport**
- `io_uring/register.c:646` already enforces
  `IORING_SETUP_SINGLE_ISSUER` by rejecting registrations issued by any
  task other than `ctx->submitter_task`, but
  `io_register_clone_buffers()` still grabs `src_ctx->uring_lock`
  without checking `src_ctx->submitter_task` (see current flow in
  `io_uring/rsrc.c:1296-1305`). This lets a non-owner thread interact
  with the source ring, breaking the SINGLE_ISSUER contract introduced
  by d7cce96c449e.
- The patch plugs that only gap by validating `src_ctx->submitter_task
  == current` immediately after `lock_two_rings(ctx, src_ctx)` in
  `io_uring/rsrc.c`, and returning `-EEXIST` when another task tries to
  clone buffers. That keeps all rings with SINGLE_ISSUER consistent with
  the locking and lockdep assumptions documented in
  `io_uring/io_uring.h:136-144`.

**Impact Without Fix**
- A second task can still take `src_ctx->uring_lock` through cloning,
  undermining the guarantee that only the designated submitter ever
  touches that ring. Upcoming optimizations that skip `uring_lock` for
  the submitter rely on this guarantee; leaving the hole risks future
  functional regressions or lockdep splats once those changes land.
- Even today, the gap lets another thread stall a SINGLE_ISSUER ring by
  holding its lock via `IORING_REGISTER_CLONE_BUFFERS`, which
  contradicts users’ expectations after enabling SINGLE_ISSUER.

**Risk & Scope**
- Change is tiny and self-contained (one extra guard plus an early exit)
  with no data structure churn or ABI impact. Rings that are not flagged
  SINGLE_ISSUER have `submitter_task == NULL`, so behaviour is
  unchanged; legitimate same-thread clones still succeed.

**Backport Notes**
- Needs to go only into trees that already contain the clone-buffer
  support (`7cc2a6eadcd7` / `636119af94f2f`) and the SINGLE_ISSUER
  registration gating (`d7cce96c449e`). No further prerequisites
  identified.

 io_uring/rsrc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index f75f5e43fa4aa..e1e5f0fb0f56d 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1299,10 +1299,17 @@ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg)
 	if (src_ctx != ctx) {
 		mutex_unlock(&ctx->uring_lock);
 		lock_two_rings(ctx, src_ctx);
+
+		if (src_ctx->submitter_task &&
+		    src_ctx->submitter_task != current) {
+			ret = -EEXIST;
+			goto out;
+		}
 	}
 
 	ret = io_clone_buffers(ctx, src_ctx, &buf);
 
+out:
 	if (src_ctx != ctx)
 		mutex_unlock(&src_ctx->uring_lock);
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 6.17-6.16] io_uring/zcrx: account niov arrays to cgroup
       [not found] <20251009155752.773732-1-sashal@kernel.org>
                   ` (2 preceding siblings ...)
  2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/rsrc: respect submitter_task in io_register_clone_buffers() Sasha Levin
@ 2025-10-09 15:55 ` Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2025-10-09 15:55 UTC (permalink / raw)
  To: patches, stable; +Cc: Pavel Begunkov, Jens Axboe, Sasha Levin, io-uring

From: Pavel Begunkov <asml.silence@gmail.com>

[ Upstream commit 31bf77dcc3810e08bcc7d15470e92cdfffb7f7f1 ]

net_iov / freelist / etc. arrays can be quite long, make sure they're
accounted.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES

- What it does: Switches three kvmalloc_array() allocations used by
  io_uring zcrx areas from GFP_KERNEL to GFP_KERNEL_ACCOUNT so their
  memory is charged to the creating task’s memcg. Specifically:
  - `area->nia.niovs` allocation: io_uring/zcrx.c:425
  - `area->freelist` allocation: io_uring/zcrx.c:430
  - `area->user_refs` allocation: io_uring/zcrx.c:435

Why it matters
- Fixes unaccounted kernel memory: These arrays can be very large (one
  entry per page of the registered area). Without GFP_KERNEL_ACCOUNT, a
  cgroup can allocate significant kernel memory that is not charged to
  its memcg, breaking containment and potentially causing host memory
  pressure. The commit explicitly addresses this: “arrays can be quite
  long, make sure they're accounted.”
- Brings consistency with existing accounting in the same path: The user
  memory backing the area is already accounted to memcg via
  `sg_alloc_table_from_pages(..., GFP_KERNEL_ACCOUNT)`
  (io_uring/zcrx.c:196) and to the io_uring context via
  `io_account_mem()` (io_uring/zcrx.c:205). Accounting these control
  arrays aligns with that design and closes a loophole where only the
  big page backing was charged but the (potentially multi‑MiB) array
  metadata was not.
- Scope is tiny and contained: The change is three flag substitutions
  within `io_zcrx_create_area()` and has no API/ABI or behavioral
  changes beyond proper memcg charging. No architectural changes; hot
  paths are unaffected (this is registration-time allocation).

Risk assessment
- Low regression risk: Uses a long-standing flag (`GFP_KERNEL_ACCOUNT`)
  already used in this file for the data path (io_uring/zcrx.c:196). The
  only behavioral change is that allocations will now fail earlier with
  `-ENOMEM` if a cgroup’s limits would be exceeded—this is the desired
  and correct behavior for accounting fixes.
- No ordering dependencies: The patch doesn’t rely on recent refactors;
  the affected allocations exist in v6.15–v6.17 and are currently done
  with `GFP_KERNEL`. The change applies cleanly to those stable series
  where `io_uring/zcrx.c` is present.

Stable tree fit
- Fixes a real bug affecting users: memcg under-accounting in a new but
  shipped subsystem (zcrx is present since v6.15).
- Minimal, localized, and low risk: Three flag changes in one function.
- No feature additions or architectural changes: Pure accounting fix.
- Consistent with stable policy: Similar accounting fixes are regularly
  accepted; related earlier work in this area explicitly targeted stable
  (e.g., “io_uring/zcrx: account area memory” carries a `Cc:
  stable@vger.kernel.org`, complementing this change).

Conclusion
- Backporting will prevent unaccounted kernel memory growth from zcrx
  area metadata, aligning with memcg expectations and improving
  containment with negligible risk.

 io_uring/zcrx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 39d1ef52a57b1..5928544cd1687 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -426,17 +426,17 @@ static int io_zcrx_create_area(struct io_zcrx_ifq *ifq,
 
 	ret = -ENOMEM;
 	area->nia.niovs = kvmalloc_array(nr_iovs, sizeof(area->nia.niovs[0]),
-					 GFP_KERNEL | __GFP_ZERO);
+					 GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!area->nia.niovs)
 		goto err;
 
 	area->freelist = kvmalloc_array(nr_iovs, sizeof(area->freelist[0]),
-					GFP_KERNEL | __GFP_ZERO);
+					GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!area->freelist)
 		goto err;
 
 	area->user_refs = kvmalloc_array(nr_iovs, sizeof(area->user_refs[0]),
-					GFP_KERNEL | __GFP_ZERO);
+					GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!area->user_refs)
 		goto err;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-10-09 16:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251009155752.773732-1-sashal@kernel.org>
2025-10-09 15:54 ` [PATCH AUTOSEL 6.17] io_uring/zcrx: check all niovs filled with dma addresses Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.12] io_uring/zctx: check chained notif contexts Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/rsrc: respect submitter_task in io_register_clone_buffers() Sasha Levin
2025-10-09 15:55 ` [PATCH AUTOSEL 6.17-6.16] io_uring/zcrx: account niov arrays to cgroup Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox