public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass via compound page accounting
@ 2025-12-18  2:59 Yuhao Jiang
  2026-01-09  3:02 ` Yuhao Jiang
  0 siblings, 1 reply; 5+ messages in thread
From: Yuhao Jiang @ 2025-12-18  2:59 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, linux-kernel, Yuhao Jiang, stable

When multiple registered buffers share the same compound page, only the
first buffer accounts for the memory via io_buffer_account_pin(). The
subsequent buffers skip accounting since headpage_already_acct() returns
true.

When the first buffer is unregistered, the accounting is decremented,
but the compound page remains pinned by the remaining buffers. This
creates a state where pinned memory is not properly accounted against
RLIMIT_MEMLOCK.

On systems with HugeTLB pages pre-allocated, an unprivileged user can
exploit this to pin memory beyond RLIMIT_MEMLOCK by cycling buffer
registrations. The bypass amount is proportional to the number of
available huge pages, potentially allowing gigabytes of memory to be
pinned while the kernel accounting shows near-zero.

Fix this by recalculating the actual pages to unaccount when unmapping
a buffer. For regular pages, always unaccount. For compound pages, only
unaccount if no other registered buffer references the same compound
page. This ensures the accounting persists until the last buffer
referencing the compound page is released.

Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Fixes: 57bebf807e2a ("io_uring/rsrc: optimise registered huge pages")
Cc: stable@vger.kernel.org
Signed-off-by: Yuhao Jiang <danisjiang@gmail.com>
---
 io_uring/rsrc.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index a63474b331bf..dcf2340af5a2 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -139,15 +139,80 @@ static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 		kvfree(imu);
 }
 
+/*
+ * Calculate pages to unaccount when unmapping a buffer. Regular pages are
+ * always counted. Compound pages are only counted if no other registered
+ * buffer references them, ensuring accounting persists until the last user.
+ */
+static unsigned long io_buffer_calc_unaccount(struct io_ring_ctx *ctx,
+					      struct io_mapped_ubuf *imu)
+{
+	struct page *last_hpage = NULL;
+	unsigned long acct = 0;
+	unsigned int i;
+
+	for (i = 0; i < imu->nr_bvecs; i++) {
+		struct page *page = imu->bvec[i].bv_page;
+		struct page *hpage;
+		unsigned int j;
+
+		if (!PageCompound(page)) {
+			acct++;
+			continue;
+		}
+
+		hpage = compound_head(page);
+		if (hpage == last_hpage)
+			continue;
+		last_hpage = hpage;
+
+		/* Check if we already processed this hpage earlier in this buffer */
+		for (j = 0; j < i; j++) {
+			if (PageCompound(imu->bvec[j].bv_page) &&
+			    compound_head(imu->bvec[j].bv_page) == hpage)
+				goto next_hpage;
+		}
+
+		/* Only unaccount if no other buffer references this page */
+		for (j = 0; j < ctx->buf_table.nr; j++) {
+			struct io_rsrc_node *node = ctx->buf_table.nodes[j];
+			struct io_mapped_ubuf *other;
+			unsigned int k;
+
+			if (!node)
+				continue;
+			other = node->buf;
+			if (other == imu)
+				continue;
+
+			for (k = 0; k < other->nr_bvecs; k++) {
+				struct page *op = other->bvec[k].bv_page;
+
+				if (!PageCompound(op))
+					continue;
+				if (compound_head(op) == hpage)
+					goto next_hpage;
+			}
+		}
+		acct += page_size(hpage) >> PAGE_SHIFT;
+next_hpage:
+		;
+	}
+	return acct;
+}
+
 static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 {
+	unsigned long acct;
+
 	if (unlikely(refcount_read(&imu->refs) > 1)) {
 		if (!refcount_dec_and_test(&imu->refs))
 			return;
 	}
 
-	if (imu->acct_pages)
-		io_unaccount_mem(ctx->user, ctx->mm_account, imu->acct_pages);
+	acct = io_buffer_calc_unaccount(ctx, imu);
+	if (acct)
+		io_unaccount_mem(ctx->user, ctx->mm_account, acct);
 	imu->release(imu->priv);
 	io_free_imu(ctx, imu);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass via compound page accounting
  2025-12-18  2:59 [PATCH] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass via compound page accounting Yuhao Jiang
@ 2026-01-09  3:02 ` Yuhao Jiang
  2026-01-13 19:44   ` Pavel Begunkov
  0 siblings, 1 reply; 5+ messages in thread
From: Yuhao Jiang @ 2026-01-09  3:02 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, linux-kernel, stable

Hi Jens, Pavel, and all,

Just a gentle follow-up on this patch below.
Please let me know if there are any concerns or if changes are needed.

Thanks for your time.

Best regards,
Yuhao Jiang

On Wed, Dec 17, 2025 at 9:00 PM Yuhao Jiang <danisjiang@gmail.com> wrote:
>
> When multiple registered buffers share the same compound page, only the
> first buffer accounts for the memory via io_buffer_account_pin(). The
> subsequent buffers skip accounting since headpage_already_acct() returns
> true.
>
> When the first buffer is unregistered, the accounting is decremented,
> but the compound page remains pinned by the remaining buffers. This
> creates a state where pinned memory is not properly accounted against
> RLIMIT_MEMLOCK.
>
> On systems with HugeTLB pages pre-allocated, an unprivileged user can
> exploit this to pin memory beyond RLIMIT_MEMLOCK by cycling buffer
> registrations. The bypass amount is proportional to the number of
> available huge pages, potentially allowing gigabytes of memory to be
> pinned while the kernel accounting shows near-zero.
>
> Fix this by recalculating the actual pages to unaccount when unmapping
> a buffer. For regular pages, always unaccount. For compound pages, only
> unaccount if no other registered buffer references the same compound
> page. This ensures the accounting persists until the last buffer
> referencing the compound page is released.
>
> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
> Fixes: 57bebf807e2a ("io_uring/rsrc: optimise registered huge pages")
> Cc: stable@vger.kernel.org
> Signed-off-by: Yuhao Jiang <danisjiang@gmail.com>
> ---
>  io_uring/rsrc.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index a63474b331bf..dcf2340af5a2 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -139,15 +139,80 @@ static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
>                 kvfree(imu);
>  }
>
> +/*
> + * Calculate pages to unaccount when unmapping a buffer. Regular pages are
> + * always counted. Compound pages are only counted if no other registered
> + * buffer references them, ensuring accounting persists until the last user.
> + */
> +static unsigned long io_buffer_calc_unaccount(struct io_ring_ctx *ctx,
> +                                             struct io_mapped_ubuf *imu)
> +{
> +       struct page *last_hpage = NULL;
> +       unsigned long acct = 0;
> +       unsigned int i;
> +
> +       for (i = 0; i < imu->nr_bvecs; i++) {
> +               struct page *page = imu->bvec[i].bv_page;
> +               struct page *hpage;
> +               unsigned int j;
> +
> +               if (!PageCompound(page)) {
> +                       acct++;
> +                       continue;
> +               }
> +
> +               hpage = compound_head(page);
> +               if (hpage == last_hpage)
> +                       continue;
> +               last_hpage = hpage;
> +
> +               /* Check if we already processed this hpage earlier in this buffer */
> +               for (j = 0; j < i; j++) {
> +                       if (PageCompound(imu->bvec[j].bv_page) &&
> +                           compound_head(imu->bvec[j].bv_page) == hpage)
> +                               goto next_hpage;
> +               }
> +
> +               /* Only unaccount if no other buffer references this page */
> +               for (j = 0; j < ctx->buf_table.nr; j++) {
> +                       struct io_rsrc_node *node = ctx->buf_table.nodes[j];
> +                       struct io_mapped_ubuf *other;
> +                       unsigned int k;
> +
> +                       if (!node)
> +                               continue;
> +                       other = node->buf;
> +                       if (other == imu)
> +                               continue;
> +
> +                       for (k = 0; k < other->nr_bvecs; k++) {
> +                               struct page *op = other->bvec[k].bv_page;
> +
> +                               if (!PageCompound(op))
> +                                       continue;
> +                               if (compound_head(op) == hpage)
> +                                       goto next_hpage;
> +                       }
> +               }
> +               acct += page_size(hpage) >> PAGE_SHIFT;
> +next_hpage:
> +               ;
> +       }
> +       return acct;
> +}
> +
>  static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
>  {
> +       unsigned long acct;
> +
>         if (unlikely(refcount_read(&imu->refs) > 1)) {
>                 if (!refcount_dec_and_test(&imu->refs))
>                         return;
>         }
>
> -       if (imu->acct_pages)
> -               io_unaccount_mem(ctx->user, ctx->mm_account, imu->acct_pages);
> +       acct = io_buffer_calc_unaccount(ctx, imu);
> +       if (acct)
> +               io_unaccount_mem(ctx->user, ctx->mm_account, acct);
>         imu->release(imu->priv);
>         io_free_imu(ctx, imu);
>  }
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass via compound page accounting
  2026-01-09  3:02 ` Yuhao Jiang
@ 2026-01-13 19:44   ` Pavel Begunkov
  2026-01-14 14:10     ` Pavel Begunkov
  0 siblings, 1 reply; 5+ messages in thread
From: Pavel Begunkov @ 2026-01-13 19:44 UTC (permalink / raw)
  To: Yuhao Jiang, Jens Axboe; +Cc: io-uring, linux-kernel

On 1/9/26 03:02, Yuhao Jiang wrote:
> Hi Jens, Pavel, and all,
> 
> Just a gentle follow-up on this patch below.
> Please let me know if there are any concerns or if changes are needed.

I'm pretty this will break with buffer sharing / cloning. I'd
be tempted to remove all this cross buffer accounting logic
and overestimate it, the current accounting is not sane.
Otherwise, it'll likely need some proxy object shared b/w
buffers or some other overly overcomplicated solution.


> Thanks for your time.
> 
> Best regards,
> Yuhao Jiang
> 
> On Wed, Dec 17, 2025 at 9:00 PM Yuhao Jiang <danisjiang@gmail.com> wrote:
>>
>> When multiple registered buffers share the same compound page, only the
>> first buffer accounts for the memory via io_buffer_account_pin(). The
>> subsequent buffers skip accounting since headpage_already_acct() returns
>> true.
>>
>> When the first buffer is unregistered, the accounting is decremented,
>> but the compound page remains pinned by the remaining buffers. This
>> creates a state where pinned memory is not properly accounted against
>> RLIMIT_MEMLOCK.
>>
>> On systems with HugeTLB pages pre-allocated, an unprivileged user can
>> exploit this to pin memory beyond RLIMIT_MEMLOCK by cycling buffer
>> registrations. The bypass amount is proportional to the number of
>> available huge pages, potentially allowing gigabytes of memory to be
>> pinned while the kernel accounting shows near-zero.
>>
>> Fix this by recalculating the actual pages to unaccount when unmapping
>> a buffer. For regular pages, always unaccount. For compound pages, only
>> unaccount if no other registered buffer references the same compound
>> page. This ensures the accounting persists until the last buffer
>> referencing the compound page is released.
>>
>> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
>> Fixes: 57bebf807e2a ("io_uring/rsrc: optimise registered huge pages")

That's not the right commit, the accounting is ancient, should
get blamed somewhere around first commits that added registered
buffers.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass via compound page accounting
  2026-01-13 19:44   ` Pavel Begunkov
@ 2026-01-14 14:10     ` Pavel Begunkov
  2026-01-14 20:59       ` Yuhao Jiang
  0 siblings, 1 reply; 5+ messages in thread
From: Pavel Begunkov @ 2026-01-14 14:10 UTC (permalink / raw)
  To: Yuhao Jiang, Jens Axboe; +Cc: io-uring, linux-kernel

On 1/13/26 19:44, Pavel Begunkov wrote:
> On 1/9/26 03:02, Yuhao Jiang wrote:
>> Hi Jens, Pavel, and all,
>>
>> Just a gentle follow-up on this patch below.
>> Please let me know if there are any concerns or if changes are needed.
> 
> I'm pretty this will break with buffer sharing / cloning. I'd
> be tempted to remove all this cross buffer accounting logic
> and overestimate it, the current accounting is not sane.
> Otherwise, it'll likely need some proxy object shared b/w
> buffers or some other overly overcomplicated solution

Another way would be to double account cloned buffers and then
have your patch, which combines overaccounting with the ugliness
of full buffer table walks.

>> On Wed, Dec 17, 2025 at 9:00 PM Yuhao Jiang <danisjiang@gmail.com> wrote:
>>>
>>> When multiple registered buffers share the same compound page, only the
>>> first buffer accounts for the memory via io_buffer_account_pin(). The
>>> subsequent buffers skip accounting since headpage_already_acct() returns
>>> true.
>>>
>>> When the first buffer is unregistered, the accounting is decremented,
>>> but the compound page remains pinned by the remaining buffers. This
>>> creates a state where pinned memory is not properly accounted against
>>> RLIMIT_MEMLOCK.
>>>
>>> On systems with HugeTLB pages pre-allocated, an unprivileged user can
>>> exploit this to pin memory beyond RLIMIT_MEMLOCK by cycling buffer
>>> registrations. The bypass amount is proportional to the number of
>>> available huge pages, potentially allowing gigabytes of memory to be
>>> pinned while the kernel accounting shows near-zero.
>>>
>>> Fix this by recalculating the actual pages to unaccount when unmapping
>>> a buffer. For regular pages, always unaccount. For compound pages, only
>>> unaccount if no other registered buffer references the same compound
>>> page. This ensures the accounting persists until the last buffer
>>> referencing the compound page is released.
>>>
>>> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
>>> Fixes: 57bebf807e2a ("io_uring/rsrc: optimise registered huge pages")
> 
> That's not the right commit, the accounting is ancient, should
> get blamed somewhere around first commits that added registered
> buffers.

Turns it came just a bit later:

commit de2939388be564836b06f0f06b3787bdedaed822
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu Sep 17 16:19:16 2020 -0600

     io_uring: improve registered buffer accounting for huge pages

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass via compound page accounting
  2026-01-14 14:10     ` Pavel Begunkov
@ 2026-01-14 20:59       ` Yuhao Jiang
  0 siblings, 0 replies; 5+ messages in thread
From: Yuhao Jiang @ 2026-01-14 20:59 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-kernel

On Wed, Jan 14, 2026 at 8:10 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 1/13/26 19:44, Pavel Begunkov wrote:
> > On 1/9/26 03:02, Yuhao Jiang wrote:
> >> Hi Jens, Pavel, and all,
> >>
> >> Just a gentle follow-up on this patch below.
> >> Please let me know if there are any concerns or if changes are needed.
> >
> > I'm pretty this will break with buffer sharing / cloning. I'd
> > be tempted to remove all this cross buffer accounting logic
> > and overestimate it, the current accounting is not sane.
> > Otherwise, it'll likely need some proxy object shared b/w
> > buffers or some other overly overcomplicated solution
>
> Another way would be to double account cloned buffers and then
> have your patch, which combines overaccounting with the ugliness
> of full buffer table walks.
>
> >> On Wed, Dec 17, 2025 at 9:00 PM Yuhao Jiang <danisjiang@gmail.com> wrote:
> >>>
> >>> When multiple registered buffers share the same compound page, only the
> >>> first buffer accounts for the memory via io_buffer_account_pin(). The
> >>> subsequent buffers skip accounting since headpage_already_acct() returns
> >>> true.
> >>>
> >>> When the first buffer is unregistered, the accounting is decremented,
> >>> but the compound page remains pinned by the remaining buffers. This
> >>> creates a state where pinned memory is not properly accounted against
> >>> RLIMIT_MEMLOCK.
> >>>
> >>> On systems with HugeTLB pages pre-allocated, an unprivileged user can
> >>> exploit this to pin memory beyond RLIMIT_MEMLOCK by cycling buffer
> >>> registrations. The bypass amount is proportional to the number of
> >>> available huge pages, potentially allowing gigabytes of memory to be
> >>> pinned while the kernel accounting shows near-zero.
> >>>
> >>> Fix this by recalculating the actual pages to unaccount when unmapping
> >>> a buffer. For regular pages, always unaccount. For compound pages, only
> >>> unaccount if no other registered buffer references the same compound
> >>> page. This ensures the accounting persists until the last buffer
> >>> referencing the compound page is released.
> >>>
> >>> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
> >>> Fixes: 57bebf807e2a ("io_uring/rsrc: optimise registered huge pages")
> >
> > That's not the right commit, the accounting is ancient, should
> > get blamed somewhere around first commits that added registered
> > buffers.
>
> Turns it came just a bit later:
>
> commit de2939388be564836b06f0f06b3787bdedaed822
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Thu Sep 17 16:19:16 2020 -0600
>
>      io_uring: improve registered buffer accounting for huge pages
>
> --
> Pavel Begunkov
>

Thanks for the review. I see the issues with buffer sharing/cloning and
the accounting concerns you pointed out. I'll rework this accordingly
and send a v2, and also fix the Fixes tag.

Best regards,
Yuhao Jiang

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-01-14 21:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-18  2:59 [PATCH] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass via compound page accounting Yuhao Jiang
2026-01-09  3:02 ` Yuhao Jiang
2026-01-13 19:44   ` Pavel Begunkov
2026-01-14 14:10     ` Pavel Begunkov
2026-01-14 20:59       ` Yuhao Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox