public inbox for [email protected]
 help / color / mirror / Atom feed
From: David Hildenbrand <[email protected]>
To: Matthew Wilcox <[email protected]>, Vlastimil Babka <[email protected]>
Cc: Andrew Morton <[email protected]>,
	Mike Kravetz <[email protected]>,
	Luis Chamberlain <[email protected]>,
	Oscar Salvador <[email protected]>, Jens Axboe <[email protected]>,
	[email protected], [email protected]
Subject: Re: [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR
Date: Wed, 13 Dec 2023 13:23:05 +0100	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 08.12.23 19:31, Matthew Wilcox wrote:
> On Fri, Dec 08, 2023 at 06:54:19PM +0100, Vlastimil Babka wrote:
>> On 8/16/23 17:11, Matthew Wilcox (Oracle) wrote:
>>> We can use a bit in page[1].flags to indicate that this folio belongs
>>> to hugetlb instead of using a value in page[1].dtors.  That lets
>>> folio_test_hugetlb() become an inline function like it should be.
>>> We can also get rid of NULL_COMPOUND_DTOR.
>>>
>>> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>>
>> I think this (as commit 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")) is
>> causing the bug reported by Luis here:
>> https://bugzilla.kernel.org/show_bug.cgi?id=218227
> 
> Luis, please stop using bugzilla.  If you'd sent email like a normal
> kernel developer, I'd've seen this bug earlier.
> 
>>> page:000000009006bf10 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x3f8a0 pfn:0x1035c0
>>> flags: 0x17fffc000000000(node=0|zone=2|lastcpupid=0x1ffff)
>>> page_type: 0xffffff7f(buddy)
>>> raw: 017fffc000000000 ffffe704c422f808 ffffe704c41ac008 0000000000000000
>>> raw: 000000000003f8a0 0000000000000005 00000000ffffff7f 0000000000000000
>>> page dumped because: VM_BUG_ON_PAGE(n > 0 && !((__builtin_constant_p(PG_head) && __builtin_constant_p((uintptr_t)(&page->flags) != (uintptr_t)((vo>
>>> ------------[ cut here ]------------
>>> kernel BUG at include/linux/page-flags.h:314!
>>> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>> CPU: 6 PID: 2435641 Comm: md5sum Not tainted 6.6.0-rc5 #2
>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
>>> RIP: 0010:folio_flags+0x65/0x70
>>> Code: a8 40 74 de 48 8b 47 48 a8 01 74 d6 48 83 e8 01 48 39 c7 75 bd eb cb 48 8b 07 a8 40 75 c8 48 c7 c6 d8 a7 c3 89 e8 3b c7 fa ff <0f> 0b 66 0f >
>>> RSP: 0018:ffffad51c0bfb7a8 EFLAGS: 00010246
>>> RAX: 000000000000015f RBX: ffffe704c40d7000 RCX: 0000000000000000
>>> RDX: 0000000000000000 RSI: ffffffff89be8040 RDI: 00000000ffffffff
>>> RBP: 0000000000103600 R08: 0000000000000000 R09: ffffad51c0bfb658
>>> R10: 0000000000000003 R11: ffffffff89eacb08 R12: 0000000000000035
>>> R13: ffffe704c40d7000 R14: 0000000000000000 R15: ffffad51c0bfb930
>>> FS:  00007f350c51b740(0000) GS:ffff9b62fbd80000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000555860919508 CR3: 00000001217fe002 CR4: 0000000000770ee0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> PKRU: 55555554
>>> Call Trace:
>>>   <TASK>
>>>   ? die+0x32/0x80
>>>   ? do_trap+0xd6/0x100
>>>   ? folio_flags+0x65/0x70
>>>   ? do_error_trap+0x6a/0x90
>>>   ? folio_flags+0x65/0x70
>>>   ? exc_invalid_op+0x4c/0x60
>>>   ? folio_flags+0x65/0x70
>>>   ? asm_exc_invalid_op+0x16/0x20
>>>   ? folio_flags+0x65/0x70
>>>   ? folio_flags+0x65/0x70
>>>   PageHuge+0x67/0x80
>>>   isolate_migratepages_block+0x1c5/0x13b0
>>>   compact_zone+0x746/0xfc0
>>>   compact_zone_order+0xbb/0x100
>>>   try_to_compact_pages+0xf0/0x2f0
>>>   __alloc_pages_direct_compact+0x78/0x210
>>>   __alloc_pages_slowpath.constprop.0+0xac1/0xdb0
>>>   __alloc_pages+0x218/0x240
>>>   folio_alloc+0x17/0x50
>>
>> It's because PageHuge() now does folio_test_hugetlb() which is documented to
>> assume caller holds a reference, but the test in
>> isolate_migratepages_block() doesn't. The code is there from 2021 by Oscar,
>> perhaps it could be changed to take a reference (and e.g. only test
>> PageHead() before), but it will be a bit involved as the
>> isolate_or_dissolve_huge_page() it calls has some logic based on the
>> refcount being zero/non-zero as well. Oscar, what do you think?
>> Also I wonder if any of the the other existing PageHuge() callers are also
>> affected because they might be doing so without a reference.
> 
> I don't think the warning is actually wrong!  We're living very
> dangerously here as PageHuge() could have returned a false positive
> before this change [1].  Then we assume that compound_nr() returns a
> consistent result (and compound_order() might, for example, return a
> value larger than 63, leading to UB).

All this code is racy (as spelled out in some of the code comments), and the
assumption is that races are tolerable. In the worst case, isolation fails on
races.

There is some code in there that sanitizes compound_order() return values :

		/*
		 * Regardless of being on LRU, compound pages such as THP and
		 * hugetlbfs are not to be compacted unless we are attempting
		 * an allocation much larger than the huge page size (eg CMA).
		 * We can potentially save a lot of iterations if we skip them
		 * at once. The check is racy, but we can consider only valid
		 * values and the only danger is skipping too much.
		 */
		if (PageCompound(page) && !cc->alloc_contig) {
			const unsigned int order = compound_order(page);

			if (likely(order <= MAX_ORDER)) {
				low_pfn += (1UL << order) - 1;
				nr_scanned += (1UL << order) - 1;
			}
			goto isolate_fail;
		}

At least isolate_or_dissolve_huge_page() looks like it wants to deal with
concurrent dissolving of hugetlb folios properly. See below, if it is
actually correct.

> 
> I think there's a place for a folio_test_hugetlb_unsafe(), but that
> would only remove the warning, and do nothing to fix all the unsafe
> usage.  The hugetlb handling code in isolate_migratepages_block()
> doesn't seem to have any understanding that it's working with pages
> that can change under it.  

Staring at the code (once again), I think we might miss to sanitize
the compound_nr() return value; but I'm not sure if that is really
problematic; We can get out-of-memory low_pfn either way when we're
operating at the end of memory, memory holes etc.

> I can have a go at fixing it up, but maybe
> Oscar would prefer to do it?
> 
> [1] terribly unlikely, but consider what happens if the page starts out
> as a non-hugetlb compound allocation.  Then it is freed and reallocated;
> the new owner of the second page could have put enough information
> into it that tricked PageHuge() into returning true.  Then we call

I think that got more likely by using a pageflag :)

> isolate_or_dissolve_huge_page() which takes a lock, but doesn't take
> a refcount.  Now it gets a bogus hstate and things go downhill from
> there ...)

hugetlb folios can have a refocunt of 0, in which case they are considered
"free". The recount handling at the end of isolate_or_dissolve_huge_page()
gives a hint that this is how hugetlb operates.

So grabbing references as you're used to from !hugetlb code doesn't work.

That function needs some care, to deal with what you described. Maybe
something like the following (assuming page_folio() is fine):


bool is_hugetlb_folio = false;


/*
  * Especially for free hugetlb folios, we cannot use the recount
  * to stabilize. While holding the hugetlb_lock, no hugetlb folios can
  * be dissolved.
  */
spin_lock_irq(&hugetlb_lock);
folio = page_folio(page);

if (folio_test_hugetlb_unsafe(folio)) {
	/*
          * We might have a false positive. Make sure the folio hasn't
          * changed in the meantime and still is a hugetlb folio.
	 */
	smp_rmb();
	if (folio == page_folio(page) &&
	    folio_test_hugetlb_unsafe(folio))
		is_hugetlb_folio = true;
}

if (is_hugetlb_folio)
	h = folio_hstate(folio);
spin_unlock_irq(&hugetlb_lock);
if (!is_hugetlb_folio)
	return 0;


But devil is in the detail (could someone trick us twice?).

isolate_hugetlb() performs similar lockless checks under
hugetlb_lock, and likely we should just unify them once we figure
out the right thing to do.

-- 
Cheers,

David / dhildenb


  reply	other threads:[~2023-12-13 12:23 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 01/13] io_uring: Stop calling free_compound_page() Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 02/13] mm: Call free_huge_page() directly Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 03/13] mm: Convert free_huge_page() to free_huge_folio() Matthew Wilcox (Oracle)
2023-08-16 20:14   ` Sidhartha Kumar
2023-08-17  3:31   ` Yanteng Si
2023-08-16 15:11 ` [PATCH v2 04/13] mm: Convert free_transhuge_folio() to folio_undo_large_rmappable() Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 05/13] mm; Convert prep_transhuge_page() to folio_prep_large_rmappable() Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 06/13] mm: Remove free_compound_page() and the compound_page_dtors array Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR Matthew Wilcox (Oracle)
2023-08-16 21:45   ` Sidhartha Kumar
2023-08-22  3:13   ` Mike Kravetz
2023-08-22  3:32     ` Matthew Wilcox
2023-08-22 17:19       ` Mike Kravetz
2023-12-08 17:54   ` Vlastimil Babka
2023-12-08 18:31     ` Matthew Wilcox
2023-12-13 12:23       ` David Hildenbrand [this message]
2023-08-16 15:11 ` [PATCH v2 08/13] mm: Add large_rmappable page flag Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 09/13] mm: Rearrange page flags Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 10/13] mm: Free up a word in the first tail page Matthew Wilcox (Oracle)
2023-08-22 23:17   ` Mike Kravetz
2023-08-23  0:29     ` Matthew Wilcox
2023-08-16 15:11 ` [PATCH v2 11/13] mm: Remove folio_test_transhuge() Matthew Wilcox (Oracle)
2023-08-16 15:12 ` [PATCH v2 12/13] mm: Add tail private fields to struct folio Matthew Wilcox (Oracle)
2023-08-16 15:21   ` David Hildenbrand
2023-08-16 15:12 ` [PATCH v2 13/13] mm: Convert split_huge_pages_pid() to use a folio Matthew Wilcox (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox