public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v2 00/13] Remove _folio_dtor and _folio_order
@ 2023-08-16 15:11 Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 01/13] io_uring: Stop calling free_compound_page() Matthew Wilcox (Oracle)
                   ` (12 more replies)
  0 siblings, 13 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

Well, this patch series has got completely out of hand.  What started
out as "Let's try to optimise the freeing path a bit" turned into
"Hey, I can free up an entire word in the first tail page", and
grew "Oh, while we're here, let's rename a bunch of things".

No detailed changelog from v1 because _so much_ changed.

Matthew Wilcox (Oracle) (13):
  io_uring: Stop calling free_compound_page()
  mm: Call free_huge_page() directly
  mm: Convert free_huge_page() to free_huge_folio()
  mm: Convert free_transhuge_folio() to folio_undo_large_rmappable()
  mm; Convert prep_transhuge_page() to folio_prep_large_rmappable()
  mm: Remove free_compound_page() and the compound_page_dtors array
  mm: Remove HUGETLB_PAGE_DTOR
  mm: Add large_rmappable page flag
  mm: Rearrange page flags
  mm: Free up a word in the first tail page
  mm: Remove folio_test_transhuge()
  mm: Add tail private fields to struct folio
  mm: Convert split_huge_pages_pid() to use a folio

 .../admin-guide/kdump/vmcoreinfo.rst          | 14 +--
 Documentation/mm/hugetlbfs_reserv.rst         | 14 +--
 .../zh_CN/mm/hugetlbfs_reserv.rst             |  4 +-
 include/linux/huge_mm.h                       |  6 +-
 include/linux/hugetlb.h                       |  3 +-
 include/linux/mm.h                            | 39 +-------
 include/linux/mm_types.h                      | 19 +++-
 include/linux/page-flags.h                    | 60 ++++++++----
 io_uring/io_uring.c                           |  6 +-
 io_uring/kbuf.c                               |  6 +-
 kernel/crash_core.c                           |  4 +-
 mm/huge_memory.c                              | 51 +++++-----
 mm/hugetlb.c                                  | 97 ++++++-------------
 mm/internal.h                                 |  5 +-
 mm/khugepaged.c                               |  2 +-
 mm/memcontrol.c                               |  2 +-
 mm/mempolicy.c                                | 15 +--
 mm/page_alloc.c                               | 41 +++-----
 18 files changed, 161 insertions(+), 227 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 01/13] io_uring: Stop calling free_compound_page()
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 02/13] mm: Call free_huge_page() directly Matthew Wilcox (Oracle)
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm,
	David Hildenbrand

folio_put() is the standard way to write this, and it's not appreciably
slower.  This is an enabling patch for removing free_compound_page()
entirely.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
---
 io_uring/io_uring.c | 6 +-----
 io_uring/kbuf.c     | 6 +-----
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index dcf5fc7d2820..a5b9b5de7aff 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2664,14 +2664,10 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 static void io_mem_free(void *ptr)
 {
-	struct page *page;
-
 	if (!ptr)
 		return;
 
-	page = virt_to_head_page(ptr);
-	if (put_page_testzero(page))
-		free_compound_page(page);
+	folio_put(virt_to_folio(ptr));
 }
 
 static void io_pages_free(struct page ***pages, int npages)
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 2f0181521c98..556f4df25b0f 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -218,11 +218,7 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx,
 	if (bl->is_mapped) {
 		i = bl->buf_ring->tail - bl->head;
 		if (bl->is_mmap) {
-			struct page *page;
-
-			page = virt_to_head_page(bl->buf_ring);
-			if (put_page_testzero(page))
-				free_compound_page(page);
+			folio_put(virt_to_folio(bl->buf_ring));
 			bl->buf_ring = NULL;
 			bl->is_mmap = 0;
 		} else if (bl->buf_nr_pages) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 02/13] mm: Call free_huge_page() directly
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 01/13] io_uring: Stop calling free_compound_page() Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 03/13] mm: Convert free_huge_page() to free_huge_folio() Matthew Wilcox (Oracle)
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

Indirect calls are expensive, thanks to Spectre.  Call free_huge_page()
directly if the folio belongs to hugetlb.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 include/linux/hugetlb.h | 3 ++-
 mm/page_alloc.c         | 8 +++++---
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0a393bc02f25..5a1dfaffbd80 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -26,6 +26,8 @@ typedef struct { unsigned long pd; } hugepd_t;
 #define __hugepd(x) ((hugepd_t) { (x) })
 #endif
 
+void free_huge_page(struct page *page);
+
 #ifdef CONFIG_HUGETLB_PAGE
 
 #include <linux/mempolicy.h>
@@ -165,7 +167,6 @@ int get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 				bool *migratable_cleared);
 void folio_putback_active_hugetlb(struct folio *folio);
 void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio, int reason);
-void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
 u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fe9ff917850..548c8016190b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -287,9 +287,6 @@ const char * const migratetype_names[MIGRATE_TYPES] = {
 static compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = {
 	[NULL_COMPOUND_DTOR] = NULL,
 	[COMPOUND_PAGE_DTOR] = free_compound_page,
-#ifdef CONFIG_HUGETLB_PAGE
-	[HUGETLB_PAGE_DTOR] = free_huge_page,
-#endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	[TRANSHUGE_PAGE_DTOR] = free_transhuge_page,
 #endif
@@ -622,6 +619,11 @@ void destroy_large_folio(struct folio *folio)
 {
 	enum compound_dtor_id dtor = folio->_folio_dtor;
 
+	if (folio_test_hugetlb(folio)) {
+		free_huge_page(&folio->page);
+		return;
+	}
+
 	VM_BUG_ON_FOLIO(dtor >= NR_COMPOUND_DTORS, folio);
 	compound_page_dtors[dtor](&folio->page);
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 03/13] mm: Convert free_huge_page() to free_huge_folio()
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 01/13] io_uring: Stop calling free_compound_page() Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 02/13] mm: Call free_huge_page() directly Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-16 20:14   ` Sidhartha Kumar
  2023-08-17  3:31   ` Yanteng Si
  2023-08-16 15:11 ` [PATCH v2 04/13] mm: Convert free_transhuge_folio() to folio_undo_large_rmappable() Matthew Wilcox (Oracle)
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm,
	Yanteng Si

Pass a folio instead of the head page to save a few instructions.
Update the documentation, at least in English.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Yanteng Si <[email protected]>
---
 Documentation/mm/hugetlbfs_reserv.rst         | 14 +++---
 .../zh_CN/mm/hugetlbfs_reserv.rst             |  4 +-
 include/linux/hugetlb.h                       |  2 +-
 mm/hugetlb.c                                  | 48 +++++++++----------
 mm/page_alloc.c                               |  2 +-
 5 files changed, 34 insertions(+), 36 deletions(-)

diff --git a/Documentation/mm/hugetlbfs_reserv.rst b/Documentation/mm/hugetlbfs_reserv.rst
index d9c2b0f01dcd..4914fbf07966 100644
--- a/Documentation/mm/hugetlbfs_reserv.rst
+++ b/Documentation/mm/hugetlbfs_reserv.rst
@@ -271,12 +271,12 @@ to the global reservation count (resv_huge_pages).
 Freeing Huge Pages
 ==================
 
-Huge page freeing is performed by the routine free_huge_page().  This routine
-is the destructor for hugetlbfs compound pages.  As a result, it is only
-passed a pointer to the page struct.  When a huge page is freed, reservation
-accounting may need to be performed.  This would be the case if the page was
-associated with a subpool that contained reserves, or the page is being freed
-on an error path where a global reserve count must be restored.
+Huge pages are freed by free_huge_folio().  It is only passed a pointer
+to the folio as it is called from the generic MM code.  When a huge page
+is freed, reservation accounting may need to be performed.  This would
+be the case if the page was associated with a subpool that contained
+reserves, or the page is being freed on an error path where a global
+reserve count must be restored.
 
 The page->private field points to any subpool associated with the page.
 If the PagePrivate flag is set, it indicates the global reserve count should
@@ -525,7 +525,7 @@ However, there are several instances where errors are encountered after a huge
 page is allocated but before it is instantiated.  In this case, the page
 allocation has consumed the reservation and made the appropriate subpool,
 reservation map and global count adjustments.  If the page is freed at this
-time (before instantiation and clearing of PagePrivate), then free_huge_page
+time (before instantiation and clearing of PagePrivate), then free_huge_folio
 will increment the global reservation count.  However, the reservation map
 indicates the reservation was consumed.  This resulting inconsistent state
 will cause the 'leak' of a reserved huge page.  The global reserve count will
diff --git a/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst b/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst
index b7a0544224ad..0f7e7fb5ca8c 100644
--- a/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst
+++ b/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst
@@ -219,7 +219,7 @@ vma_commit_reservation()之间,预留映射有可能被改变。如果hugetlb_
 释放巨页
 ========
 
-巨页释放是由函数free_huge_page()执行的。这个函数是hugetlbfs复合页的析构器。因此,它只传
+巨页释放是由函数free_huge_folio()执行的。这个函数是hugetlbfs复合页的析构器。因此,它只传
 递一个指向页面结构体的指针。当一个巨页被释放时,可能需要进行预留计算。如果该页与包含保
 留的子池相关联,或者该页在错误路径上被释放,必须恢复全局预留计数,就会出现这种情况。
 
@@ -387,7 +387,7 @@ region_count()在解除私有巨页映射时被调用。在私有映射中,预
 
 然而,有几种情况是,在一个巨页被分配后,但在它被实例化之前,就遇到了错误。在这种情况下,
 页面分配已经消耗了预留,并进行了适当的子池、预留映射和全局计数调整。如果页面在这个时候被释放
-(在实例化和清除PagePrivate之前),那么free_huge_page将增加全局预留计数。然而,预留映射
+(在实例化和清除PagePrivate之前),那么free_huge_folio将增加全局预留计数。然而,预留映射
 显示报留被消耗了。这种不一致的状态将导致预留的巨页的 “泄漏” 。全局预留计数将比它原本的要高,
 并阻止分配一个预先分配的页面。
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5a1dfaffbd80..5b2626063f4f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -26,7 +26,7 @@ typedef struct { unsigned long pd; } hugepd_t;
 #define __hugepd(x) ((hugepd_t) { (x) })
 #endif
 
-void free_huge_page(struct page *page);
+void free_huge_folio(struct folio *folio);
 
 #ifdef CONFIG_HUGETLB_PAGE
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e327a5a7602c..086eb51bf845 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1706,10 +1706,10 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
 	zeroed = folio_put_testzero(folio);
 	if (unlikely(!zeroed))
 		/*
-		 * It is VERY unlikely soneone else has taken a ref on
-		 * the page.  In this case, we simply return as the
-		 * hugetlb destructor (free_huge_page) will be called
-		 * when this other ref is dropped.
+		 * It is VERY unlikely soneone else has taken a ref
+		 * on the folio.  In this case, we simply return as
+		 * free_huge_folio() will be called when this other ref
+		 * is dropped.
 		 */
 		return;
 
@@ -1875,13 +1875,12 @@ struct hstate *size_to_hstate(unsigned long size)
 	return NULL;
 }
 
-void free_huge_page(struct page *page)
+void free_huge_folio(struct folio *folio)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
 	 * compound page destructor.
 	 */
-	struct folio *folio = page_folio(page);
 	struct hstate *h = folio_hstate(folio);
 	int nid = folio_nid(folio);
 	struct hugepage_subpool *spool = hugetlb_folio_subpool(folio);
@@ -1936,7 +1935,7 @@ void free_huge_page(struct page *page)
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
 		update_and_free_hugetlb_folio(h, folio, true);
 	} else {
-		arch_clear_hugepage_flags(page);
+		arch_clear_hugepage_flags(&folio->page);
 		enqueue_hugetlb_folio(h, folio);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
 	}
@@ -2246,7 +2245,7 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 		folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node,
 					nodes_allowed, node_alloc_noretry);
 		if (folio) {
-			free_huge_page(&folio->page); /* free it into the hugepage allocator */
+			free_huge_folio(folio); /* free it into the hugepage allocator */
 			return 1;
 		}
 	}
@@ -2429,13 +2428,13 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
 	 * We could have raced with the pool size change.
 	 * Double check that and simply deallocate the new page
 	 * if we would end up overcommiting the surpluses. Abuse
-	 * temporary page to workaround the nasty free_huge_page
+	 * temporary page to workaround the nasty free_huge_folio
 	 * codeflow
 	 */
 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
 		folio_set_hugetlb_temporary(folio);
 		spin_unlock_irq(&hugetlb_lock);
-		free_huge_page(&folio->page);
+		free_huge_folio(folio);
 		return NULL;
 	}
 
@@ -2547,8 +2546,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	__must_hold(&hugetlb_lock)
 {
 	LIST_HEAD(surplus_list);
-	struct folio *folio;
-	struct page *page, *tmp;
+	struct folio *folio, *tmp;
 	int ret;
 	long i;
 	long needed, allocated;
@@ -2608,21 +2606,21 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	ret = 0;
 
 	/* Free the needed pages to the hugetlb pool */
-	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
+	list_for_each_entry_safe(folio, tmp, &surplus_list, lru) {
 		if ((--needed) < 0)
 			break;
 		/* Add the page to the hugetlb allocator */
-		enqueue_hugetlb_folio(h, page_folio(page));
+		enqueue_hugetlb_folio(h, folio);
 	}
 free:
 	spin_unlock_irq(&hugetlb_lock);
 
 	/*
 	 * Free unnecessary surplus pages to the buddy allocator.
-	 * Pages have no ref count, call free_huge_page directly.
+	 * Pages have no ref count, call free_huge_folio directly.
 	 */
-	list_for_each_entry_safe(page, tmp, &surplus_list, lru)
-		free_huge_page(page);
+	list_for_each_entry_safe(folio, tmp, &surplus_list, lru)
+		free_huge_folio(folio);
 	spin_lock_irq(&hugetlb_lock);
 
 	return ret;
@@ -2836,11 +2834,11 @@ static long vma_del_reservation(struct hstate *h,
  * 2) No reservation was in place for the page, so hugetlb_restore_reserve is
  *    not set.  However, alloc_hugetlb_folio always updates the reserve map.
  *
- * In case 1, free_huge_page later in the error path will increment the
- * global reserve count.  But, free_huge_page does not have enough context
+ * In case 1, free_huge_folio later in the error path will increment the
+ * global reserve count.  But, free_huge_folio does not have enough context
  * to adjust the reservation map.  This case deals primarily with private
  * mappings.  Adjust the reserve map here to be consistent with global
- * reserve count adjustments to be made by free_huge_page.  Make sure the
+ * reserve count adjustments to be made by free_huge_folio.  Make sure the
  * reserve map indicates there is a reservation present.
  *
  * In case 2, simply undo reserve map modifications done by alloc_hugetlb_folio.
@@ -2856,7 +2854,7 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
 			 * Rare out of memory condition in reserve map
 			 * manipulation.  Clear hugetlb_restore_reserve so
 			 * that global reserve count will not be incremented
-			 * by free_huge_page.  This will make it appear
+			 * by free_huge_folio.  This will make it appear
 			 * as though the reservation for this folio was
 			 * consumed.  This may prevent the task from
 			 * faulting in the folio at a later time.  This
@@ -3232,7 +3230,7 @@ static void __init gather_bootmem_prealloc(void)
 		if (prep_compound_gigantic_folio(folio, huge_page_order(h))) {
 			WARN_ON(folio_test_reserved(folio));
 			prep_new_hugetlb_folio(h, folio, folio_nid(folio));
-			free_huge_page(page); /* add to the hugepage allocator */
+			free_huge_folio(folio); /* add to the hugepage allocator */
 		} else {
 			/* VERY unlikely inflated ref count on a tail page */
 			free_gigantic_folio(folio, huge_page_order(h));
@@ -3264,7 +3262,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 					&node_states[N_MEMORY], NULL);
 			if (!folio)
 				break;
-			free_huge_page(&folio->page); /* free it into the hugepage allocator */
+			free_huge_folio(folio); /* free it into the hugepage allocator */
 		}
 		cond_resched();
 	}
@@ -3542,7 +3540,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	while (count > persistent_huge_pages(h)) {
 		/*
 		 * If this allocation races such that we no longer need the
-		 * page, free_huge_page will handle it by freeing the page
+		 * page, free_huge_folio will handle it by freeing the page
 		 * and reducing the surplus.
 		 */
 		spin_unlock_irq(&hugetlb_lock);
@@ -3658,7 +3656,7 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
 			prep_compound_page(subpage, target_hstate->order);
 		folio_change_private(inner_folio, NULL);
 		prep_new_hugetlb_folio(target_hstate, inner_folio, nid);
-		free_huge_page(subpage);
+		free_huge_folio(inner_folio);
 	}
 	mutex_unlock(&target_hstate->resize_lock);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 548c8016190b..b569fd5562aa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -620,7 +620,7 @@ void destroy_large_folio(struct folio *folio)
 	enum compound_dtor_id dtor = folio->_folio_dtor;
 
 	if (folio_test_hugetlb(folio)) {
-		free_huge_page(&folio->page);
+		free_huge_folio(folio);
 		return;
 	}
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 04/13] mm: Convert free_transhuge_folio() to folio_undo_large_rmappable()
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2023-08-16 15:11 ` [PATCH v2 03/13] mm: Convert free_huge_page() to free_huge_folio() Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 05/13] mm; Convert prep_transhuge_page() to folio_prep_large_rmappable() Matthew Wilcox (Oracle)
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

Indirect calls are expensive, thanks to Spectre.  Test for
TRANSHUGE_PAGE_DTOR and destroy the folio appropriately.  Move the
free_compound_page() call into destroy_large_folio() to simplify later
patches.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 include/linux/huge_mm.h |  2 --
 include/linux/mm.h      |  2 --
 mm/huge_memory.c        | 22 +++++++++++-----------
 mm/internal.h           |  2 ++
 mm/page_alloc.c         |  9 ++++++---
 5 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 20284387b841..f351c3f9d58b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -144,8 +144,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 
 void prep_transhuge_page(struct page *page);
-void free_transhuge_page(struct page *page);
-
 bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 19493d6a2bb8..6c338b65b86b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1281,9 +1281,7 @@ enum compound_dtor_id {
 #ifdef CONFIG_HUGETLB_PAGE
 	HUGETLB_PAGE_DTOR,
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	TRANSHUGE_PAGE_DTOR,
-#endif
 	NR_COMPOUND_DTORS,
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8480728fa220..9598bbe6c792 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2779,10 +2779,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	return ret;
 }
 
-void free_transhuge_page(struct page *page)
+void folio_undo_large_rmappable(struct folio *folio)
 {
-	struct folio *folio = (struct folio *)page;
-	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
+	struct deferred_split *ds_queue;
 	unsigned long flags;
 
 	/*
@@ -2790,15 +2789,16 @@ void free_transhuge_page(struct page *page)
 	 * deferred_list. If folio is not in deferred_list, it's safe
 	 * to check without acquiring the split_queue_lock.
 	 */
-	if (data_race(!list_empty(&folio->_deferred_list))) {
-		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-		if (!list_empty(&folio->_deferred_list)) {
-			ds_queue->split_queue_len--;
-			list_del(&folio->_deferred_list);
-		}
-		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	if (data_race(list_empty(&folio->_deferred_list)))
+		return;
+
+	ds_queue = get_deferred_split_queue(folio);
+	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	if (!list_empty(&folio->_deferred_list)) {
+		ds_queue->split_queue_len--;
+		list_del(&folio->_deferred_list);
 	}
-	free_compound_page(page);
+	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
 
 void deferred_split_folio(struct folio *folio)
diff --git a/mm/internal.h b/mm/internal.h
index 5a03bc4782a2..1e98c867f0de 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -413,6 +413,8 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
 #endif
 }
 
+void folio_undo_large_rmappable(struct folio *folio);
+
 static inline void prep_compound_head(struct page *page, unsigned int order)
 {
 	struct folio *folio = (struct folio *)page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b569fd5562aa..0dbc2ecdefa5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -287,9 +287,6 @@ const char * const migratetype_names[MIGRATE_TYPES] = {
 static compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = {
 	[NULL_COMPOUND_DTOR] = NULL,
 	[COMPOUND_PAGE_DTOR] = free_compound_page,
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	[TRANSHUGE_PAGE_DTOR] = free_transhuge_page,
-#endif
 };
 
 int min_free_kbytes = 1024;
@@ -624,6 +621,12 @@ void destroy_large_folio(struct folio *folio)
 		return;
 	}
 
+	if (folio_test_transhuge(folio) && dtor == TRANSHUGE_PAGE_DTOR) {
+		folio_undo_large_rmappable(folio);
+		free_compound_page(&folio->page);
+		return;
+	}
+
 	VM_BUG_ON_FOLIO(dtor >= NR_COMPOUND_DTORS, folio);
 	compound_page_dtors[dtor](&folio->page);
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 05/13] mm; Convert prep_transhuge_page() to folio_prep_large_rmappable()
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2023-08-16 15:11 ` [PATCH v2 04/13] mm: Convert free_transhuge_folio() to folio_undo_large_rmappable() Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 06/13] mm: Remove free_compound_page() and the compound_page_dtors array Matthew Wilcox (Oracle)
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

Match folio_undo_large_rmappable(), and move the casting from page to
folio into the callers (which they were largely doing anyway).

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 include/linux/huge_mm.h |  4 ++--
 mm/huge_memory.c        |  4 +---
 mm/khugepaged.c         |  2 +-
 mm/mempolicy.c          | 15 ++++++++-------
 mm/page_alloc.c         |  7 ++++---
 5 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f351c3f9d58b..6d812b8856c8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -143,7 +143,7 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 
-void prep_transhuge_page(struct page *page);
+void folio_prep_large_rmappable(struct folio *folio);
 bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
@@ -283,7 +283,7 @@ static inline bool hugepage_vma_check(struct vm_area_struct *vma,
 	return false;
 }
 
-static inline void prep_transhuge_page(struct page *page) {}
+static inline void folio_prep_large_rmappable(struct folio *folio) {}
 
 #define transparent_hugepage_flags 0UL
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9598bbe6c792..04664e6918c1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -577,10 +577,8 @@ struct deferred_split *get_deferred_split_queue(struct folio *folio)
 }
 #endif
 
-void prep_transhuge_page(struct page *page)
+void folio_prep_large_rmappable(struct folio *folio)
 {
-	struct folio *folio = (struct folio *)page;
-
 	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
 	INIT_LIST_HEAD(&folio->_deferred_list);
 	folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index bb76a5d454de..a8e0eca2cd1e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -896,7 +896,7 @@ static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node,
 		return false;
 	}
 
-	prep_transhuge_page(*hpage);
+	folio_prep_large_rmappable((struct folio *)*hpage);
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return true;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c53f8beeb507..4afbb67ccf27 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2189,9 +2189,9 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 		mpol_cond_put(pol);
 		gfp |= __GFP_COMP;
 		page = alloc_page_interleave(gfp, order, nid);
-		if (page && order > 1)
-			prep_transhuge_page(page);
 		folio = (struct folio *)page;
+		if (folio && order > 1)
+			folio_prep_large_rmappable(folio);
 		goto out;
 	}
 
@@ -2202,9 +2202,9 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 		gfp |= __GFP_COMP;
 		page = alloc_pages_preferred_many(gfp, order, node, pol);
 		mpol_cond_put(pol);
-		if (page && order > 1)
-			prep_transhuge_page(page);
 		folio = (struct folio *)page;
+		if (folio && order > 1)
+			folio_prep_large_rmappable(folio);
 		goto out;
 	}
 
@@ -2300,10 +2300,11 @@ EXPORT_SYMBOL(alloc_pages);
 struct folio *folio_alloc(gfp_t gfp, unsigned order)
 {
 	struct page *page = alloc_pages(gfp | __GFP_COMP, order);
+	struct folio *folio = (struct folio *)page;
 
-	if (page && order > 1)
-		prep_transhuge_page(page);
-	return (struct folio *)page;
+	if (folio && order > 1)
+		folio_prep_large_rmappable(folio);
+	return folio;
 }
 EXPORT_SYMBOL(folio_alloc);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0dbc2ecdefa5..5ee4dc9318b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4548,10 +4548,11 @@ struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
 {
 	struct page *page = __alloc_pages(gfp | __GFP_COMP, order,
 			preferred_nid, nodemask);
+	struct folio *folio = (struct folio *)page;
 
-	if (page && order > 1)
-		prep_transhuge_page(page);
-	return (struct folio *)page;
+	if (folio && order > 1)
+		folio_prep_large_rmappable(folio);
+	return folio;
 }
 EXPORT_SYMBOL(__folio_alloc);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 06/13] mm: Remove free_compound_page() and the compound_page_dtors array
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2023-08-16 15:11 ` [PATCH v2 05/13] mm; Convert prep_transhuge_page() to folio_prep_large_rmappable() Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR Matthew Wilcox (Oracle)
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

The only remaining destructor is free_compound_page().  Inline it
into destroy_large_folio() and remove the array it used to live in.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 include/linux/mm.h | 10 ----------
 mm/page_alloc.c    | 24 +++++-------------------
 2 files changed, 5 insertions(+), 29 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6c338b65b86b..7b800d1298dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1267,14 +1267,6 @@ void folio_copy(struct folio *dst, struct folio *src);
 
 unsigned long nr_free_buffer_pages(void);
 
-/*
- * Compound pages have a destructor function.  Provide a
- * prototype for that function and accessor functions.
- * These are _only_ valid on the head of a compound page.
- */
-typedef void compound_page_dtor(struct page *);
-
-/* Keep the enum in sync with compound_page_dtors array in mm/page_alloc.c */
 enum compound_dtor_id {
 	NULL_COMPOUND_DTOR,
 	COMPOUND_PAGE_DTOR,
@@ -1327,8 +1319,6 @@ static inline unsigned long thp_size(struct page *page)
 	return PAGE_SIZE << thp_order(page);
 }
 
-void free_compound_page(struct page *page);
-
 #ifdef CONFIG_MMU
 /*
  * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ee4dc9318b7..9638fdddf065 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -284,11 +284,6 @@ const char * const migratetype_names[MIGRATE_TYPES] = {
 #endif
 };
 
-static compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = {
-	[NULL_COMPOUND_DTOR] = NULL,
-	[COMPOUND_PAGE_DTOR] = free_compound_page,
-};
-
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 static int watermark_boost_factor __read_mostly = 15000;
@@ -587,19 +582,13 @@ static inline void free_the_page(struct page *page, unsigned int order)
  * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
  * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
  *
- * The first tail page's ->compound_dtor holds the offset in array of compound
- * page destructors. See compound_page_dtors.
+ * The first tail page's ->compound_dtor describes how to destroy the
+ * compound page.
  *
  * The first tail page's ->compound_order holds the order of allocation.
  * This usage means that zero-order pages may not be compound.
  */
 
-void free_compound_page(struct page *page)
-{
-	mem_cgroup_uncharge(page_folio(page));
-	free_the_page(page, compound_order(page));
-}
-
 void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
@@ -621,14 +610,11 @@ void destroy_large_folio(struct folio *folio)
 		return;
 	}
 
-	if (folio_test_transhuge(folio) && dtor == TRANSHUGE_PAGE_DTOR) {
+	if (folio_test_transhuge(folio) && dtor == TRANSHUGE_PAGE_DTOR)
 		folio_undo_large_rmappable(folio);
-		free_compound_page(&folio->page);
-		return;
-	}
 
-	VM_BUG_ON_FOLIO(dtor >= NR_COMPOUND_DTORS, folio);
-	compound_page_dtors[dtor](&folio->page);
+	mem_cgroup_uncharge(folio);
+	free_the_page(&folio->page, folio_order(folio));
 }
 
 static inline void set_buddy_order(struct page *page, unsigned int order)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
                   ` (5 preceding siblings ...)
  2023-08-16 15:11 ` [PATCH v2 06/13] mm: Remove free_compound_page() and the compound_page_dtors array Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-16 21:45   ` Sidhartha Kumar
                     ` (2 more replies)
  2023-08-16 15:11 ` [PATCH v2 08/13] mm: Add large_rmappable page flag Matthew Wilcox (Oracle)
                   ` (5 subsequent siblings)
  12 siblings, 3 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

We can use a bit in page[1].flags to indicate that this folio belongs
to hugetlb instead of using a value in page[1].dtors.  That lets
folio_test_hugetlb() become an inline function like it should be.
We can also get rid of NULL_COMPOUND_DTOR.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 .../admin-guide/kdump/vmcoreinfo.rst          | 10 +---
 include/linux/mm.h                            |  4 --
 include/linux/page-flags.h                    | 43 ++++++++++++----
 kernel/crash_core.c                           |  2 +-
 mm/hugetlb.c                                  | 49 +++----------------
 mm/page_alloc.c                               |  2 +-
 6 files changed, 43 insertions(+), 67 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index c18d94fa6470..baa1c355741d 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -325,8 +325,8 @@ NR_FREE_PAGES
 On linux-2.6.21 or later, the number of free pages is in
 vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
 
-PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
-------------------------------------------------------------------------------
+PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb
+-----------------------------------------------------------------------------------------
 
 Page attributes. These flags are used to filter various unnecessary for
 dumping pages.
@@ -338,12 +338,6 @@ More page attributes. These flags are used to filter various unnecessary for
 dumping pages.
 
 
-HUGETLB_PAGE_DTOR
------------------
-
-The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
-excludes these pages.
-
 x86_64
 ======
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b800d1298dc..642f5fe5860e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1268,11 +1268,7 @@ void folio_copy(struct folio *dst, struct folio *src);
 unsigned long nr_free_buffer_pages(void);
 
 enum compound_dtor_id {
-	NULL_COMPOUND_DTOR,
 	COMPOUND_PAGE_DTOR,
-#ifdef CONFIG_HUGETLB_PAGE
-	HUGETLB_PAGE_DTOR,
-#endif
 	TRANSHUGE_PAGE_DTOR,
 	NR_COMPOUND_DTORS,
 };
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 92a2063a0a23..aeecf0cf1456 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -171,15 +171,6 @@ enum pageflags {
 	/* Remapped by swiotlb-xen. */
 	PG_xen_remapped = PG_owner_priv_1,
 
-#ifdef CONFIG_MEMORY_FAILURE
-	/*
-	 * Compound pages. Stored in first tail page's flags.
-	 * Indicates that at least one subpage is hwpoisoned in the
-	 * THP.
-	 */
-	PG_has_hwpoisoned = PG_error,
-#endif
-
 	/* non-lru isolated movable page */
 	PG_isolated = PG_reclaim,
 
@@ -190,6 +181,15 @@ enum pageflags {
 	/* For self-hosted memmap pages */
 	PG_vmemmap_self_hosted = PG_owner_priv_1,
 #endif
+
+	/*
+	 * Flags only valid for compound pages.  Stored in first tail page's
+	 * flags word.
+	 */
+
+	/* At least one page in this folio has the hwpoison flag set */
+	PG_has_hwpoisoned = PG_error,
+	PG_hugetlb = PG_active,
 };
 
 #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
@@ -812,7 +812,23 @@ static inline void ClearPageCompound(struct page *page)
 
 #ifdef CONFIG_HUGETLB_PAGE
 int PageHuge(struct page *page);
-bool folio_test_hugetlb(struct folio *folio);
+SETPAGEFLAG(HugeTLB, hugetlb, PF_SECOND)
+CLEARPAGEFLAG(HugeTLB, hugetlb, PF_SECOND)
+
+/**
+ * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs
+ * @folio: The folio to test.
+ *
+ * Context: Any context.  Caller should have a reference on the folio to
+ * prevent it from being turned into a tail page.
+ * Return: True for hugetlbfs folios, false for anon folios or folios
+ * belonging to other filesystems.
+ */
+static inline bool folio_test_hugetlb(struct folio *folio)
+{
+	return folio_test_large(folio) &&
+		test_bit(PG_hugetlb, folio_flags(folio, 1));
+}
 #else
 TESTPAGEFLAG_FALSE(Huge, hugetlb)
 #endif
@@ -1040,6 +1056,13 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
 #define PAGE_FLAGS_CHECK_AT_PREP	\
 	((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
 
+/*
+ * Flags stored in the second page of a compound page.  They may overlap
+ * the CHECK_AT_FREE flags above, so need to be cleared.
+ */
+#define PAGE_FLAGS_SECOND						\
+	(1UL << PG_has_hwpoisoned	| 1UL << PG_hugetlb)
+
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
 /**
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 90ce1dfd591c..dd5f87047d06 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -490,7 +490,7 @@ static int __init crash_save_vmcoreinfo_init(void)
 #define PAGE_BUDDY_MAPCOUNT_VALUE	(~PG_buddy)
 	VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
 #ifdef CONFIG_HUGETLB_PAGE
-	VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
+	VMCOREINFO_NUMBER(PG_hugetlb);
 #define PAGE_OFFLINE_MAPCOUNT_VALUE	(~PG_offline)
 	VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 086eb51bf845..389490f100b0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1585,25 +1585,7 @@ static inline void __clear_hugetlb_destructor(struct hstate *h,
 {
 	lockdep_assert_held(&hugetlb_lock);
 
-	/*
-	 * Very subtle
-	 *
-	 * For non-gigantic pages set the destructor to the normal compound
-	 * page dtor.  This is needed in case someone takes an additional
-	 * temporary ref to the page, and freeing is delayed until they drop
-	 * their reference.
-	 *
-	 * For gigantic pages set the destructor to the null dtor.  This
-	 * destructor will never be called.  Before freeing the gigantic
-	 * page destroy_compound_gigantic_folio will turn the folio into a
-	 * simple group of pages.  After this the destructor does not
-	 * apply.
-	 *
-	 */
-	if (hstate_is_gigantic(h))
-		folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
-	else
-		folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
+	folio_clear_hugetlb(folio);
 }
 
 /*
@@ -1690,7 +1672,7 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
 		h->surplus_huge_pages_node[nid]++;
 	}
 
-	folio_set_compound_dtor(folio, HUGETLB_PAGE_DTOR);
+	folio_set_hugetlb(folio);
 	folio_change_private(folio, NULL);
 	/*
 	 * We have to set hugetlb_vmemmap_optimized again as above
@@ -1814,9 +1796,8 @@ static void free_hpage_workfn(struct work_struct *work)
 		/*
 		 * The VM_BUG_ON_FOLIO(!folio_test_hugetlb(folio), folio) in
 		 * folio_hstate() is going to trigger because a previous call to
-		 * remove_hugetlb_folio() will call folio_set_compound_dtor
-		 * (folio, NULL_COMPOUND_DTOR), so do not use folio_hstate()
-		 * directly.
+		 * remove_hugetlb_folio() will clear the hugetlb bit, so do
+		 * not use folio_hstate() directly.
 		 */
 		h = size_to_hstate(page_size(page));
 
@@ -1955,7 +1936,7 @@ static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	hugetlb_vmemmap_optimize(h, &folio->page);
 	INIT_LIST_HEAD(&folio->lru);
-	folio_set_compound_dtor(folio, HUGETLB_PAGE_DTOR);
+	folio_set_hugetlb(folio);
 	hugetlb_set_folio_subpool(folio, NULL);
 	set_hugetlb_cgroup(folio, NULL);
 	set_hugetlb_cgroup_rsvd(folio, NULL);
@@ -2070,28 +2051,10 @@ int PageHuge(struct page *page)
 	if (!PageCompound(page))
 		return 0;
 	folio = page_folio(page);
-	return folio->_folio_dtor == HUGETLB_PAGE_DTOR;
+	return folio_test_hugetlb(folio);
 }
 EXPORT_SYMBOL_GPL(PageHuge);
 
-/**
- * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs
- * @folio: The folio to test.
- *
- * Context: Any context.  Caller should have a reference on the folio to
- * prevent it from being turned into a tail page.
- * Return: True for hugetlbfs folios, false for anon folios or folios
- * belonging to other filesystems.
- */
-bool folio_test_hugetlb(struct folio *folio)
-{
-	if (!folio_test_large(folio))
-		return false;
-
-	return folio->_folio_dtor == HUGETLB_PAGE_DTOR;
-}
-EXPORT_SYMBOL_GPL(folio_test_hugetlb);
-
 /*
  * Find and lock address space (mapping) in write mode.
  *
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9638fdddf065..f8e276de4fd5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1122,7 +1122,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
 
 		if (compound)
-			ClearPageHasHWPoisoned(page);
+			page[1].flags &= ~PAGE_FLAGS_SECOND;
 		for (i = 1; i < (1 << order); i++) {
 			if (compound)
 				bad += free_tail_page_prepare(page, page + i);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 08/13] mm: Add large_rmappable page flag
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
                   ` (6 preceding siblings ...)
  2023-08-16 15:11 ` [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 09/13] mm: Rearrange page flags Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

Stored in the first tail page's flags, this flag replaces the destructor.
That removes the last of the destructors, so remove all references to
folio_dtor and compound_dtor.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 Documentation/admin-guide/kdump/vmcoreinfo.rst |  4 ++--
 include/linux/mm.h                             | 13 -------------
 include/linux/mm_types.h                       |  2 --
 include/linux/page-flags.h                     |  7 ++++++-
 kernel/crash_core.c                            |  1 -
 mm/huge_memory.c                               |  4 ++--
 mm/internal.h                                  |  1 -
 mm/page_alloc.c                                |  7 +------
 8 files changed, 11 insertions(+), 28 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index baa1c355741d..3bd38ac0e7de 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -141,8 +141,8 @@ nodemask_t
 The size of a nodemask_t type. Used to compute the number of online
 nodes.
 
-(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head)
--------------------------------------------------------------------------------------------------
+(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head)
+----------------------------------------------------------------------------------
 
 User-space tools compute their values based on the offset of these
 variables. The variables are used when excluding unnecessary pages.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 642f5fe5860e..cf0ae8c51d7f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1267,19 +1267,6 @@ void folio_copy(struct folio *dst, struct folio *src);
 
 unsigned long nr_free_buffer_pages(void);
 
-enum compound_dtor_id {
-	COMPOUND_PAGE_DTOR,
-	TRANSHUGE_PAGE_DTOR,
-	NR_COMPOUND_DTORS,
-};
-
-static inline void folio_set_compound_dtor(struct folio *folio,
-		enum compound_dtor_id compound_dtor)
-{
-	VM_BUG_ON_FOLIO(compound_dtor >= NR_COMPOUND_DTORS, folio);
-	folio->_folio_dtor = compound_dtor;
-}
-
 void destroy_large_folio(struct folio *folio);
 
 /* Returns the number of bytes in this potentially compound page. */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index da538ff68953..d45a2b8041e0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -282,7 +282,6 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
  * @_refcount: Do not access this member directly.  Use folio_ref_count()
  *    to find how many references there are to this folio.
  * @memcg_data: Memory Control Group data.
- * @_folio_dtor: Which destructor to use for this folio.
  * @_folio_order: Do not use directly, call folio_order().
  * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
  * @_nr_pages_mapped: Do not use directly, call folio_mapcount().
@@ -336,7 +335,6 @@ struct folio {
 			unsigned long _flags_1;
 			unsigned long _head_1;
 	/* public: */
-			unsigned char _folio_dtor;
 			unsigned char _folio_order;
 			atomic_t _entire_mapcount;
 			atomic_t _nr_pages_mapped;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index aeecf0cf1456..732d13c708e7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -190,6 +190,7 @@ enum pageflags {
 	/* At least one page in this folio has the hwpoison flag set */
 	PG_has_hwpoisoned = PG_error,
 	PG_hugetlb = PG_active,
+	PG_large_rmappable = PG_workingset, /* anon or file-backed */
 };
 
 #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
@@ -806,6 +807,9 @@ static inline void ClearPageCompound(struct page *page)
 	BUG_ON(!PageHead(page));
 	ClearPageHead(page);
 }
+PAGEFLAG(LargeRmappable, large_rmappable, PF_SECOND)
+#else
+TESTPAGEFLAG_FALSE(LargeRmappable, large_rmappable)
 #endif
 
 #define PG_head_mask ((1UL << PG_head))
@@ -1061,7 +1065,8 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
  * the CHECK_AT_FREE flags above, so need to be cleared.
  */
 #define PAGE_FLAGS_SECOND						\
-	(1UL << PG_has_hwpoisoned	| 1UL << PG_hugetlb)
+	(1UL << PG_has_hwpoisoned	| 1UL << PG_hugetlb |		\
+	 1UL << PG_large_rmappable)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index dd5f87047d06..934dd86e19f5 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -455,7 +455,6 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_OFFSET(page, lru);
 	VMCOREINFO_OFFSET(page, _mapcount);
 	VMCOREINFO_OFFSET(page, private);
-	VMCOREINFO_OFFSET(folio, _folio_dtor);
 	VMCOREINFO_OFFSET(folio, _folio_order);
 	VMCOREINFO_OFFSET(page, compound_head);
 	VMCOREINFO_OFFSET(pglist_data, node_zones);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 04664e6918c1..c721f7ec5b6a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -581,7 +581,7 @@ void folio_prep_large_rmappable(struct folio *folio)
 {
 	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
 	INIT_LIST_HEAD(&folio->_deferred_list);
-	folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR);
+	folio_set_large_rmappable(folio);
 }
 
 static inline bool is_transparent_hugepage(struct page *page)
@@ -593,7 +593,7 @@ static inline bool is_transparent_hugepage(struct page *page)
 
 	folio = page_folio(page);
 	return is_huge_zero_page(&folio->page) ||
-	       folio->_folio_dtor == TRANSHUGE_PAGE_DTOR;
+		folio_test_large_rmappable(folio);
 }
 
 static unsigned long __thp_get_unmapped_area(struct file *filp,
diff --git a/mm/internal.h b/mm/internal.h
index 1e98c867f0de..9dc7629ffbc9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -419,7 +419,6 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 {
 	struct folio *folio = (struct folio *)page;
 
-	folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
 	folio_set_order(folio, order);
 	atomic_set(&folio->_entire_mapcount, -1);
 	atomic_set(&folio->_nr_pages_mapped, 0);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8e276de4fd5..81b1c7e3a28b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -582,9 +582,6 @@ static inline void free_the_page(struct page *page, unsigned int order)
  * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
  * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
  *
- * The first tail page's ->compound_dtor describes how to destroy the
- * compound page.
- *
  * The first tail page's ->compound_order holds the order of allocation.
  * This usage means that zero-order pages may not be compound.
  */
@@ -603,14 +600,12 @@ void prep_compound_page(struct page *page, unsigned int order)
 
 void destroy_large_folio(struct folio *folio)
 {
-	enum compound_dtor_id dtor = folio->_folio_dtor;
-
 	if (folio_test_hugetlb(folio)) {
 		free_huge_folio(folio);
 		return;
 	}
 
-	if (folio_test_transhuge(folio) && dtor == TRANSHUGE_PAGE_DTOR)
+	if (folio_test_large_rmappable(folio))
 		folio_undo_large_rmappable(folio);
 
 	mem_cgroup_uncharge(folio);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 09/13] mm: Rearrange page flags
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
                   ` (7 preceding siblings ...)
  2023-08-16 15:11 ` [PATCH v2 08/13] mm: Add large_rmappable page flag Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-16 15:11 ` [PATCH v2 10/13] mm: Free up a word in the first tail page Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

Move PG_writeback into bottom byte so that it can use PG_waiters in a
later patch.  Move PG_head into bottom byte as well to match with where
'order' is moving next.  PG_active and PG_workingset move into the second
byte to make room for them.

By putting PG_head in bit 6, we ensure that it is cleared by assigning
the folio order to the bottom byte of the first tail page (since the
order cannot be larger than 63).

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 include/linux/page-flags.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 732d13c708e7..b452fba9bc71 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -99,13 +99,15 @@
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
+	PG_writeback,		/* Page is under writeback */
 	PG_referenced,
 	PG_uptodate,
 	PG_dirty,
 	PG_lru,
+	PG_head,		/* Must be in bit 6 */
+	PG_waiters,		/* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
 	PG_active,
 	PG_workingset,
-	PG_waiters,		/* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
 	PG_error,
 	PG_slab,
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
@@ -113,8 +115,6 @@ enum pageflags {
 	PG_reserved,
 	PG_private,		/* If pagecache, has fs-private data */
 	PG_private_2,		/* If pagecache, has fs aux data */
-	PG_writeback,		/* Page is under writeback */
-	PG_head,		/* A head page */
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 10/13] mm: Free up a word in the first tail page
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
                   ` (8 preceding siblings ...)
  2023-08-16 15:11 ` [PATCH v2 09/13] mm: Rearrange page flags Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-22 23:17   ` Mike Kravetz
  2023-08-16 15:11 ` [PATCH v2 11/13] mm: Remove folio_test_transhuge() Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

Store the folio order in the low byte of the flags word in the first
tail page.  This frees up the word that was being used to store the
order and dtor bytes previously.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 include/linux/mm.h         | 10 +++++-----
 include/linux/mm_types.h   |  3 +--
 include/linux/page-flags.h |  7 ++++---
 kernel/crash_core.c        |  1 -
 mm/internal.h              |  2 +-
 5 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cf0ae8c51d7f..85568e2b2556 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1028,7 +1028,7 @@ struct inode;
  * compound_order() can be called without holding a reference, which means
  * that niceties like page_folio() don't work.  These callers should be
  * prepared to handle wild return values.  For example, PG_head may be
- * set before _folio_order is initialised, or this may be a tail page.
+ * set before the order is initialised, or this may be a tail page.
  * See compaction.c for some good examples.
  */
 static inline unsigned int compound_order(struct page *page)
@@ -1037,7 +1037,7 @@ static inline unsigned int compound_order(struct page *page)
 
 	if (!test_bit(PG_head, &folio->flags))
 		return 0;
-	return folio->_folio_order;
+	return folio->_flags_1 & 0xff;
 }
 
 /**
@@ -1053,7 +1053,7 @@ static inline unsigned int folio_order(struct folio *folio)
 {
 	if (!folio_test_large(folio))
 		return 0;
-	return folio->_folio_order;
+	return folio->_flags_1 & 0xff;
 }
 
 #include <linux/huge_mm.h>
@@ -2025,7 +2025,7 @@ static inline long folio_nr_pages(struct folio *folio)
 #ifdef CONFIG_64BIT
 	return folio->_folio_nr_pages;
 #else
-	return 1L << folio->_folio_order;
+	return 1L << (folio->_flags_1 & 0xff);
 #endif
 }
 
@@ -2043,7 +2043,7 @@ static inline unsigned long compound_nr(struct page *page)
 #ifdef CONFIG_64BIT
 	return folio->_folio_nr_pages;
 #else
-	return 1L << folio->_folio_order;
+	return 1L << (folio->_flags_1 & 0xff);
 #endif
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d45a2b8041e0..659c7b84726c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -282,7 +282,6 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
  * @_refcount: Do not access this member directly.  Use folio_ref_count()
  *    to find how many references there are to this folio.
  * @memcg_data: Memory Control Group data.
- * @_folio_order: Do not use directly, call folio_order().
  * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
  * @_nr_pages_mapped: Do not use directly, call folio_mapcount().
  * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
@@ -334,8 +333,8 @@ struct folio {
 		struct {
 			unsigned long _flags_1;
 			unsigned long _head_1;
+			unsigned long _folio_avail;
 	/* public: */
-			unsigned char _folio_order;
 			atomic_t _entire_mapcount;
 			atomic_t _nr_pages_mapped;
 			atomic_t _pincount;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b452fba9bc71..5b466e619f71 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -184,7 +184,8 @@ enum pageflags {
 
 	/*
 	 * Flags only valid for compound pages.  Stored in first tail page's
-	 * flags word.
+	 * flags word.  Cannot use the first 8 flags or any flag marked as
+	 * PF_ANY.
 	 */
 
 	/* At least one page in this folio has the hwpoison flag set */
@@ -1065,8 +1066,8 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
  * the CHECK_AT_FREE flags above, so need to be cleared.
  */
 #define PAGE_FLAGS_SECOND						\
-	(1UL << PG_has_hwpoisoned	| 1UL << PG_hugetlb |		\
-	 1UL << PG_large_rmappable)
+	(0xffUL /* order */		| 1UL << PG_has_hwpoisoned |	\
+	 1UL << PG_hugetlb		| 1UL << PG_large_rmappable)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 934dd86e19f5..693445e1f7f6 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -455,7 +455,6 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_OFFSET(page, lru);
 	VMCOREINFO_OFFSET(page, _mapcount);
 	VMCOREINFO_OFFSET(page, private);
-	VMCOREINFO_OFFSET(folio, _folio_order);
 	VMCOREINFO_OFFSET(page, compound_head);
 	VMCOREINFO_OFFSET(pglist_data, node_zones);
 	VMCOREINFO_OFFSET(pglist_data, nr_zones);
diff --git a/mm/internal.h b/mm/internal.h
index 9dc7629ffbc9..5c777b6779fa 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -407,7 +407,7 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
 	if (WARN_ON_ONCE(!order || !folio_test_large(folio)))
 		return;
 
-	folio->_folio_order = order;
+	folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order;
 #ifdef CONFIG_64BIT
 	folio->_folio_nr_pages = 1U << order;
 #endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 11/13] mm: Remove folio_test_transhuge()
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
                   ` (9 preceding siblings ...)
  2023-08-16 15:11 ` [PATCH v2 10/13] mm: Free up a word in the first tail page Matthew Wilcox (Oracle)
@ 2023-08-16 15:11 ` Matthew Wilcox (Oracle)
  2023-08-16 15:12 ` [PATCH v2 12/13] mm: Add tail private fields to struct folio Matthew Wilcox (Oracle)
  2023-08-16 15:12 ` [PATCH v2 13/13] mm: Convert split_huge_pages_pid() to use a folio Matthew Wilcox (Oracle)
  12 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

This function is misleading; people think it means "Is this a THP",
when all it actually does is check whether this is a large folio.
Remove it; the one remaining user should have been checking to see
whether the folio is PMD sized or not.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 include/linux/page-flags.h | 5 -----
 mm/memcontrol.c            | 2 +-
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5b466e619f71..e3ca17e95bbf 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -853,11 +853,6 @@ static inline int PageTransHuge(struct page *page)
 	return PageHead(page);
 }
 
-static inline bool folio_test_transhuge(struct folio *folio)
-{
-	return folio_test_head(folio);
-}
-
 /*
  * PageTransCompound returns true for both transparent huge pages
  * and hugetlbfs pages, so it should only be called when it's known
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35d7e66ab032..67bda1ceedbe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5765,7 +5765,7 @@ static int mem_cgroup_move_account(struct page *page,
 		if (folio_mapped(folio)) {
 			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
 			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
-			if (folio_test_transhuge(folio)) {
+			if (folio_test_pmd_mappable(folio)) {
 				__mod_lruvec_state(from_vec, NR_ANON_THPS,
 						   -nr_pages);
 				__mod_lruvec_state(to_vec, NR_ANON_THPS,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 12/13] mm: Add tail private fields to struct folio
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
                   ` (10 preceding siblings ...)
  2023-08-16 15:11 ` [PATCH v2 11/13] mm: Remove folio_test_transhuge() Matthew Wilcox (Oracle)
@ 2023-08-16 15:12 ` Matthew Wilcox (Oracle)
  2023-08-16 15:21   ` David Hildenbrand
  2023-08-16 15:12 ` [PATCH v2 13/13] mm: Convert split_huge_pages_pid() to use a folio Matthew Wilcox (Oracle)
  12 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

Because THP_SWAP uses page->private for each page, we must not use
the space which overlaps that field for anything which would conflict
with that.  We avoid the conflict on 32-bit systems by disallowing
THP_SWAP on 32-bit.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 include/linux/mm_types.h | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 659c7b84726c..3880b3f2e321 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -340,8 +340,11 @@ struct folio {
 			atomic_t _pincount;
 #ifdef CONFIG_64BIT
 			unsigned int _folio_nr_pages;
-#endif
+			/* 4 byte gap here */
 	/* private: the union with struct page is transitional */
+			/* Fix THP_SWAP to not use tail->private */
+			unsigned long _private_1;
+#endif
 		};
 		struct page __page_1;
 	};
@@ -362,6 +365,9 @@ struct folio {
 	/* public: */
 			struct list_head _deferred_list;
 	/* private: the union with struct page is transitional */
+			unsigned long _avail_2a;
+			/* Fix THP_SWAP to not use tail->private */
+			unsigned long _private_2a;
 		};
 		struct page __page_2;
 	};
@@ -386,12 +392,18 @@ FOLIO_MATCH(memcg_data, memcg_data);
 			offsetof(struct page, pg) + sizeof(struct page))
 FOLIO_MATCH(flags, _flags_1);
 FOLIO_MATCH(compound_head, _head_1);
+#ifdef CONFIG_64BIT
+FOLIO_MATCH(private, _private_1);
+#endif
 #undef FOLIO_MATCH
 #define FOLIO_MATCH(pg, fl)						\
 	static_assert(offsetof(struct folio, fl) ==			\
 			offsetof(struct page, pg) + 2 * sizeof(struct page))
 FOLIO_MATCH(flags, _flags_2);
 FOLIO_MATCH(compound_head, _head_2);
+FOLIO_MATCH(flags, _flags_2a);
+FOLIO_MATCH(compound_head, _head_2a);
+FOLIO_MATCH(private, _private_2a);
 #undef FOLIO_MATCH
 
 /*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 13/13] mm: Convert split_huge_pages_pid() to use a folio
  2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
                   ` (11 preceding siblings ...)
  2023-08-16 15:12 ` [PATCH v2 12/13] mm: Add tail private fields to struct folio Matthew Wilcox (Oracle)
@ 2023-08-16 15:12 ` Matthew Wilcox (Oracle)
  12 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-16 15:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox (Oracle), Jens Axboe, io-uring, linux-mm

Replaces five calls to compound_head with one.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
 mm/huge_memory.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c721f7ec5b6a..4ffc78edaf26 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -584,14 +584,11 @@ void folio_prep_large_rmappable(struct folio *folio)
 	folio_set_large_rmappable(folio);
 }
 
-static inline bool is_transparent_hugepage(struct page *page)
+static inline bool is_transparent_hugepage(struct folio *folio)
 {
-	struct folio *folio;
-
-	if (!PageCompound(page))
+	if (!folio_test_large(folio))
 		return false;
 
-	folio = page_folio(page);
 	return is_huge_zero_page(&folio->page) ||
 		folio_test_large_rmappable(folio);
 }
@@ -3015,6 +3012,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 	for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
 		struct vm_area_struct *vma = vma_lookup(mm, addr);
 		struct page *page;
+		struct folio *folio;
 
 		if (!vma)
 			break;
@@ -3031,22 +3029,23 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (IS_ERR_OR_NULL(page))
 			continue;
 
-		if (!is_transparent_hugepage(page))
+		folio = page_folio(page);
+		if (!is_transparent_hugepage(folio))
 			goto next;
 
 		total++;
-		if (!can_split_folio(page_folio(page), NULL))
+		if (!can_split_folio(folio, NULL))
 			goto next;
 
-		if (!trylock_page(page))
+		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_huge_page(page))
+		if (!split_folio(folio))
 			split++;
 
-		unlock_page(page);
+		folio_unlock(folio);
 next:
-		put_page(page);
+		folio_put(folio);
 		cond_resched();
 	}
 	mmap_read_unlock(mm);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 12/13] mm: Add tail private fields to struct folio
  2023-08-16 15:12 ` [PATCH v2 12/13] mm: Add tail private fields to struct folio Matthew Wilcox (Oracle)
@ 2023-08-16 15:21   ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2023-08-16 15:21 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Andrew Morton; +Cc: Jens Axboe, io-uring, linux-mm

On 16.08.23 17:12, Matthew Wilcox (Oracle) wrote:
> Because THP_SWAP uses page->private for each page, we must not use
> the space which overlaps that field for anything which would conflict
> with that.  We avoid the conflict on 32-bit systems by disallowing
> THP_SWAP on 32-bit.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
>   include/linux/mm_types.h | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 659c7b84726c..3880b3f2e321 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -340,8 +340,11 @@ struct folio {
>   			atomic_t _pincount;
>   #ifdef CONFIG_64BIT
>   			unsigned int _folio_nr_pages;
> -#endif
> +			/* 4 byte gap here */
>   	/* private: the union with struct page is transitional */
> +			/* Fix THP_SWAP to not use tail->private */
> +			unsigned long _private_1;

If you could give

https://lkml.kernel.org/r/[email protected]

a quick glimpse to see if anything important jumps at you, that would be 
great.

That really should be fixed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 03/13] mm: Convert free_huge_page() to free_huge_folio()
  2023-08-16 15:11 ` [PATCH v2 03/13] mm: Convert free_huge_page() to free_huge_folio() Matthew Wilcox (Oracle)
@ 2023-08-16 20:14   ` Sidhartha Kumar
  2023-08-17  3:31   ` Yanteng Si
  1 sibling, 0 replies; 26+ messages in thread
From: Sidhartha Kumar @ 2023-08-16 20:14 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Andrew Morton
  Cc: Jens Axboe, io-uring, linux-mm, Yanteng Si

On 8/16/23 8:11 AM, Matthew Wilcox (Oracle) wrote:
> Pass a folio instead of the head page to save a few instructions.
> Update the documentation, at least in English.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Cc: Yanteng Si <[email protected]>
> ---
>   Documentation/mm/hugetlbfs_reserv.rst         | 14 +++---
>   .../zh_CN/mm/hugetlbfs_reserv.rst             |  4 +-
>   include/linux/hugetlb.h                       |  2 +-
>   mm/hugetlb.c                                  | 48 +++++++++----------
>   mm/page_alloc.c                               |  2 +-
>   5 files changed, 34 insertions(+), 36 deletions(-)
> 

Reviewed-by: Sidhartha Kumar <[email protected]>

> diff --git a/Documentation/mm/hugetlbfs_reserv.rst b/Documentation/mm/hugetlbfs_reserv.rst
> index d9c2b0f01dcd..4914fbf07966 100644
> --- a/Documentation/mm/hugetlbfs_reserv.rst
> +++ b/Documentation/mm/hugetlbfs_reserv.rst
> @@ -271,12 +271,12 @@ to the global reservation count (resv_huge_pages).
>   Freeing Huge Pages
>   ==================
>   
> -Huge page freeing is performed by the routine free_huge_page().  This routine
> -is the destructor for hugetlbfs compound pages.  As a result, it is only
> -passed a pointer to the page struct.  When a huge page is freed, reservation
> -accounting may need to be performed.  This would be the case if the page was
> -associated with a subpool that contained reserves, or the page is being freed
> -on an error path where a global reserve count must be restored.
> +Huge pages are freed by free_huge_folio().  It is only passed a pointer
> +to the folio as it is called from the generic MM code.  When a huge page
> +is freed, reservation accounting may need to be performed.  This would
> +be the case if the page was associated with a subpool that contained
> +reserves, or the page is being freed on an error path where a global
> +reserve count must be restored.
>   
>   The page->private field points to any subpool associated with the page.
>   If the PagePrivate flag is set, it indicates the global reserve count should
> @@ -525,7 +525,7 @@ However, there are several instances where errors are encountered after a huge
>   page is allocated but before it is instantiated.  In this case, the page
>   allocation has consumed the reservation and made the appropriate subpool,
>   reservation map and global count adjustments.  If the page is freed at this
> -time (before instantiation and clearing of PagePrivate), then free_huge_page
> +time (before instantiation and clearing of PagePrivate), then free_huge_folio
>   will increment the global reservation count.  However, the reservation map
>   indicates the reservation was consumed.  This resulting inconsistent state
>   will cause the 'leak' of a reserved huge page.  The global reserve count will
> diff --git a/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst b/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst
> index b7a0544224ad..0f7e7fb5ca8c 100644
> --- a/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst
> +++ b/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst
> @@ -219,7 +219,7 @@ vma_commit_reservation()之间,预留映射有可能被改变。如果hugetlb_
>   释放巨页
>   ========
>   
> -巨页释放是由函数free_huge_page()执行的。这个函数是hugetlbfs复合页的析构器。因此,它只传
> +巨页释放是由函数free_huge_folio()执行的。这个函数是hugetlbfs复合页的析构器。因此,它只传
>   递一个指向页面结构体的指针。当一个巨页被释放时,可能需要进行预留计算。如果该页与包含保
>   留的子池相关联,或者该页在错误路径上被释放,必须恢复全局预留计数,就会出现这种情况。
>   
> @@ -387,7 +387,7 @@ region_count()在解除私有巨页映射时被调用。在私有映射中,预
>   
>   然而,有几种情况是,在一个巨页被分配后,但在它被实例化之前,就遇到了错误。在这种情况下,
>   页面分配已经消耗了预留,并进行了适当的子池、预留映射和全局计数调整。如果页面在这个时候被释放
> -(在实例化和清除PagePrivate之前),那么free_huge_page将增加全局预留计数。然而,预留映射
> +(在实例化和清除PagePrivate之前),那么free_huge_folio将增加全局预留计数。然而,预留映射
>   显示报留被消耗了。这种不一致的状态将导致预留的巨页的 “泄漏” 。全局预留计数将比它原本的要高,
>   并阻止分配一个预先分配的页面。
>   
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 5a1dfaffbd80..5b2626063f4f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -26,7 +26,7 @@ typedef struct { unsigned long pd; } hugepd_t;
>   #define __hugepd(x) ((hugepd_t) { (x) })
>   #endif
>   
> -void free_huge_page(struct page *page);
> +void free_huge_folio(struct folio *folio);
>   
>   #ifdef CONFIG_HUGETLB_PAGE
>   
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e327a5a7602c..086eb51bf845 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1706,10 +1706,10 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
>   	zeroed = folio_put_testzero(folio);
>   	if (unlikely(!zeroed))
>   		/*
> -		 * It is VERY unlikely soneone else has taken a ref on
> -		 * the page.  In this case, we simply return as the
> -		 * hugetlb destructor (free_huge_page) will be called
> -		 * when this other ref is dropped.
> +		 * It is VERY unlikely soneone else has taken a ref
> +		 * on the folio.  In this case, we simply return as
> +		 * free_huge_folio() will be called when this other ref
> +		 * is dropped.
>   		 */
>   		return;
>   
> @@ -1875,13 +1875,12 @@ struct hstate *size_to_hstate(unsigned long size)
>   	return NULL;
>   }
>   
> -void free_huge_page(struct page *page)
> +void free_huge_folio(struct folio *folio)
>   {
>   	/*
>   	 * Can't pass hstate in here because it is called from the
>   	 * compound page destructor.
>   	 */
> -	struct folio *folio = page_folio(page);
>   	struct hstate *h = folio_hstate(folio);
>   	int nid = folio_nid(folio);
>   	struct hugepage_subpool *spool = hugetlb_folio_subpool(folio);
> @@ -1936,7 +1935,7 @@ void free_huge_page(struct page *page)
>   		spin_unlock_irqrestore(&hugetlb_lock, flags);
>   		update_and_free_hugetlb_folio(h, folio, true);
>   	} else {
> -		arch_clear_hugepage_flags(page);
> +		arch_clear_hugepage_flags(&folio->page);
>   		enqueue_hugetlb_folio(h, folio);
>   		spin_unlock_irqrestore(&hugetlb_lock, flags);
>   	}
> @@ -2246,7 +2245,7 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>   		folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node,
>   					nodes_allowed, node_alloc_noretry);
>   		if (folio) {
> -			free_huge_page(&folio->page); /* free it into the hugepage allocator */
> +			free_huge_folio(folio); /* free it into the hugepage allocator */
>   			return 1;
>   		}
>   	}
> @@ -2429,13 +2428,13 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
>   	 * We could have raced with the pool size change.
>   	 * Double check that and simply deallocate the new page
>   	 * if we would end up overcommiting the surpluses. Abuse
> -	 * temporary page to workaround the nasty free_huge_page
> +	 * temporary page to workaround the nasty free_huge_folio
>   	 * codeflow
>   	 */
>   	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>   		folio_set_hugetlb_temporary(folio);
>   		spin_unlock_irq(&hugetlb_lock);
> -		free_huge_page(&folio->page);
> +		free_huge_folio(folio);
>   		return NULL;
>   	}
>   
> @@ -2547,8 +2546,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>   	__must_hold(&hugetlb_lock)
>   {
>   	LIST_HEAD(surplus_list);
> -	struct folio *folio;
> -	struct page *page, *tmp;
> +	struct folio *folio, *tmp;
>   	int ret;
>   	long i;
>   	long needed, allocated;
> @@ -2608,21 +2606,21 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>   	ret = 0;
>   
>   	/* Free the needed pages to the hugetlb pool */
> -	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> +	list_for_each_entry_safe(folio, tmp, &surplus_list, lru) {
>   		if ((--needed) < 0)
>   			break;
>   		/* Add the page to the hugetlb allocator */
> -		enqueue_hugetlb_folio(h, page_folio(page));
> +		enqueue_hugetlb_folio(h, folio);
>   	}
>   free:
>   	spin_unlock_irq(&hugetlb_lock);
>   
>   	/*
>   	 * Free unnecessary surplus pages to the buddy allocator.
> -	 * Pages have no ref count, call free_huge_page directly.
> +	 * Pages have no ref count, call free_huge_folio directly.
>   	 */
> -	list_for_each_entry_safe(page, tmp, &surplus_list, lru)
> -		free_huge_page(page);
> +	list_for_each_entry_safe(folio, tmp, &surplus_list, lru)
> +		free_huge_folio(folio);
>   	spin_lock_irq(&hugetlb_lock);
>   
>   	return ret;
> @@ -2836,11 +2834,11 @@ static long vma_del_reservation(struct hstate *h,
>    * 2) No reservation was in place for the page, so hugetlb_restore_reserve is
>    *    not set.  However, alloc_hugetlb_folio always updates the reserve map.
>    *
> - * In case 1, free_huge_page later in the error path will increment the
> - * global reserve count.  But, free_huge_page does not have enough context
> + * In case 1, free_huge_folio later in the error path will increment the
> + * global reserve count.  But, free_huge_folio does not have enough context
>    * to adjust the reservation map.  This case deals primarily with private
>    * mappings.  Adjust the reserve map here to be consistent with global
> - * reserve count adjustments to be made by free_huge_page.  Make sure the
> + * reserve count adjustments to be made by free_huge_folio.  Make sure the
>    * reserve map indicates there is a reservation present.
>    *
>    * In case 2, simply undo reserve map modifications done by alloc_hugetlb_folio.
> @@ -2856,7 +2854,7 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
>   			 * Rare out of memory condition in reserve map
>   			 * manipulation.  Clear hugetlb_restore_reserve so
>   			 * that global reserve count will not be incremented
> -			 * by free_huge_page.  This will make it appear
> +			 * by free_huge_folio.  This will make it appear
>   			 * as though the reservation for this folio was
>   			 * consumed.  This may prevent the task from
>   			 * faulting in the folio at a later time.  This
> @@ -3232,7 +3230,7 @@ static void __init gather_bootmem_prealloc(void)
>   		if (prep_compound_gigantic_folio(folio, huge_page_order(h))) {
>   			WARN_ON(folio_test_reserved(folio));
>   			prep_new_hugetlb_folio(h, folio, folio_nid(folio));
> -			free_huge_page(page); /* add to the hugepage allocator */
> +			free_huge_folio(folio); /* add to the hugepage allocator */
>   		} else {
>   			/* VERY unlikely inflated ref count on a tail page */
>   			free_gigantic_folio(folio, huge_page_order(h));
> @@ -3264,7 +3262,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>   					&node_states[N_MEMORY], NULL);
>   			if (!folio)
>   				break;
> -			free_huge_page(&folio->page); /* free it into the hugepage allocator */
> +			free_huge_folio(folio); /* free it into the hugepage allocator */
>   		}
>   		cond_resched();
>   	}
> @@ -3542,7 +3540,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>   	while (count > persistent_huge_pages(h)) {
>   		/*
>   		 * If this allocation races such that we no longer need the
> -		 * page, free_huge_page will handle it by freeing the page
> +		 * page, free_huge_folio will handle it by freeing the page
>   		 * and reducing the surplus.
>   		 */
>   		spin_unlock_irq(&hugetlb_lock);
> @@ -3658,7 +3656,7 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
>   			prep_compound_page(subpage, target_hstate->order);
>   		folio_change_private(inner_folio, NULL);
>   		prep_new_hugetlb_folio(target_hstate, inner_folio, nid);
> -		free_huge_page(subpage);
> +		free_huge_folio(inner_folio);
>   	}
>   	mutex_unlock(&target_hstate->resize_lock);
>   
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 548c8016190b..b569fd5562aa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -620,7 +620,7 @@ void destroy_large_folio(struct folio *folio)
>   	enum compound_dtor_id dtor = folio->_folio_dtor;
>   
>   	if (folio_test_hugetlb(folio)) {
> -		free_huge_page(&folio->page);
> +		free_huge_folio(folio);
>   		return;
>   	}
>   


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR
  2023-08-16 15:11 ` [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR Matthew Wilcox (Oracle)
@ 2023-08-16 21:45   ` Sidhartha Kumar
  2023-08-22  3:13   ` Mike Kravetz
  2023-12-08 17:54   ` Vlastimil Babka
  2 siblings, 0 replies; 26+ messages in thread
From: Sidhartha Kumar @ 2023-08-16 21:45 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Andrew Morton; +Cc: Jens Axboe, io-uring, linux-mm

On 8/16/23 8:11 AM, Matthew Wilcox (Oracle) wrote:
> We can use a bit in page[1].flags to indicate that this folio belongs
> to hugetlb instead of using a value in page[1].dtors.  That lets
> folio_test_hugetlb() become an inline function like it should be.
> We can also get rid of NULL_COMPOUND_DTOR.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
>   .../admin-guide/kdump/vmcoreinfo.rst          | 10 +---
>   include/linux/mm.h                            |  4 --
>   include/linux/page-flags.h                    | 43 ++++++++++++----
>   kernel/crash_core.c                           |  2 +-
>   mm/hugetlb.c                                  | 49 +++----------------
>   mm/page_alloc.c                               |  2 +-
>   6 files changed, 43 insertions(+), 67 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index c18d94fa6470..baa1c355741d 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -325,8 +325,8 @@ NR_FREE_PAGES
>   On linux-2.6.21 or later, the number of free pages is in
>   vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
>   
> -PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
> -------------------------------------------------------------------------------
> +PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb
> +-----------------------------------------------------------------------------------------
>   
>   Page attributes. These flags are used to filter various unnecessary for
>   dumping pages.
> @@ -338,12 +338,6 @@ More page attributes. These flags are used to filter various unnecessary for
>   dumping pages.
>   
>   
> -HUGETLB_PAGE_DTOR
> ------------------
> -
> -The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
> -excludes these pages.
> -
>   x86_64
>   ======
>   
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7b800d1298dc..642f5fe5860e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1268,11 +1268,7 @@ void folio_copy(struct folio *dst, struct folio *src);
>   unsigned long nr_free_buffer_pages(void);
>   
>   enum compound_dtor_id {
> -	NULL_COMPOUND_DTOR,
>   	COMPOUND_PAGE_DTOR,
> -#ifdef CONFIG_HUGETLB_PAGE
> -	HUGETLB_PAGE_DTOR,
> -#endif
>   	TRANSHUGE_PAGE_DTOR,
>   	NR_COMPOUND_DTORS,
>   };
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 92a2063a0a23..aeecf0cf1456 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -171,15 +171,6 @@ enum pageflags {
>   	/* Remapped by swiotlb-xen. */
>   	PG_xen_remapped = PG_owner_priv_1,
>   
> -#ifdef CONFIG_MEMORY_FAILURE
> -	/*
> -	 * Compound pages. Stored in first tail page's flags.
> -	 * Indicates that at least one subpage is hwpoisoned in the
> -	 * THP.
> -	 */
> -	PG_has_hwpoisoned = PG_error,
> -#endif
> -
>   	/* non-lru isolated movable page */
>   	PG_isolated = PG_reclaim,
>   
> @@ -190,6 +181,15 @@ enum pageflags {
>   	/* For self-hosted memmap pages */
>   	PG_vmemmap_self_hosted = PG_owner_priv_1,
>   #endif
> +
> +	/*
> +	 * Flags only valid for compound pages.  Stored in first tail page's
> +	 * flags word.
> +	 */
> +
> +	/* At least one page in this folio has the hwpoison flag set */
> +	PG_has_hwpoisoned = PG_error,
> +	PG_hugetlb = PG_active,
>   };
>   
>   #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
> @@ -812,7 +812,23 @@ static inline void ClearPageCompound(struct page *page)
>   
>   #ifdef CONFIG_HUGETLB_PAGE
>   int PageHuge(struct page *page);
> -bool folio_test_hugetlb(struct folio *folio);
> +SETPAGEFLAG(HugeTLB, hugetlb, PF_SECOND)
> +CLEARPAGEFLAG(HugeTLB, hugetlb, PF_SECOND)
> +
> +/**
> + * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs
> + * @folio: The folio to test.
> + *
> + * Context: Any context.  Caller should have a reference on the folio to
> + * prevent it from being turned into a tail page.
> + * Return: True for hugetlbfs folios, false for anon folios or folios
> + * belonging to other filesystems.
> + */
> +static inline bool folio_test_hugetlb(struct folio *folio)
> +{
> +	return folio_test_large(folio) &&
> +		test_bit(PG_hugetlb, folio_flags(folio, 1));
> +}
>   #else
>   TESTPAGEFLAG_FALSE(Huge, hugetlb)
>   #endif
> @@ -1040,6 +1056,13 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
>   #define PAGE_FLAGS_CHECK_AT_PREP	\
>   	((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
>   
> +/*
> + * Flags stored in the second page of a compound page.  They may overlap
> + * the CHECK_AT_FREE flags above, so need to be cleared.
> + */
> +#define PAGE_FLAGS_SECOND						\
> +	(1UL << PG_has_hwpoisoned	| 1UL << PG_hugetlb)
> +
>   #define PAGE_FLAGS_PRIVATE				\
>   	(1UL << PG_private | 1UL << PG_private_2)
>   /**
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 90ce1dfd591c..dd5f87047d06 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -490,7 +490,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>   #define PAGE_BUDDY_MAPCOUNT_VALUE	(~PG_buddy)
>   	VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
>   #ifdef CONFIG_HUGETLB_PAGE
> -	VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
> +	VMCOREINFO_NUMBER(PG_hugetlb);
>   #define PAGE_OFFLINE_MAPCOUNT_VALUE	(~PG_offline)
>   	VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
>   #endif
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 086eb51bf845..389490f100b0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1585,25 +1585,7 @@ static inline void __clear_hugetlb_destructor(struct hstate *h,
>   {
>   	lockdep_assert_held(&hugetlb_lock);
>   
> -	/*
> -	 * Very subtle
> -	 *
> -	 * For non-gigantic pages set the destructor to the normal compound
> -	 * page dtor.  This is needed in case someone takes an additional
> -	 * temporary ref to the page, and freeing is delayed until they drop
> -	 * their reference.
> -	 *
> -	 * For gigantic pages set the destructor to the null dtor.  This
> -	 * destructor will never be called.  Before freeing the gigantic
> -	 * page destroy_compound_gigantic_folio will turn the folio into a
> -	 * simple group of pages.  After this the destructor does not
> -	 * apply.
> -	 *
> -	 */
> -	if (hstate_is_gigantic(h))
> -		folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> -	else
> -		folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
> +	folio_clear_hugetlb(folio);
>   }
>   
>   /*
> @@ -1690,7 +1672,7 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
>   		h->surplus_huge_pages_node[nid]++;
>   	}
>   
> -	folio_set_compound_dtor(folio, HUGETLB_PAGE_DTOR);
> +	folio_set_hugetlb(folio);
>   	folio_change_private(folio, NULL);
>   	/*
>   	 * We have to set hugetlb_vmemmap_optimized again as above
> @@ -1814,9 +1796,8 @@ static void free_hpage_workfn(struct work_struct *work)
>   		/*
>   		 * The VM_BUG_ON_FOLIO(!folio_test_hugetlb(folio), folio) in
>   		 * folio_hstate() is going to trigger because a previous call to
> -		 * remove_hugetlb_folio() will call folio_set_compound_dtor
> -		 * (folio, NULL_COMPOUND_DTOR), so do not use folio_hstate()
> -		 * directly.
> +		 * remove_hugetlb_folio() will clear the hugetlb bit, so do
> +		 * not use folio_hstate() directly.
>   		 */
>   		h = size_to_hstate(page_size(page));
>   
> @@ -1955,7 +1936,7 @@ static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
>   {
>   	hugetlb_vmemmap_optimize(h, &folio->page);
>   	INIT_LIST_HEAD(&folio->lru);
> -	folio_set_compound_dtor(folio, HUGETLB_PAGE_DTOR);

Hi Matthew,

In __prep_compound_gigantic_folio() there is still the comment:
/* we rely on prep_new_hugetlb_folio to set the destructor */
should that be modified to reference the hugetlb flag?

Thanks,
Sid

> +	folio_set_hugetlb(folio);
>   	hugetlb_set_folio_subpool(folio, NULL);
>   	set_hugetlb_cgroup(folio, NULL);
>   	set_hugetlb_cgroup_rsvd(folio, NULL);
> @@ -2070,28 +2051,10 @@ int PageHuge(struct page *page)
>   	if (!PageCompound(page))
>   		return 0;
>   	folio = page_folio(page);
> -	return folio->_folio_dtor == HUGETLB_PAGE_DTOR;
> +	return folio_test_hugetlb(folio);
>   }
>   EXPORT_SYMBOL_GPL(PageHuge);
>   
> -/**
> - * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs
> - * @folio: The folio to test.
> - *
> - * Context: Any context.  Caller should have a reference on the folio to
> - * prevent it from being turned into a tail page.
> - * Return: True for hugetlbfs folios, false for anon folios or folios
> - * belonging to other filesystems.
> - */
> -bool folio_test_hugetlb(struct folio *folio)
> -{
> -	if (!folio_test_large(folio))
> -		return false;
> -
> -	return folio->_folio_dtor == HUGETLB_PAGE_DTOR;
> -}
> -EXPORT_SYMBOL_GPL(folio_test_hugetlb);
> -
>   /*
>    * Find and lock address space (mapping) in write mode.
>    *
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9638fdddf065..f8e276de4fd5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1122,7 +1122,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>   		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
>   
>   		if (compound)
> -			ClearPageHasHWPoisoned(page);
> +			page[1].flags &= ~PAGE_FLAGS_SECOND;
>   		for (i = 1; i < (1 << order); i++) {
>   			if (compound)
>   				bad += free_tail_page_prepare(page, page + i);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 03/13] mm: Convert free_huge_page() to free_huge_folio()
  2023-08-16 15:11 ` [PATCH v2 03/13] mm: Convert free_huge_page() to free_huge_folio() Matthew Wilcox (Oracle)
  2023-08-16 20:14   ` Sidhartha Kumar
@ 2023-08-17  3:31   ` Yanteng Si
  1 sibling, 0 replies; 26+ messages in thread
From: Yanteng Si @ 2023-08-17  3:31 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Andrew Morton; +Cc: Jens Axboe, io-uring, linux-mm


在 2023/8/16 23:11, Matthew Wilcox (Oracle) 写道:
> Pass a folio instead of the head page to save a few instructions.
> Update the documentation, at least in English.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Cc: Yanteng Si <[email protected]>
> ---
>   Documentation/mm/hugetlbfs_reserv.rst         | 14 +++---
>   .../zh_CN/mm/hugetlbfs_reserv.rst             |  4 +-
>   include/linux/hugetlb.h                       |  2 +-
>   mm/hugetlb.c                                  | 48 +++++++++----------
>   mm/page_alloc.c                               |  2 +-
>   5 files changed, 34 insertions(+), 36 deletions(-)
>
> diff --git a/Documentation/mm/hugetlbfs_reserv.rst b/Documentation/mm/hugetlbfs_reserv.rst
> index d9c2b0f01dcd..4914fbf07966 100644
> --- a/Documentation/mm/hugetlbfs_reserv.rst
> +++ b/Documentation/mm/hugetlbfs_reserv.rst
> @@ -271,12 +271,12 @@ to the global reservation count (resv_huge_pages).
>   Freeing Huge Pages
>   ==================
>   
> -Huge page freeing is performed by the routine free_huge_page().  This routine
> -is the destructor for hugetlbfs compound pages.  As a result, it is only
> -passed a pointer to the page struct.  When a huge page is freed, reservation
> -accounting may need to be performed.  This would be the case if the page was
> -associated with a subpool that contained reserves, or the page is being freed
> -on an error path where a global reserve count must be restored.
> +Huge pages are freed by free_huge_folio().  It is only passed a pointer
> +to the folio as it is called from the generic MM code.  When a huge page
> +is freed, reservation accounting may need to be performed.  This would
> +be the case if the page was associated with a subpool that contained
> +reserves, or the page is being freed on an error path where a global
> +reserve count must be restored.
>   
>   The page->private field points to any subpool associated with the page.
>   If the PagePrivate flag is set, it indicates the global reserve count should
> @@ -525,7 +525,7 @@ However, there are several instances where errors are encountered after a huge
>   page is allocated but before it is instantiated.  In this case, the page
>   allocation has consumed the reservation and made the appropriate subpool,
>   reservation map and global count adjustments.  If the page is freed at this
> -time (before instantiation and clearing of PagePrivate), then free_huge_page
> +time (before instantiation and clearing of PagePrivate), then free_huge_folio
>   will increment the global reservation count.  However, the reservation map
>   indicates the reservation was consumed.  This resulting inconsistent state
>   will cause the 'leak' of a reserved huge page.  The global reserve count will
> diff --git a/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst b/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst
> index b7a0544224ad..0f7e7fb5ca8c 100644
> --- a/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst
> +++ b/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst
> @@ -219,7 +219,7 @@ vma_commit_reservation()之间,预留映射有可能被改变。如果hugetlb_
>   释放巨页
>   ========
>   
> -巨页释放是由函数free_huge_page()执行的。这个函数是hugetlbfs复合页的析构器。因此,它只传
> +巨页释放是由函数free_huge_folio()执行的。这个函数是hugetlbfs复合页的析构器。因此,它只传
>   递一个指向页面结构体的指针。当一个巨页被释放时,可能需要进行预留计算。如果该页与包含保
>   留的子池相关联,或者该页在错误路径上被释放,必须恢复全局预留计数,就会出现这种情况。
@@ -219,9 +219,10 @@ 
vma_commit_reservation()之间,预留映射有可能被改变。如果hugetlb_
  释放巨页
  ========

-巨页释放是由函数free_huge_page()执行的。这个函数是hugetlbfs复合页的析构器。因此,它只传
-递一个指向页面结构体的指针。当一个巨页被释放时,可能需要进行预留计算。如果该页与包含保
-留的子池相关联,或者该页在错误路径上被释放,必须恢复全局预留计数,就会出现这种情况。
+巨页由free_huge_folio()释放。从公共MM代码中调用free_huge_folio()时,只会传递一个folio
+(物理连续、虚拟连续的2^n次的PAGE_SIZE的一些bytes的集合,当然这个n也是允许是0的)指针。
+当一个巨页被释放时,可能需要进行预留计算。如果该页与包含保留的子池相关联,或者该页在错
+误路径上被释放,必须恢复全局预留计数,就会出现这种情况。

  page->private字段指向与该页相关的任何子池。如果PagePrivate标志被设置,它表明全局预留计数

  应该被调整(关于如何设置这些标志的信息,请参见



Thanks,

Yanteng

>   
> @@ -387,7 +387,7 @@ region_count()在解除私有巨页映射时被调用。在私有映射中,预
>   
>   然而,有几种情况是,在一个巨页被分配后,但在它被实例化之前,就遇到了错误。在这种情况下,
>   页面分配已经消耗了预留,并进行了适当的子池、预留映射和全局计数调整。如果页面在这个时候被释放
> -(在实例化和清除PagePrivate之前),那么free_huge_page将增加全局预留计数。然而,预留映射
> +(在实例化和清除PagePrivate之前),那么free_huge_folio将增加全局预留计数。然而,预留映射
>   显示报留被消耗了。这种不一致的状态将导致预留的巨页的 “泄漏” 。全局预留计数将比它原本的要高,
>   并阻止分配一个预先分配的页面。
>   
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 5a1dfaffbd80..5b2626063f4f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -26,7 +26,7 @@ typedef struct { unsigned long pd; } hugepd_t;
>   #define __hugepd(x) ((hugepd_t) { (x) })
>   #endif
>   
> -void free_huge_page(struct page *page);
> +void free_huge_folio(struct folio *folio);
>   
>   #ifdef CONFIG_HUGETLB_PAGE
>   
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e327a5a7602c..086eb51bf845 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1706,10 +1706,10 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
>   	zeroed = folio_put_testzero(folio);
>   	if (unlikely(!zeroed))
>   		/*
> -		 * It is VERY unlikely soneone else has taken a ref on
> -		 * the page.  In this case, we simply return as the
> -		 * hugetlb destructor (free_huge_page) will be called
> -		 * when this other ref is dropped.
> +		 * It is VERY unlikely soneone else has taken a ref
> +		 * on the folio.  In this case, we simply return as
> +		 * free_huge_folio() will be called when this other ref
> +		 * is dropped.
>   		 */
>   		return;
>   
> @@ -1875,13 +1875,12 @@ struct hstate *size_to_hstate(unsigned long size)
>   	return NULL;
>   }
>   
> -void free_huge_page(struct page *page)
> +void free_huge_folio(struct folio *folio)
>   {
>   	/*
>   	 * Can't pass hstate in here because it is called from the
>   	 * compound page destructor.
>   	 */
> -	struct folio *folio = page_folio(page);
>   	struct hstate *h = folio_hstate(folio);
>   	int nid = folio_nid(folio);
>   	struct hugepage_subpool *spool = hugetlb_folio_subpool(folio);
> @@ -1936,7 +1935,7 @@ void free_huge_page(struct page *page)
>   		spin_unlock_irqrestore(&hugetlb_lock, flags);
>   		update_and_free_hugetlb_folio(h, folio, true);
>   	} else {
> -		arch_clear_hugepage_flags(page);
> +		arch_clear_hugepage_flags(&folio->page);
>   		enqueue_hugetlb_folio(h, folio);
>   		spin_unlock_irqrestore(&hugetlb_lock, flags);
>   	}
> @@ -2246,7 +2245,7 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>   		folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node,
>   					nodes_allowed, node_alloc_noretry);
>   		if (folio) {
> -			free_huge_page(&folio->page); /* free it into the hugepage allocator */
> +			free_huge_folio(folio); /* free it into the hugepage allocator */
>   			return 1;
>   		}
>   	}
> @@ -2429,13 +2428,13 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
>   	 * We could have raced with the pool size change.
>   	 * Double check that and simply deallocate the new page
>   	 * if we would end up overcommiting the surpluses. Abuse
> -	 * temporary page to workaround the nasty free_huge_page
> +	 * temporary page to workaround the nasty free_huge_folio
>   	 * codeflow
>   	 */
>   	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>   		folio_set_hugetlb_temporary(folio);
>   		spin_unlock_irq(&hugetlb_lock);
> -		free_huge_page(&folio->page);
> +		free_huge_folio(folio);
>   		return NULL;
>   	}
>   
> @@ -2547,8 +2546,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>   	__must_hold(&hugetlb_lock)
>   {
>   	LIST_HEAD(surplus_list);
> -	struct folio *folio;
> -	struct page *page, *tmp;
> +	struct folio *folio, *tmp;
>   	int ret;
>   	long i;
>   	long needed, allocated;
> @@ -2608,21 +2606,21 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>   	ret = 0;
>   
>   	/* Free the needed pages to the hugetlb pool */
> -	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> +	list_for_each_entry_safe(folio, tmp, &surplus_list, lru) {
>   		if ((--needed) < 0)
>   			break;
>   		/* Add the page to the hugetlb allocator */
> -		enqueue_hugetlb_folio(h, page_folio(page));
> +		enqueue_hugetlb_folio(h, folio);
>   	}
>   free:
>   	spin_unlock_irq(&hugetlb_lock);
>   
>   	/*
>   	 * Free unnecessary surplus pages to the buddy allocator.
> -	 * Pages have no ref count, call free_huge_page directly.
> +	 * Pages have no ref count, call free_huge_folio directly.
>   	 */
> -	list_for_each_entry_safe(page, tmp, &surplus_list, lru)
> -		free_huge_page(page);
> +	list_for_each_entry_safe(folio, tmp, &surplus_list, lru)
> +		free_huge_folio(folio);
>   	spin_lock_irq(&hugetlb_lock);
>   
>   	return ret;
> @@ -2836,11 +2834,11 @@ static long vma_del_reservation(struct hstate *h,
>    * 2) No reservation was in place for the page, so hugetlb_restore_reserve is
>    *    not set.  However, alloc_hugetlb_folio always updates the reserve map.
>    *
> - * In case 1, free_huge_page later in the error path will increment the
> - * global reserve count.  But, free_huge_page does not have enough context
> + * In case 1, free_huge_folio later in the error path will increment the
> + * global reserve count.  But, free_huge_folio does not have enough context
>    * to adjust the reservation map.  This case deals primarily with private
>    * mappings.  Adjust the reserve map here to be consistent with global
> - * reserve count adjustments to be made by free_huge_page.  Make sure the
> + * reserve count adjustments to be made by free_huge_folio.  Make sure the
>    * reserve map indicates there is a reservation present.
>    *
>    * In case 2, simply undo reserve map modifications done by alloc_hugetlb_folio.
> @@ -2856,7 +2854,7 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
>   			 * Rare out of memory condition in reserve map
>   			 * manipulation.  Clear hugetlb_restore_reserve so
>   			 * that global reserve count will not be incremented
> -			 * by free_huge_page.  This will make it appear
> +			 * by free_huge_folio.  This will make it appear
>   			 * as though the reservation for this folio was
>   			 * consumed.  This may prevent the task from
>   			 * faulting in the folio at a later time.  This
> @@ -3232,7 +3230,7 @@ static void __init gather_bootmem_prealloc(void)
>   		if (prep_compound_gigantic_folio(folio, huge_page_order(h))) {
>   			WARN_ON(folio_test_reserved(folio));
>   			prep_new_hugetlb_folio(h, folio, folio_nid(folio));
> -			free_huge_page(page); /* add to the hugepage allocator */
> +			free_huge_folio(folio); /* add to the hugepage allocator */
>   		} else {
>   			/* VERY unlikely inflated ref count on a tail page */
>   			free_gigantic_folio(folio, huge_page_order(h));
> @@ -3264,7 +3262,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>   					&node_states[N_MEMORY], NULL);
>   			if (!folio)
>   				break;
> -			free_huge_page(&folio->page); /* free it into the hugepage allocator */
> +			free_huge_folio(folio); /* free it into the hugepage allocator */
>   		}
>   		cond_resched();
>   	}
> @@ -3542,7 +3540,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>   	while (count > persistent_huge_pages(h)) {
>   		/*
>   		 * If this allocation races such that we no longer need the
> -		 * page, free_huge_page will handle it by freeing the page
> +		 * page, free_huge_folio will handle it by freeing the page
>   		 * and reducing the surplus.
>   		 */
>   		spin_unlock_irq(&hugetlb_lock);
> @@ -3658,7 +3656,7 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
>   			prep_compound_page(subpage, target_hstate->order);
>   		folio_change_private(inner_folio, NULL);
>   		prep_new_hugetlb_folio(target_hstate, inner_folio, nid);
> -		free_huge_page(subpage);
> +		free_huge_folio(inner_folio);
>   	}
>   	mutex_unlock(&target_hstate->resize_lock);
>   
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 548c8016190b..b569fd5562aa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -620,7 +620,7 @@ void destroy_large_folio(struct folio *folio)
>   	enum compound_dtor_id dtor = folio->_folio_dtor;
>   
>   	if (folio_test_hugetlb(folio)) {
> -		free_huge_page(&folio->page);
> +		free_huge_folio(folio);
>   		return;
>   	}
>   


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR
  2023-08-16 15:11 ` [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR Matthew Wilcox (Oracle)
  2023-08-16 21:45   ` Sidhartha Kumar
@ 2023-08-22  3:13   ` Mike Kravetz
  2023-08-22  3:32     ` Matthew Wilcox
  2023-12-08 17:54   ` Vlastimil Babka
  2 siblings, 1 reply; 26+ messages in thread
From: Mike Kravetz @ 2023-08-22  3:13 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: Andrew Morton, Jens Axboe, io-uring, linux-mm

On 08/16/23 16:11, Matthew Wilcox (Oracle) wrote:
> We can use a bit in page[1].flags to indicate that this folio belongs
> to hugetlb instead of using a value in page[1].dtors.  That lets
> folio_test_hugetlb() become an inline function like it should be.
> We can also get rid of NULL_COMPOUND_DTOR.

Not 100% sure yet, but I suspect this patch/series causes the following
BUG in today's linux-next.  I can do a deeper dive tomorrow.

# echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

[  352.631099] page:ffffea0007a30000 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1e8c00
[  352.633674] head:ffffea0007a30000 order:8 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  352.636021] flags: 0x200000000000040(head|node=0|zone=2)
[  352.637424] page_type: 0xffffffff()
[  352.638619] raw: 0200000000000040 ffffc90003bb3d98 ffffc90003bb3d98 0000000000000000
[  352.640689] raw: 0000000000000001 0000000000000008 00000000ffffffff 0000000000000000
[  352.642882] page dumped because: VM_BUG_ON_PAGE(compound && compound_order(page) != order)
[  352.644671] ------------[ cut here ]------------
[  352.645746] kernel BUG at mm/page_alloc.c:1101!
[  352.647284] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  352.649286] CPU: 2 PID: 894 Comm: bash Not tainted 6.5.0-rc7-next-20230821-dirty #178
[  352.651245] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  352.653349] RIP: 0010:free_unref_page_prepare+0x3c4/0x3e0
[  352.654731] Code: ff 0f 00 00 75 d4 48 8b 03 a8 40 74 cd 48 8b 43 48 a8 01 74 c5 48 8d 78 ff eb bf 48 c7 c6 10 90 22 82 48 89 df e8 fc e1 fc ff <0f> 0b 48 c7 c6 20 2e 22 82 e8 ee e1 fc ff 0f 0b 66 66 2e 0f 1f 84
[  352.660080] RSP: 0018:ffffc90003bb3d08 EFLAGS: 00010246
[  352.661457] RAX: 000000000000004e RBX: ffffea0007a30000 RCX: 0000000000000000
[  352.663119] RDX: 0000000000000000 RSI: ffffffff8224cc56 RDI: 00000000ffffffff
[  352.664697] RBP: 00000000001e8c00 R08: 0000000000009ffb R09: 00000000ffffdfff
[  352.666191] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: 0000000000000009
[  352.667612] R13: 00000000001e8c00 R14: 0000000000000000 R15: 0000000000000000
[  352.669033] FS:  00007f18fc3a0740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[  352.670654] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  352.671678] CR2: 000055b1b1f16018 CR3: 0000000302e5a002 CR4: 0000000000370ee0
[  352.672936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  352.674215] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  352.675399] Call Trace:
[  352.675907]  <TASK>
[  352.676400]  ? die+0x32/0x80
[  352.676974]  ? do_trap+0xd6/0x100
[  352.677691]  ? free_unref_page_prepare+0x3c4/0x3e0
[  352.678591]  ? do_error_trap+0x6a/0x90
[  352.679343]  ? free_unref_page_prepare+0x3c4/0x3e0
[  352.680245]  ? exc_invalid_op+0x49/0x60
[  352.680961]  ? free_unref_page_prepare+0x3c4/0x3e0
[  352.681851]  ? asm_exc_invalid_op+0x16/0x20
[  352.682607]  ? free_unref_page_prepare+0x3c4/0x3e0
[  352.683518]  ? free_unref_page_prepare+0x3c4/0x3e0
[  352.684387]  free_unref_page+0x34/0x160
[  352.685176]  ? _raw_spin_lock_irq+0x19/0x40
[  352.686641]  set_max_huge_pages+0x281/0x370
[  352.687621]  nr_hugepages_store_common+0x91/0xf0
[  352.688634]  kernfs_fop_write_iter+0x108/0x1f0
[  352.689619]  vfs_write+0x207/0x400
[  352.690423]  ksys_write+0x63/0xe0
[  352.691198]  do_syscall_64+0x37/0x90
[  352.691938]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  352.692847] RIP: 0033:0x7f18fc494e87
[  352.693481] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  352.696304] RSP: 002b:00007ffff7597318 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  352.697539] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f18fc494e87
[  352.698790] RDX: 0000000000000002 RSI: 000055b1b1ed3620 RDI: 0000000000000001
[  352.700238] RBP: 000055b1b1ed3620 R08: 000000000000000a R09: 00007f18fc52c0c0
[  352.701662] R10: 00007f18fc52bfc0 R11: 0000000000000246 R12: 0000000000000002
[  352.703112] R13: 00007f18fc568520 R14: 0000000000000002 R15: 00007f18fc568720
[  352.704547]  </TASK>
[  352.705101] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec 9p snd_hwdep joydev snd_hda_core netfs snd_seq snd_seq_device snd_pcm snd_timer 9pnet_virtio snd soundcore virtio_balloon 9pnet virtio_net net_failover virtio_console virtio_blk failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[  352.713435] ---[ end trace 0000000000000000 ]---
[  352.714725] RIP: 0010:free_unref_page_prepare+0x3c4/0x3e0
[  352.717293] Code: ff 0f 00 00 75 d4 48 8b 03 a8 40 74 cd 48 8b 43 48 a8 01 74 c5 48 8d 78 ff eb bf 48 c7 c6 10 90 22 82 48 89 df e8 fc e1 fc ff <0f> 0b 48 c7 c6 20 2e 22 82 e8 ee e1 fc ff 0f 0b 66 66 2e 0f 1f 84
[  352.721235] RSP: 0018:ffffc90003bb3d08 EFLAGS: 00010246
[  352.722388] RAX: 000000000000004e RBX: ffffea0007a30000 RCX: 0000000000000000
[  352.723909] RDX: 0000000000000000 RSI: ffffffff8224cc56 RDI: 00000000ffffffff
[  352.725408] RBP: 00000000001e8c00 R08: 0000000000009ffb R09: 00000000ffffdfff
[  352.726875] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: 0000000000000009
[  352.728352] R13: 00000000001e8c00 R14: 0000000000000000 R15: 0000000000000000
[  352.729821] FS:  00007f18fc3a0740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[  352.731511] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  352.732723] CR2: 000055b1b1f16018 CR3: 0000000302e5a002 CR4: 0000000000370ee0
[  352.734199] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  352.735681] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR
  2023-08-22  3:13   ` Mike Kravetz
@ 2023-08-22  3:32     ` Matthew Wilcox
  2023-08-22 17:19       ` Mike Kravetz
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2023-08-22  3:32 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Andrew Morton, Jens Axboe, io-uring, linux-mm

On Mon, Aug 21, 2023 at 08:13:00PM -0700, Mike Kravetz wrote:
> On 08/16/23 16:11, Matthew Wilcox (Oracle) wrote:
> > We can use a bit in page[1].flags to indicate that this folio belongs
> > to hugetlb instead of using a value in page[1].dtors.  That lets
> > folio_test_hugetlb() become an inline function like it should be.
> > We can also get rid of NULL_COMPOUND_DTOR.
> 
> Not 100% sure yet, but I suspect this patch/series causes the following
> BUG in today's linux-next.  I can do a deeper dive tomorrow.
> 
> # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> # echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> 
> [  352.631099] page:ffffea0007a30000 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1e8c00
> [  352.633674] head:ffffea0007a30000 order:8 entire_mapcount:0 nr_pages_mapped:0 pincount:0

order 8?  Well, that's exciting.  This is clearly x86, so it should be
order 9.  Did we mistakenly clear bit 0 of tail->flags?

Oh.  Oh yes, we do.

__update_and_free_hugetlb_folio:

        for (i = 0; i < pages_per_huge_page(h); i++) {
                subpage = folio_page(folio, i);
                subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
                                1 << PG_referenced | 1 << PG_dirty |
                                1 << PG_active | 1 << PG_private |
                                1 << PG_writeback);
        }

locked		PF_NO_TAIL
error		PF_NO_TAIL
referenced	PF_HEAD
dirty		PF_HEAD
active		PF_HEAD
private		PF_ANY
writeback	PF_NO_TAIL

So it seems to me there's no way to set any of these bits other than
PG_private.  And, well, you control PG_private in hugetlbfs.  Do you
ever set it on tail pages?

I think you can remove this entire loop and be happy?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR
  2023-08-22  3:32     ` Matthew Wilcox
@ 2023-08-22 17:19       ` Mike Kravetz
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2023-08-22 17:19 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, Jens Axboe, io-uring, linux-mm

On 08/22/23 04:32, Matthew Wilcox wrote:
> On Mon, Aug 21, 2023 at 08:13:00PM -0700, Mike Kravetz wrote:
> > On 08/16/23 16:11, Matthew Wilcox (Oracle) wrote:
> > > We can use a bit in page[1].flags to indicate that this folio belongs
> > > to hugetlb instead of using a value in page[1].dtors.  That lets
> > > folio_test_hugetlb() become an inline function like it should be.
> > > We can also get rid of NULL_COMPOUND_DTOR.
> > 
> > Not 100% sure yet, but I suspect this patch/series causes the following
> > BUG in today's linux-next.  I can do a deeper dive tomorrow.
> > 
> > # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > # echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > 
> > [  352.631099] page:ffffea0007a30000 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1e8c00
> > [  352.633674] head:ffffea0007a30000 order:8 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> 
> order 8?  Well, that's exciting.  This is clearly x86, so it should be
> order 9.  Did we mistakenly clear bit 0 of tail->flags?
> 
> Oh.  Oh yes, we do.
> 
> __update_and_free_hugetlb_folio:
> 
>         for (i = 0; i < pages_per_huge_page(h); i++) {
>                 subpage = folio_page(folio, i);
>                 subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
>                                 1 << PG_referenced | 1 << PG_dirty |
>                                 1 << PG_active | 1 << PG_private |
>                                 1 << PG_writeback);
>         }
> 
> locked		PF_NO_TAIL
> error		PF_NO_TAIL
> referenced	PF_HEAD
> dirty		PF_HEAD
> active		PF_HEAD
> private		PF_ANY
> writeback	PF_NO_TAIL
> 
> So it seems to me there's no way to set any of these bits other than
> PG_private.  And, well, you control PG_private in hugetlbfs.  Do you
> ever set it on tail pages?

Nope

> 
> I think you can remove this entire loop and be happy?

I believe you are correct.  Although, this is clearing flags in the head
page as well as tail pages.  So, I think we still need to clear referenced,
dirty and active as well as private on the head page.

Will take a look today.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 10/13] mm: Free up a word in the first tail page
  2023-08-16 15:11 ` [PATCH v2 10/13] mm: Free up a word in the first tail page Matthew Wilcox (Oracle)
@ 2023-08-22 23:17   ` Mike Kravetz
  2023-08-23  0:29     ` Matthew Wilcox
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Kravetz @ 2023-08-22 23:17 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: Andrew Morton, Jens Axboe, io-uring, linux-mm

On 08/16/23 16:11, Matthew Wilcox (Oracle) wrote:
> Store the folio order in the low byte of the flags word in the first
> tail page.  This frees up the word that was being used to store the
> order and dtor bytes previously.

hugetlb manually creates and destroys compound pages.  As such it makes
assumptions about struct page layout.  This breaks hugetlb.  The following
will allow fix the breakage.

The hugetlb code is quite fragile when changes like this are made.  I am
open to suggestions on how we can make this more robust.  Perhaps start
with a simple set of APIs to create_folio from a set of contiguous pages
and destroy a folio?
-- 
Mike Kravetz

From 8d8aa4486a4119f6d694b423b2f68161b4e7432c Mon Sep 17 00:00:00 2001
From: Mike Kravetz <[email protected]>
Date: Tue, 22 Aug 2023 15:30:43 -0700
Subject: [PATCH] hugetlb: clear flags in tail pages that will be freed
 individually

Signed-off-by: Mike Kravetz <[email protected]>
---
 mm/hugetlb.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a82c3104337e..cbc25826c9b0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1484,6 +1484,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
 
 	for (i = 1; i < nr_pages; i++) {
 		p = folio_page(folio, i);
+		p->flags &= ~PAGE_FLAGS_CHECK_AT_FREE;
 		p->mapping = NULL;
 		clear_compound_head(p);
 		if (!demote)
@@ -1702,8 +1703,6 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
 static void __update_and_free_hugetlb_folio(struct hstate *h,
 						struct folio *folio)
 {
-	int i;
-	struct page *subpage;
 	bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio);
 
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
@@ -1745,14 +1744,6 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 		spin_unlock_irq(&hugetlb_lock);
 	}
 
-	for (i = 0; i < pages_per_huge_page(h); i++) {
-		subpage = folio_page(folio, i);
-		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
-				1 << PG_referenced | 1 << PG_dirty |
-				1 << PG_active | 1 << PG_private |
-				1 << PG_writeback);
-	}
-
 	/*
 	 * Non-gigantic pages demoted from CMA allocated gigantic pages
 	 * need to be given back to CMA in free_gigantic_folio.
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 10/13] mm: Free up a word in the first tail page
  2023-08-22 23:17   ` Mike Kravetz
@ 2023-08-23  0:29     ` Matthew Wilcox
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox @ 2023-08-23  0:29 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Andrew Morton, Jens Axboe, io-uring, linux-mm

On Tue, Aug 22, 2023 at 04:17:41PM -0700, Mike Kravetz wrote:
> On 08/16/23 16:11, Matthew Wilcox (Oracle) wrote:
> > Store the folio order in the low byte of the flags word in the first
> > tail page.  This frees up the word that was being used to store the
> > order and dtor bytes previously.
> 
> hugetlb manually creates and destroys compound pages.  As such it makes
> assumptions about struct page layout.  This breaks hugetlb.  The following
> will allow fix the breakage.
> 
> The hugetlb code is quite fragile when changes like this are made.  I am
> open to suggestions on how we can make this more robust.  Perhaps start
> with a simple set of APIs to create_folio from a set of contiguous pages
> and destroy a folio?

Thanks; those changes look good to me.

I don't have any _immediate_ plans to make your life easier.  When
we get to a memdesc world, it'll all get much easier.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR
  2023-08-16 15:11 ` [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR Matthew Wilcox (Oracle)
  2023-08-16 21:45   ` Sidhartha Kumar
  2023-08-22  3:13   ` Mike Kravetz
@ 2023-12-08 17:54   ` Vlastimil Babka
  2023-12-08 18:31     ` Matthew Wilcox
  2 siblings, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2023-12-08 17:54 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Andrew Morton, Mike Kravetz,
	Luis Chamberlain, Oscar Salvador
  Cc: Jens Axboe, io-uring, linux-mm

On 8/16/23 17:11, Matthew Wilcox (Oracle) wrote:
> We can use a bit in page[1].flags to indicate that this folio belongs
> to hugetlb instead of using a value in page[1].dtors.  That lets
> folio_test_hugetlb() become an inline function like it should be.
> We can also get rid of NULL_COMPOUND_DTOR.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

I think this (as commit 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")) is
causing the bug reported by Luis here:
https://bugzilla.kernel.org/show_bug.cgi?id=218227

> page:000000009006bf10 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x3f8a0 pfn:0x1035c0
> flags: 0x17fffc000000000(node=0|zone=2|lastcpupid=0x1ffff)
> page_type: 0xffffff7f(buddy)
> raw: 017fffc000000000 ffffe704c422f808 ffffe704c41ac008 0000000000000000
> raw: 000000000003f8a0 0000000000000005 00000000ffffff7f 0000000000000000
> page dumped because: VM_BUG_ON_PAGE(n > 0 && !((__builtin_constant_p(PG_head) && __builtin_constant_p((uintptr_t)(&page->flags) != (uintptr_t)((vo>
> ------------[ cut here ]------------
> kernel BUG at include/linux/page-flags.h:314!
> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 6 PID: 2435641 Comm: md5sum Not tainted 6.6.0-rc5 #2
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> RIP: 0010:folio_flags+0x65/0x70
> Code: a8 40 74 de 48 8b 47 48 a8 01 74 d6 48 83 e8 01 48 39 c7 75 bd eb cb 48 8b 07 a8 40 75 c8 48 c7 c6 d8 a7 c3 89 e8 3b c7 fa ff <0f> 0b 66 0f >
> RSP: 0018:ffffad51c0bfb7a8 EFLAGS: 00010246
> RAX: 000000000000015f RBX: ffffe704c40d7000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff89be8040 RDI: 00000000ffffffff
> RBP: 0000000000103600 R08: 0000000000000000 R09: ffffad51c0bfb658
> R10: 0000000000000003 R11: ffffffff89eacb08 R12: 0000000000000035
> R13: ffffe704c40d7000 R14: 0000000000000000 R15: ffffad51c0bfb930
> FS:  00007f350c51b740(0000) GS:ffff9b62fbd80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000555860919508 CR3: 00000001217fe002 CR4: 0000000000770ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  ? die+0x32/0x80
>  ? do_trap+0xd6/0x100
>  ? folio_flags+0x65/0x70
>  ? do_error_trap+0x6a/0x90
>  ? folio_flags+0x65/0x70
>  ? exc_invalid_op+0x4c/0x60
>  ? folio_flags+0x65/0x70
>  ? asm_exc_invalid_op+0x16/0x20
>  ? folio_flags+0x65/0x70
>  ? folio_flags+0x65/0x70
>  PageHuge+0x67/0x80
>  isolate_migratepages_block+0x1c5/0x13b0
>  ? __pv_queued_spin_lock_slowpath+0x16c/0x370
>  compact_zone+0x746/0xfc0
>  compact_zone_order+0xbb/0x100
>  try_to_compact_pages+0xf0/0x2f0
>  __alloc_pages_direct_compact+0x78/0x210
>  __alloc_pages_slowpath.constprop.0+0xac1/0xdb0
>  ? prepare_alloc_pages.constprop.0+0xff/0x1b0
>  __alloc_pages+0x218/0x240
>  folio_alloc+0x17/0x50
>  page_cache_ra_order+0x15a/0x340
>  filemap_get_pages+0x136/0x6c0
>  ? update_load_avg+0x7e/0x780
>  ? current_time+0x2b/0xd0
>  filemap_read+0xce/0x340
>  ? do_sched_setscheduler+0x111/0x1b0
>  ? nohz_balance_exit_idle+0x16/0xc0
>  ? trigger_load_balance+0x302/0x370
>  ? preempt_count_add+0x47/0xa0
>  xfs_file_buffered_read+0x52/0xd0 [xfs]
>  xfs_file_read_iter+0x73/0xe0 [xfs]
>  vfs_read+0x1b1/0x300
>  ksys_read+0x63/0xe0
>  do_syscall_64+0x38/0x90
>  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> RIP: 0033:0x7f350c615a5d
> Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6 60 0a 00 e8 99 08 02 00 66 0f 1f 84 00 00 00 00 00 80 3d 81 3b 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 >
> RSP: 002b:00007ffca3ef5108 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 000055ff140712f0 RCX: 00007f350c615a5d
> RDX: 0000000000008000 RSI: 000055ff140714d0 RDI: 0000000000000003
> RBP: 00007f350c6ee600 R08: 0000000000000900 R09: 000000000b9bc05b
> R10: 000055ff140794d0 R11: 0000000000000246 R12: 000055ff140714d0
> R13: 0000000000008000 R14: 0000000000000a68 R15: 00007f350c6edd00
>  </TASK>
> Modules linked in: dm_zero dm_thin_pool dm_persistent_data dm_bio_prison sd_mod sg scsi_mod scsi_common dm_snapshot dm_bufio dm_flakey xfs sunrpc >
> ---[ end trace 0000000000000000 ]---

It's because PageHuge() now does folio_test_hugetlb() which is documented to
assume caller holds a reference, but the test in
isolate_migratepages_block() doesn't. The code is there from 2021 by Oscar,
perhaps it could be changed to take a reference (and e.g. only test
PageHead() before), but it will be a bit involved as the
isolate_or_dissolve_huge_page() it calls has some logic based on the
refcount being zero/non-zero as well. Oscar, what do you think?
Also I wonder if any of the the other existing PageHuge() callers are also
affected because they might be doing so without a reference.

(keeping the rest of patch for new Cc's)

> ---
>  .../admin-guide/kdump/vmcoreinfo.rst          | 10 +---
>  include/linux/mm.h                            |  4 --
>  include/linux/page-flags.h                    | 43 ++++++++++++----
>  kernel/crash_core.c                           |  2 +-
>  mm/hugetlb.c                                  | 49 +++----------------
>  mm/page_alloc.c                               |  2 +-
>  6 files changed, 43 insertions(+), 67 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index c18d94fa6470..baa1c355741d 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -325,8 +325,8 @@ NR_FREE_PAGES
>  On linux-2.6.21 or later, the number of free pages is in
>  vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
>  
> -PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
> -------------------------------------------------------------------------------
> +PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb
> +-----------------------------------------------------------------------------------------
>  
>  Page attributes. These flags are used to filter various unnecessary for
>  dumping pages.
> @@ -338,12 +338,6 @@ More page attributes. These flags are used to filter various unnecessary for
>  dumping pages.
>  
>  
> -HUGETLB_PAGE_DTOR
> ------------------
> -
> -The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
> -excludes these pages.
> -
>  x86_64
>  ======
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7b800d1298dc..642f5fe5860e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1268,11 +1268,7 @@ void folio_copy(struct folio *dst, struct folio *src);
>  unsigned long nr_free_buffer_pages(void);
>  
>  enum compound_dtor_id {
> -	NULL_COMPOUND_DTOR,
>  	COMPOUND_PAGE_DTOR,
> -#ifdef CONFIG_HUGETLB_PAGE
> -	HUGETLB_PAGE_DTOR,
> -#endif
>  	TRANSHUGE_PAGE_DTOR,
>  	NR_COMPOUND_DTORS,
>  };
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 92a2063a0a23..aeecf0cf1456 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -171,15 +171,6 @@ enum pageflags {
>  	/* Remapped by swiotlb-xen. */
>  	PG_xen_remapped = PG_owner_priv_1,
>  
> -#ifdef CONFIG_MEMORY_FAILURE
> -	/*
> -	 * Compound pages. Stored in first tail page's flags.
> -	 * Indicates that at least one subpage is hwpoisoned in the
> -	 * THP.
> -	 */
> -	PG_has_hwpoisoned = PG_error,
> -#endif
> -
>  	/* non-lru isolated movable page */
>  	PG_isolated = PG_reclaim,
>  
> @@ -190,6 +181,15 @@ enum pageflags {
>  	/* For self-hosted memmap pages */
>  	PG_vmemmap_self_hosted = PG_owner_priv_1,
>  #endif
> +
> +	/*
> +	 * Flags only valid for compound pages.  Stored in first tail page's
> +	 * flags word.
> +	 */
> +
> +	/* At least one page in this folio has the hwpoison flag set */
> +	PG_has_hwpoisoned = PG_error,
> +	PG_hugetlb = PG_active,
>  };
>  
>  #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
> @@ -812,7 +812,23 @@ static inline void ClearPageCompound(struct page *page)
>  
>  #ifdef CONFIG_HUGETLB_PAGE
>  int PageHuge(struct page *page);
> -bool folio_test_hugetlb(struct folio *folio);
> +SETPAGEFLAG(HugeTLB, hugetlb, PF_SECOND)
> +CLEARPAGEFLAG(HugeTLB, hugetlb, PF_SECOND)
> +
> +/**
> + * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs
> + * @folio: The folio to test.
> + *
> + * Context: Any context.  Caller should have a reference on the folio to
> + * prevent it from being turned into a tail page.
> + * Return: True for hugetlbfs folios, false for anon folios or folios
> + * belonging to other filesystems.
> + */
> +static inline bool folio_test_hugetlb(struct folio *folio)
> +{
> +	return folio_test_large(folio) &&
> +		test_bit(PG_hugetlb, folio_flags(folio, 1));
> +}
>  #else
>  TESTPAGEFLAG_FALSE(Huge, hugetlb)
>  #endif
> @@ -1040,6 +1056,13 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
>  #define PAGE_FLAGS_CHECK_AT_PREP	\
>  	((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
>  
> +/*
> + * Flags stored in the second page of a compound page.  They may overlap
> + * the CHECK_AT_FREE flags above, so need to be cleared.
> + */
> +#define PAGE_FLAGS_SECOND						\
> +	(1UL << PG_has_hwpoisoned	| 1UL << PG_hugetlb)
> +
>  #define PAGE_FLAGS_PRIVATE				\
>  	(1UL << PG_private | 1UL << PG_private_2)
>  /**
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 90ce1dfd591c..dd5f87047d06 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -490,7 +490,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>  #define PAGE_BUDDY_MAPCOUNT_VALUE	(~PG_buddy)
>  	VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
>  #ifdef CONFIG_HUGETLB_PAGE
> -	VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
> +	VMCOREINFO_NUMBER(PG_hugetlb);
>  #define PAGE_OFFLINE_MAPCOUNT_VALUE	(~PG_offline)
>  	VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
>  #endif
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 086eb51bf845..389490f100b0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1585,25 +1585,7 @@ static inline void __clear_hugetlb_destructor(struct hstate *h,
>  {
>  	lockdep_assert_held(&hugetlb_lock);
>  
> -	/*
> -	 * Very subtle
> -	 *
> -	 * For non-gigantic pages set the destructor to the normal compound
> -	 * page dtor.  This is needed in case someone takes an additional
> -	 * temporary ref to the page, and freeing is delayed until they drop
> -	 * their reference.
> -	 *
> -	 * For gigantic pages set the destructor to the null dtor.  This
> -	 * destructor will never be called.  Before freeing the gigantic
> -	 * page destroy_compound_gigantic_folio will turn the folio into a
> -	 * simple group of pages.  After this the destructor does not
> -	 * apply.
> -	 *
> -	 */
> -	if (hstate_is_gigantic(h))
> -		folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> -	else
> -		folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
> +	folio_clear_hugetlb(folio);
>  }
>  
>  /*
> @@ -1690,7 +1672,7 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
>  		h->surplus_huge_pages_node[nid]++;
>  	}
>  
> -	folio_set_compound_dtor(folio, HUGETLB_PAGE_DTOR);
> +	folio_set_hugetlb(folio);
>  	folio_change_private(folio, NULL);
>  	/*
>  	 * We have to set hugetlb_vmemmap_optimized again as above
> @@ -1814,9 +1796,8 @@ static void free_hpage_workfn(struct work_struct *work)
>  		/*
>  		 * The VM_BUG_ON_FOLIO(!folio_test_hugetlb(folio), folio) in
>  		 * folio_hstate() is going to trigger because a previous call to
> -		 * remove_hugetlb_folio() will call folio_set_compound_dtor
> -		 * (folio, NULL_COMPOUND_DTOR), so do not use folio_hstate()
> -		 * directly.
> +		 * remove_hugetlb_folio() will clear the hugetlb bit, so do
> +		 * not use folio_hstate() directly.
>  		 */
>  		h = size_to_hstate(page_size(page));
>  
> @@ -1955,7 +1936,7 @@ static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
>  {
>  	hugetlb_vmemmap_optimize(h, &folio->page);
>  	INIT_LIST_HEAD(&folio->lru);
> -	folio_set_compound_dtor(folio, HUGETLB_PAGE_DTOR);
> +	folio_set_hugetlb(folio);
>  	hugetlb_set_folio_subpool(folio, NULL);
>  	set_hugetlb_cgroup(folio, NULL);
>  	set_hugetlb_cgroup_rsvd(folio, NULL);
> @@ -2070,28 +2051,10 @@ int PageHuge(struct page *page)
>  	if (!PageCompound(page))
>  		return 0;
>  	folio = page_folio(page);
> -	return folio->_folio_dtor == HUGETLB_PAGE_DTOR;
> +	return folio_test_hugetlb(folio);
>  }
>  EXPORT_SYMBOL_GPL(PageHuge);
>  
> -/**
> - * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs
> - * @folio: The folio to test.
> - *
> - * Context: Any context.  Caller should have a reference on the folio to
> - * prevent it from being turned into a tail page.
> - * Return: True for hugetlbfs folios, false for anon folios or folios
> - * belonging to other filesystems.
> - */
> -bool folio_test_hugetlb(struct folio *folio)
> -{
> -	if (!folio_test_large(folio))
> -		return false;
> -
> -	return folio->_folio_dtor == HUGETLB_PAGE_DTOR;
> -}
> -EXPORT_SYMBOL_GPL(folio_test_hugetlb);
> -
>  /*
>   * Find and lock address space (mapping) in write mode.
>   *
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9638fdddf065..f8e276de4fd5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1122,7 +1122,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
>  
>  		if (compound)
> -			ClearPageHasHWPoisoned(page);
> +			page[1].flags &= ~PAGE_FLAGS_SECOND;
>  		for (i = 1; i < (1 << order); i++) {
>  			if (compound)
>  				bad += free_tail_page_prepare(page, page + i);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR
  2023-12-08 17:54   ` Vlastimil Babka
@ 2023-12-08 18:31     ` Matthew Wilcox
  2023-12-13 12:23       ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2023-12-08 18:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Mike Kravetz, Luis Chamberlain, Oscar Salvador,
	Jens Axboe, io-uring, linux-mm

On Fri, Dec 08, 2023 at 06:54:19PM +0100, Vlastimil Babka wrote:
> On 8/16/23 17:11, Matthew Wilcox (Oracle) wrote:
> > We can use a bit in page[1].flags to indicate that this folio belongs
> > to hugetlb instead of using a value in page[1].dtors.  That lets
> > folio_test_hugetlb() become an inline function like it should be.
> > We can also get rid of NULL_COMPOUND_DTOR.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> 
> I think this (as commit 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")) is
> causing the bug reported by Luis here:
> https://bugzilla.kernel.org/show_bug.cgi?id=218227

Luis, please stop using bugzilla.  If you'd sent email like a normal
kernel developer, I'd've seen this bug earlier.

> > page:000000009006bf10 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x3f8a0 pfn:0x1035c0
> > flags: 0x17fffc000000000(node=0|zone=2|lastcpupid=0x1ffff)
> > page_type: 0xffffff7f(buddy)
> > raw: 017fffc000000000 ffffe704c422f808 ffffe704c41ac008 0000000000000000
> > raw: 000000000003f8a0 0000000000000005 00000000ffffff7f 0000000000000000
> > page dumped because: VM_BUG_ON_PAGE(n > 0 && !((__builtin_constant_p(PG_head) && __builtin_constant_p((uintptr_t)(&page->flags) != (uintptr_t)((vo>
> > ------------[ cut here ]------------
> > kernel BUG at include/linux/page-flags.h:314!
> > invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> > CPU: 6 PID: 2435641 Comm: md5sum Not tainted 6.6.0-rc5 #2
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > RIP: 0010:folio_flags+0x65/0x70
> > Code: a8 40 74 de 48 8b 47 48 a8 01 74 d6 48 83 e8 01 48 39 c7 75 bd eb cb 48 8b 07 a8 40 75 c8 48 c7 c6 d8 a7 c3 89 e8 3b c7 fa ff <0f> 0b 66 0f >
> > RSP: 0018:ffffad51c0bfb7a8 EFLAGS: 00010246
> > RAX: 000000000000015f RBX: ffffe704c40d7000 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: ffffffff89be8040 RDI: 00000000ffffffff
> > RBP: 0000000000103600 R08: 0000000000000000 R09: ffffad51c0bfb658
> > R10: 0000000000000003 R11: ffffffff89eacb08 R12: 0000000000000035
> > R13: ffffe704c40d7000 R14: 0000000000000000 R15: ffffad51c0bfb930
> > FS:  00007f350c51b740(0000) GS:ffff9b62fbd80000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000555860919508 CR3: 00000001217fe002 CR4: 0000000000770ee0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> >  <TASK>
> >  ? die+0x32/0x80
> >  ? do_trap+0xd6/0x100
> >  ? folio_flags+0x65/0x70
> >  ? do_error_trap+0x6a/0x90
> >  ? folio_flags+0x65/0x70
> >  ? exc_invalid_op+0x4c/0x60
> >  ? folio_flags+0x65/0x70
> >  ? asm_exc_invalid_op+0x16/0x20
> >  ? folio_flags+0x65/0x70
> >  ? folio_flags+0x65/0x70
> >  PageHuge+0x67/0x80
> >  isolate_migratepages_block+0x1c5/0x13b0
> >  compact_zone+0x746/0xfc0
> >  compact_zone_order+0xbb/0x100
> >  try_to_compact_pages+0xf0/0x2f0
> >  __alloc_pages_direct_compact+0x78/0x210
> >  __alloc_pages_slowpath.constprop.0+0xac1/0xdb0
> >  __alloc_pages+0x218/0x240
> >  folio_alloc+0x17/0x50
> 
> It's because PageHuge() now does folio_test_hugetlb() which is documented to
> assume caller holds a reference, but the test in
> isolate_migratepages_block() doesn't. The code is there from 2021 by Oscar,
> perhaps it could be changed to take a reference (and e.g. only test
> PageHead() before), but it will be a bit involved as the
> isolate_or_dissolve_huge_page() it calls has some logic based on the
> refcount being zero/non-zero as well. Oscar, what do you think?
> Also I wonder if any of the the other existing PageHuge() callers are also
> affected because they might be doing so without a reference.

I don't think the warning is actually wrong!  We're living very
dangerously here as PageHuge() could have returned a false positive
before this change [1].  Then we assume that compound_nr() returns a
consistent result (and compound_order() might, for example, return a
value larger than 63, leading to UB).

I think there's a place for a folio_test_hugetlb_unsafe(), but that
would only remove the warning, and do nothing to fix all the unsafe
usage.  The hugetlb handling code in isolate_migratepages_block()
doesn't seem to have any understanding that it's working with pages
that can change under it.  I can have a go at fixing it up, but maybe
Oscar would prefer to do it?

[1] terribly unlikely, but consider what happens if the page starts out
as a non-hugetlb compound allocation.  Then it is freed and reallocated;
the new owner of the second page could have put enough information
into it that tricked PageHuge() into returning true.  Then we call
isolate_or_dissolve_huge_page() which takes a lock, but doesn't take
a refcount.  Now it gets a bogus hstate and things go downhill from
there ...)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR
  2023-12-08 18:31     ` Matthew Wilcox
@ 2023-12-13 12:23       ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2023-12-13 12:23 UTC (permalink / raw)
  To: Matthew Wilcox, Vlastimil Babka
  Cc: Andrew Morton, Mike Kravetz, Luis Chamberlain, Oscar Salvador,
	Jens Axboe, io-uring, linux-mm

On 08.12.23 19:31, Matthew Wilcox wrote:
> On Fri, Dec 08, 2023 at 06:54:19PM +0100, Vlastimil Babka wrote:
>> On 8/16/23 17:11, Matthew Wilcox (Oracle) wrote:
>>> We can use a bit in page[1].flags to indicate that this folio belongs
>>> to hugetlb instead of using a value in page[1].dtors.  That lets
>>> folio_test_hugetlb() become an inline function like it should be.
>>> We can also get rid of NULL_COMPOUND_DTOR.
>>>
>>> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>>
>> I think this (as commit 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")) is
>> causing the bug reported by Luis here:
>> https://bugzilla.kernel.org/show_bug.cgi?id=218227
> 
> Luis, please stop using bugzilla.  If you'd sent email like a normal
> kernel developer, I'd've seen this bug earlier.
> 
>>> page:000000009006bf10 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x3f8a0 pfn:0x1035c0
>>> flags: 0x17fffc000000000(node=0|zone=2|lastcpupid=0x1ffff)
>>> page_type: 0xffffff7f(buddy)
>>> raw: 017fffc000000000 ffffe704c422f808 ffffe704c41ac008 0000000000000000
>>> raw: 000000000003f8a0 0000000000000005 00000000ffffff7f 0000000000000000
>>> page dumped because: VM_BUG_ON_PAGE(n > 0 && !((__builtin_constant_p(PG_head) && __builtin_constant_p((uintptr_t)(&page->flags) != (uintptr_t)((vo>
>>> ------------[ cut here ]------------
>>> kernel BUG at include/linux/page-flags.h:314!
>>> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>> CPU: 6 PID: 2435641 Comm: md5sum Not tainted 6.6.0-rc5 #2
>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
>>> RIP: 0010:folio_flags+0x65/0x70
>>> Code: a8 40 74 de 48 8b 47 48 a8 01 74 d6 48 83 e8 01 48 39 c7 75 bd eb cb 48 8b 07 a8 40 75 c8 48 c7 c6 d8 a7 c3 89 e8 3b c7 fa ff <0f> 0b 66 0f >
>>> RSP: 0018:ffffad51c0bfb7a8 EFLAGS: 00010246
>>> RAX: 000000000000015f RBX: ffffe704c40d7000 RCX: 0000000000000000
>>> RDX: 0000000000000000 RSI: ffffffff89be8040 RDI: 00000000ffffffff
>>> RBP: 0000000000103600 R08: 0000000000000000 R09: ffffad51c0bfb658
>>> R10: 0000000000000003 R11: ffffffff89eacb08 R12: 0000000000000035
>>> R13: ffffe704c40d7000 R14: 0000000000000000 R15: ffffad51c0bfb930
>>> FS:  00007f350c51b740(0000) GS:ffff9b62fbd80000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000555860919508 CR3: 00000001217fe002 CR4: 0000000000770ee0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> PKRU: 55555554
>>> Call Trace:
>>>   <TASK>
>>>   ? die+0x32/0x80
>>>   ? do_trap+0xd6/0x100
>>>   ? folio_flags+0x65/0x70
>>>   ? do_error_trap+0x6a/0x90
>>>   ? folio_flags+0x65/0x70
>>>   ? exc_invalid_op+0x4c/0x60
>>>   ? folio_flags+0x65/0x70
>>>   ? asm_exc_invalid_op+0x16/0x20
>>>   ? folio_flags+0x65/0x70
>>>   ? folio_flags+0x65/0x70
>>>   PageHuge+0x67/0x80
>>>   isolate_migratepages_block+0x1c5/0x13b0
>>>   compact_zone+0x746/0xfc0
>>>   compact_zone_order+0xbb/0x100
>>>   try_to_compact_pages+0xf0/0x2f0
>>>   __alloc_pages_direct_compact+0x78/0x210
>>>   __alloc_pages_slowpath.constprop.0+0xac1/0xdb0
>>>   __alloc_pages+0x218/0x240
>>>   folio_alloc+0x17/0x50
>>
>> It's because PageHuge() now does folio_test_hugetlb() which is documented to
>> assume caller holds a reference, but the test in
>> isolate_migratepages_block() doesn't. The code is there from 2021 by Oscar,
>> perhaps it could be changed to take a reference (and e.g. only test
>> PageHead() before), but it will be a bit involved as the
>> isolate_or_dissolve_huge_page() it calls has some logic based on the
>> refcount being zero/non-zero as well. Oscar, what do you think?
>> Also I wonder if any of the the other existing PageHuge() callers are also
>> affected because they might be doing so without a reference.
> 
> I don't think the warning is actually wrong!  We're living very
> dangerously here as PageHuge() could have returned a false positive
> before this change [1].  Then we assume that compound_nr() returns a
> consistent result (and compound_order() might, for example, return a
> value larger than 63, leading to UB).

All this code is racy (as spelled out in some of the code comments), and the
assumption is that races are tolerable. In the worst case, isolation fails on
races.

There is some code in there that sanitizes compound_order() return values :

		/*
		 * Regardless of being on LRU, compound pages such as THP and
		 * hugetlbfs are not to be compacted unless we are attempting
		 * an allocation much larger than the huge page size (eg CMA).
		 * We can potentially save a lot of iterations if we skip them
		 * at once. The check is racy, but we can consider only valid
		 * values and the only danger is skipping too much.
		 */
		if (PageCompound(page) && !cc->alloc_contig) {
			const unsigned int order = compound_order(page);

			if (likely(order <= MAX_ORDER)) {
				low_pfn += (1UL << order) - 1;
				nr_scanned += (1UL << order) - 1;
			}
			goto isolate_fail;
		}

At least isolate_or_dissolve_huge_page() looks like it wants to deal with
concurrent dissolving of hugetlb folios properly. See below, if it is
actually correct.

> 
> I think there's a place for a folio_test_hugetlb_unsafe(), but that
> would only remove the warning, and do nothing to fix all the unsafe
> usage.  The hugetlb handling code in isolate_migratepages_block()
> doesn't seem to have any understanding that it's working with pages
> that can change under it.  

Staring at the code (once again), I think we might miss to sanitize
the compound_nr() return value; but I'm not sure if that is really
problematic; We can get out-of-memory low_pfn either way when we're
operating at the end of memory, memory holes etc.

> I can have a go at fixing it up, but maybe
> Oscar would prefer to do it?
> 
> [1] terribly unlikely, but consider what happens if the page starts out
> as a non-hugetlb compound allocation.  Then it is freed and reallocated;
> the new owner of the second page could have put enough information
> into it that tricked PageHuge() into returning true.  Then we call

I think that got more likely by using a pageflag :)

> isolate_or_dissolve_huge_page() which takes a lock, but doesn't take
> a refcount.  Now it gets a bogus hstate and things go downhill from
> there ...)

hugetlb folios can have a refocunt of 0, in which case they are considered
"free". The recount handling at the end of isolate_or_dissolve_huge_page()
gives a hint that this is how hugetlb operates.

So grabbing references as you're used to from !hugetlb code doesn't work.

That function needs some care, to deal with what you described. Maybe
something like the following (assuming page_folio() is fine):


bool is_hugetlb_folio = false;


/*
  * Especially for free hugetlb folios, we cannot use the recount
  * to stabilize. While holding the hugetlb_lock, no hugetlb folios can
  * be dissolved.
  */
spin_lock_irq(&hugetlb_lock);
folio = page_folio(page);

if (folio_test_hugetlb_unsafe(folio)) {
	/*
          * We might have a false positive. Make sure the folio hasn't
          * changed in the meantime and still is a hugetlb folio.
	 */
	smp_rmb();
	if (folio == page_folio(page) &&
	    folio_test_hugetlb_unsafe(folio))
		is_hugetlb_folio = true;
}

if (is_hugetlb_folio)
	h = folio_hstate(folio);
spin_unlock_irq(&hugetlb_lock);
if (!is_hugetlb_folio)
	return 0;


But devil is in the detail (could someone trick us twice?).

isolate_hugetlb() performs similar lockless checks under
hugetlb_lock, and likely we should just unify them once we figure
out the right thing to do.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-12-13 12:23 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-16 15:11 [PATCH v2 00/13] Remove _folio_dtor and _folio_order Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 01/13] io_uring: Stop calling free_compound_page() Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 02/13] mm: Call free_huge_page() directly Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 03/13] mm: Convert free_huge_page() to free_huge_folio() Matthew Wilcox (Oracle)
2023-08-16 20:14   ` Sidhartha Kumar
2023-08-17  3:31   ` Yanteng Si
2023-08-16 15:11 ` [PATCH v2 04/13] mm: Convert free_transhuge_folio() to folio_undo_large_rmappable() Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 05/13] mm; Convert prep_transhuge_page() to folio_prep_large_rmappable() Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 06/13] mm: Remove free_compound_page() and the compound_page_dtors array Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 07/13] mm: Remove HUGETLB_PAGE_DTOR Matthew Wilcox (Oracle)
2023-08-16 21:45   ` Sidhartha Kumar
2023-08-22  3:13   ` Mike Kravetz
2023-08-22  3:32     ` Matthew Wilcox
2023-08-22 17:19       ` Mike Kravetz
2023-12-08 17:54   ` Vlastimil Babka
2023-12-08 18:31     ` Matthew Wilcox
2023-12-13 12:23       ` David Hildenbrand
2023-08-16 15:11 ` [PATCH v2 08/13] mm: Add large_rmappable page flag Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 09/13] mm: Rearrange page flags Matthew Wilcox (Oracle)
2023-08-16 15:11 ` [PATCH v2 10/13] mm: Free up a word in the first tail page Matthew Wilcox (Oracle)
2023-08-22 23:17   ` Mike Kravetz
2023-08-23  0:29     ` Matthew Wilcox
2023-08-16 15:11 ` [PATCH v2 11/13] mm: Remove folio_test_transhuge() Matthew Wilcox (Oracle)
2023-08-16 15:12 ` [PATCH v2 12/13] mm: Add tail private fields to struct folio Matthew Wilcox (Oracle)
2023-08-16 15:21   ` David Hildenbrand
2023-08-16 15:12 ` [PATCH v2 13/13] mm: Convert split_huge_pages_pid() to use a folio Matthew Wilcox (Oracle)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox