Commit Graph

1010 Commits

Author SHA1 Message Date
Rafael Aquini f2b730364a mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
JIRA: https://issues.redhat.com/browse/RHEL-84184
NOTE: this patch can also be found (by kerneloscope) on a dead parent branch
      in upstream mainline tree as:
      commit 1390a3334a48ecac5175865fd433d55eec255db8
Conflicts:
  * include/linux/page-flags.h: RHEL-only hunk is required here to avoid breaking
    the build for kernel variants that disable CONFIG_TRANSPARENT_HUGEPAGE but
    keep CONFIG_HUGETLBFS enabled (-rt). This is because RHEL-9 misses upstream
    v6.10 commit 85edc15a4c60 ("mm: remove folio_prep_large_rmappable()") along
    with its accompanying series which are irrelevant for this backport work;

This patch is a backport of the following upstream commit:
commit f708f6970cc9d6bac71da45c129482092e710537
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 9 20:04:33 2024 +0800

    mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio

    A kernel crash was observed when migrating hugetlb folio:

    BUG: kernel NULL pointer dereference, address: 0000000000000008
    PGD 0 P4D 0
    Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
    CPU: 0 PID: 3435 Comm: bash Not tainted 6.10.0-rc6-00450-g8578ca01f21f #66
    RIP: 0010:__folio_undo_large_rmappable+0x70/0xb0
    RSP: 0018:ffffb165c98a7b38 EFLAGS: 00000097
    RAX: fffffbbc44528090 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: ffffa30e000a2800 RSI: 0000000000000246 RDI: ffffa3153ffffcc0
    RBP: fffffbbc44528000 R08: 0000000000002371 R09: ffffffffbe4e5868
    R10: 0000000000000001 R11: 0000000000000001 R12: ffffa3153ffffcc0
    R13: fffffbbc44468000 R14: 0000000000000001 R15: 0000000000000001
    FS:  00007f5b3a716740(0000) GS:ffffa3151fc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000008 CR3: 000000010959a000 CR4: 00000000000006f0
    Call Trace:
     <TASK>
     __folio_migrate_mapping+0x59e/0x950
     __migrate_folio.constprop.0+0x5f/0x120
     move_to_new_folio+0xfd/0x250
     migrate_pages+0x383/0xd70
     soft_offline_page+0x2ab/0x7f0
     soft_offline_page_store+0x52/0x90
     kernfs_fop_write_iter+0x12c/0x1d0
     vfs_write+0x380/0x540
     ksys_write+0x64/0xe0
     do_syscall_64+0xb9/0x1d0
     entry_SYSCALL_64_after_hwframe+0x77/0x7f
    RIP: 0033:0x7f5b3a514887
    RSP: 002b:00007ffe138fce68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f5b3a514887
    RDX: 000000000000000c RSI: 0000556ab809ee10 RDI: 0000000000000001
    RBP: 0000556ab809ee10 R08: 00007f5b3a5d1460 R09: 000000007fffffff
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
    R13: 00007f5b3a61b780 R14: 00007f5b3a617600 R15: 00007f5b3a616a00

    It's because hugetlb folio is passed to __folio_undo_large_rmappable()
    unexpectedly.  large_rmappable flag is imperceptibly set to hugetlb folio
    since commit f6a8dd98a2ce ("hugetlb: convert alloc_buddy_hugetlb_folio to
    use a folio").  Then commit be9581ea8c05 ("mm: fix crashes from deferred
    split racing folio migration") makes folio_migrate_mapping() call
    folio_undo_large_rmappable() triggering the bug.  Fix this issue by
    clearing large_rmappable flag for hugetlb folios.  They don't need that
    flag set anyway.

    Link: https://lkml.kernel.org/r/20240709120433.4136700-1-linmiaohe@huawei.com
    Fixes: f6a8dd98a2ce ("hugetlb: convert alloc_buddy_hugetlb_folio to use a folio")
    Fixes: be9581ea8c05 ("mm: fix crashes from deferred split racing folio migration")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:40:03 -04:00
Rafael Aquini 71fd027f70 mm: clear uffd-wp PTE/PMD state on mremap()
JIRA: https://issues.redhat.com/browse/RHEL-84184
JIRA: https://issues.redhat.com/browse/RHEL-80529
CVE: CVE-2025-21696

This patch is a backport of the following upstream commit:
commit 0cef0bb836e3cfe00f08f9606c72abd72fe78ca3
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Tue Jan 7 14:47:52 2025 +0000

    mm: clear uffd-wp PTE/PMD state on mremap()

    When mremap()ing a memory region previously registered with userfaultfd as
    write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in
    flag clearing leads to a mismatch between the vma flags (which have
    uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp
    cleared).  This mismatch causes a subsequent mprotect(PROT_WRITE) to
    trigger a warning in page_table_check_pte_flags() due to setting the pte
    to writable while uffd-wp is still set.

    Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any
    such mremap() so that the values are consistent with the existing clearing
    of VM_UFFD_WP.  Be careful to clear the logical flag regardless of its
    physical form; a PTE bit, a swap PTE bit, or a PTE marker.  Cover PTE,
    huge PMD and hugetlb paths.

    Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com
    Co-developed-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
    Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.com/
    Fixes: 63b2d4174c ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:40:03 -04:00
CKI Backport Bot 6e1725c74e mm: hugetlb: avoid fallback for specific node allocation of 1G pages
JIRA: https://issues.redhat.com/browse/RHEL-78971

commit 6d7bc938adca9024a6b51cf55d9b0542b653b69c
Author: Luiz Capitulino <luizcap@redhat.com>
Date:   Mon Feb 10 22:48:56 2025 -0500

    mm: hugetlb: avoid fallback for specific node allocation of 1G pages

    When using the HugeTLB kernel command-line to allocate 1G pages from a
    specific node, such as:

       default_hugepagesz=1G hugepages=1:1

    If node 1 happens to not have enough memory for the requested number of 1G
    pages, the allocation falls back to other nodes.  A quick way to reproduce
    this is by creating a KVM guest with a memory-less node and trying to
    allocate 1 1G page from it.  Instead of failing, the allocation will
    fallback to other nodes.

    This defeats the purpose of node specific allocation.  Also, specific node
    allocation for 2M pages don't have this behavior: the allocation will just
    fail for the pages it can't satisfy.

    This issue happens because HugeTLB calls memblock_alloc_try_nid_raw() for
    1G boot-time allocation as this function falls back to other nodes if the
    allocation can't be satisfied.  Use memblock_alloc_exact_nid_raw()
    instead, which ensures that the allocation will only be satisfied from the
    specified node.

    Link: https://lkml.kernel.org/r/20250211034856.629371-1-luizcap@redhat.com
    Fixes: b5389086ad7b ("hugetlbfs: extend the definition of hugepages parameter to support node allocation")
    Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
    Acked-by: Oscar Salvador <osalvador@suse.de>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
    Cc: Frank van der Linden <fvdl@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2025-02-24 14:23:10 +00:00
Aristeu Rozanski 53820ed2e7 hugetlb: prioritize surplus allocation from current node
JIRA: https://issues.redhat.com/browse/RHEL-68966
Tested: by me

commit d0f14f7ee0e2d5df447d54487ae0c3aee5a7208f
Author: Koichiro Den <koichiro.den@canonical.com>
Date:   Thu Dec 5 01:55:03 2024 +0900

    hugetlb: prioritize surplus allocation from current node

    Previously, surplus allocations triggered by mmap were typically made from
    the node where the process was running.  On a page fault, the area was
    reliably dequeued from the hugepage_freelists for that node.  However,
    since commit 003af997c8a9 ("hugetlb: force allocating surplus hugepages on
    mempolicy allowed nodes"), dequeue_hugetlb_folio_vma() may fall back to
    other nodes unnecessarily even if there is no MPOL_BIND policy, causing
    folios to be dequeued from nodes other than the current one.

    Also, allocating from the node where the current process is running is
    likely to result in a performance win, as mmap-ing processes often touch
    the area not so long after allocation.  This change minimizes surprises
    for users relying on the previous behavior while maintaining the benefit
    introduced by the commit.

    So, prioritize the node the current process is running on when possible.

    Link: https://lkml.kernel.org/r/20241204165503.628784-1-koichiro.den@canonical.com
    Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
    Acked-by: Aristeu Rozanski <aris@ruivo.org>
    Cc: Aristeu Rozanski <aris@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2025-01-29 14:40:37 -05:00
Rafael Aquini e9a02ddb46 mm/hugetlb: fix possible recursive locking detected warning
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 667574e873b5f77a220b2a93329689f36fb56d5d
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Jul 12 11:13:14 2024 +0800

    mm/hugetlb: fix possible recursive locking detected warning

    When tries to demote 1G hugetlb folios, a lockdep warning is observed:

    ============================================
    WARNING: possible recursive locking detected
    6.10.0-rc6-00452-ga4d0275fa660-dirty #79 Not tainted
    --------------------------------------------
    bash/710 is trying to acquire lock:
    ffffffff8f0a7850 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0x244/0x460

    but task is already holding lock:
    ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460

    other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock(&h->resize_lock);
      lock(&h->resize_lock);

     *** DEADLOCK ***

     May be due to missing lock nesting notation

    4 locks held by bash/710:
     #0: ffff8f118439c3f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
     #1: ffff8f11893b9e88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
     #2: ffff8f1183dc4428 (kn->active#98){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
     #3: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460

    stack backtrace:
    CPU: 3 PID: 710 Comm: bash Not tainted 6.10.0-rc6-00452-ga4d0275fa660-dirty #79
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x68/0xa0
     __lock_acquire+0x10f2/0x1ca0
     lock_acquire+0xbe/0x2d0
     __mutex_lock+0x6d/0x400
     demote_store+0x244/0x460
     kernfs_fop_write_iter+0x12c/0x1d0
     vfs_write+0x380/0x540
     ksys_write+0x64/0xe0
     do_syscall_64+0xb9/0x1d0
     entry_SYSCALL_64_after_hwframe+0x77/0x7f
    RIP: 0033:0x7fa61db14887
    RSP: 002b:00007ffc56c48358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa61db14887
    RDX: 0000000000000002 RSI: 000055a030050220 RDI: 0000000000000001
    RBP: 000055a030050220 R08: 00007fa61dbd1460 R09: 000000007fffffff
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
    R13: 00007fa61dc1b780 R14: 00007fa61dc17600 R15: 00007fa61dc16a00
     </TASK>

    Lockdep considers this an AA deadlock because the different resize_lock
    mutexes reside in the same lockdep class, but this is a false positive.
    Place them in distinct classes to avoid these warnings.

    Link: https://lkml.kernel.org/r/20240712031314.2570452-1-linmiaohe@huawei.com
    Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Muchun Song <muchun.song@linux.dev>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:29 -05:00
Rafael Aquini 356484cfaa mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 5596d9e8b553dacb0ac34bcf873cbbfb16c3ba3e
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon Jul 8 10:51:27 2024 +0800

    mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio()

    There is a potential race between __update_and_free_hugetlb_folio() and
    try_memory_failure_hugetlb():

     CPU1                                   CPU2
     __update_and_free_hugetlb_folio        try_memory_failure_hugetlb
                                             folio_test_hugetlb
                                              -- It's still hugetlb folio.
      folio_clear_hugetlb_hwpoison
                                              spin_lock_irq(&hugetlb_lock);
                                               __get_huge_page_for_hwpoison
                                                folio_set_hugetlb_hwpoison
                                              spin_unlock_irq(&hugetlb_lock);
      spin_lock_irq(&hugetlb_lock);
      __folio_clear_hugetlb(folio);
       -- Hugetlb flag is cleared but too late.
      spin_unlock_irq(&hugetlb_lock);

    When the above race occurs, raw error page info will be leaked.  Even
    worse, raw error pages won't have hwpoisoned flag set and hit
    pcplists/buddy.  Fix this issue by deferring
    folio_clear_hugetlb_hwpoison() until __folio_clear_hugetlb() is done.  So
    all raw error pages will have hwpoisoned flag set.

    Link: https://lkml.kernel.org/r/20240708025127.107713-1-linmiaohe@huawei.com
    Fixes: 32c877191e02 ("hugetlb: do not clear hugetlb dtor until allocating vmemmap")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Muchun Song <muchun.song@linux.dev>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:24 -05:00
Rafael Aquini 8f0ea841dc mm: gup: stop abusing try_grab_folio
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * this is a direct port from v6.6 LTS branch backport commit 26273f5f4cf6
    ("mm: gup: stop abusing try_grab_folio"), due to RHEL9 missing upstream
    commit 53e45c4f6d4f ("mm: convert put_devmap_managed_page_refs() to
    put_devmap_managed_folio_refs()") and its big accompanying series.

This patch is a backport of the following upstream commit:
commit f442fa6141379a20b48ae3efabee827a3d260787
Author: Yang Shi <yang@os.amperecomputing.com>
Date:   Fri Jun 28 12:14:58 2024 -0700

    mm: gup: stop abusing try_grab_folio

    A kernel warning was reported when pinning folio in CMA memory when
    launching SEV virtual machine.  The splat looks like:

    [  464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520
    [  464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6
    [  464.325477] RIP: 0010:__get_user_pages+0x423/0x520
    [  464.325515] Call Trace:
    [  464.325520]  <TASK>
    [  464.325523]  ? __get_user_pages+0x423/0x520
    [  464.325528]  ? __warn+0x81/0x130
    [  464.325536]  ? __get_user_pages+0x423/0x520
    [  464.325541]  ? report_bug+0x171/0x1a0
    [  464.325549]  ? handle_bug+0x3c/0x70
    [  464.325554]  ? exc_invalid_op+0x17/0x70
    [  464.325558]  ? asm_exc_invalid_op+0x1a/0x20
    [  464.325567]  ? __get_user_pages+0x423/0x520
    [  464.325575]  __gup_longterm_locked+0x212/0x7a0
    [  464.325583]  internal_get_user_pages_fast+0xfb/0x190
    [  464.325590]  pin_user_pages_fast+0x47/0x60
    [  464.325598]  sev_pin_memory+0xca/0x170 [kvm_amd]
    [  464.325616]  sev_mem_enc_register_region+0x81/0x130 [kvm_amd]

    Per the analysis done by yangge, when starting the SEV virtual machine, it
    will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory.
    But the page is in CMA area, so fast GUP will fail then fallback to the
    slow path due to the longterm pinnalbe check in try_grab_folio().

    The slow path will try to pin the pages then migrate them out of CMA area.
    But the slow path also uses try_grab_folio() to pin the page, it will
    also fail due to the same check then the above warning is triggered.

    In addition, the try_grab_folio() is supposed to be used in fast path and
    it elevates folio refcount by using add ref unless zero.  We are guaranteed
    to have at least one stable reference in slow path, so the simple atomic add
    could be used.  The performance difference should be trivial, but the
    misuse may be confusing and misleading.

    Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page()
    to try_grab_folio(), and use them in the proper paths.  This solves both
    the abuse and the kernel warning.

    The proper naming makes their usecase more clear and should prevent from
    abusing in the future.

    peterx said:

    : The user will see the pin fails, for gpu-slow it further triggers the WARN
    : right below that failure (as in the original report):
    :
    :         folio = try_grab_folio(page, page_increm - 1,
    :                                 foll_flags);
    :         if (WARN_ON_ONCE(!folio)) { <------------------------ here
    :                 /*
    :                         * Release the 1st page ref if the
    :                         * folio is problematic, fail hard.
    :                         */
    :                 gup_put_folio(page_folio(page), 1,
    :                                 foll_flags);
    :                 ret = -EFAULT;
    :                 goto out;
    :         }

    [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/

    [shy828301@gmail.com: fix implicit declaration of function try_grab_folio_fast]
      Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com
    Link: https://lkml.kernel.org/r/20240628191458.2605553-1-yang@os.amperecomputing.com
    Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"")
    Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
    Reported-by: yangge <yangge1116@126.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: <stable@vger.kernel.org>    [6.6+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:23 -05:00
Rafael Aquini 7b22a0ea90 mm/hugetlb: do not call vma_add_reservation upon ENOMEM
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 8daf9c702ee7f825f0de8600abff764acfedea13
Author: Oscar Salvador <osalvador@suse.de>
Date:   Tue May 28 22:53:23 2024 +0200

    mm/hugetlb: do not call vma_add_reservation upon ENOMEM

    sysbot reported a splat [1] on __unmap_hugepage_range().  This is because
    vma_needs_reservation() can return -ENOMEM if
    allocate_file_region_entries() fails to allocate the file_region struct
    for the reservation.

    Check for that and do not call vma_add_reservation() if that is the case,
    otherwise region_abort() and region_del() will see that we do not have any
    file_regions.

    If we detect that vma_needs_reservation() returned -ENOMEM, we clear the
    hugetlb_restore_reserve flag as if this reservation was still consumed, so
    free_huge_folio() will not increment the resv count.

    [1] https://lore.kernel.org/linux-mm/0000000000004096100617c58d54@google.com/T/#ma5983bc1ab18a54910da83416b3f89f3c7ee43aa

    Link: https://lkml.kernel.org/r/20240528205323.20439-1-osalvador@suse.de
    Fixes: df7a6d1f6405 ("mm/hugetlb: restore the reservation if needed")
    Signed-off-by: Oscar Salvador <osalvador@suse.de>
    Reported-and-tested-by: syzbot+d3fe2dc5ffe9380b714b@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/linux-mm/0000000000004096100617c58d54@google.com/
    Cc: Breno Leitao <leitao@debian.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:13 -05:00
Rafael Aquini 407adaf8ff mm/hugetlb: align cma on allocation order, not demotion order
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit cc48be374b654e1533c40cd64ebc4e4b0a637317
Author: Frank van der Linden <fvdl@google.com>
Date:   Tue Apr 30 16:14:37 2024 +0000

    mm/hugetlb: align cma on allocation order, not demotion order

    Align the CMA area for hugetlb gigantic pages to their size, not the size
    that they can be demoted to.  Otherwise there might be misaligned sections
    at the start and end of the CMA area that will never be used for hugetlb
    page allocations.

    Link: https://lkml.kernel.org/r/20240430161437.2100295-1-fvdl@google.com
    Fixes: a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to CMA")
    Signed-off-by: Frank van der Linden <fvdl@google.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:09 -05:00
Rafael Aquini ad81ab1925 mm/hugetlb: pass correct order_per_bit to cma_declare_contiguous_nid
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 55d134a7b499c77e7cfd0ee41046f3c376e791e5
Author: Frank van der Linden <fvdl@google.com>
Date:   Thu Apr 4 16:25:15 2024 +0000

    mm/hugetlb: pass correct order_per_bit to cma_declare_contiguous_nid

    The hugetlb_cma code passes 0 in the order_per_bit argument to
    cma_declare_contiguous_nid (the alignment, computed using the page order,
    is correctly passed in).

    This causes a bit in the cma allocation bitmap to always represent a 4k
    page, making the bitmaps potentially very large, and slower.

    It would create bitmaps that would be pretty big.  E.g.  for a 4k page
    size on x86, hugetlb_cma=64G would mean a bitmap size of (64G / 4k) / 8
    == 2M.  With HUGETLB_PAGE_ORDER as order_per_bit, as intended, this
    would be (64G / 2M) / 8 == 4k.  So, that's quite a difference.

    Also, this restricted the hugetlb_cma area to ((PAGE_SIZE <<
    MAX_PAGE_ORDER) * 8) * PAGE_SIZE (e.g.  128G on x86) , since
    bitmap_alloc uses normal page allocation, and is thus restricted by
    MAX_PAGE_ORDER.  Specifying anything about that would fail the CMA
    initialization.

    So, correctly pass in the order instead.

    Link: https://lkml.kernel.org/r/20240404162515.527802-2-fvdl@google.com
    Fixes: cf11e85fc0 ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
    Signed-off-by: Frank van der Linden <fvdl@google.com>
    Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:05 -05:00
Rafael Aquini 9483343dbf hugetlb: convert alloc_buddy_hugetlb_folio to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit f6a8dd98a2ce7ace0be59d77868751131af6d2f0
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Apr 2 21:06:54 2024 +0100

    hugetlb: convert alloc_buddy_hugetlb_folio to use a folio

    While this function returned a folio, it was still using __alloc_pages()
    and __free_pages().  Use __folio_alloc() and put_folio() instead.  This
    actually removes a call to compound_head(), but more importantly, it
    prepares us for the move to memdescs.

    Link: https://lkml.kernel.org/r/20240402200656.913841-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:03 -05:00
Rafael Aquini 6be9f58cab mm: turn folio_test_hugetlb into a PageType
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * kernel/vmcore_info.c: hunk applied to kernel/crash_core.c instead, as
    RHEL9 misses upstream commit 443cbaf9e2fd ("crash: split vmcoreinfo
    exporting code out from crash_core.c") and related series;
  * mm/hugetlb.c: minor context difference due to RHEL9 missing upstream
    commit d67e32f26713 ("hugetlb: restructure pool allocations") and its
    related series;

This patch is a backport of the following upstream commit:
commit d99e3140a4d33e26066183ff727d8f02f56bec64
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Mar 21 14:24:43 2024 +0000

    mm: turn folio_test_hugetlb into a PageType

    The current folio_test_hugetlb() can be fooled by a concurrent folio split
    into returning true for a folio which has never belonged to hugetlbfs.
    This can't happen if the caller holds a refcount on it, but we have a few
    places (memory-failure, compaction, procfs) which do not and should not
    take a speculative reference.

    Since hugetlb pages do not use individual page mapcounts (they are always
    fully mapped and use the entire_mapcount field to record the number of
    mappings), the PageType field is available now that page_mapcount()
    ignores the value in this field.

    In compaction and with CONFIG_DEBUG_VM enabled, the current implementation
    can result in an oops, as reported by Luis. This happens since 9c5ccf2db04b
    ("mm: remove HUGETLB_PAGE_DTOR") effectively added some VM_BUG_ON() checks
    in the PageHuge() testing path.

    [willy@infradead.org: update vmcoreinfo]
      Link: https://lkml.kernel.org/r/ZgGZUvsdhaT1Va-T@casper.infradead.org
    Link: https://lkml.kernel.org/r/20240321142448.1645400-6-willy@infradead.org
    Fixes: 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reported-by: Luis Chamberlain <mcgrof@kernel.org>
    Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218227
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:52 -05:00
Rafael Aquini 3686a95b5a mm: constify more page/folio tests
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 29cfe7556bfd6be043b6eb602a29c89d43565d71
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Feb 27 19:23:34 2024 +0000

    mm: constify more page/folio tests

    Constify the flag tests that aren't automatically generated and the tests
    that look like flag tests but are more complicated.

    Link: https://lkml.kernel.org/r/20240227192337.757313-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:34 -05:00
Rafael Aquini 126d9b24d2 mm/hugetlb: restore the reservation if needed
JIRA: https://issues.redhat.com/browse/RHEL-27745
JIRA: https://issues.redhat.com/browse/RHEL-61137
Conflicts:
  * minor difference on the 2nd hunk due to RHEL-9 missing upstream v6.8's
    commit e135826b2da0 ("mm/rmap: introduce and use hugetlb_remove_rmap()")
    and the long series that goes along with it

This patch is a backport of the following upstream commit:
commit df7a6d1f64056aec572162c5d35ed9ff86ece6f3
Author: Breno Leitao <leitao@debian.org>
Date:   Mon Feb 5 11:18:41 2024 -0800

    mm/hugetlb: restore the reservation if needed

    Patch series "mm/hugetlb: Restore the reservation", v2.

    This is a fix for a case where a backing huge page could stolen after
    madvise(MADV_DONTNEED).

    A full reproducer is in selftest. See
    https://lore.kernel.org/all/20240105155419.1939484-1-leitao@debian.org/

    In order to test this patch, I instrumented the kernel with LOCKDEP and
    KASAN, and run the following tests, without any regression:
      * The self test that reproduces the problem
      * All mm hugetlb selftests
            SUMMARY: PASS=9 SKIP=0 FAIL=0
      * All libhugetlbfs tests
            PASS:     0     86
            FAIL:     0      0

    This patch (of 2):

    Currently there is a bug that a huge page could be stolen, and when the
    original owner tries to fault in it, it causes a page fault.

    You can achieve that by:
      1) Creating a single page
            echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

      2) mmap() the page above with MAP_HUGETLB into (void *ptr1).
            * This will mark the page as reserved
      3) touch the page, which causes a page fault and allocates the page
            * This will move the page out of the free list.
            * It will also unreserved the page, since there is no more free
              page
      4) madvise(MADV_DONTNEED) the page
            * This will free the page, but not mark it as reserved.
      5) Allocate a secondary page with mmap(MAP_HUGETLB) into (void *ptr2).
            * it should fail, but, since there is no more available page.
            * But, since the page above is not reserved, this mmap() succeed.
      6) Faulting at ptr1 will cause a SIGBUS
            * it will try to allocate a huge page, but there is none
              available

    A full reproducer is in selftest. See
    https://lore.kernel.org/all/20240105155419.1939484-1-leitao@debian.org/

    Fix this by restoring the reserved page if necessary.

    These are the condition for the page restore:

     * The system is not using surplus pages. The goal is to reduce the
       surplus usage for this case.
     * If the VMA has the HPAGE_RESV_OWNER flag set, and is PRIVATE. This is
       safely checked using __vma_private_lock()
     * The page is anonymous

    Once this is scenario is found, set the `hugetlb_restore_reserve` bit in
    the folio. Then check if the resv reservations need to be adjusted
    later, done later, after the spinlock, since the vma_xxxx_reservation()
    might touch the file system lock.

    Link: https://lkml.kernel.org/r/20240205191843.4009640-1-leitao@debian.org
    Link: https://lkml.kernel.org/r/20240205191843.4009640-2-leitao@debian.org
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Suggested-by: Rik van Riel <riel@surriel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:24 -05:00
Rafael Aquini c8c9c0b259 mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * arch/*/Kconfig: all hunks dropped as there were only text blurbs and comments
     being changed with no functional changes whatsoever, and RHEL9 is missing
     several (unrelated) commits to these arches that tranform the text blurbs in
     the way these non-functional hunks were expecting;
  * drivers/accel/qaic/qaic_data.c: hunk dropped due to RHEL-only commit
     083c0cdce2 ("Merge DRM changes from upstream v6.8..v6.9");
  * drivers/gpu/drm/i915/gem/selftests/huge_pages.c: hunk dropped due to RHEL-only
     commit ca8b16c11b ("Merge DRM changes from upstream v6.7..v6.8");
  * drivers/gpu/drm/ttm/tests/ttm_pool_test.c: all hunks dropped due to RHEL-only
     commit ca8b16c11b ("Merge DRM changes from upstream v6.7..v6.8");
  * drivers/video/fbdev/vermilion/vermilion.c: hunk dropped as RHEL9 misses
     commit dbe7e429fe ("vmlfb: framebuffer driver for Intel Vermilion Range");
  * include/linux/pageblock-flags.h: differences due to out-of-order backport
    of upstream commits 72801513b2bf ("mm: set pageblock_order to HPAGE_PMD_ORDER
    in case with !CONFIG_HUGETLB_PAGE but THP enabled"), and 3a7e02c040b1
    ("minmax: avoid overly complicated constant expressions in VM code");
  * mm/mm_init.c: differences on the 3rd, and 4th hunks are due to RHEL
     backport commit 1845b92dcf ("mm: move most of core MM initialization to
     mm/mm_init.c") ignoring the out-of-order backport of commit 3f6dac0fd1b8
     ("mm/page_alloc: make deferred page init free pages in MAX_ORDER blocks")
     thus partially reverting the changes introduced by the latter;

This patch is a backport of the following upstream commit:
commit 5e0a760b44417f7cadd79de2204d6247109558a0
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Thu Dec 28 17:47:04 2023 +0300

    mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

    commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
    changed the definition of MAX_ORDER to be inclusive.  This has caused
    issues with code that was not yet upstream and depended on the previous
    definition.

    To draw attention to the altered meaning of the define, rename MAX_ORDER
    to MAX_PAGE_ORDER.

    Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:17 -05:00
Rafael Aquini 32b576f804 mm/filemap: remove hugetlb special casing in filemap.c
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Sep 26 12:20:17 2023 -0700

    mm/filemap: remove hugetlb special casing in filemap.c

    Remove special cased hugetlb handling code within the page cache by
    changing the granularity of ->index to the base page size rather than the
    huge page size.  The motivation of this patch is to reduce complexity
    within the filemap code while also increasing performance by removing
    branches that are evaluated on every page cache lookup.

    To support the change in index, new wrappers for hugetlb page cache
    interactions are added.  These wrappers perform the conversion to a linear
    index which is now expected by the page cache for huge pages.

    ========================= PERFORMANCE ======================================

    Perf was used to check the performance differences after the patch.
    Overall the performance is similar to mainline with a very small larger
    overhead that occurs in __filemap_add_folio() and
    hugetlb_add_to_page_cache().  This is because of the larger overhead that
    occurs in xa_load() and xa_store() as the xarray is now using more entries
    to store hugetlb folios in the page cache.

    Timing

    aarch64
        2MB Page Size
            6.5-rc3 + this patch:
                [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
                real    1m49.568s
                user    0m0.000s
                sys     1m49.461s

            6.5-rc3:
                [root]# time fallocate -l 700GB test.txt
                real    1m47.495s
                user    0m0.000s
                sys     1m47.370s
        1GB Page Size
            6.5-rc3 + this patch:
                [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                real    1m47.024s
                user    0m0.000s
                sys     1m46.921s

            6.5-rc3:
                [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                real    1m44.551s
                user    0m0.000s
                sys     1m44.438s

    x86
        2MB Page Size
            6.5-rc3 + this patch:
                [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
                real    0m22.383s
                user    0m0.000s
                sys     0m22.255s

            6.5-rc3:
                [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
                real    0m22.735s
                user    0m0.038s
                sys     0m22.567s

        1GB Page Size
            6.5-rc3 + this patch:
                [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
                real    0m25.786s
                user    0m0.001s
                sys     0m25.589s

            6.5-rc3:
                [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
                real    0m33.454s
                user    0m0.001s
                sys     0m33.193s

    aarch64:
        workload - fallocate a 700GB file backed by huge pages

        6.5-rc3 + this patch:
            2MB Page Size:
                --100.00%--__arm64_sys_fallocate
                              ksys_fallocate
                              vfs_fallocate
                              hugetlbfs_fallocate
                              |
                              |--95.04%--__pi_clear_page
                              |
                              |--3.57%--clear_huge_page
                              |          |
                              |          |--2.63%--rcu_all_qs
                              |          |
                              |           --0.91%--__cond_resched
                              |
                               --0.67%--__cond_resched
                0.17%     0.00%             0  fallocate  [kernel.vmlinux]       [k] hugetlb_add_to_page_cache
                0.14%     0.10%            11  fallocate  [kernel.vmlinux]       [k] __filemap_add_folio

        6.5-rc3
            2MB Page Size:
                    --100.00%--__arm64_sys_fallocate
                              ksys_fallocate
                              vfs_fallocate
                              hugetlbfs_fallocate
                              |
                              |--94.91%--__pi_clear_page
                              |
                              |--4.11%--clear_huge_page
                              |          |
                              |          |--3.00%--rcu_all_qs
                              |          |
                              |           --1.10%--__cond_resched
                              |
                               --0.59%--__cond_resched
                0.08%     0.01%             1  fallocate  [kernel.kallsyms]  [k] hugetlb_add_to_page_cache
                0.05%     0.03%             3  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio

    x86
        workload - fallocate a 100GB file backed by huge pages

        6.5-rc3 + this patch:
            2MB Page Size:
                hugetlbfs_fallocate
                |
                --99.57%--clear_huge_page
                    |
                    --98.47%--clear_page_erms
                        |
                        --0.53%--asm_sysvec_apic_timer_interrupt

                0.04%     0.04%             1  fallocate  [kernel.kallsyms]     [k] xa_load
                0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] hugetlb_add_to_page_cache
                0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] __filemap_add_folio
                0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] xas_store

        6.5-rc3
            2MB Page Size:
                    --99.93%--__x64_sys_fallocate
                              vfs_fallocate
                              hugetlbfs_fallocate
                              |
                               --99.38%--clear_huge_page
                                         |
                                         |--98.40%--clear_page_erms
                                         |
                                          --0.59%--__cond_resched
                0.03%     0.03%             1  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio

    ========================= TESTING ======================================

    This patch passes libhugetlbfs tests and LTP hugetlb tests

    ********** TEST SUMMARY
    *                      2M
    *                      32-bit 64-bit
    *     Total testcases:   110    113
    *             Skipped:     0      0
    *                PASS:   107    113
    *                FAIL:     0      0
    *    Killed by signal:     3      0
    *   Bad configuration:     0      0
    *       Expected FAIL:     0      0
    *     Unexpected PASS:     0      0
    *    Test not present:     0      0
    * Strange test result:     0      0
    **********

        Done executing testcases.
        LTP Version:  20220527-178-g2761a81c4

    page migration was also tested using Mike Kravetz's test program.[8]

    [dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
      Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
    Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
    Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
    Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:37 -05:00
Rafael Aquini 8d2924bfae mm/hugetlb: use nth_page() in place of direct struct page manipulation
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 426056efe835cf4864ccf4c328fe3af9146fc539
Author: Zi Yan <ziy@nvidia.com>
Date:   Wed Sep 13 16:12:45 2023 -0400

    mm/hugetlb: use nth_page() in place of direct struct page manipulation

    When dealing with hugetlb pages, manipulating struct page pointers
    directly can get to wrong struct page, since struct page is not guaranteed
    to be contiguous on SPARSEMEM without VMEMMAP.  Use nth_page() to handle
    it properly.

    A wrong or non-existing page might be tried to be grabbed, either
    leading to a non freeable page or kernel memory access errors.  No bug
    is reported.  It comes from code inspection.

    Link: https://lkml.kernel.org/r/20230913201248.452081-3-zi.yan@sent.com
    Fixes: 57a196a58421 ("hugetlb: simplify hugetlb handling in follow_page_mask")
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:29 -05:00
Rado Vrbovsky 570a71d7db Merge: mm: update core code to v6.6 upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252

JIRA: https://issues.redhat.com/browse/RHEL-27743  
JIRA: https://issues.redhat.com/browse/RHEL-59459    
CVE: CVE-2024-46787    
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961  
  
This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level.    
This work follows up on the previous v6.5 update (RHEL-27742) and as such,    
the bulk of this changeset is comprised of refactoring and clean-ups of     
the internal implementation of several APIs as it further advances the     
conversion to FOLIOS, and follow up on the per-VMA locking changes.

Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow    
Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds,    
and we add a potential extra level of protection (assessment pending) to help    
on mitigating kernel heap exploits dubbed as "SlubStick".     
    
Follow-up fixes are omitted from this series either because they are irrelevant to     
the bits we support on RHEL or because they depend on bigger changesets introduced     
upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately.    

Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot")    
Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources")   
Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()")    
Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros")    
Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages")    
Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType")    
Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()")    
Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio")    
Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling")    
    
Signed-off-by: Rafael Aquini <raquini@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-30 07:22:28 +00:00
Rafael Aquini 5f28258391 mm/userfaultfd: allow hugetlb change protection upon poison entry
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit c5977c95dff182d6ee06f4d6f60bcb0284912969
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Apr 5 19:19:20 2024 -0400

    mm/userfaultfd: allow hugetlb change protection upon poison entry

    After UFFDIO_POISON, there can be two kinds of hugetlb pte markers, either
    the POISON one or UFFD_WP one.

    Allow change protection to run on a poisoned marker just like !hugetlb
    cases, ignoring the marker irrelevant of the permission.

    Here the two bits are mutual exclusive.  For example, when install a
    poisoned entry it must not be UFFD_WP already (by checking pte_none()
    before such install).  And it also means if UFFD_WP is set there must have
    no POISON bit set.  It makes sense because UFFD_WP is a bit to reflect
    permission, and permissions do not apply if the pte is poisoned and
    destined to sigbus.

    So here we simply check uffd_wp bit set first, do nothing otherwise.

    Attach the Fixes to UFFDIO_POISON work, as before that it should not be
    possible to have poison entry for hugetlb (e.g., hugetlb doesn't do swap,
    so no chance of swapin errors).

    Link: https://lkml.kernel.org/r/20240405231920.1772199-1-peterx@redhat.com
    Link: https://lore.kernel.org/r/000000000000920d5e0615602dd1@google.com
    Fixes: fc71884a5f59 ("mm: userfaultfd: add new UFFDIO_POISON ioctl")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: syzbot+b07c8ac8eee3d4d8440f@syzkaller.appspotmail.com
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
    Cc: <stable@vger.kernel.org>    [6.6+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:40 -04:00
Rafael Aquini e66e65400a mm: hugetlb: add huge page size param to set_huge_pte_at()
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * arch/parisc/include/asm/hugetlb.h: hunks dropped (unsupported arch)
  * arch/parisc/mm/hugetlbpage.c:  hunks dropped (unsupported arch)
  * arch/riscv/include/asm/hugetlb.h: hunks dropped (unsupported arch)
  * arch/riscv/mm/hugetlbpage.c: hunks dropped (unsupported arch)
  * arch/sparc/mm/hugetlbpage.c: hunks dropped (unsupported arch)
  * mm/rmap.c: minor context conflict on the 7th hunk due to backport of
      upstream commit 322842ea3c72 ("mm/rmap: fix missing swap_free() in
      try_to_unmap() after arch_unmap_one() failed")

This patch is a backport of the following upstream commit:
commit 935d4f0c6dc8b3533e6e39346de7389a84490178
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Fri Sep 22 12:58:03 2023 +0100

    mm: hugetlb: add huge page size param to set_huge_pte_at()

    Patch series "Fix set_huge_pte_at() panic on arm64", v2.

    This series fixes a bug in arm64's implementation of set_huge_pte_at(),
    which can result in an unprivileged user causing a kernel panic.  The
    problem was triggered when running the new uffd poison mm selftest for
    HUGETLB memory.  This test (and the uffd poison feature) was merged for
    v6.5-rc7.

    Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
    (correctly this time) to get it backported to v6.5, where the issue first
    showed up.

    Description of Bug
    ==================

    arm64's huge pte implementation supports multiple huge page sizes, some of
    which are implemented in the page table with multiple contiguous entries.
    So set_huge_pte_at() needs to work out how big the logical pte is, so that
    it can also work out how many physical ptes (or pmds) need to be written.
    It previously did this by grabbing the folio out of the pte and querying
    its size.

    However, there are cases when the pte being set is actually a swap entry.
    But this also used to work fine, because for huge ptes, we only ever saw
    migration entries and hwpoison entries.  And both of these types of swap
    entries have a PFN embedded, so the code would grab that and everything
    still worked out.

    But over time, more calls to set_huge_pte_at() have been added that set
    swap entry types that do not embed a PFN.  And this causes the code to go
    bang.  The triggering case is for the uffd poison test, commit
    99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
    causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
    8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
    added in v6.5-rc7.  Although review shows that there are other call sites
    that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
    on arm64 because arm64 doesn't support UFFD WP.

    If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
    it will dereference a bad pointer in page_folio():

        static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
        {
            VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));

            return page_folio(pfn_to_page(swp_offset_pfn(entry)));
        }

    Fix
    ===

    The simplest fix would have been to revert the dodgy cleanup commit
    18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
    things have moved on, this would have required an audit of all the new
    set_huge_pte_at() call sites to see if they should be converted to
    set_huge_swap_pte_at().  As per the original intent of the change, it
    would also leave us open to future bugs when people invariably get it
    wrong and call the wrong helper.

    So instead, I've added a huge page size parameter to set_huge_pte_at().
    This means that the arm64 code has the size in all cases.  It's a bigger
    change, due to needing to touch the arches that implement the function,
    but it is entirely mechanical, so in my view, low risk.

    I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
    s390, sparc (and additionally x86_64).  I've additionally booted and run
    mm selftests against arm64, where I observe the uffd poison test is fixed,
    and there are no other regressions.

    This patch (of 2):

    In order to fix a bug, arm64 needs to be told the size of the huge page
    for which the pte is being set in set_huge_pte_at().  Provide for this by
    adding an `unsigned long sz` parameter to the function.  This follows the
    same pattern as huge_pte_clear().

    This commit makes the required interface modifications to the core mm as
    well as all arches that implement this function (arm64, parisc, powerpc,
    riscv, s390, sparc).  The actual arm64 bug will be fixed in a separate
    commit.

    No behavioral changes intended.

    Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
    Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
    Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>     [powerpc 8xx]
    Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>       [vmalloc change]
    Cc: Alexandre Ghiti <alex@ghiti.fr>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: <stable@vger.kernel.org>    [6.5+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:19 -04:00
Rafael Aquini e45bf92e58 hugetlb: clear flags in tail pages that will be freed individually
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 6c1419730822fe991fc15bfd7059f6872a71a7af
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Aug 22 15:30:43 2023 -0700

    hugetlb: clear flags in tail pages that will be freed individually

    hugetlb manually creates and destroys compound pages.  As such it makes
    assumptions about struct page layout.  Commit ebc1baf5c9b4 ("mm: free up a
    word in the first tail page") breaks hugetlb.  The following will fix the
    breakage.

    Link: https://lkml.kernel.org/r/20230822231741.GC4509@monkey
    Fixes: ebc1baf5c9b4 ("mm: free up a word in the first tail page")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:10 -04:00
Rafael Aquini 8500ad8619 hugetlb: add documentation for vma_kernel_pagesize()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 8cfd014efd93e9450fcd4892bbfe8b10f41e53c3
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Aug 22 18:24:59 2023 +0100

    hugetlb: add documentation for vma_kernel_pagesize()

    This is an exported symbol, so it should have kernel-doc.  Update it to
    mention folios, and point out that they might be larger than the supported
    page size for this VMA.

    Link: https://lkml.kernel.org/r/20230822172459.4190699-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:09 -04:00
Rafael Aquini 980ab30d90 mm: remove HUGETLB_PAGE_DTOR
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * mm/hugetlb.c: conflict on the 4th hunk due to out-of-order backport of
      commit d8f5f7e445f0 ("hugetlb: set hugetlb page flag before optimizing vmemmap")

This patch is a backport of the following upstream commit:
commit 9c5ccf2db04b8d7c3df363fdd4856c2b79ab2c6a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:55 2023 +0100

    mm: remove HUGETLB_PAGE_DTOR

    We can use a bit in page[1].flags to indicate that this folio belongs to
    hugetlb instead of using a value in page[1].dtors.  That lets
    folio_test_hugetlb() become an inline function like it should be.  We can
    also get rid of NULL_COMPOUND_DTOR.

    Link: https://lkml.kernel.org/r/20230816151201.3655946-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:50 -04:00
Rafael Aquini a70f0dc41c mm: convert free_huge_page() to free_huge_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 454a00c40a21c59e99c526fe8cc57bd029cf8f0e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:51 2023 +0100

    mm: convert free_huge_page() to free_huge_folio()

    Pass a folio instead of the head page to save a few instructions.  Update
    the documentation, at least in English.

    Link: https://lkml.kernel.org/r/20230816151201.3655946-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:47 -04:00
Rafael Aquini 107b563fb0 mm/hugetlb.c: use helper macro K()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 6c1aa2d37f7677609c74a4ff120f99a07b90ba08
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Fri Aug 4 09:25:59 2023 +0800

    mm/hugetlb.c: use helper macro K()

    Use helper macro K() to improve code readability.  No functional
    modification involved.

    Link: https://lkml.kernel.org/r/20230804012559.2617515-8-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:57 -04:00
Rafael Aquini 4f9ef33c45 mm: hugetlb: use flush_hugetlb_tlb_range() in move_hugetlb_page_tables()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit f720b471fdb35619402293dcd421761fb1942e27
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Tue Aug 1 10:31:44 2023 +0800

    mm: hugetlb: use flush_hugetlb_tlb_range() in move_hugetlb_page_tables()

    Archs may need to do special things when flushing hugepage tlb, so use the
    more applicable flush_hugetlb_tlb_range() instead of flush_tlb_range().

    Link: https://lkml.kernel.org/r/20230801023145.17026-2-wangkefeng.wang@huawei.com
    Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma")
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Barry Song <21cnbao@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
    Cc: Kalesh Singh <kaleshsingh@google.com>
    Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: William Kucharski <william.kucharski@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:11 -04:00
Rafael Aquini 37d87ab0c0 mm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 4ec31152a80d83d74d231d964703a721236244ef
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:03 2023 +0100

    mm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()

    Handle a little more of the page fault path outside the mmap sem.  The
    hugetlb path doesn't need to check whether the VMA is anonymous; the
    VM_HUGETLB flag is only set on hugetlbfs VMAs.  There should be no
    performance change from the previous commit; this is simply a step to ease
    bisection of any problems.

    Link: https://lkml.kernel.org/r/20230724185410.1124082-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:36 -04:00
Rafael Aquini 5629abccbc mm/hugetlb: get rid of page_hstate()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit affd26b1fbd67fceea70d9ceac40ff4815afbeb5
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Wed Jul 19 11:41:45 2023 -0700

    mm/hugetlb: get rid of page_hstate()

    Convert the last page_hstate() user to use folio_hstate() so page_hstate()
    can be safely removed.

    Link: https://lkml.kernel.org/r/20230719184145.301911-1-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:13 -04:00
Rafael Aquini 2097286f41 mm: userfaultfd: support UFFDIO_POISON for hugetlbfs
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 8a13897fb0daa8f56821f263f0c63661e1c6acae
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Fri Jul 7 14:55:37 2023 -0700

    mm: userfaultfd: support UFFDIO_POISON for hugetlbfs

    The behavior here is the same as it is for anon/shmem.  This is done
    separately because hugetlb pte marker handling is a bit different.

    Link: https://lkml.kernel.org/r/20230707215540.2324998-6-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:06 -04:00
Rafael Aquini 4b5fb83182 mm: make PTE_MARKER_SWAPIN_ERROR more general
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit af19487f00f34ff8643921d7909dbb3fedc7e329
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Fri Jul 7 14:55:33 2023 -0700

    mm: make PTE_MARKER_SWAPIN_ERROR more general

    Patch series "add UFFDIO_POISON to simulate memory poisoning with UFFD",
    v4.

    This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
    for a detailed description of the feature.

    This patch (of 8):

    Future patches will reuse PTE_MARKER_SWAPIN_ERROR to implement
    UFFDIO_POISON, so make some various preparations for that:

    First, rename it to just PTE_MARKER_POISONED.  The "SWAPIN" can be
    confusing since we're going to re-use it for something not really related
    to swap.  This can be particularly confusing for things like hugetlbfs,
    which doesn't support swap whatsoever.  Also rename some various helper
    functions.

    Next, fix pte marker copying for hugetlbfs.  Previously, it would WARN on
    seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap.
    But, since we're going to re-use it, we want it to go ahead and copy it
    just like non-hugetlbfs memory does today.  Since the code to do this is
    more complicated now, pull it out into a helper which can be re-used in
    both places.  While we're at it, also make it slightly more explicit in
    its handling of e.g.  uffd wp markers.

    For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for an
    error entry, return VM_FAULT_HWPOISON.  For most cases this change doesn't
    matter, e.g.  a userspace program would receive a SIGBUS either way.  But
    for UFFDIO_POISON, this change will let KVM guests get an MCE out of the
    box, instead of giving a SIGBUS to the hypervisor and requiring it to
    somehow inject an MCE.

    Finally, for hugetlbfs faults, handle PTE_MARKER_POISONED, and return
    VM_FAULT_HWPOISON_LARGE in such cases.  Note that this can't happen today
    because the lack of swap support means we'll never end up with such a PTE
    anyway, but this behavior will be needed once such entries *can* show up
    via UFFDIO_POISON.

    Link: https://lkml.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com
    Link: https://lkml.kernel.org/r/20230707215540.2324998-2-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:03 -04:00
Rafael Aquini 83c1915e9e mm/gup: retire follow_hugetlb_page()
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * include/linux/hugetlb.h: minor context conflict on the 2nd hunk
      due to out-of-order backport of upstream commit 2820b0f09be9
      ("hugetlbfs: close race between MADV_DONTNEED and page fault")

This patch is a backport of the following upstream commit:
commit 4849807114b83e1897381ed3f851632f376a0b7e
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Jun 28 17:53:08 2023 -0400

    mm/gup: retire follow_hugetlb_page()

    Now __get_user_pages() should be well prepared to handle thp completely,
    as long as hugetlb gup requests even without the hugetlb's special path.

    Time to retire follow_hugetlb_page().

    Tweak misc comments to reflect reality of follow_hugetlb_page()'s removal.

    Link: https://lkml.kernel.org/r/20230628215310.73782-7-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kirill A . Shutemov <kirill@shutemov.name>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:29 -04:00
Rafael Aquini 8f0c01beae mm/hugetlb: add page_mask for hugetlb_follow_page_mask()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 5502ea44f5ade35d32a397353956bc026b870400
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Jun 28 17:53:05 2023 -0400

    mm/hugetlb: add page_mask for hugetlb_follow_page_mask()

    follow_page() doesn't need it, but we'll start to need it when unifying
    gup for hugetlb.

    Link: https://lkml.kernel.org/r/20230628215310.73782-4-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kirill A . Shutemov <kirill@shutemov.name>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:27 -04:00
Rafael Aquini fd1e4bd434 mm/hugetlb: prepare hugetlb_follow_page_mask() for FOLL_PIN
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * minor difference in the 3rd hunk because we missed a merge
    commit fixup to the comment block (commit e2ca6ba6ba01),
    which gets folded in this backport to avoid future hiccups.

This patch is a backport of the following upstream commit:
commit 458568c92953dee3716234711f1a2830a35261f3
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Jun 28 17:53:04 2023 -0400

    mm/hugetlb: prepare hugetlb_follow_page_mask() for FOLL_PIN

    follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
    target of FOLL_WRITE either.  However add the checks.

    Namely, either the need to CoW due to missing write bit, or proper
    unsharing on !AnonExclusive pages over R/O pins to reject the follow page.
    That brings this function closer to follow_hugetlb_page().

    So we don't care before, and also for now.  But we'll care if we switch
    over slow-gup to use hugetlb_follow_page_mask().  We'll also care when to
    return -EMLINK properly, as that's the gup internal api to mean "we should
    unshare".  Not really needed for follow page path, though.

    When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
    clear that it just should never fail.  When error happens, instead of
    setting page==NULL, capture the errno instead.

    Link: https://lkml.kernel.org/r/20230628215310.73782-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kirill A . Shutemov <kirill@shutemov.name>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:26 -04:00
Rafael Aquini 348c15e7d9 mm/hugetlb: handle FOLL_DUMP well in follow_page_mask()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit dd767aaa2fc8f1a000df0504f6231afcafe8a8e9
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Jun 28 17:53:03 2023 -0400

    mm/hugetlb: handle FOLL_DUMP well in follow_page_mask()

    Patch series "mm/gup: Unify hugetlb, speed up thp", v4.

    Hugetlb has a special path for slow gup that follow_page_mask() is
    actually skipped completely along with faultin_page().  It's not only
    confusing, but also duplicating a lot of logics that generic gup already
    has, making hugetlb slightly special.

    This patchset tries to dedup the logic, by first touching up the slow gup
    code to be able to handle hugetlb pages correctly with the current follow
    page and faultin routines (where we're mostly there..  due to 10 years ago
    we did try to optimize thp, but half way done; more below), then at the
    last patch drop the special path, then the hugetlb gup will always go the
    generic routine too via faultin_page().

    Note that hugetlb is still special for gup, mostly due to the pgtable
    walking (hugetlb_walk()) that we rely on which is currently per-arch.  But
    this is still one small step forward, and the diffstat might be a proof
    too that this might be worthwhile.

    Then for the "speed up thp" side: as a side effect, when I'm looking at
    the chunk of code, I found that thp support is actually partially done.
    It doesn't mean that thp won't work for gup, but as long as **pages
    pointer passed over, the optimization will be skipped too.  Patch 6 should
    address that, so for thp we now get full speed gup.

    For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10"
    gives me 13992.50us -> 378.50us.  Gup_test is an extreme case, but just to
    show how it affects thp gups.

    This patch (of 8):

    Firstly, the no_page_table() is meaningless for hugetlb which is a no-op
    there, because a hugetlb page always satisfies:

      - vma_is_anonymous() == false
      - vma->vm_ops->fault != NULL

    So we can already safely remove it in hugetlb_follow_page_mask(), alongside
    with the page* variable.

    Meanwhile, what we do in follow_hugetlb_page() actually makes sense for a
    dump: we try to fault in the page only if the page cache is already
    allocated.  Let's do the same here for follow_page_mask() on hugetlb.

    It should so far has zero effect on real dumps, because that still goes
    into follow_hugetlb_page().  But this may start to influence a bit on
    follow_page() users who mimics a "dump page" scenario, but hopefully in a
    good way.  This also paves way for unifying the hugetlb gup-slow.

    Link: https://lkml.kernel.org/r/20230628215310.73782-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20230628215310.73782-2-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kirill A . Shutemov <kirill@shutemov.name>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:25 -04:00
Waiman Long 18cf897167 hugetlb: memcg: account hugetlb-backed memory in memory controller
JIRA: https://issues.redhat.com/browse/RHEL-56023
Conflicts: A context diff in alloc_hugetlb_folio() hunk of mm/hugetlb.c
	   due to the presence of a later upstream commit b76b46902c2d
	   ("mm/hugetlb: fix missing hugetlb_lock for resv uncharge").

commit 8cba9576df601c384abd334a503c3f6e1e29eefb
Author: Nhat Pham <nphamcs@gmail.com>
Date:   Fri, 6 Oct 2023 11:46:28 -0700

    hugetlb: memcg: account hugetlb-backed memory in memory controller

    Currently, hugetlb memory usage is not acounted for in the memory
    controller, which could lead to memory overprotection for cgroups with
    hugetlb-backed memory.  This has been observed in our production system.

    For instance, here is one of our usecases: suppose there are two 32G
    containers.  The machine is booted with hugetlb_cma=6G, and each container
    may or may not use up to 3 gigantic page, depending on the workload within
    it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
    limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
    difficult to configure memory.max to keep overall consumption, including
    anon, cache, slab etc.  fair.

    What we have had to resort to is to constantly poll hugetlb usage and
    readjust memory.max.  Similar procedure is done to other memory limits
    (memory.low for e.g).  However, this is rather cumbersome and buggy.
    Furthermore, when there is a delay in memory limits correction, (for e.g
    when hugetlb usage changes within consecutive runs of the userspace
    agent), the system could be in an over/underprotected state.

    This patch rectifies this issue by charging the memcg when the hugetlb
    folio is utilized, and uncharging when the folio is freed (analogous to
    the hugetlb controller).  Note that we do not charge when the folio is
    allocated to the hugetlb pool, because at this point it is not owned by
    any memcg.

    Some caveats to consider:
      * This feature is only available on cgroup v2.
      * There is no hugetlb pool management involved in the memory
        controller. As stated above, hugetlb folios are only charged towards
        the memory controller when it is used. Host overcommit management
        has to consider it when configuring hard limits.
      * Failure to charge towards the memcg results in SIGBUS. This could
        happen even if the hugetlb pool still has pages (but the cgroup
        limit is hit and reclaim attempt fails).
      * When this feature is enabled, hugetlb pages contribute to memory
        reclaim protection. low, min limits tuning must take into account
        hugetlb memory.
      * Hugetlb pages utilized while this option is not selected will not
        be tracked by the memory controller (even if cgroup v2 is remounted
        later on).

    Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
    Signed-off-by: Nhat Pham <nphamcs@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Frank van der Linden <fvdl@google.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-09-30 09:46:59 -04:00
Rafael Aquini 01e4afce24 mm/hugetlb: fix pgtable lock on pmd sharing
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 349d1670008d3dab99a11b015bef51ad3f26fb4f
Author: Peter Xu <peterx@redhat.com>
Date:   Mon Jun 12 12:04:20 2023 -0400

    mm/hugetlb: fix pgtable lock on pmd sharing

    Huge pmd sharing operates on PUD not PMD, huge_pte_lock() is not suitable
    in this case because it should only work for last level pte changes, while
    pmd sharing is always one level higher.

    Meanwhile, here we're locking over the spte pgtable lock which is even not
    a lock for current mm but someone else's.

    It seems even racy on operating on the lock, as after put_page() of the
    spte pgtable page logically the page can be released, so at least the
    spin_unlock() needs to be done after the put_page().

    No report I am aware, I'm not even sure whether it'll just work on taking
    the spte pmd lock, because while we're holding i_mmap read lock it probably
    means the vma interval tree is frozen, all pte allocators over this pud
    entry could always find the specific svma and spte page, so maybe they'll
    serialize on this spte page lock?  Even so, doesn't seem to be expected.
    It just seems to be an accident of cb900f4121.

    Fix it with the proper pud lock (which is the mm's page_table_lock).

    Link: https://lkml.kernel.org/r/20230612160420.809818-1-peterx@redhat.com
    Fixes: cb900f4121 ("mm, hugetlb: convert hugetlbfs to use split pmd lock")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:53 -04:00
Rafael Aquini 2358f80408 mm/folio: avoid special handling for order value 0 in folio_set_order
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit e3b7bf972d632288ccad95b116628e3141be676e
Author: Tarun Sahu <tsahu@linux.ibm.com>
Date:   Fri Jun 9 21:59:07 2023 +0530

    mm/folio: avoid special handling for order value 0 in folio_set_order

    folio_set_order(folio, 0) is used in kernel at two places
    __destroy_compound_gigantic_folio and __prep_compound_gigantic_folio.
    Currently, It is called to clear out the folio->_folio_nr_pages and
    folio->_folio_order.

    For __destroy_compound_gigantic_folio:
    In past, folio_set_order(folio, 0) was needed because page->mapping used
    to overlap with _folio_nr_pages and _folio_order. So if these fields were
    left uncleared during freeing gigantic hugepages, they were causing
    "BUG: bad page state" due to non-zero page->mapping. Now, After
    Commit a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to
    CMA") page->mapping has explicitly been cleared out for tail pages. Also,
    _folio_order and _folio_nr_pages no longer overlaps with page->mapping.

    So, folio_set_order(folio, 0) can be removed from freeing gigantic
    folio path (__destroy_compound_gigantic_folio).

    Another place, folio_set_order(folio, 0) is called inside
    __prep_compound_gigantic_folio during error path. Here,
    folio_set_order(folio, 0) can also be removed if we move
    folio_set_order(folio, order) after for loop.

    The patch also moves _folio_set_head call in __prep_compound_gigantic_folio()
    such that we avoid clearing them in the error path.

    Also, as Mike pointed out:
    "It would actually be better to move the calls _folio_set_head and
    folio_set_order in __prep_compound_gigantic_folio() as suggested here. Why?
    In the current code, the ref count on the 'head page' is still 1 (or more)
    while those calls are made. So, someone could take a speculative ref on the
    page BEFORE the tail pages are set up."

    This way, folio_set_order(folio, 0) is no more needed. And it will also
    helps removing the confusion of folio order being set to 0 (as _folio_order
    field is part of first tail page).

    Testing: I have run LTP tests, which all passes. and also I have written
    the test in LTP which tests the bug caused by compound_nr and page->mapping
    overlapping.

    https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/hugetlb/hugemmap/hugemmap32.c

    Running on older kernel ( < 5.10-rc7) with the above bug this fails while
    on newer kernel and, also with this patch it passes.

    Link: https://lkml.kernel.org/r/20230609162907.111756-1-tsahu@linux.ibm.com
    Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:41 -04:00
Rafael Aquini 961f35344e mm/hugetlb: use a folio in hugetlb_fault()
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 061e62e8180d3fab378a52d868e29ceebe2fe1d2
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Tue Jun 6 14:20:13 2023 +0800

    mm/hugetlb: use a folio in hugetlb_fault()

    We can replace seven implicit calls to compound_head() with one by using
    folio.

    [akpm@linux-foundation.org: update comment, per Sidhartha]
    Link: https://lkml.kernel.org/r/20230606062013.2947002-4-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:32 -04:00
Rafael Aquini c01b7a63dd mm/hugetlb: use a folio in hugetlb_wp()
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * mm/hugetlb.c: minor context diff on the 2nd hunk due to out-of-order backport
    of upstream commit b8a2528835b3 ("mm/hugetlb: document why hugetlb uses
    folio_mapcount() for COW reuse decisions"), and on the 8th hunk due to
    out-of-order backport of upstream commit ec8832d007cb ("mmu_notifiers:
    don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()")

This patch is a backport of the following upstream commit:
commit 959a78b6dd4526fb11d3cacf2de909479b06a4f4
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Tue Jun 6 14:20:12 2023 +0800

    mm/hugetlb: use a folio in hugetlb_wp()

    We can replace nine implict calls to compound_head() with one by using
    old_folio.  The page we get back is always a head page, so we just convert
    old_page to old_folio.

    Link: https://lkml.kernel.org/r/20230606062013.2947002-3-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:31 -04:00
Rafael Aquini bdea6fda7a mm/hugetlb: use a folio in copy_hugetlb_page_range()
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit ad27ce206af731f6854b3d8a1760c573b217e363
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Tue Jun 6 14:20:11 2023 +0800

    mm/hugetlb: use a folio in copy_hugetlb_page_range()

    Patch series "Convert several functions in hugetlb.c to use a folio", v2.

    This patch series converts three functions in hugetlb.c to use a folio,
    which can remove several implicit calls to compound_head().

    This patch (of 3):

    We can replace five implict calls to compound_head() with one by using
    pte_folio.  The page we get back is always a head page, so we just convert
    ptepage to pte_folio.

    Link: https://lkml.kernel.org/r/20230606062013.2947002-1-zhangpeng362@huawei.com
    Link: https://lkml.kernel.org/r/20230606062013.2947002-2-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:30 -04:00
Rafael Aquini 831afe0191 mm/gup: remove vmas array from internal GUP functions
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit b2cac248191b7466c5819e0da617b0705a26e197
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Wed May 17 20:25:48 2023 +0100

    mm/gup: remove vmas array from internal GUP functions

    Now we have eliminated all callers to GUP APIs which use the vmas
    parameter, eliminate it altogether.

    This eliminates a class of bugs where vmas might have been kept around
    longer than the mmap_lock and thus we need not be concerned about locks
    being dropped during this operation leaving behind dangling pointers.

    This simplifies the GUP API and makes it considerably clearer as to its
    purpose - follow flags are applied and if pinning, an array of pages is
    returned.

    Link: https://lkml.kernel.org/r/6811b4b2b4b3baf3dd07f422bb18853bb2cd09fb.1684350871.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Janosch Frank <frankja@linux.ibm.com>
    Cc: Jarkko Sakkinen <jarkko@kernel.org>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:38 -04:00
Lucas Zampieri 921df80dee Merge: Update MM Selftests for 9.5
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4786

Update kselftests mm to include new tests and fixes. This update is larger than usual, due to having to backport a lot of build related changes in `tools/testing/selftests/kselftest*`

Omitted-fix: f1227dc7d0411ee9a9faaa1e80cfd9d6e5d6d63e

Omitted-fix: a52540522c9541bfa3e499d2edba7bc0ca73a4ca

Omitted-fix: 2bfed7d2ffa5d86c462d3e2067f2832eaf8c04c7

JIRA: https://issues.redhat.com/browse/RHEL-39306

Signed-off-by: Nico Pache <npache@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-21 12:51:18 +00:00
Lucas Zampieri 2424e8e040 Merge: mm: follow up work for the MM v6.4 update and disable CONFIG_PER_VMA_LOCK until it is fixed
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4749

JIRA: https://issues.redhat.com/browse/RHEL-48221  
  
It was identified that our process to bring in code-base updates   
has been unwittingly missing some of the peripheric commits not   
touching directly the core code under mm/ the directory.  
While most of these identified peripheric commits are simple  
and basic clean-ups, some are relevant changesets that might end   
up causing real(and subtle) issues for RHEL deployments if they  
remain missing.   
  
The intent of this patchset is to close the aforementioned GAP  
by bringing in the missing peripheric commits from v5.14 up to  
v6.4, which is the level we're parking our codebase for RHEL-9.5.  
  
A secondary intent of this patchset is to bring in upstream's   
v6.5 commit that disables the PER_VMA_LOCK feature which was   
recently introduced (to RHEL-9.5) but was marked BROKEN upstream  
circa release v6.5, in order to avoid the reported issues with  
memory corruptions in the upstream builds.  
  
Signed-off-by: Rafael Aquini <aquini@redhat.com>

Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-06 14:21:52 +00:00
Nico Pache 5100562ec1 mm/hugetlb: document why hugetlb uses folio_mapcount() for COW reuse decisions
commit b8a2528835b31718286e7436529917e1f521bf6f
Author: David Hildenbrand <david@redhat.com>
Date:   Thu May 2 10:52:59 2024 +0200

    mm/hugetlb: document why hugetlb uses folio_mapcount() for COW reuse decisions

    Let's document why hugetlb still uses folio_mapcount() and is prone to
    leaking memory between processes, for example using vmsplice() that still
    uses FOLL_GET.

    More details can be found in [1], especially around how hugetlb pages
    cannot really be overcommitted, and why we don't particularly care about
    these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.

    [1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/

    Link: https://lkml.kernel.org/r/20240502085259.103784-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Suggested-by: Peter Xu <peterx@redhat.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-39306
Signed-off-by: Nico Pache <npache@redhat.com>
2024-08-02 10:13:28 -06:00
Scott Weaver d637f3c51d Merge: hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4827

JIRA: https://issues.redhat.com/browse/RHEL-38605

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Scott Weaver <scweaver@redhat.com>
2024-08-02 10:38:24 -04:00
Aristeu Rozanski 286db241d9 hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
JIRA: https://issues.redhat.com/browse/RHEL-38605
Tested: by reporter

commit 003af997c8a945493859dd1a2d015cc9387ff27a
Author: Aristeu Rozanski <aris@redhat.com>
Date:   Fri Jun 21 15:00:50 2024 -0400

    hugetlb: force allocating surplus hugepages on mempolicy allowed nodes

    When trying to allocate a hugepage with no reserved ones free, it may be
    allowed in case a number of overcommit hugepages was configured (using
    /proc/sys/vm/nr_overcommit_hugepages) and that number wasn't reached.
    This allows for a behavior of having extra hugepages allocated
    dynamically, if there're resources for it.  Some sysadmins even prefer not
    reserving any hugepages and setting a big number of overcommit hugepages.

    But while attempting to allocate overcommit hugepages in a multi node
    system (either NUMA or mempolicy/cpuset) said allocations might randomly
    fail even when there're resources available for the allocation.

    This happens due to allowed_mems_nr() only accounting for the number of
    free hugepages in the nodes the current process belongs to and the surplus
    hugepage allocation is done so it can be allocated in any node.  In case
    one or more of the requested surplus hugepages are allocated in a
    different node, the whole allocation will fail due allowed_mems_nr()
    returning a lower value.

    So allocate surplus hugepages in one of the nodes the current process
    belongs to.

    Easy way to reproduce this issue is to use a 2+ NUMA nodes system:

            # echo 0 >/proc/sys/vm/nr_hugepages
            # echo 1 >/proc/sys/vm/nr_overcommit_hugepages
            # numactl -m0 ./tools/testing/selftests/mm/map_hugetlb 2

    Repeating the execution of map_hugetlb test application will eventually
    fail when the hugepage ends up allocated in a different node.

    [aris@ruivo.org: v2]
      Link: https://lkml.kernel.org/r/20240701212343.GG844599@cathedrallabs.org
    Link: https://lkml.kernel.org/r/20240621190050.mhxwb65zn37doegp@redhat.com
    Signed-off-by: Aristeu Rozanski <aris@redhat.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Aristeu Rozanski <aris@ruivo.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Vishal Moola <vishal.moola@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-07-26 15:21:04 -04:00
Rafael Aquini 17829a0682 mm: hugetlb: move hugeltb sysctls to its own file
JIRA: https://issues.redhat.com/browse/RHEL-48221
Conflicts:
    * kernel/sysctl.c: minor context diff due to out-of-order backport for
      commit e95d372c4cd4 ("mm: page_alloc: move sysctls into it own fils")
    * mm/hugetlb.c: minor context diff due to out-of-order backport for
      commit 263b899802fc ("hugetlb: make hugetlb_cma_check() static")

This patch is a backport of the following upstream commit:
commit 962de54828c5bb100eb8de881b79ed9c8b7468c0
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 9 20:20:11 2023 +0800

    mm: hugetlb: move hugeltb sysctls to its own file

    This moves all hugetlb sysctls to its own file, also kill an
    useless hugetlb_treat_movable_handler() defination.

    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-07-16 09:29:59 -04:00
Rafael Aquini 80c2409b20 mm: update mmap_sem comments to refer to mmap_lock
JIRA: https://issues.redhat.com/browse/RHEL-48221

This patch is a backport of the following upstream commit:
commit 8651a137e62ebfde3df95cbb1ca055d013ec5b9e
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Sat Jan 7 00:00:05 2023 +0000

    mm: update mmap_sem comments to refer to mmap_lock

    The rename from mm->mmap_sem to mm->mmap_lock was performed in commit
    da1c55f1b2 ("mmap locking API: rename mmap_sem to mmap_lock") and commit
    c1e8d7c6a7 ("map locking API: convert mmap_sem comments"), however some
    incorrect comments remain.

    This patch simply corrects those comments which are obviously incorrect
    within mm itself.

    Link: https://lkml.kernel.org/r/33fba04389ab63fc4980e7ba5442f521df6dc657.1673048927.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-07-16 09:29:53 -04:00
Lucas Zampieri a8eba3341d Merge: CVE-2024-36028: mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4402

```
mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()
    
    commit 52ccdde16b6540abe43b6f8d8e1e1ec90b0983af
    Author: Miaohe Lin <linmiaohe@huawei.com>
    Date:   Fri Apr 19 16:58:19 2024 +0800
```

CVE: CVE-2024-36028

JIRA: https://issues.redhat.com/browse/RHEL-39710

Signed-off-by: Nico Pache <npache@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Aristeu Rozanski <arozansk@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-07-03 13:11:05 +00:00
Nico Pache 70bfb6749c mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()
commit 52ccdde16b6540abe43b6f8d8e1e1ec90b0983af
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Apr 19 16:58:19 2024 +0800

    mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()

    When I did memory failure tests recently, below warning occurs:

    DEBUG_LOCKS_WARN_ON(1)
    WARNING: CPU: 8 PID: 1011 at kernel/locking/lockdep.c:232 __lock_acquire+0xccb/0x1ca0
    Modules linked in: mce_inject hwpoison_inject
    CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    RIP: 0010:__lock_acquire+0xccb/0x1ca0
    RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
    RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
    RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
    RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
    R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
    R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
    FS:  00007ff9f32aa740(0000) GS:ffffa1ce5fc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ff9f3134ba0 CR3: 00000008484e4000 CR4: 00000000000006f0
    Call Trace:
     <TASK>
     lock_acquire+0xbe/0x2d0
     _raw_spin_lock_irqsave+0x3a/0x60
     hugepage_subpool_put_pages.part.0+0xe/0xc0
     free_huge_folio+0x253/0x3f0
     dissolve_free_huge_page+0x147/0x210
     __page_handle_poison+0x9/0x70
     memory_failure+0x4e6/0x8c0
     hard_offline_page_store+0x55/0xa0
     kernfs_fop_write_iter+0x12c/0x1d0
     vfs_write+0x380/0x540
     ksys_write+0x64/0xe0
     do_syscall_64+0xbc/0x1d0
     entry_SYSCALL_64_after_hwframe+0x77/0x7f
    RIP: 0033:0x7ff9f3114887
    RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
    RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
    RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
    R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
     </TASK>
    Kernel panic - not syncing: kernel: panic_on_warn set ...
    CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    Call Trace:
     <TASK>
     panic+0x326/0x350
     check_panic_on_warn+0x4f/0x50
     __warn+0x98/0x190
     report_bug+0x18e/0x1a0
     handle_bug+0x3d/0x70
     exc_invalid_op+0x18/0x70
     asm_exc_invalid_op+0x1a/0x20
    RIP: 0010:__lock_acquire+0xccb/0x1ca0
    RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
    RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
    RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
    RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
    R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
    R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
     lock_acquire+0xbe/0x2d0
     _raw_spin_lock_irqsave+0x3a/0x60
     hugepage_subpool_put_pages.part.0+0xe/0xc0
     free_huge_folio+0x253/0x3f0
     dissolve_free_huge_page+0x147/0x210
     __page_handle_poison+0x9/0x70
     memory_failure+0x4e6/0x8c0
     hard_offline_page_store+0x55/0xa0
     kernfs_fop_write_iter+0x12c/0x1d0
     vfs_write+0x380/0x540
     ksys_write+0x64/0xe0
     do_syscall_64+0xbc/0x1d0
     entry_SYSCALL_64_after_hwframe+0x77/0x7f
    RIP: 0033:0x7ff9f3114887
    RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
    RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
    RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
    R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
     </TASK>

    After git bisecting and digging into the code, I believe the root cause is
    that _deferred_list field of folio is unioned with _hugetlb_subpool field.
    In __update_and_free_hugetlb_folio(), folio->_deferred_list is
    initialized leading to corrupted folio->_hugetlb_subpool when folio is
    hugetlb.  Later free_huge_folio() will use _hugetlb_subpool and above
    warning happens.

    But it is assumed hugetlb flag must have been cleared when calling
    folio_put() in update_and_free_hugetlb_folio().  This assumption is broken
    due to below race:

    CPU1                                    CPU2
    dissolve_free_huge_page                 update_and_free_pages_bulk
     update_and_free_hugetlb_folio           hugetlb_vmemmap_restore_folios
                                              folio_clear_hugetlb_vmemmap_optimized
      clear_flag = folio_test_hugetlb_vmemmap_optimized
      if (clear_flag) <-- False, it's already cleared.
       __folio_clear_hugetlb(folio) <-- Hugetlb is not cleared.
      folio_put
       free_huge_folio <-- free_the_page is expected.
                                             list_for_each_entry()
                                              __folio_clear_hugetlb <-- Too late.

    Fix this issue by checking whether folio is hugetlb directly instead of
    checking clear_flag to close the race window.

    Link: https://lkml.kernel.org/r/20240419085819.1901645-1-linmiaohe@huawei.com
    Fixes: 32c877191e02 ("hugetlb: do not clear hugetlb dtor until allocating vmemmap")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

CVE: CVE-2024-36028
JIRA: https://issues.redhat.com/browse/RHEL-39710
Signed-off-by: Nico Pache <npache@redhat.com>
2024-06-17 20:55:20 -06:00