Commit Graph

1010 Commits

Author SHA1 Message Date
Audra Mitchell 52c4085ea0 mm: allow multiple error returns in try_grab_page()
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Major context differences as both of the following commits got merged at the
    same time upstream (see merge commit for details
    57a196a58421 ("hugetlb: simplify hugetlb handling in follow_page_mask")
    0f0892356fa1 ("mm: allow multiple error returns in try_grab_page()")
    e2ca6ba6ba01 ("Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")

    Essentially, commit 0f0892356fa1 references functions follow_huge_pmd_pte and
    follow_huge_pud, which were merged into one function (hugetlb_follow_page_mask)
    in commit 57a196a58421. Because both of these commits were accepted around the
    same time, please refer to how upstream resolved the conflicts in commit
    e2ca6ba6ba01.

This patch is a backport of the following upstream commit:
commit 0f0892356fa174bdd8bd655c820ee3658c4c9f01
Author: Logan Gunthorpe <logang@deltatee.com>
Date:   Fri Oct 21 11:41:08 2022 -0600

    mm: allow multiple error returns in try_grab_page()

    In order to add checks for P2PDMA memory into try_grab_page(), expand
    the error return from a bool to an int/error code. Update all the
    callsites handle change in usage.

    Also remove the WARN_ON_ONCE() call at the callsites seeing there
    already is a WARN_ON_ONCE() inside the function if it fails.

    Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20221021174116.7200-2-logang@deltatee.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:52 -04:00
Chris von Recklinghausen 80a5bb5a00 hugetlb: remove duplicate mmu notifications
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 369258ce41c6d7663a7b6d509356fecad577378d
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Mon Nov 14 15:55:07 2022 -0800

    hugetlb: remove duplicate mmu notifications

    The common hugetlb unmap routine __unmap_hugepage_range performs mmu
    notification calls.  However, in the case where __unmap_hugepage_range is
    called via __unmap_hugepage_range_final, mmu notification calls are
    performed earlier in other calling routines.

    Remove mmu notification calls from __unmap_hugepage_range.  Add
    notification calls to the only other caller: unmap_hugepage_range.
    unmap_hugepage_range is called for truncation and hole punch, so change
    notification type from UNMAP to CLEAR as this is more appropriate.

    Link: https://lkml.kernel.org/r/20221114235507.294320-4-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Peter Xu <peterx@redhat.com>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:05 -04:00
Jerry Snitselaar feb173f234 mmu_notifiers: rename invalidate_range notifier
JIRA: https://issues.redhat.com/browse/RHEL-26541
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Conflicts: flush_tlb_page_nosync has vma struct arg

commit 1af5a8109904b7f00828e7f9f63f5695b42f8215
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jul 25 23:42:07 2023 +1000

    mmu_notifiers: rename invalidate_range notifier

    There are two main use cases for mmu notifiers.  One is by KVM which uses
    mmu_notifier_invalidate_range_start()/end() to manage a software TLB.

    The other is to manage hardware TLBs which need to use the
    invalidate_range() callback because HW can establish new TLB entries at
    any time.  Hence using start/end() can lead to memory corruption as these
    callbacks happen too soon/late during page unmap.

    mmu notifier users should therefore either use the start()/end() callbacks
    or the invalidate_range() callbacks.  To make this usage clearer rename
    the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
    update documention.

    Link: https://lkml.kernel.org/r/6f77248cd25545c8020a54b4e567e8b72be4dca1.1690292440.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Andrew Donnellan <ajd@linux.ibm.com>
    Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
    Cc: Frederic Barrat <fbarrat@linux.ibm.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kevin Tian <kevin.tian@intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Nicolin Chen <nicolinc@nvidia.com>
    Cc: Robin Murphy <robin.murphy@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zhi Wang <zhi.wang.linux@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 1af5a8109904b7f00828e7f9f63f5695b42f8215)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2024-02-26 15:51:24 -07:00
Jerry Snitselaar efb6748971 mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
JIRA: https://issues.redhat.com/browse/RHEL-26541
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Conflicts: Context diff due to some commits not being backported yet such as c33c794828f2 ("mm: ptep_get() conversion"),
           and 959a78b6dd45 ("mm/hugetlb: use a folio in hugetlb_wp()").

commit ec8832d007cb7b50229ad5745eec35b847cc9120
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jul 25 23:42:06 2023 +1000

    mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()

    Secondary TLBs are now invalidated from the architecture specific TLB
    invalidation functions.  Therefore there is no need to explicitly notify
    or invalidate as part of the range end functions.  This means we can
    remove mmu_notifier_invalidate_range_end_only() and some of the
    ptep_*_notify() functions.

    Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Andrew Donnellan <ajd@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
    Cc: Frederic Barrat <fbarrat@linux.ibm.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kevin Tian <kevin.tian@intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Nicolin Chen <nicolinc@nvidia.com>
    Cc: Robin Murphy <robin.murphy@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zhi Wang <zhi.wang.linux@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit ec8832d007cb7b50229ad5745eec35b847cc9120)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2024-02-26 15:49:51 -07:00
Scott Weaver 8c2c8bf31a Merge: DRM Backport 9.4 dependencies
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3094

JIRA: https://issues.redhat.com/browse/RHEL-1349

Depends: !2843

Depends: !3129

These are the dependencies needed for the 9.4 DRM backport.

Omitted-fix: cf683e8870bd4be0fd6b98639286700a35088660 (fix is included)

Omitted-fix: c042030aa15e9265504a034243a8cae062e900a1 (fix is included)

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>

Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Michel Dänzer <mdaenzer@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-11-02 12:33:47 -04:00
Paolo Bonzini 538bf6f332 mm, treewide: redefine MAX_ORDER sanely
JIRA: https://issues.redhat.com/browse/RHEL-10059

MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.

This definition is counter-intuitive and lead to number of bugs all over
the kernel.

Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.

[kirill@shutemov.name: fix min() warning]
  Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box
[akpm@linux-foundation.org: fix another min_t warning]
[kirill@shutemov.name: fixups per Zi Yan]
  Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name
[akpm@linux-foundation.org: fix underlining in docs]
  Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 23baf831a32c04f9a968812511540b1b3e648bf5)

[RHEL: Fix conflicts by changing MAX_ORDER - 1 to MAX_ORDER,
       ">= MAX_ORDER" to "> MAX_ORDER", etc.]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-30 09:12:37 +01:00
Mika Penttilä 7ef8f6ec98 mm: fix a few rare cases of using swapin error pte marker
JIRA: https://issues.redhat.com/browse/RHEL-1349
Upstream Status: v6.2-rc7

commit 7e3ce3f8d2d235f916baad1582f6cf12e0319013
Author:     Peter Xu <peterx@redhat.com>
AuthorDate: Wed Dec 14 15:04:53 2022 -0500
Commit:     Andrew Morton <akpm@linux-foundation.org>
CommitDate: Wed Jan 18 17:02:19 2023 -0800

    This patch should harden commit 15520a3f0469 ("mm: use pte markers for
    swap errors") on using pte markers for swapin errors on a few corner
    cases.

    1. Propagate swapin errors across fork()s: if there're swapin errors in
       the parent mm, after fork()s the child should sigbus too when an error
       page is accessed.

    2. Fix a rare condition race in pte_marker_clear() where a uffd-wp pte
       marker can be quickly switched to a swapin error.

    3. Explicitly ignore swapin error pte markers in change_protection().

    I mostly don't worry on (2) or (3) at all, but we should still have them. 
    Case (1) is special because it can potentially cause silent data corrupt
    on child when parent has swapin error triggered with swapoff, but since
    swapin error is rare itself already it's probably not easy to trigger
    either.

    Currently there is a priority difference between the uffd-wp bit and the
    swapin error entry, in which the swapin error always has higher priority
    (e.g.  we don't need to wr-protect a swapin error pte marker).

    If there will be a 3rd bit introduced, we'll probably need to consider a
    more involved approach so we may need to start operate on the bits.  Let's
    leave that for later.

    This patch is tested with case (1) explicitly where we'll get corrupted
    data before in the child if there's existing swapin error pte markers, and
    after patch applied the child can be rightfully killed.

    We don't need to copy stable for this one since 15520a3f0469 just landed
    as part of v6.2-rc1, only "Fixes" applied.

    Link: https://lkml.kernel.org/r/20221214200453.1772655-3-peterx@redhat.com
    Fixes: 15520a3f0469 ("mm: use pte markers for swap errors")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Pengfei Xu <pengfei.xu@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
2023-10-30 07:03:06 +02:00
Chris von Recklinghausen d384489054 mm: convert head_subpages_mapcount() into folio_nr_pages_mapped()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit eec20426d48bd7b63c69969a793943ed1a99b731
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:48 2023 +0000

    mm: convert head_subpages_mapcount() into folio_nr_pages_mapped()

    Calling this 'mapcount' is confusing since mapcount is usually the number
    of times something is mapped; instead this is the number of mapped pages.
    It's also better to enforce that this is a folio rather than a head page.

    Move folio_nr_pages_mapped() into mm/internal.h since this is not
    something we want device drivers or filesystems poking at.  Get rid of
    folio_subpages_mapcount_ptr() and use folio->_nr_pages_mapped directly.

    Link: https://lkml.kernel.org/r/20230111142915.1001531-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:52 -04:00
Chris von Recklinghausen fe5f50def7 mm: remove folio_pincount_ptr() and head_compound_pincount()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 94688e8eb453e616098cb930e5f6fed4a6ea2dfa
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:47 2023 +0000

    mm: remove folio_pincount_ptr() and head_compound_pincount()

    We can use folio->_pincount directly, since all users are guarded by tests
    of compound/large.

    Link: https://lkml.kernel.org/r/20230111142915.1001531-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:52 -04:00
Chris von Recklinghausen 40638b50bc mm/hugetlb: introduce hugetlb_walk()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 9c67a20704e763f9cb8cd262c3e45de7bd2816bc
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Dec 16 10:52:29 2022 -0500

    mm/hugetlb: introduce hugetlb_walk()

    huge_pte_offset() is the main walker function for hugetlb pgtables.  The
    name is not really representing what it does, though.

    Instead of renaming it, introduce a wrapper function called hugetlb_walk()
    which will use huge_pte_offset() inside.  Assert on the locks when walking
    the pgtable.

    Note, the vma lock assertion will be a no-op for private mappings.

    Document the last special case in the page_vma_mapped_walk() path where we
    don't need any more lock to call hugetlb_walk().

    Taking vma lock there is not needed because either: (1) potential callers
    of hugetlb pvmw holds i_mmap_rwsem already (from one rmap_walk()), or (2)
    the caller will not walk a hugetlb vma at all so the hugetlb code path not
    reachable (e.g.  in ksm or uprobe paths).

    It's slightly implicit for future page_vma_mapped_walk() callers on that
    lock requirement.  But anyway, when one day this rule breaks, one will get
    a straightforward warning in hugetlb_walk() with lockdep, then there'll be
    a way out.

    [akpm@linux-foundation.org: coding-style cleanups]
    Link: https://lkml.kernel.org/r/20221216155229.2043750-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:46 -04:00
Chris von Recklinghausen 342b235b99 mm/hugetlb: make follow_hugetlb_page() safe to pmd unshare
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit eefc7fa53608920203a1402ecf7255ecfa8bb030
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Dec 16 10:52:23 2022 -0500

    mm/hugetlb: make follow_hugetlb_page() safe to pmd unshare

    Since follow_hugetlb_page() walks the pgtable, it needs the vma lock to
    make sure the pgtable page will not be freed concurrently.

    Link: https://lkml.kernel.org/r/20221216155223.2043727-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:45 -04:00
Chris von Recklinghausen 18455d905f mm/hugetlb: make hugetlb_follow_page_mask() safe to pmd unshare
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 7d049f3a03ea705522210d70b9d3e223ef86d663
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Dec 16 10:52:19 2022 -0500

    mm/hugetlb: make hugetlb_follow_page_mask() safe to pmd unshare

    Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
    to make sure the pgtable page will not be freed concurrently.

    Link: https://lkml.kernel.org/r/20221216155219.2043714-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:45 -04:00
Chris von Recklinghausen 20b7a6fe2d mm/hugetlb: move swap entry handling into vma lock when faulted
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit fcd48540d188876c917a377d81cd24c100332a62
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Dec 16 10:50:55 2022 -0500

    mm/hugetlb: move swap entry handling into vma lock when faulted

    In hugetlb_fault(), there used to have a special path to handle swap entry
    at the entrance using huge_pte_offset().  That's unsafe because
    huge_pte_offset() for a pmd sharable range can access freed pgtables if
    without any lock to protect the pgtable from being freed after pmd
    unshare.

    Here the simplest solution to make it safe is to move the swap handling to
    be after the vma lock being held.  We may need to take the fault mutex on
    either migration or hwpoison entries now (also the vma lock, but that's
    really needed), however neither of them is hot path.

    Note that the vma lock cannot be released in hugetlb_fault() when the
    migration entry is detected, because in migration_entry_wait_huge() the
    pgtable page will be used again (by taking the pgtable lock), so that also
    need to be protected by the vma lock.  Modify migration_entry_wait_huge()
    so that it must be called with vma read lock held, and properly release
    the lock in __migration_entry_wait_huge().

    Link: https://lkml.kernel.org/r/20221216155100.2043537-5-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:44 -04:00
Chris von Recklinghausen ff598ff493 mm/hugetlb: don't wait for migration entry during follow page
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bb373dce2c7b473023f9e69f041a22d81171b71a
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Dec 16 10:50:53 2022 -0500

    mm/hugetlb: don't wait for migration entry during follow page

    That's what the code does with !hugetlb pages, so we should logically do
    the same for hugetlb, so migration entry will also be treated as no page.

    This is probably also the last piece in follow_page code that may sleep,
    the last one should be removed in cf994dd8af27 ("mm/gup: remove
    FOLL_MIGRATION", 2022-11-16).

    Link: https://lkml.kernel.org/r/20221216155100.2043537-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:44 -04:00
Chris von Recklinghausen 809793f9b9 hugetlb: update vma flag check for hugetlb vma lock
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 379c2e60e82ff71510a949033bf8431f39f66c75
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Mon Dec 12 15:50:42 2022 -0800

    hugetlb: update vma flag check for hugetlb vma lock

    The check for whether a hugetlb vma lock exists partially depends on the
    vma's flags.  Currently, it checks for either VM_MAYSHARE or VM_SHARED.
    The reason both flags are used is because VM_MAYSHARE was previously
    cleared in hugetlb vmas as they are tore down.  This is no longer the
    case, and only the VM_MAYSHARE check is required.

    Link: https://lkml.kernel.org/r/20221212235042.178355-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:43 -04:00
Chris von Recklinghausen 653ae76632 mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()
Conflicts: mm/userfaultfd.c - RHEL-only patch
	8e95bedaa1a ("mm: Fix CVE-2022-2590 by reverting "mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"")
	causes a merge conflict with this patch. Since upstream commit
	5535be309971 ("mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW")
	actually fixes the CVE we can safely remove the conflicted lines
	and replace them with the lines the upstream version of thes
	patch adds

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f1eb1bacfba9019823b2fce42383f010cd561fa6
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Dec 14 15:15:33 2022 -0500

    mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()

    This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths.

    The reasons I still think this patch is worthwhile, are:

      (1) It is a cleanup already; diffstat tells.

      (2) It just feels natural after I thought about this, if the pte is uffd
          protected, let's remove the write bit no matter what it was.

      (2) Since x86 is the only arch that supports uffd-wp, it also redefines
          pte|pmd_mkuffd_wp() in that it should always contain removals of
          write bits.  It means any future arch that want to implement uffd-wp
          should naturally follow this rule too.  It's good to make it a
          default, even if with vm_page_prot changes on VM_UFFD_WP.

      (3) It covers more than vm_page_prot.  So no chance of any potential
          future "accident" (like pte_mkdirty() sparc64 or loongarch, even
          though it just got its pte_mkdirty fixed <1 month ago).  It'll be
          fairly clear when reading the code too that we don't worry anything
          before a pte_mkuffd_wp() on uncertainty of the write bit.

    We may call pte_wrprotect() one more time in some paths (e.g.  thp split),
    but that should be fully local bitop instruction so the overhead should be
    negligible.

    Although this patch should logically also fix all the known issues on
    uffd-wp too recently on page migration (not for numa hint recovery - that
    may need another explcit pte_wrprotect), but this is not the plan for that
    fix.  So no fixes, and stable doesn't need this.

    Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ives van Hoorne <ives@codesandbox.io>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:43 -04:00
Chris von Recklinghausen 8e0969ab45 mm: move folio_set_compound_order() to mm/internal.h
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 04a42e72d77a93a166b79c34b7bc862f55a53967
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Wed Dec 14 22:17:57 2022 -0800

    mm: move folio_set_compound_order() to mm/internal.h

    folio_set_compound_order() is moved to an mm-internal location so external
    folio users cannot misuse this function.  Change the name of the function
    to folio_set_order() and use WARN_ON_ONCE() rather than BUG_ON.  Also,
    handle the case if a non-large folio is passed and add clarifying comments
    to the function.

    Link: https://lore.kernel.org/lkml/20221207223731.32784-1-sidhartha.kumar@oracle.com/T/
    Link: https://lkml.kernel.org/r/20221215061757.223440-1-sidhartha.kumar@oracle.com
    Fixes: 9fd330582b2f ("mm: add folio dtor and order setter functions")
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Muchun Song <songmuchun@bytedance.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Suggested-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:43 -04:00
Chris von Recklinghausen dc21656712 hugetlb: really allocate vma lock for all sharable vmas
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e700898fa075c69b3ae02b702ab57fb75e1a82ec
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Mon Dec 12 15:50:41 2022 -0800

    hugetlb: really allocate vma lock for all sharable vmas

    Commit bbff39cc6cbc ("hugetlb: allocate vma lock for all sharable vmas")
    removed the pmd sharable checks in the vma lock helper routines.  However,
    it left the functional version of helper routines behind #ifdef
    CONFIG_ARCH_WANT_HUGE_PMD_SHARE.  Therefore, the vma lock is not being
    used for sharable vmas on architectures that do not support pmd sharing.
    On these architectures, a potential fault/truncation race is exposed that
    could leave pages in a hugetlb file past i_size until the file is removed.

    Move the functional vma lock helpers outside the ifdef, and remove the
    non-functional stubs.  Since the vma lock is not just for pmd sharing,
    rename the routine __vma_shareable_flags_pmd.

    Link: https://lkml.kernel.org/r/20221212235042.178355-1-mike.kravetz@oracle.com
    Fixes: bbff39cc6cbc ("hugetlb: allocate vma lock for all sharable vmas")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:38 -04:00
Chris von Recklinghausen e546340977 mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit c45bc55a99957b20e4e0333bcd42e12d1833a7f5
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Mon Dec 12 14:55:29 2022 -0800

    mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio

    folio_set_compound_order() checks if the passed in folio is a large folio.
    A large folio is indicated by the PG_head flag.  Call __folio_set_head()
    before setting the order.

    Link: https://lkml.kernel.org/r/20221212225529.22493-1-sidhartha.kumar@oracle.com
    Fixes: d1c6095572d0 ("mm/hugetlb: convert hugetlb prep functions to folios")
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reported-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:36 -04:00
Chris von Recklinghausen c6d772b121 mm/hugetlb: change hugetlb allocation functions to return a folio
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 19fc1a7e8b2b3b0e18fbea84ee26517e1b0f1a6e
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 29 14:50:39 2022 -0800

    mm/hugetlb: change hugetlb allocation functions to return a folio

    Many hugetlb allocation helper functions have now been converting to
    folios, update their higher level callers to be compatible with folios.
    alloc_pool_huge_page is reorganized to avoid a smatch warning reporting
    the folio variable is uninitialized.

    [sidhartha.kumar@oracle.com: update alloc_and_dissolve_hugetlb_folio comments]
      Link: https://lkml.kernel.org/r/20221206233512.146535-1-sidhartha.kumar@oracle.com
    Link: https://lkml.kernel.org/r/20221129225039.82257-11-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reported-by: Wei Chen <harperchen1110@gmail.com>
    Suggested-by: John Hubbard <jhubbard@nvidia.com>
    Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Tarun Sahu <tsahu@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:34 -04:00
Chris von Recklinghausen 7e650ba2b1 mm/hugetlb: convert hugetlb prep functions to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d1c6095572d0cf00c0cd30378639ff9387b34edd
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 29 14:50:38 2022 -0800

    mm/hugetlb: convert hugetlb prep functions to folios

    Convert prep_new_huge_page() and __prep_compound_gigantic_page() to
    folios.

    Link: https://lkml.kernel.org/r/20221129225039.82257-10-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Tarun Sahu <tsahu@linux.ibm.com>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:33 -04:00
Chris von Recklinghausen f381670865 mm/hugetlb: convert free_gigantic_page() to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 7f325a8d25631e68cd75afaeaf330187e45e0eb5
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 29 14:50:37 2022 -0800

    mm/hugetlb: convert free_gigantic_page() to folios

    Convert callers of free_gigantic_page() to use folios, function is then
    renamed to free_gigantic_folio().

    Link: https://lkml.kernel.org/r/20221129225039.82257-9-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Tarun Sahu <tsahu@linux.ibm.com>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:33 -04:00
Chris von Recklinghausen 3d85e464e2 mm/hugetlb: convert enqueue_huge_page() to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 240d67a86ecb0fa18863821a0cb55783ad50ef30
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 29 14:50:36 2022 -0800

    mm/hugetlb: convert enqueue_huge_page() to folios

    Convert callers of enqueue_huge_page() to pass in a folio, function is
    renamed to enqueue_hugetlb_folio().

    Link: https://lkml.kernel.org/r/20221129225039.82257-8-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Tarun Sahu <tsahu@linux.ibm.com>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:33 -04:00
Chris von Recklinghausen 9a4125ce96 mm/hugetlb: convert add_hugetlb_page() to folios and add hugetlb_cma_folio()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 2f6c57d696abcd2d27d07b8506d5e6bcc060e77a
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 29 14:50:35 2022 -0800

    mm/hugetlb: convert add_hugetlb_page() to folios and add hugetlb_cma_folio()

    Convert add_hugetlb_page() to take in a folio, also convert
    hugetlb_cma_page() to take in a folio.

    Link: https://lkml.kernel.org/r/20221129225039.82257-7-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Tarun Sahu <tsahu@linux.ibm.com>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:32 -04:00
Chris von Recklinghausen 1814b3b531 mm/hugetlb: convert update_and_free_page() to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d6ef19e25df2aa50f932a78c368d7bb710eaaa1b
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 29 14:50:34 2022 -0800

    mm/hugetlb: convert update_and_free_page() to folios

    Make more progress on converting the free_huge_page() destructor to
    operate on folios by converting update_and_free_page() to folios.

    Link: https://lkml.kernel.org/r/20221129225039.82257-6-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>\
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Tarun Sahu <tsahu@linux.ibm.com>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:32 -04:00
Chris von Recklinghausen 4940f2d374 mm/hugetlb: convert remove_hugetlb_page() to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit cfd5082b514765f873504cc60a50cce30738bfd3
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 29 14:50:33 2022 -0800

    mm/hugetlb: convert remove_hugetlb_page() to folios

    Removes page_folio() call by converting callers to directly pass a folio
    into __remove_hugetlb_page().

    Link: https://lkml.kernel.org/r/20221129225039.82257-5-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Tarun Sahu <tsahu@linux.ibm.com>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:32 -04:00
Chris von Recklinghausen b611c893df mm/hugetlb: convert dissolve_free_huge_page() to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 1a7cdab59b22465b850501e3897a3f3aa01670d8
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 29 14:50:32 2022 -0800

    mm/hugetlb: convert dissolve_free_huge_page() to folios

    Removes compound_head() call by using a folio rather than a head page.

    Link: https://lkml.kernel.org/r/20221129225039.82257-4-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Tarun Sahu <tsahu@linux.ibm.com>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:31 -04:00
Chris von Recklinghausen 22f017224c mm/hugetlb: convert destroy_compound_gigantic_page() to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 911565b8285381e62d3bfd0cae2889a022737c37
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 29 14:50:31 2022 -0800

    mm/hugetlb: convert destroy_compound_gigantic_page() to folios

    Convert page operations within __destroy_compound_gigantic_page() to the
    corresponding folio operations.

    Link: https://lkml.kernel.org/r/20221129225039.82257-3-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Tarun Sahu <tsahu@linux.ibm.com>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:31 -04:00
Chris von Recklinghausen 99c827d6e4 mm: add folio dtor and order setter functions
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 9fd330582b2fe43c49ebcd02b2480f051f85aad4
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 29 14:50:30 2022 -0800

    mm: add folio dtor and order setter functions

    Patch series "convert core hugetlb functions to folios", v5.

    ============== OVERVIEW ===========================
    Now that many hugetlb helper functions that deal with hugetlb specific
    flags[1] and hugetlb cgroups[2] are converted to folios, higher level
    allocation, prep, and freeing functions within hugetlb can also be
    converted to operate in folios.

    Patch 1 of this series implements the wrapper functions around setting the
    compound destructor and compound order for a folio.  Besides the user
    added in patch 1, patch 2 and patch 9 also use these helper functions.

    Patches 2-10 convert the higher level hugetlb functions to folios.

    ============== TESTING ===========================
    LTP:
            Ran 10 back to back rounds of the LTP hugetlb test suite.

    Gigantic Huge Pages:
            Test allocation and freeing via hugeadm commands:
                    hugeadm --pool-pages-min 1GB:10
                    hugeadm --pool-pages-min 1GB:0

    Demote:
            Demote 1 1GB hugepages to 512 2MB hugepages
                    echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
                    echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
                    cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
                            # 512
                    cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
                            # 0

    [1] https://lore.kernel.org/lkml/20220922154207.1575343-1-sidhartha.kumar@oracle.com/
    [2] https://lore.kernel.org/linux-mm/20221101223059.460937-1-sidhartha.kumar@oracle.com/

    This patch (of 10):

    Add folio equivalents for set_compound_order() and
    set_compound_page_dtor().

    Also remove extra new-lines introduced by mm/hugetlb: convert
    move_hugetlb_state() to folios and mm/hugetlb_cgroup: convert
    hugetlb_cgroup_uncharge_page() to folios.

    [sidhartha.kumar@oracle.com: clarify folio_set_compound_order() zero support]
      Link: https://lkml.kernel.org/r/20221207223731.32784-1-sidhartha.kumar@oracle.com
    Link: https://lkml.kernel.org/r/20221129225039.82257-1-sidhartha.kumar@oracle.com
    Link: https://lkml.kernel.org/r/20221129225039.82257-2-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Tarun Sahu <tsahu@linux.ibm.com>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:31 -04:00
Chris von Recklinghausen e1c02a97f1 mm,thp,rmap: simplify compound page mapcount handling
Conflicts:
	include/linux/mm.h - We already have
		a1554c002699 ("include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h")
		so keep declaration of nr_free_buffer_pages
	mm/huge_memory.c - We already have RHEL-only commit
		0837bdd68b ("Revert "mm: thp: stabilize the THP mapcount in page_remove_anon_compound_rmap"")
		so there is a difference in deleted code.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit cb67f4282bf9693658dbda934a441ddbbb1446df
Author: Hugh Dickins <hughd@google.com>
Date:   Wed Nov 2 18:51:38 2022 -0700

    mm,thp,rmap: simplify compound page mapcount handling

    Compound page (folio) mapcount calculations have been different for anon
    and file (or shmem) THPs, and involved the obscure PageDoubleMap flag.
    And each huge mapping and unmapping of a file (or shmem) THP involved
    atomically incrementing and decrementing the mapcount of every subpage of
    that huge page, dirtying many struct page cachelines.

    Add subpages_mapcount field to the struct folio and first tail page, so
    that the total of subpage mapcounts is available in one place near the
    head: then page_mapcount() and total_mapcount() and page_mapped(), and
    their folio equivalents, are so quick that anon and file and hugetlb don't
    need to be optimized differently.  Delete the unloved PageDoubleMap.

    page_add and page_remove rmap functions must now maintain the
    subpages_mapcount as well as the subpage _mapcount, when dealing with pte
    mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and
    NR_FILE_MAPPED statistics still needs reading through the subpages, using
    nr_subpages_unmapped() - but only when first or last pmd mapping finds
    subpages_mapcount raised (double-map case, not the common case).

    But are those counts (used to decide when to split an anon THP, and in
    vmscan's pagecache_reclaimable heuristic) correctly maintained?  Not
    quite: since page_remove_rmap() (and also split_huge_pmd()) is often
    called without page lock, there can be races when a subpage pte mapcount
    0<->1 while compound pmd mapcount 0<->1 is scanning - races which the
    previous implementation had prevented.  The statistics might become
    inaccurate, and even drift down until they underflow through 0.  That is
    not good enough, but is better dealt with in a followup patch.

    Update a few comments on first and second tail page overlaid fields.
    hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
    subpages_mapcount and compound_pincount are already correctly at 0, so
    delete its reinitialization of compound_pincount.

    A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
    18 seconds on small pages, and used to take 1 second on huge pages, but
    now takes 119 milliseconds on huge pages.  Mapping by pmds a second time
    used to take 860ms and now takes 92ms; mapping by pmds after mapping by
    ptes (when the scan is needed) used to take 870ms and now takes 495ms.
    But there might be some benchmarks which would show a slowdown, because
    tail struct pages now fall out of cache until final freeing checks them.

    Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.co
m
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:26 -04:00
Chris von Recklinghausen 311d13ef90 mm/hugetlb: convert move_hugetlb_state() to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 345c62d163496ae4b5c1ce530b1588067d8f5a8b
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 1 15:30:59 2022 -0700

    mm/hugetlb: convert move_hugetlb_state() to folios

    Clean up unmap_and_move_huge_page() by converting move_hugetlb_state() to
    take in folios.

    [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n build]
    Link: https://lkml.kernel.org/r/20221101223059.460937-10-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Bui Quang Minh <minhquangbui99@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:25 -04:00
Chris von Recklinghausen b96436486a mm/hugetlb_cgroup: convert hugetlb_cgroup_uncharge_page() to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d4ab0316cc33aeedf6dcb1c2c25e097a25766132
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 1 15:30:57 2022 -0700

    mm/hugetlb_cgroup: convert hugetlb_cgroup_uncharge_page() to folios

    Continue to use a folio inside free_huge_page() by converting
    hugetlb_cgroup_uncharge_page*() to folios.

    Link: https://lkml.kernel.org/r/20221101223059.460937-8-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Bui Quang Minh <minhquangbui99@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:25 -04:00
Chris von Recklinghausen b9544876bc mm/hugetlb: convert free_huge_page to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 0356c4b96f6890dd61af4c902f681764f4bdba09
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 1 15:30:56 2022 -0700

    mm/hugetlb: convert free_huge_page to folios

    Use folios inside free_huge_page(), this is in preparation for converting
    hugetlb_cgroup_uncharge_page() to take in a folio.

    Link: https://lkml.kernel.org/r/20221101223059.460937-7-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Bui Quang Minh <minhquangbui99@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:24 -04:00
Chris von Recklinghausen 12ff8e1504 mm/hugetlb: convert isolate_or_dissolve_huge_page to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d5e33bd8c16b6f5f47665d378f078bee72b85225
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 1 15:30:55 2022 -0700

    mm/hugetlb: convert isolate_or_dissolve_huge_page to folios

    Removes a call to compound_head() by using a folio when operating on the
    head page of a hugetlb compound page.

    Link: https://lkml.kernel.org/r/20221101223059.460937-6-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Bui Quang Minh <minhquangbui99@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:24 -04:00
Chris von Recklinghausen 6658973279 mm/hugetlb_cgroup: convert hugetlb_cgroup_migrate to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 29f394304f624b06fafb3cc9c3da8779f71f4bee
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 1 15:30:54 2022 -0700

    mm/hugetlb_cgroup: convert hugetlb_cgroup_migrate to folios

    Cleans up intermediate page to folio conversion code in
    hugetlb_cgroup_migrate() by changing its arguments from pages to folios.

    Link: https://lkml.kernel.org/r/20221101223059.460937-5-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Bui Quang Minh <minhquangbui99@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:24 -04:00
Chris von Recklinghausen 6200fa5886 mm/hugetlb_cgroup: convert set_hugetlb_cgroup*() to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit de656ed376c4cb47c5713fba52f8bbfbea44f387
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 1 15:30:53 2022 -0700

    mm/hugetlb_cgroup: convert set_hugetlb_cgroup*() to folios

    Allows __prep_new_huge_page() to operate on a folio by converting
    set_hugetlb_cgroup*() to take in a folio.

    Link: https://lkml.kernel.org/r/20221101223059.460937-4-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Bui Quang Minh <minhquangbui99@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:23 -04:00
Chris von Recklinghausen d153aac91e mm/hugetlb_cgroup: convert hugetlb_cgroup_from_page() to folios
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f074732d599e19a2a5b12e54743ad5eaccbe6550
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Nov 1 15:30:52 2022 -0700

    mm/hugetlb_cgroup: convert hugetlb_cgroup_from_page() to folios

    Introduce folios in __remove_hugetlb_page() by converting
    hugetlb_cgroup_from_page() to use folios.

    Also gets rid of unsed hugetlb_cgroup_from_page_resv() function.

    Link: https://lkml.kernel.org/r/20221101223059.460937-3-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Bui Quang Minh <minhquangbui99@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:23 -04:00
Chris von Recklinghausen 0e8d7c85ff mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e591ef7d96d6ea249916f351dc26a636e565c635
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Mon Oct 24 15:20:09 2022 +0900

    mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage

    Patch series "mm, hwpoison: improve handling workload related to hugetlb
    and memory_hotplug", v7.

    This patchset tries to solve the issue among memory_hotplug, hugetlb and hwpoison.
    In this patchset, memory hotplug handles hwpoison pages like below:

      - hwpoison pages should not prevent memory hotremove,
      - memory block with hwpoison pages should not be onlined.

    This patch (of 4):

    HWPoisoned page is not supposed to be accessed once marked, but currently
    such accesses can happen during memory hotremove because
    do_migrate_range() can be called before dissolve_free_huge_pages() is
    called.

    Clear HPageMigratable for hwpoisoned hugepages to prevent them from being
    migrated.  This should be done in hugetlb_lock to avoid race against
    isolate_hugetlb().

    get_hwpoison_huge_page() needs to have a flag to show it's called from
    unpoison to take refcount of hwpoisoned hugepages, so add it.

    [naoya.horiguchi@linux.dev: remove TestClearHPageMigratable and reduce to test and clear separately]
      Link: https://lkml.kernel.org/r/20221025053559.GA2104800@ik1-406-35019.vs.sakura.ne.jp
    Link: https://lkml.kernel.org/r/20221024062012.1520887-1-naoya.horiguchi@linux.dev
    Link: https://lkml.kernel.org/r/20221024062012.1520887-2-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:18 -04:00
Chris von Recklinghausen ac4694cf43 Revert "mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in"
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b12fdbf15f92b6cf5fecdd8a1855afe8809e5c58
Author: Peter Xu <peterx@redhat.com>
Date:   Mon Oct 24 15:33:36 2022 -0400

    Revert "mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in"

    With " mm/uffd: Fix vma check on userfault for wp" to fix the
    registration, we'll be safe to remove the macro hacks now.

    Link: https://lkml.kernel.org/r/20221024193336.1233616-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:18 -04:00
Chris von Recklinghausen bdc3c88db4 mm/hugetlb: unify clearing of RestoreReserve for private pages
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4781593d5dbae50500d1c7975be03b590ae2b92a
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Oct 20 15:38:32 2022 -0400

    mm/hugetlb: unify clearing of RestoreReserve for private pages

    A trivial cleanup to move clearing of RestoreReserve into adding anon rmap
    of private hugetlb mappings.  It matches with the shared mappings where we
    only clear the bit when adding into page cache, rather than spreading it
    around the code paths.

    Link: https://lkml.kernel.org/r/20221020193832.776173-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:17 -04:00
Chris von Recklinghausen 3fe0d67558 hugetlb: simplify hugetlb handling in follow_page_mask
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 57a196a58421a4b0c45949ae7309f21829aaa77f
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Sun Sep 18 19:13:48 2022 -0700

    hugetlb: simplify hugetlb handling in follow_page_mask

    During discussions of this series [1], it was suggested that hugetlb
    handling code in follow_page_mask could be simplified.  At the beginning
    of follow_page_mask, there currently is a call to follow_huge_addr which
    'may' handle hugetlb pages.  ia64 is the only architecture which provides
    a follow_huge_addr routine that does not return error.  Instead, at each
    level of the page table a check is made for a hugetlb entry.  If a hugetlb
    entry is found, a call to a routine associated with that entry is made.

    Currently, there are two checks for hugetlb entries at each page table
    level.  The first check is of the form:

            if (p?d_huge())
                    page = follow_huge_p?d();

    the second check is of the form:

            if (is_hugepd())
                    page = follow_huge_pd().

    We can replace these checks, as well as the special handling routines such
    as follow_huge_p?d() and follow_huge_pd() with a single routine to handle
    hugetlb vmas.

    A new routine hugetlb_follow_page_mask is called for hugetlb vmas at the
    beginning of follow_page_mask.  hugetlb_follow_page_mask will use the
    existing routine huge_pte_offset to walk page tables looking for hugetlb
    entries.  huge_pte_offset can be overwritten by architectures, and already
    handles special cases such as hugepd entries.

    [1] https://lore.kernel.org/linux-mm/cover.1661240170.git.baolin.wang@linux.alibaba.com/

    [mike.kravetz@oracle.com: remove vma (pmd sharing) per Peter]
      Link: https://lkml.kernel.org/r/20221028181108.119432-1-mike.kravetz@oracle.com
    [mike.kravetz@oracle.com: remove left over hugetlb_vma_unlock_read()]
      Link: https://lkml.kernel.org/r/20221030225825.40872-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20220919021348.22151-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:15 -04:00
Chris von Recklinghausen 2e4f279847 hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 04ada095dcfc4ae359418053c0be94453bdf1e84
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Mon Nov 14 15:55:06 2022 -0800

    hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing

    madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
    tables associated with the address range.  For hugetlb vmas,
    zap_page_range will call __unmap_hugepage_range_final.  However,
    __unmap_hugepage_range_final assumes the passed vma is about to be removed
    and deletes the vma_lock to prevent pmd sharing as the vma is on the way
    out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
    missing vma_lock prevents pmd sharing and could potentially lead to issues
    with truncation/fault races.

    This issue was originally reported here [1] as a BUG triggered in
    page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
    vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
    prevent pmd sharing.  Subsequent faults on this vma were confused as
    VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
    not set in new pages added to the page table.  This resulted in pages that
    appeared anonymous in a VM_SHARED vma and triggered the BUG.

    Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
    call from unmap_vmas().  This is used to indicate the 'final' unmapping of
    a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
    the vm_lock is not deleted.

    [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

    Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
    Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reported-by: Wei Chen <harperchen1110@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:13 -04:00
Chris von Recklinghausen 89b8017a38 hugetlb: fix __prep_compound_gigantic_page page flag setting
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 7fb0728a9b005b8fc55e835529047cca15191031
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Fri Nov 18 11:52:49 2022 -0800

    hugetlb: fix __prep_compound_gigantic_page page flag setting

    Commit 2b21624fc232 ("hugetlb: freeze allocated pages before creating
    hugetlb pages") changed the order page flags were cleared and set in the
    head page.  It moved the __ClearPageReserved after __SetPageHead.
    However, there is a check to make sure __ClearPageReserved is never done
    on a head page.  If CONFIG_DEBUG_VM_PGFLAGS is enabled, the following BUG
    will be hit when creating a hugetlb gigantic page:

        page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
        ------------[ cut here ]------------
        kernel BUG at include/linux/page-flags.h:500!
        Call Trace will differ depending on whether hugetlb page is created
        at boot time or run time.

    Make sure to __ClearPageReserved BEFORE __SetPageHead.

    Link: https://lkml.kernel.org/r/20221118195249.178319-1-mike.kravetz@oracle.com
    Fixes: 2b21624fc232 ("hugetlb: freeze allocated pages before creating hugetlb pages")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Tested-by: Tarun Sahu <tsahu@linux.ibm.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Joao Martins <joao.m.martins@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:12 -04:00
Chris von Recklinghausen 24a1691241 hugetlb: fix memory leak associated with vma_lock structure
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 612b8a317023e1396965aacac43d80053c6e77db
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Oct 19 13:19:57 2022 -0700

    hugetlb: fix memory leak associated with vma_lock structure

    The hugetlb vma_lock structure hangs off the vm_private_data pointer of
    sharable hugetlb vmas.  The structure is vma specific and can not be
    shared between vmas.  At fork and various other times, vmas are duplicated
    via vm_area_dup().  When this happens, the pointer in the newly created
    vma must be cleared and the structure reallocated.  Two hugetlb specific
    routines deal with this hugetlb_dup_vma_private and hugetlb_vm_op_open.
    Both routines are called for newly created vmas.  hugetlb_dup_vma_private
    would always clear the pointer and hugetlb_vm_op_open would allocate the
    new vms_lock structure.  This did not work in the case of this calling
    sequence pointed out in [1].

      move_vma
        copy_vma
          new_vma = vm_area_dup(vma);
          new_vma->vm_ops->open(new_vma); --> new_vma has its own vma lock.
        is_vm_hugetlb_page(vma)
          clear_vma_resv_huge_pages
            hugetlb_dup_vma_private --> vma->vm_private_data is set to NULL

    When clearing hugetlb_dup_vma_private we actually leak the associated
    vma_lock structure.

    The vma_lock structure contains a pointer to the associated vma.  This
    information can be used in hugetlb_dup_vma_private and hugetlb_vm_op_open
    to ensure we only clear the vm_private_data of newly created (copied)
    vmas.  In such cases, the vma->vma_lock->vma field will not point to the
    vma.

    Update hugetlb_dup_vma_private and hugetlb_vm_op_open to not clear
    vm_private_data if vma->vma_lock->vma == vma.  Also, log a warning if
    hugetlb_vm_op_open ever encounters the case where vma_lock has already
    been correctly allocated for the vma.

    [1] https://lore.kernel.org/linux-mm/5154292a-4c55-28cd-0935-82441e512fc3@huawei.com/

    Link: https://lkml.kernel.org/r/20221019201957.34607-1-mike.kravetz@oracle.com
    Fixes: 131a79b474e9 ("hugetlb: fix vma lock handling during split vma and range unmapping")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:04 -04:00
Chris von Recklinghausen 62938ffdf0 mm/hugetlb.c: make __hugetlb_vma_unlock_write_put() static
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit acfac37851e01b40c30a7afd0d93ad8db8914f25
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Fri Oct 7 12:59:20 2022 -0700

    mm/hugetlb.c: make __hugetlb_vma_unlock_write_put() static

    Reported-by: kernel test robot <lkp@intel.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:00 -04:00
Chris von Recklinghausen 58c07ff87d hugetlb: allocate vma lock for all sharable vmas
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bbff39cc6cbcb86ccfacb2dcafc79912a9f9df69
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Oct 4 18:17:07 2022 -0700

    hugetlb: allocate vma lock for all sharable vmas

    The hugetlb vma lock was originally designed to synchronize pmd sharing.
    As such, it was only necessary to allocate the lock for vmas that were
    capable of pmd sharing.  Later in the development cycle, it was discovered
    that it could also be used to simplify fault/truncation races as described
    in [1].  However, a subsequent change to allocate the lock for all vmas
    that use the page cache was never made.  A fault/truncation race could
    leave pages in a file past i_size until the file is removed.

    Remove the previous restriction and allocate lock for all VM_MAYSHARE
    vmas.  Warn in the unlikely event of allocation failure.

    [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t

    Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com
    Fixes: "hugetlb: clean up code checking for fault/truncation races"
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:55 -04:00
Chris von Recklinghausen 51803c7ce0 hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit ecfbd733878da48ed03a5b8a9c301366a03e3cca
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Oct 4 18:17:06 2022 -0700

    hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer

    hugetlb file truncation/hole punch code may need to back out and take
    locks in order in the routine hugetlb_unmap_file_folio().  This code could
    race with vma freeing as pointed out in [1] and result in accessing a
    stale vma pointer.  To address this, take the vma_lock when clearing the
    vma_lock->vma pointer.

    [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/

    [mike.kravetz@oracle.com: address build issues]
      Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey
    Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com
    Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization"
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:55 -04:00
Chris von Recklinghausen 02174dae48 hugetlb: fix vma lock handling during split vma and range unmapping
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 131a79b474e973f023c5c75e2323a940332103be
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Oct 4 18:17:05 2022 -0700

    hugetlb: fix vma lock handling during split vma and range unmapping

    Patch series "hugetlb: fixes for new vma lock series".

    In review of the series "hugetlb: Use new vma lock for huge pmd sharing
    synchronization", Miaohe Lin pointed out two key issues:

    1) There is a race in the routine hugetlb_unmap_file_folio when locks
       are dropped and reacquired in the correct order [1].

    2) With the switch to using vma lock for fault/truncate synchronization,
       we need to make sure lock exists for all VM_MAYSHARE vmas, not just
       vmas capable of pmd sharing.

    These two issues are addressed here.  In addition, having a vma lock
    present in all VM_MAYSHARE vmas, uncovered some issues around vma
    splitting.  Those are also addressed.

    [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/

    This patch (of 3):

    The hugetlb vma lock hangs off the vm_private_data field and is specific
    to the vma.  When vm_area_dup() is called as part of vma splitting, the
    vma lock pointer is copied to the new vma.  This will result in issues
    such as double freeing of the structure.  Update the hugetlb open vm_ops
    to allocate a new vma lock for the new vma.

    The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
    to prevent subsequent pmd sharing.  hugetlb_vma_lock_free attempted to
    anticipate this by checking both VM_MAYSHARE and VM_SHARED.  However, if
    only VM_MAYSHARE was set we would miss the free.  With the introduction of
    the vma lock, a vma can not participate in pmd sharing if vm_private_data
    is NULL.  Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
    free the vma lock to prevent sharing.  Also, update the sharing code to
    make sure vma lock is indeed a condition for pmd sharing.
    hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.

    Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
    Fixes: "hugetlb: add vma based lock for pmd sharing"
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:55 -04:00
Chris von Recklinghausen a43bab41ba mm/hugetlb: add available_huge_pages() func
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 8346d69d8bcb6c526a0d8bd126241dff41a60723
Author: Xin Hao <xhao@linux.alibaba.com>
Date:   Thu Sep 22 10:19:29 2022 +0800

    mm/hugetlb: add available_huge_pages() func

    In hugetlb.c there are several places which compare the values of
    'h->free_huge_pages' and 'h->resv_huge_pages', it looks a bit messy, so
    add a new available_huge_pages() function to do these.

    Link: https://lkml.kernel.org/r/20220922021929.98961-1-xhao@linux.alibaba.com
    Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:53 -04:00
Chris von Recklinghausen d69f8317cf hugetlb: freeze allocated pages before creating hugetlb pages
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 2b21624fc23277553ef254b3ad02c37afa1c484d
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Fri Sep 16 14:46:38 2022 -0700

    hugetlb: freeze allocated pages before creating hugetlb pages

    When creating hugetlb pages, the hugetlb code must first allocate
    contiguous pages from a low level allocator such as buddy, cma or
    memblock.  The pages returned from these low level allocators are ref
    counted.  This creates potential issues with other code taking speculative
    references on these pages before they can be transformed to a hugetlb
    page.  This issue has been addressed with methods and code such as that
    provided in [1].

    Recent discussions about vmemmap freeing [2] have indicated that it would
    be beneficial to freeze all sub pages, including the head page of pages
    returned from low level allocators before converting to a hugetlb page.
    This helps avoid races if we want to replace the page containing vmemmap
    for the head page.

    There have been proposals to change at least the buddy allocator to return
    frozen pages as described at [3].  If such a change is made, it can be
    employed by the hugetlb code.  However, as mentioned above hugetlb uses
    several low level allocators so each would need to be modified to return
    frozen pages.  For now, we can manually freeze the returned pages.  This
    is done in two places:

    1) alloc_buddy_huge_page, only the returned head page is ref counted.
       We freeze the head page, retrying once in the VERY rare case where
       there may be an inflated ref count.
    2) prep_compound_gigantic_page, for gigantic pages the current code
       freezes all pages except the head page.  New code will simply freeze
       the head page as well.

    In a few other places, code checks for inflated ref counts on newly
    allocated hugetlb pages.  With the modifications to freeze after
    allocating, this code can be removed.

    After hugetlb pages are freshly allocated, they are often added to the
    hugetlb free lists.  Since these pages were previously ref counted, this
    was done via put_page() which would end up calling the hugetlb destructor:
    free_huge_page.  With changes to freeze pages, we simply call
    free_huge_page directly to add the pages to the free list.

    In a few other places, freshly allocated hugetlb pages were immediately
    put into use, and the expectation was they were already ref counted.  In
    these cases, we must manually ref count the page.

    [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/
    [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/
    [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/

    [mike.kravetz@oracle.com: fix NULL pointer dereference]
      Link: https://lkml.kernel.org/r/20220921202702.106069-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20220916214638.155744-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Joao Martins <joao.m.martins@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:48 -04:00
Chris von Recklinghausen 87bc41cc32 mm/hugetlb: remove unnecessary 'NULL' values from pointer
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3259914f8cab1bab3fe691a90ac3c47411cb0aba
Author: XU pengfei <xupengfei@nfschina.com>
Date:   Wed Sep 14 09:21:14 2022 +0800

    mm/hugetlb: remove unnecessary 'NULL' values from pointer

    Pointer variables allocate memory first, and then judge.  There is no need
    to initialize the assignment.

    Link: https://lkml.kernel.org/r/20220914012113.6271-1-xupengfei@nfschina.com
    Signed-off-by: XU pengfei <xupengfei@nfschina.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:31 -04:00
Chris von Recklinghausen 1a73a05b6e mm/hugetlb.c: remove unnecessary initialization of local `err'
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 8eeda55fe08944421cf57f6185fe37b069829e7b
Author: Li zeming <zeming@nfschina.com>
Date:   Mon Sep 5 10:09:18 2022 +0800

    mm/hugetlb.c: remove unnecessary initialization of local `err'

    Link: https://lkml.kernel.org/r/20220905020918.3552-1-zeming@nfschina.com
    Signed-off-by: Li zeming <zeming@nfschina.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:06 -04:00
Chris von Recklinghausen b7b229594c hugetlb: add comment for subtle SetHPageVmemmapOptimized()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a9e1eab241bdaadd56b6cfdc481cff6a24c4799b
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 1 20:00:28 2022 +0800

    hugetlb: add comment for subtle SetHPageVmemmapOptimized()

    The SetHPageVmemmapOptimized() called here seems unnecessary as it's
    assumed to be set when calling this function. But it's indeed cleared
    by above set_page_private(page, 0). Add a comment to avoid possible
    future confusion.

    Link: https://lkml.kernel.org/r/20220901120030.63318-9-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:41 -04:00
Chris von Recklinghausen 3f0895872f hugetlb: kill hugetlbfs_pagecache_page()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 29be84265fe0023cff8b5aa4fa670b55453c3afc
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 1 20:00:27 2022 +0800

    hugetlb: kill hugetlbfs_pagecache_page()

    Fold hugetlbfs_pagecache_page() into its sole caller to remove some
    duplicated code. No functional change intended.

    Link: https://lkml.kernel.org/r/20220901120030.63318-8-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:41 -04:00
Chris von Recklinghausen 5b2beb15b4 hugetlb: pass NULL to kobj_to_hstate() if nid is unused
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 12658abfc59ddf2ed176fce461e83f392ff18e5b
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 1 20:00:26 2022 +0800

    hugetlb: pass NULL to kobj_to_hstate() if nid is unused

    We can pass NULL to kobj_to_hstate() directly when nid is unused to
    simplify the code. No functional change intended.

    Link: https://lkml.kernel.org/r/20220901120030.63318-7-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:41 -04:00
Chris von Recklinghausen c06fadbb50 hugetlb: use helper {huge_pte|pmd}_lock()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bcc665436fe4648550ed4fd3345c7106b5da35b0
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 1 20:00:25 2022 +0800

    hugetlb: use helper {huge_pte|pmd}_lock()

    Use helper huge_pte_lock and pmd_lock to simplify the code. No functional
    change intended.

    Link: https://lkml.kernel.org/r/20220901120030.63318-6-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:40 -04:00
Chris von Recklinghausen 10e265bdc4 hugetlb: use sizeof() to get the array size
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 103956805c250ccef3d2a54a0a2e3354291cdd85
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 1 20:00:24 2022 +0800

    hugetlb: use sizeof() to get the array size

    It's better to use sizeof() to get the array size instead of manual
    calculation. Minor readability improvement.

    Link: https://lkml.kernel.org/r/20220901120030.63318-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:40 -04:00
Chris von Recklinghausen b71eed3d96 hugetlb: use LIST_HEAD() to define a list head
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3466534131b28e1ee1e7cfd5c77d981ea41b20aa
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 1 20:00:23 2022 +0800

    hugetlb: use LIST_HEAD() to define a list head

    Use LIST_HEAD() directly to define a list head to simplify the code.
    No functional change intended.

    Link: https://lkml.kernel.org/r/20220901120030.63318-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:40 -04:00
Chris von Recklinghausen 5423daba31 hugetlb: Use helper macro SZ_1K
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit c2c3a60a857bfeb154a97ee7e430fc8609c57979
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 1 20:00:22 2022 +0800

    hugetlb: Use helper macro SZ_1K

    Use helper macro SZ_1K to do the size conversion to make code more
    consistent in this file. Minor readability improvement.

    Link: https://lkml.kernel.org/r/20220901120030.63318-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:39 -04:00
Chris von Recklinghausen e09bdb77bc hugetlb: make hugetlb_cma_check() static
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 263b899802fc43cd0e6979f819271dfbb93c94af
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 1 20:00:21 2022 +0800

    hugetlb: make hugetlb_cma_check() static

    Patch series "A few cleanup patches for hugetlb", v2.

    This series contains a few cleanup patches to use helper functions to
    simplify the codes, remove unneeded nid parameter and so on. More
    details can be found in the respective changelogs.

    This patch (of 10):

    Make hugetlb_cma_check() static as it's only used inside mm/hugetlb.c.

    Link: https://lkml.kernel.org/r/20220901120030.63318-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220901120030.63318-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:39 -04:00
Jan Stancek 6577715afe Merge: KVM: Rebase KVM common and x86 to upstream 6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2467

This merge request rebases common and x86 KVM code to upstream kernel 6.3
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2177720

Tested:

- kvm selftests and kvm unit tests on an AMD and on an Intel machine.
- running few real world VMs on AMD machine and on an Intel machine.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>

Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Paolo Bonzini <bonzini@gnu.org>
Approved-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Approved-by: Gavin Shan <gshan@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Eric Auger <eric.auger@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-30 14:03:00 +02:00
Maxim Levitsky 9d98595e2c mm/gup: Add FOLL_INTERRUPTIBLE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2177720

commit 93c5c61d9e58a9ea423439d358c198be5b674a58
Author: Peter Xu <peterx@redhat.com>
Date:   Tue Oct 11 15:58:06 2022 -0400

    mm/gup: Add FOLL_INTERRUPTIBLE

    We have had FAULT_FLAG_INTERRUPTIBLE but it was never applied to GUPs.  One
    issue with it is that not all GUP paths are able to handle signal delivers
    besides SIGKILL.

    That's not ideal for the GUP users who are actually able to handle these
    cases, like KVM.

    KVM uses GUP extensively on faulting guest pages, during which we've got
    existing infrastructures to retry a page fault at a later time.  Allowing
    the GUP to be interrupted by generic signals can make KVM related threads
    to be more responsive.  For examples:

      (1) SIGUSR1: which QEMU/KVM uses to deliver an inter-process IPI,
          e.g. when the admin issues a vm_stop QMP command, SIGUSR1 can be
          generated to kick the vcpus out of kernel context immediately,

      (2) SIGINT: which can be used with interactive hypervisor users to stop a
          virtual machine with Ctrl-C without any delays/hangs,

      (3) SIGTRAP: which grants GDB capability even during page faults that are
          stuck for a long time.

    Normally hypervisor will be able to receive these signals properly, but not
    if we're stuck in a GUP for a long time for whatever reason.  It happens
    easily with a stucked postcopy migration when e.g. a network temp failure
    happens, then some vcpu threads can hang death waiting for the pages.  With
    the new FOLL_INTERRUPTIBLE, we can allow GUP users like KVM to selectively
    enable the ability to trap these signals.

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Message-Id: <20221011195809.557016-2-peterx@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2023-06-27 08:45:04 +03:00
Nico Pache 541066a441 mm: hugetlb: fix UAF in hugetlb_handle_userfault
Conflicts:
       mm/hugetlb.c: only change the return when surrounding
        conflicting areas.

commit 958f32ce832ba781ac20e11bb2d12a9352ea28fc
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Fri Sep 23 12:21:13 2022 +0800

    mm: hugetlb: fix UAF in hugetlb_handle_userfault

    The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
    and reacquire them again after handle_userfault(), but reacquire the
    vma_lock could lead to UAF[1,2] due to the following race,

    hugetlb_fault
      hugetlb_no_page
        /*unlock vma_lock */
        hugetlb_handle_userfault
          handle_userfault
            /* unlock mm->mmap_lock*/
                                               vm_mmap_pgoff
                                                 do_mmap
                                                   mmap_region
                                                     munmap_vma_range
                                                       /* clean old vma */
            /* lock vma_lock again  <--- UAF */
        /* unlock vma_lock */

    Since the vma_lock will unlock immediately after
    hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
    hugetlb_handle_userfault() to fix the issue.

    [1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
    [2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/
    Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
    Fixes: 1a1aad8a9b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com
    Reported-by: Liu Zixian <liuzixian4@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: <stable@vger.kernel.org>    [4.14+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:04 -06:00
Nico Pache c007e2df2e mm/hugetlb: fix uffd wr-protection for CoW optimization path
commit 60d5b473d61be61ac315e544fcd6a8234a79500e
Author: Peter Xu <peterx@redhat.com>
Date:   Tue Mar 21 15:18:40 2023 -0400

    mm/hugetlb: fix uffd wr-protection for CoW optimization path

    This patch fixes an issue that a hugetlb uffd-wr-protected mapping can be
    writable even with uffd-wp bit set.  It only happens with hugetlb private
    mappings, when someone firstly wr-protects a missing pte (which will
    install a pte marker), then a write to the same page without any prior
    access to the page.

    Userfaultfd-wp trap for hugetlb was implemented in hugetlb_fault() before
    reaching hugetlb_wp() to avoid taking more locks that userfault won't
    need.  However there's one CoW optimization path that can trigger
    hugetlb_wp() inside hugetlb_no_page(), which will bypass the trap.

    This patch skips hugetlb_wp() for CoW and retries the fault if uffd-wp bit
    is detected.  The new path will only trigger in the CoW optimization path
    because generic hugetlb_fault() (e.g.  when a present pte was
    wr-protected) will resolve the uffd-wp bit already.  Also make sure
    anonymous UNSHARE won't be affected and can still be resolved, IOW only
    skip CoW not CoR.

    This patch will be needed for v5.19+ hence copy stable.

    [peterx@redhat.com: v2]
      Link: https://lkml.kernel.org/r/ZBzOqwF2wrHgBVZb@x1n
    [peterx@redhat.com: v3]
      Link: https://lkml.kernel.org/r/20230324142620.2344140-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20230321191840.1897940-1-peterx@redhat.com
    Fixes: 166f3ecc0daf ("mm/hugetlb: hook page faults for uffd write protection")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
    Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:03 -06:00
Nico Pache fcf68412df hugetlb: clean up code checking for fault/truncation races
Conflict: out of order with rhel commit ec94d66baa ("hugetlbfs: don't
	  delete error page from pagecache")

commit fa27759af4a6d7494c986c44695b13bcd6eaf46b
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:10 2022 -0700

    hugetlb: clean up code checking for fault/truncation races

    With the new hugetlb vma lock in place, it can also be used to handle page
    fault races with file truncation.  The lock is taken at the beginning of
    the code fault path in read mode.  During truncation, it is taken in write
    mode for each vma which has the file mapped.  The file's size (i_size) is
    modified before taking the vma lock to unmap.

    How are races handled?

    The page fault code checks i_size early in processing after taking the vma
    lock.  If the fault is beyond i_size, the fault is aborted.  If the fault
    is not beyond i_size the fault will continue and a new page will be added
    to the file.  It could be that truncation code modifies i_size after the
    check in fault code.  That is OK, as truncation code will soon remove the
    page.  The truncation code will wait until the fault is finished, as it
    must obtain the vma lock in write mode.

    This patch cleans up/removes late checks in the fault paths that try to
    back out pages racing with truncation.  As noted above, we just let the
    truncation code remove the pages.

    [mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
      Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
    Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache 897a73840c hugetlb: use new vma_lock for pmd sharing synchronization
commit 40549ba8f8e0ed1f8b235979563f619e9aa34fdf
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:09 2022 -0700

    hugetlb: use new vma_lock for pmd sharing synchronization

    The new hugetlb vma lock is used to address this race:

    Faulting thread                                 Unsharing thread
    ...                                                  ...
    ptep = huge_pte_offset()
          or
    ptep = huge_pte_alloc()
    ...
                                                    i_mmap_lock_write
                                                    lock page table
    ptep invalid   <------------------------        huge_pmd_unshare()
    Could be in a previously                        unlock_page_table
    sharing process or worse                        i_mmap_unlock_write
    ...

    The vma_lock is used as follows:
    - During fault processing. The lock is acquired in read mode before
      doing a page table lock and allocation (huge_pte_alloc).  The lock is
      held until code is finished with the page table entry (ptep).
    - The lock must be held in write mode whenever huge_pmd_unshare is
      called.

    Lock ordering issues come into play when unmapping a page from all
    vmas mapping the page.  The i_mmap_rwsem must be held to search for the
    vmas, and the vma lock must be held before calling unmap which will
    call huge_pmd_unshare.  This is done today in:
    - try_to_migrate_one and try_to_unmap_ for page migration and memory
      error handling.  In these routines we 'try' to obtain the vma lock and
      fail to unmap if unsuccessful.  Calling routines already deal with the
      failure of unmapping.
    - hugetlb_vmdelete_list for truncation and hole punch.  This routine
      also tries to acquire the vma lock.  If it fails, it skips the
      unmapping.  However, we can not have file truncation or hole punch
      fail because of contention.  After hugetlb_vmdelete_list, truncation
      and hole punch call remove_inode_hugepages.  remove_inode_hugepages
      checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
      hugetlb_unmap_file_page is designed to drop locks and reacquire in the
      correct order to guarantee unmap success.

    Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache 9735c28de1 hugetlb: add vma based lock for pmd sharing
commit 8d9bfb2608145cf3e408428c224099e1585471af
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:07 2022 -0700

    hugetlb: add vma based lock for pmd sharing

    Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for
    synchronization use by vmas that could be involved in pmd sharing.  This
    data structure contains a rw semaphore that is the primary tool used for
    synchronization.

    This new structure is ref counted, so that it can exist when NOT attached
    to a vma.  This is only helpful in resolving lock ordering issues where
    code may need to obtain the vma_lock while there are no guarantees the vma
    may go away.  By obtaining a ref on the structure, it can be guaranteed
    that at least the rw semaphore will not go away.

    Only add infrastructure for the new lock here.  Actual use will be added
    in subsequent patches.

    [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release]
      Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey
    Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache f5b0c47ca2 hugetlb: rename vma_shareable() and refactor code
commit 12710fd696343a0d6c318bdad22fa7809af7859b
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:06 2022 -0700

    hugetlb: rename vma_shareable() and refactor code

    Rename the routine vma_shareable to vma_addr_pmd_shareable as it is
    checking a specific address within the vma.  Refactor code to check if an
    aligned range is shareable as this will be needed in a subsequent patch.

    Link: https://lkml.kernel.org/r/20220914221810.95771-6-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache d6b2e808b0 hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache
Conflict: out of order with RHEL commit ec94d66baa ("hugetlbfs: don't
	delete error page from pagecache")

commit 7e1813d48dd30e6c6f235f6661d1bc108fcab528
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:04 2022 -0700

    hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache

    remove_huge_page removes a hugetlb page from the page cache.  Change to
    hugetlb_delete_from_page_cache as it is a more descriptive name.
    huge_add_to_page_cache is global in scope, but only deals with hugetlb
    pages.  For consistency and clarity, rename to hugetlb_add_to_page_cache.

    Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache 56a9d706ab hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization
commit 3a47c54f09c4c89128d8f67d49296b1c25b317d0
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:03 2022 -0700

    hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization

    Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") added code to take i_mmap_rwsem in read mode for the
    duration of fault processing.  However, this has been shown to cause
    performance/scaling issues.  Revert the code and go back to only taking
    the semaphore in huge_pmd_share during the fault path.

    Keep the code that takes i_mmap_rwsem in write mode before calling
    try_to_unmap as this is required if huge_pmd_unshare is called.

    NOTE: Reverting this code does expose the following race condition.

    Faulting thread                                 Unsharing thread
    ...                                                  ...
    ptep = huge_pte_offset()
          or
    ptep = huge_pte_alloc()
    ...
                                                    i_mmap_lock_write
                                                    lock page table
    ptep invalid   <------------------------        huge_pmd_unshare()
    Could be in a previously                        unlock_page_table
    sharing process or worse                        i_mmap_unlock_write
    ...
    ptl = huge_pte_lock(ptep)
    get/update pte
    set_pte_at(pte, ptep)

    It is unknown if the above race was ever experienced by a user.  It was
    discovered via code inspection when initially addressed.

    In subsequent patches, a new synchronization mechanism will be added to
    coordinate pmd sharing and eliminate this race.

    Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache b319f1136b mm: hugetlb: eliminate memory-less nodes handling
Conflicts:
       mm/hugetlb.c: add only the hugetlb_sysfs_init in conflict

commit a4a00b451ef5e1deb959088e25e248f4ee399792
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Wed Sep 14 15:26:03 2022 +0800

    mm: hugetlb: eliminate memory-less nodes handling

    The memory-notify-based approach aims to handle meory-less nodes, however,
    it just adds the complexity of code as pointed by David in thread [1].
    The handling of memory-less nodes is introduced by commit 4faf8d950e
    ("hugetlb: handle memory hot-plug events").  >From its commit message, we
    cannot find any necessity of handling this case.  So, we can simply
    register/unregister sysfs entries in register_node/unregister_node to
    simlify the code.

    BTW, hotplug callback added because in hugetlb_register_all_nodes() we
    register sysfs nodes only for N_MEMORY nodes, seeing commit 9b5e5d0fdc,
    which said it was a preparation for handling memory-less nodes via memory
    hotplug.  Since we want to remove memory hotplug, so make sure we only
    register per-node sysfs for online (N_ONLINE) nodes in
    hugetlb_register_all_nodes().

    https://lore.kernel.org/linux-mm/60933ffc-b850-976c-78a0-0ee6e0ea9ef0@redhat.com/ [1]
    Link: https://lkml.kernel.org/r/20220914072603.60293-3-songmuchun@bytedance.com
    Suggested-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andi Kleen <andi@firstfloor.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Rafael J. Wysocki <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache e73467a7bd mm: hugetlb: simplify per-node sysfs creation and removal
commit b958d4d08fbfe938af24ea06ebbf839b48fa18a9
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Wed Sep 14 15:26:02 2022 +0800

    mm: hugetlb: simplify per-node sysfs creation and removal

    Patch series "simplify handling of per-node sysfs creation and removal",
    v4.

    This patch (of 2):

    The following commit offload per-node sysfs creation and removal to a
    kworker and did not say why it is needed.  And it also said "I don't know
    that this is absolutely required".  It seems like the author was not sure
    as well.  Since it only complicates the code, this patch will revert the
    changes to simplify the code.

      39da08cb07 ("hugetlb: offload per node attribute registrations")

    We could use memory hotplug notifier to do per-node sysfs creation and
    removal instead of inserting those operations to node registration and
    unregistration.  Then, it can reduce the code coupling between node.c and
    hugetlb.c.  Also, it can simplify the code.

    Link: https://lkml.kernel.org/r/20220914072603.60293-1-songmuchun@bytedance.com
    Link: https://lkml.kernel.org/r/20220914072603.60293-2-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andi Kleen <andi@firstfloor.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Rafael J. Wysocki <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache 003b1ff24f hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
commit 188a39725ad7ded2d13e752a1a620152b0750175
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:02 2022 -0700

    hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race

    Patch series "hugetlb: Use new vma lock for huge pmd sharing
    synchronization", v2.

    hugetlb fault scalability regressions have recently been reported [1].
    This is not the first such report, as regressions were also noted when
    commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") was added [2] in v5.7.  At that time, a proposal to
    address the regression was suggested [3] but went nowhere.

    The regression and benefit of this patch series is not evident when
    using the vm_scalability benchmark reported in [2] on a recent kernel.
    Results from running,
    "./usemem -n 48 --prealloc --prefault -O -U 3448054972"

                            48 sample Avg
    next-20220913           next-20220913                   next-20220913
    unmodified      revert i_mmap_sema locking      vma sema locking, this series
    -----------------------------------------------------------------------------
    498150 KB/s             501934 KB/s                     504793 KB/s

    The recent regression report [1] notes page fault and fork latency of
    shared hugetlb mappings.  To measure this, I created two simple programs:
    1) map a shared hugetlb area, write fault all pages, unmap area
       Do this in a continuous loop to measure faults per second
    2) map a shared hugetlb area, write fault a few pages, fork and exit
       Do this in a continuous loop to measure forks per second
    These programs were run on a 48 CPU VM with 320GB memory.  The shared
    mapping size was 250GB.  For comparison, a single instance of the program
    was run.  Then, multiple instances were run in parallel to introduce
    lock contention.  Changing the locking scheme results in a significant
    performance benefit.

    test            instances       unmodified      revert          vma
    --------------------------------------------------------------------------
    faults per sec  1               393043          395680          389932
    faults per sec  24               71405           81191           79048
    forks per sec   1                 2802            2747            2725
    forks per sec   24                 439             536             500
    Combined faults 24                1621           68070           53662
    Combined forks  24                 358              67             142

    Combined test is when running both faulting program and forking program
    simultaneously.

    Patches 1 and 2 of this series revert c0d0381ade and 87bf91d39b which
    depends on c0d0381ade.  Acquisition of i_mmap_rwsem is still required in
    the fault path to establish pmd sharing, so this is moved back to
    huge_pmd_share.  With c0d0381ade reverted, this race is exposed:

    Faulting thread                                 Unsharing thread
    ...                                                  ...
    ptep = huge_pte_offset()
          or
    ptep = huge_pte_alloc()
    ...
                                                    i_mmap_lock_write
                                                    lock page table
    ptep invalid   <------------------------        huge_pmd_unshare()
    Could be in a previously                        unlock_page_table
    sharing process or worse                        i_mmap_unlock_write
    ...
    ptl = huge_pte_lock(ptep)
    get/update pte
    set_pte_at(pte, ptep)

    Reverting 87bf91d39b exposes races in page fault/file truncation.  When
    the new vma lock is put to use in patch 8, this will handle the fault/file
    truncation races.  This is explained in patch 9 where code associated with
    these races is cleaned up.

    Patches 3 - 5 restructure existing code in preparation for using the new
    vma lock (rw semaphore) for pmd sharing synchronization.  The idea is that
    this semaphore will be held in read mode for the duration of fault
    processing, and held in write mode for unmap operations which may call
    huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required to
    synchronize huge pmd sharing.  However it is only required in the fault
    path when setting up sharing, and will be acquired in huge_pmd_share().

    Patch 6 adds the new vma lock and all supporting routines, but does not
    actually change code to use the new lock.

    Patch 7 refactors code in preparation for using the new lock.  And, patch
    8 finally adds code to make use of this new vma lock.  Unfortunately, the
    fault code and truncate/hole punch code would naturally take locks in the
    opposite order which could lead to deadlock.  Since the performance of
    page faults is more important, the truncation/hole punch code is modified
    to back out and take locks in the correct order if necessary.

    [1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
    [2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
    [3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/

    This patch (of 9):

    Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") added code to take i_mmap_rwsem in read mode for the
    duration of fault processing.  The use of i_mmap_rwsem to prevent
    fault/truncate races depends on this.  However, this has been shown to
    cause performance/scaling issues.  As a result, that code will be
    reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
    depends on this, it must also be reverted.

    In a subsequent patch, code will be added to detect the fault/truncate
    race and back out operations as required.

    Link: https://lkml.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20220914221810.95771-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:00 -06:00
Nico Pache 5c428506be hugetlb: remove meaningless BUG_ON(huge_pte_none())
commit 5e6b1bf1b5c3c8b72fa2eea9d8731d5c59773945
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 1 20:00:29 2022 +0800

    hugetlb: remove meaningless BUG_ON(huge_pte_none())

    When code reaches here, invalid page would have been accessed if huge pte
    is none. So this BUG_ON(huge_pte_none()) is meaningless. Remove it.

    Link: https://lkml.kernel.org/r/20220901120030.63318-10-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:00 -06:00
Nico Pache d9678e29b9 mm: use nth_page instead of mem_map_offset mem_map_next
commit 14455eabd8404a503dc8e80cd8ce185e96a94b22
Author: Cheng Li <lic121@chinatelecom.cn>
Date:   Fri Sep 9 07:31:09 2022 +0000

    mm: use nth_page instead of mem_map_offset mem_map_next

    To handle the discontiguous case, mem_map_next() has a parameter named
    `offset`.  As a function caller, one would be confused why "get next
    entry" needs a parameter named "offset".  The other drawback of
    mem_map_next() is that the callers must take care of the map between
    parameter "iter" and "offset", otherwise we may get an hole or duplication
    during iteration.  So we use nth_page instead of mem_map_next.

    And replace mem_map_offset with nth_page() per Matthew's comments.

    Link: https://lkml.kernel.org/r/1662708669-9395-1-git-send-email-lic121@chinatelecom.cn
    Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
    Fixes: 69d177c2fc ("hugetlbfs: handle pages higher order than MAX_ORDER")
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:00 -06:00
Nico Pache 643137feaf mm/hugetlb: make detecting shared pte more reliable
commit 3aa4ed8040e1535d95c03cef8b52cf11bf0d8546
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 16 21:05:53 2022 +0800

    mm/hugetlb: make detecting shared pte more reliable

    If the pagetables are shared, we shouldn't copy or take references.  Since
    src could have unshared and dst shares with another vma, huge_pte_none()
    is thus used to determine whether dst_pte is shared.  But this check isn't
    reliable.  A shared pte could have pte none in pagetable in fact.  The
    page count of ptep page should be checked here in order to reliably
    determine whether pte is shared.

    [lukas.bulwahn@gmail.com: remove unused local variable dst_entry in copy_hugetlb_page_range()]
      Link: https://lkml.kernel.org/r/20220822082525.26071-1-lukas.bulwahn@gmail.com
    Link: https://lkml.kernel.org/r/20220816130553.31406-7-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:10:59 -06:00
Nico Pache e4a806b0b9 mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node()
commit 01088a603660dfa240ba3331f0a49a3e9797cad1
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 16 21:05:52 2022 +0800

    mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node()

    The sysfs group per_node_hstate_attr_group and hstate_demote_attr_group
    when h->demote_order != 0 are created in hugetlb_register_node().  But
    these sysfs groups are not removed when unregister the node, thus sysfs
    group is leaked.  Using sysfs_remove_group() to fix this issue.

    Link: https://lkml.kernel.org/r/20220816130553.31406-6-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Fengwei Yin <fengwei.yin@intel.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:10:59 -06:00
Nico Pache bda2951559 mm/hugetlb: fix missing call to restore_reserve_on_error()
commit 3a5497a2dae381cb1b201fb20847fb32a059da25
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 16 21:05:50 2022 +0800

    mm/hugetlb: fix missing call to restore_reserve_on_error()

    When huge_add_to_page_cache() fails, the page is freed directly without
    calling restore_reserve_on_error() to restore reserve for newly allocated
    pages not in page cache.  Fix this by calling restore_reserve_on_error()
    when huge_add_to_page_cache fails.

    [linmiaohe@huawei.com: remove err == -EEXIST check and retry logic]
      Link: https://lkml.kernel.org/r/20220823030209.57434-4-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220816130553.31406-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:10:59 -06:00
Nico Pache 72b8429df7 mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group()
commit 3a6bdda0b58bab883da8dc3a0503e2a13b9d7456
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 16 21:05:49 2022 +0800

    mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group()

    If sysfs_create_group() fails with hstate_attr_group, hstate_kobjs[hi]
    will be set to NULL.  Then it will be passed to sysfs_create_group() if
    h->demote_order != 0 thus triggering WARN_ON(!kobj) check.  Fix this by
    making sure hstate_kobjs[hi] != NULL when calling sysfs_create_group.

    Link: https://lkml.kernel.org/r/20220816130553.31406-3-linmiaohe@huawei.com
    Fixes: 79dfc695525f ("hugetlb: add demote hugetlb page sysfs interfaces")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:10:59 -06:00
Nico Pache 1cd305e74c mm/hugetlb: fix incorrect update of max_huge_pages
commit a43a83c79b4fd36e297f97b7468be28fdf771d78
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 16 21:05:48 2022 +0800

    mm/hugetlb: fix incorrect update of max_huge_pages

    Patch series "A few fixup patches for hugetlb".

    This series contains a few fixup patches to fix incorrect update of
    max_huge_pages, fix WARN_ON(!kobj) in sysfs_create_group() and so on.
    More details can be found in the respective changelogs.

    This patch (of 6):

    There should be pages_per_huge_page(h) /
    pages_per_huge_page(target_hstate) pages incremented for
    target_hstate->max_huge_pages when page is demoted.  Update max_huge_pages
    accordingly for consistency.

    Link: https://lkml.kernel.org/r/20220816130553.31406-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220816130553.31406-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:10:59 -06:00
Aristeu Rozanski ec94d66baa hugetlbfs: don't delete error page from pagecache
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests
Conflicts: different context due missing fa27759af4a6d7494c986c44695b13bcd6eaf46b. We still use remove_huge_page() since we didn't backport 7e1813d48dd30e6c6f235f6661d1bc108fcab528 which renames it as hugetlb_delete_from_page_cache()

commit 8625147cafaa9ba74713d682f5185eb62cb2aedb
Author: James Houghton <jthoughton@google.com>
Date:   Tue Oct 18 20:01:25 2022 +0000

    hugetlbfs: don't delete error page from pagecache

    This change is very similar to the change that was made for shmem [1], and
    it solves the same problem but for HugeTLBFS instead.

    Currently, when poison is found in a HugeTLB page, the page is removed
    from the page cache.  That means that attempting to map or read that
    hugepage in the future will result in a new hugepage being allocated
    instead of notifying the user that the page was poisoned.  As [1] states,
    this is effectively memory corruption.

    The fix is to leave the page in the page cache.  If the user attempts to
    use a poisoned HugeTLB page with a syscall, the syscall will fail with
    EIO, the same error code that shmem uses.  For attempts to map the page,
    the thread will get a BUS_MCEERR_AR SIGBUS.

    [1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens")

    Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
    Signed-off-by: James Houghton <jthoughton@google.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Chris von Recklinghausen d392132e1e mm/uffd: detect pgtable allocation failures
Bugzilla: https://bugzilla.redhat.com/2160210

commit d1751118c88673fe5a948ad82277898e9e284c55
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Jan 4 17:52:07 2023 -0500

    mm/uffd: detect pgtable allocation failures

    Before this patch, when there's any pgtable allocation issues happened
    during change_protection(), the error will be ignored from the syscall.
    For shmem, there will be an error dumped into the host dmesg.  Two issues
    with that:

      (1) Doing a trace dump when allocation fails is not anything close to
          grace.

      (2) The user should be notified with any kind of such error, so the user
          can trap it and decide what to do next, either by retrying, or stop
          the process properly, or anything else.

    For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
    pgtable allocation failure happened.  It should not normally break anyone,
    though.  If it breaks, then in good ways.

    One man-page update will be on the way to introduce the new -ENOMEM for
    UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on
    the 5.19-till-now kernels.

    [akpm@linux-foundation.org: coding-style cleanups]
    Link: https://lkml.kernel.org/r/20230104225207.1066932-4-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: James Houghton <jthoughton@google.com>
    Acked-by: James Houghton <jthoughton@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:35 -04:00
Chris von Recklinghausen 403d9558c1 mm/mprotect: use long for page accountings and retval
Bugzilla: https://bugzilla.redhat.com/2160210

commit a79390f5d6a78647fd70856bd42b22d994de0ba2
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Jan 4 17:52:06 2023 -0500

    mm/mprotect: use long for page accountings and retval

    Switch to use type "long" for page accountings and retval across the whole
    procedure of change_protection().

    The change should have shrinked the possible maximum page number to be
    half comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on
    any system either because the maximum possible pages touched by change
    protection should be ULONG_MAX / PAGE_SIZE.

    Two reasons to switch from "unsigned long" to "long":

      1. It suites better on count_vm_numa_events(), whose 2nd parameter takes
         a long type.

      2. It paves way for returning negative (error) values in the future.

    Currently the only caller that consumes this retval is change_prot_numa(),
    where the unsigned long was converted to an int.  Since at it, touching up
    the numa code to also take a long, so it'll avoid any possible overflow
    too during the int-size convertion.

    Link: https://lkml.kernel.org/r/20230104225207.1066932-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: James Houghton <jthoughton@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:35 -04:00
Chris von Recklinghausen 4c1006592b mm/hugetlb: pre-allocate pgtable pages for uffd wr-protects
Bugzilla: https://bugzilla.redhat.com/2160210

commit fed15f1345dc8a7fc8baa81e8b55c3ba010d7f4b
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Jan 4 17:52:05 2023 -0500

    mm/hugetlb: pre-allocate pgtable pages for uffd wr-protects

    Userfaultfd-wp uses pte markers to mark wr-protected pages for both shmem
    and hugetlb.  Shmem has pre-allocation ready for markers, but hugetlb path
    was overlooked.

    Doing so by calling huge_pte_alloc() if the initial pgtable walk fails to
    find the huge ptep.  It's possible that huge_pte_alloc() can fail with
    high memory pressure, in that case stop the loop immediately and fail
    silently.  This is not the most ideal solution but it matches with what we
    do with shmem meanwhile it avoids the splat in dmesg.

    Link: https://lkml.kernel.org/r/20230104225207.1066932-2-peterx@redhat.com
    Fixes: 60dfaad65aa9 ("mm/hugetlb: allow uffd wr-protect none ptes")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: James Houghton <jthoughton@google.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: James Houghton <jthoughton@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: <stable@vger.kernel.org>    [5.19+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:34 -04:00
Chris von Recklinghausen 4bbdf459cf mm/hugetlb: fix uffd-wp handling for migration entries in hugetlb_change_protection()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 44f86392bdd165da7e43d3c772aeb1e128ffd6c8
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Dec 22 21:55:11 2022 +0100

    mm/hugetlb: fix uffd-wp handling for migration entries in hugetlb_change_protection()

    We have to update the uffd-wp SWP PTE bit independent of the type of
    migration entry.  Currently, if we're unlucky and we want to install/clear
    the uffd-wp bit just while we're migrating a read-only mapped hugetlb
    page, we would miss to set/clear the uffd-wp bit.

    Further, if we're processing a readable-exclusive migration entry and
    neither want to set or clear the uffd-wp bit, we could currently end up
    losing the uffd-wp bit.  Note that the same would hold for writable
    migrating entries, however, having a writable migration entry with the
    uffd-wp bit set would already mean that something went wrong.

    Note that the change from !is_readable_migration_entry ->
    writable_migration_entry is harmless and actually cleaner, as raised by
    Miaohe Lin and discussed in [1].

    [1] https://lkml.kernel.org/r/90dd6a93-4500-e0de-2bf0-bf522c311b0c@huawei.com

    Link: https://lkml.kernel.org/r/20221222205511.675832-3-david@redhat.com
    Fixes: 60dfaad65aa9 ("mm/hugetlb: allow uffd wr-protect none ptes")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:34 -04:00
Chris von Recklinghausen 55d064c402 mm/hugetlb: fix PTE marker handling in hugetlb_change_protection()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 0e678153f5be7e6c8d28835f5a678618da4b7a9c
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Dec 22 21:55:10 2022 +0100

    mm/hugetlb: fix PTE marker handling in hugetlb_change_protection()

    Patch series "mm/hugetlb: uffd-wp fixes for hugetlb_change_protection()".

    Playing with virtio-mem and background snapshots (using uffd-wp) on
    hugetlb in QEMU, I managed to trigger a VM_BUG_ON().  Looking into the
    details, hugetlb_change_protection() seems to not handle uffd-wp correctly
    in all cases.

    Patch #1 fixes my test case.  I don't have reproducers for patch #2, as it
    requires running into migration entries.

    I did not yet check in detail yet if !hugetlb code requires similar care.

    This patch (of 2):

    There are two problematic cases when stumbling over a PTE marker in
    hugetlb_change_protection():

    (1) We protect an uffd-wp PTE marker a second time using uffd-wp: we will
        end up in the "!huge_pte_none(pte)" case and mess up the PTE marker.

    (2) We unprotect a uffd-wp PTE marker: we will similarly end up in the
        "!huge_pte_none(pte)" case even though we cleared the PTE, because
        the "pte" variable is stale. We'll mess up the PTE marker.

    For example, if we later stumble over such a "wrongly modified" PTE marker,
    we'll treat it like a present PTE that maps some garbage page.

    This can, for example, be triggered by mapping a memfd backed by huge
    pages, registering uffd-wp, uffd-wp'ing an unmapped page and (a)
    uffd-wp'ing it a second time; or (b) uffd-unprotecting it; or (c)
    unregistering uffd-wp. Then, ff we trigger fallocate(FALLOC_FL_PUNCH_HOLE)
    on that file range, we will run into a VM_BUG_ON:

    [  195.039560] page:00000000ba1f2987 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x0
    [  195.039565] flags: 0x7ffffc0001000(reserved|node=0|zone=0|lastcpupid=0x1fffff)
    [  195.039568] raw: 0007ffffc0001000 ffffe742c0000008 ffffe742c0000008 0000000000000000
    [  195.039569] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
    [  195.039569] page dumped because: VM_BUG_ON_PAGE(compound && !PageHead(page))
    [  195.039573] ------------[ cut here ]------------
    [  195.039574] kernel BUG at mm/rmap.c:1346!
    [  195.039579] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    [  195.039581] CPU: 7 PID: 4777 Comm: qemu-system-x86 Not tainted 6.0.12-200.fc36.x86_64 #1
    [  195.039583] Hardware name: LENOVO 20WNS1F81N/20WNS1F81N, BIOS N35ET50W (1.50 ) 09/15/2022
    [  195.039584] RIP: 0010:page_remove_rmap+0x45b/0x550
    [  195.039588] Code: [...]
    [  195.039589] RSP: 0018:ffffbc03c3633ba8 EFLAGS: 00010292
    [  195.039591] RAX: 0000000000000040 RBX: ffffe742c0000000 RCX: 0000000000000000
    [  195.039592] RDX: 0000000000000002 RSI: ffffffff8e7aac1a RDI: 00000000ffffffff
    [  195.039592] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffbc03c3633a08
    [  195.039593] R10: 0000000000000003 R11: ffffffff8f146328 R12: ffff9b04c42754b0
    [  195.039594] R13: ffffffff8fcc6328 R14: ffffbc03c3633c80 R15: ffff9b0484ab9100
    [  195.039595] FS:  00007fc7aaf68640(0000) GS:ffff9b0bbf7c0000(0000) knlGS:0000000000000000
    [  195.039596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  195.039597] CR2: 000055d402c49110 CR3: 0000000159392003 CR4: 0000000000772ee0
    [  195.039598] PKRU: 55555554
    [  195.039599] Call Trace:
    [  195.039600]  <TASK>
    [  195.039602]  __unmap_hugepage_range+0x33b/0x7d0
    [  195.039605]  unmap_hugepage_range+0x55/0x70
    [  195.039608]  hugetlb_vmdelete_list+0x77/0xa0
    [  195.039611]  hugetlbfs_fallocate+0x410/0x550
    [  195.039612]  ? _raw_spin_unlock_irqrestore+0x23/0x40
    [  195.039616]  vfs_fallocate+0x12e/0x360
    [  195.039618]  __x64_sys_fallocate+0x40/0x70
    [  195.039620]  do_syscall_64+0x58/0x80
    [  195.039623]  ? syscall_exit_to_user_mode+0x17/0x40
    [  195.039624]  ? do_syscall_64+0x67/0x80
    [  195.039626]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
    [  195.039628] RIP: 0033:0x7fc7b590651f
    [  195.039653] Code: [...]
    [  195.039654] RSP: 002b:00007fc7aaf66e70 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
    [  195.039655] RAX: ffffffffffffffda RBX: 0000558ef4b7f370 RCX: 00007fc7b590651f
    [  195.039656] RDX: 0000000018000000 RSI: 0000000000000003 RDI: 000000000000000c
    [  195.039657] RBP: 0000000008000000 R08: 0000000000000000 R09: 0000000000000073
    [  195.039658] R10: 0000000008000000 R11: 0000000000000293 R12: 0000000018000000
    [  195.039658] R13: 00007fb8bbe00000 R14: 000000000000000c R15: 0000000000001000
    [  195.039661]  </TASK>

    Fix it by not going into the "!huge_pte_none(pte)" case if we stumble over
    an exclusive marker.  spin_unlock() + continue would get the job done.

    However, instead, make it clearer that there are no fall-through
    statements: we process each case (hwpoison, migration, marker, !none,
    none) and then unlock the page table to continue with the next PTE.  Let's
    avoid "continue" statements and use a single spin_unlock() at the end.

    Link: https://lkml.kernel.org/r/20221222205511.675832-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20221222205511.675832-2-david@redhat.com
    Fixes: 60dfaad65aa9 ("mm/hugetlb: allow uffd wr-protect none ptes")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:34 -04:00
Chris von Recklinghausen 2f2ceb6140 mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in
Bugzilla: https://bugzilla.redhat.com/2160210

commit 515778e2d790652a38a24554fdb7f21420d91efc
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Sep 30 20:25:55 2022 -0400

    mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in

    When PTE_MARKER_UFFD_WP not configured, it's still possible to reach pte
    marker code and trigger an warning. Add a few CONFIG_PTE_MARKER_UFFD_WP
    ifdefs to make sure the code won't be reached when not compiled in.

    Link: https://lkml.kernel.org/r/YzeR+R6b4bwBlBHh@x1n
    Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: <syzbot+2b9b4f0895be09a6dec3@syzkaller.appspotmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Edward Liaw <edliaw@google.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:33 -04:00
Chris von Recklinghausen 16a4b1211c mm, hwpoison, hugetlb: support saving mechanism of raw error pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit 161df60e9e89651c9aa3ae0edc9aae3a8a2d21e7
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:15 2022 +0900

    mm, hwpoison, hugetlb: support saving mechanism of raw error pages

    When handling memory error on a hugetlb page, the error handler tries to
    dissolve and turn it into 4kB pages.  If it's successfully dissolved,
    PageHWPoison flag is moved to the raw error page, so that's all right.
    However, dissolve sometimes fails, then the error page is left as
    hwpoisoned hugepage.  It's useful if we can retry to dissolve it to save
    healthy pages, but that's not possible now because the information about
    where the raw error pages is lost.

    Use the private field of a few tail pages to keep that information.  The
    code path of shrinking hugepage pool uses this info to try delayed
    dissolve.  In order to remember multiple errors in a hugepage, a
    singly-linked list originated from SUBPAGE_INDEX_HWPOISON-th tail page is
    constructed.  Only simple operations (adding an entry or clearing all) are
    required and the list is assumed not to be very long, so this simple data
    structure should be enough.

    If we failed to save raw error info, the hwpoison hugepage has errors on
    unknown subpage, then this new saving mechanism does not work any more, so
    disable saving new raw error info and freeing hwpoison hugepages.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-4-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 64da63e199 mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
Bugzilla: https://bugzilla.redhat.com/2160210

commit 3a194f3f8ad01bce00bd7174aaba1563bcc827eb
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:14 2022 +0900

    mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry

    follow_pud_mask() does not support non-present pud entry now.  As long as
    I tested on x86_64 server, follow_pud_mask() still simply returns
    no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
    user-visible effect should happen.  But generally we should call
    follow_huge_pud() for non-present pud entry for 1GB hugetlb page.

    Update pud_huge() and follow_huge_pud() to handle non-present pud entries.
    The changes are similar to previous works for pud entries commit
    e66f17ff71 ("mm/hugetlb: take page table lock in follow_huge_pmd()") and
    commit cbef8478be ("mm/hugetlb: pmd_huge() returns true for non-present
    hugepage").

    Link: https://lkml.kernel.org/r/20220714042420.1847125-3-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 55ef056b32 mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
Bugzilla: https://bugzilla.redhat.com/2160210

commit c0531714d6e3fd720b7dacc2de2d0503a995bcdc
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:13 2022 +0900

    mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()

    Patch series "mm, hwpoison: enable 1GB hugepage support", v7.

    This patch (of 8):

    I found a weird state of 1GB hugepage pool, caused by the following
    procedure:

      - run a process reserving all free 1GB hugepages,
      - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
        /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
      - kill the reserving process.

    , then all the hugepages are free *and* surplus at the same time.

      $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
      3
      $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
      3
      $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
      0
      $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
      3

    This state is resolved by reserving and allocating the pages then freeing
    them again, so this seems not to result in serious problem.  But it's a
    little surprising (shrinking pool suddenly fails).

    This behavior is caused by hstate_is_gigantic() check in
    return_unused_surplus_pages().  This was introduced so long ago in 2008 by
    commit aa888a7497 ("hugetlb: support larger than MAX_ORDER"), and at
    that time the gigantic pages were not supposed to be allocated/freed at
    run-time.  Now kernel can support runtime allocation/free, so let's check
    gigantic_page_runtime_supported() together.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-1-naoya.horiguchi@linux.dev
    Link: https://lkml.kernel.org/r/20220714042420.1847125-2-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: kernel test robot <lkp@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:30 -04:00
Chris von Recklinghausen 037f2eb933 mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6213834c10de954470b7195cf0cdbda858edf0ee
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue Jun 28 17:22:33 2022 +0800

    mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability

    There is a discussion about the name of hugetlb_vmemmap_alloc/free in
    thread [1].  The suggestion suggested by David is rename "alloc/free" to
    "optimize/restore" to make functionalities clearer to users, "optimize"
    means the function will optimize vmemmap pages, while "restore" means
    restoring its vmemmap pages discared before.  This commit does this.

    Another discussion is the confusion RESERVE_VMEMMAP_NR isn't used
    explicitly for vmemmap_addr but implicitly for vmemmap_end in
    hugetlb_vmemmap_alloc/free.  David suggested we can compute what
    hugetlb_vmemmap_init() does now at runtime.  We do not need to worry for
    the overhead of computing at runtime since the calculation is simple
    enough and those functions are not in a hot path.  This commit has the
    following improvements:

      1) The function suffixed name ("optimize/restore") is more expressive.
      2) The logic becomes less weird in hugetlb_vmemmap_optimize/restore().
      3) The hugetlb_vmemmap_init() does not need to be exported anymore.
      4) A ->optimize_vmemmap_pages field in struct hstate is killed.
      5) There is only one place where checks is_power_of_2(sizeof(struct
         page)) instead of two places.
      6) Add more comments for hugetlb_vmemmap_optimize/restore().
      7) For external users, hugetlb_optimize_vmemmap_pages() is used for
         detecting if the HugeTLB's vmemmap pages is optimizable originally.
         In this commit, it is killed and we introduce a new helper
         hugetlb_vmemmap_optimizable() to replace it.  The name is more
         expressive.

    Link: https://lore.kernel.org/all/20220404074652.68024-2-songmuchun@bytedance.com/ [1]
    Link: https://lkml.kernel.org/r/20220628092235.91270-7-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Will Deacon <will@kernel.org>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:30 -04:00
Chris von Recklinghausen 830fb0c1df hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte
Bugzilla: https://bugzilla.redhat.com/2160210

commit da9a298f5fad0dc615079a340da42928bc5b138e
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Sat Jul 9 17:26:29 2022 +0800

    hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte

    When alloc_huge_page fails, *pagep is set to NULL without put_page first.
    So the hugepage indicated by *pagep is leaked.

    Link: https://lkml.kernel.org/r/20220709092629.54291-1-linmiaohe@huawei.com
    Fixes: 8cc5fcbb5b ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:27 -04:00
Chris von Recklinghausen e2c13e3883 mm, hugetlb: skip irrelevant nodes in show_free_areas()
Bugzilla: https://bugzilla.redhat.com/2160210

commit dcadcf1c30619ead2f3280bfb7f74de8304be2bb
Author: Gang Li <ligang.bdlg@bytedance.com>
Date:   Wed Jul 6 11:46:54 2022 +0800

    mm, hugetlb: skip irrelevant nodes in show_free_areas()

    show_free_areas() allows to filter out node specific data which is
    irrelevant to the allocation request.  But hugetlb_show_meminfo() still
    shows hugetlb on all nodes, which is redundant and unnecessary.

    Use show_mem_node_skip() to skip irrelevant nodes.  And replace
    hugetlb_show_meminfo() with hugetlb_show_meminfo_node(nid).

    before-and-after sample output of OOM:

    before:
    ```
    [  214.362453] Node 1 active_anon:148kB inactive_anon:4050920kB active_file:112kB inactive_file:100kB
    [  214.375429] Node 1 Normal free:45100kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig
    [  214.388334] lowmem_reserve[]: 0 0 0 0 0
    [  214.390251] Node 1 Normal: 423*4kB (UE) 320*8kB (UME) 187*16kB (UE) 117*32kB (UE) 57*64kB (UME) 20
    [  214.397626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    [  214.401518] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    ```

    after:
    ```
    [  145.069705] Node 1 active_anon:128kB inactive_anon:4049412kB active_file:56kB inactive_file:84kB u
    [  145.110319] Node 1 Normal free:45424kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig
    [  145.152315] lowmem_reserve[]: 0 0 0 0 0
    [  145.155244] Node 1 Normal: 470*4kB (UME) 373*8kB (UME) 247*16kB (UME) 168*32kB (UE) 86*64kB (UME)
    [  145.164119] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    ```

    Link: https://lkml.kernel.org/r/20220706034655.1834-1-ligang.bdlg@bytedance.com
    Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:26 -04:00
Chris von Recklinghausen d23db1ae34 hugetlb: do not update address in huge_pmd_unshare
Bugzilla: https://bugzilla.redhat.com/2160210

commit 4ddb4d91b82f4b64458fe35bc8e395c7c082ea2b
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Jun 21 16:56:19 2022 -0700

    hugetlb: do not update address in huge_pmd_unshare

    As an optimization for loops sequentially processing hugetlb address
    ranges, huge_pmd_unshare would update a passed address if it unshared a
    pmd.  Updating a loop control variable outside the loop like this is
    generally a bad idea.  These loops are now using hugetlb_mask_last_page to
    optimize scanning when non-present ptes are discovered.  The same can be
    done when huge_pmd_unshare returns 1 indicating a pmd was unshared.

    Remove address update from huge_pmd_unshare.  Change the passed argument
    type and update all callers.  In loops sequentially processing addresses
    use hugetlb_mask_last_page to update address if pmd is unshared.

    [sfr@canb.auug.org.au: fix an unused variable warning/error]
      Link: https://lkml.kernel.org/r/20220622171117.70850960@canb.auug.org.au
    Link: https://lkml.kernel.org/r/20220621235620.291305-4-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
    Cc: Will Deacon <will@kernel.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:24 -04:00
Chris von Recklinghausen 4639934348 hugetlb: skip to end of PT page mapping when pte not present
Bugzilla: https://bugzilla.redhat.com/2160210

commit e95a9851787bbb3cd4deb40fe8bab03f731852d1
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Jun 21 16:56:17 2022 -0700

    hugetlb: skip to end of PT page mapping when pte not present

    Patch series "hugetlb: speed up linear address scanning", v2.

    At unmap, fork and remap time hugetlb address ranges are linearly scanned.
    We can optimize these scans if the ranges are sparsely populated.

    Also, enable page table "Lazy copy" for hugetlb at fork.

    NOTE: Architectures not defining CONFIG_ARCH_WANT_GENERAL_HUGETLB need to
    add an arch specific version hugetlb_mask_last_page() to take advantage of
    sparse address scanning improvements.  Baolin Wang added the routine for
    arm64.  Other architectures which could be optimized are: ia64, mips,
    parisc, powerpc, s390, sh and sparc.

    This patch (of 4):

    HugeTLB address ranges are linearly scanned during fork, unmap and remap
    operations.  If a non-present entry is encountered, the code currently
    continues to the next huge page aligned address.  However, a non-present
    entry implies that the page table page for that entry is not present.
    Therefore, the linear scan can skip to the end of range mapped by the page
    table page.  This can speed operations on large sparsely populated hugetlb
    mappings.

    Create a new routine hugetlb_mask_last_page() that will return an address
    mask.  When the mask is ORed with an address, the result will be the
    address of the last huge page mapped by the associated page table page.
    Use this mask to update addresses in routines which linearly scan hugetlb
    address ranges when a non-present pte is encountered.

    hugetlb_mask_last_page is related to the implementation of huge_pte_offset
    as hugetlb_mask_last_page is called when huge_pte_offset returns NULL.
    This patch only provides a complete hugetlb_mask_last_page implementation
    when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined.  Architectures which
    provide their own versions of huge_pte_offset can also provide their own
    version of hugetlb_mask_last_page.

    Link: https://lkml.kernel.org/r/20220621235620.291305-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20220621235620.291305-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:24 -04:00
Chris von Recklinghausen 94b8e0ebc1 mm: rename is_pinnable_page() to is_longterm_pinnable_page()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6077c943beee407168f72ece745b0aeaef6b896f
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:08 2022 -0500

    mm: rename is_pinnable_page() to is_longterm_pinnable_page()

    Patch series "Add MEMORY_DEVICE_COHERENT for coherent device memory
    mapping", v9.

    This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
    owned by a device that can be mapped into CPU page tables like
    MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.

    This patch series is mostly self-contained except for a few places where
    it needs to update other subsystems to handle the new memory type.

    System stability and performance are not affected according to our ongoing
    testing, including xfstests.

    How it works: The system BIOS advertises the GPU device memory (aka VRAM)
    as SPM (special purpose memory) in the UEFI system address map.

    The amdgpu driver registers the memory with devmap as
    MEMORY_DEVICE_COHERENT using devm_memremap_pages.  The initial user for
    this hardware page migration capability is the Frontier supercomputer
    project.  This functionality is not AMD-specific.  We expect other GPU
    vendors to find this functionality useful, and possibly other hardware
    types in the future.

    Our test nodes in the lab are similar to the Frontier configuration, with
    .5 TB of system memory plus 256 GB of device memory split across 4 GPUs,
    all in a single coherent address space.  Page migration is expected to
    improve application efficiency significantly.  We will report empirical
    results as they become available.

    Coherent device type pages at gup are now migrated back to system memory
    if they are being pinned long-term (FOLL_LONGTERM).  The reason is, that
    long-term pinning would interfere with the device memory manager owning
    the device-coherent pages (e.g.  evictions in TTM).  These series
    incorporate Alistair Popple patches to do this migration from
    pin_user_pages() calls.  hmm_gup_test has been added to hmm-test to test
    different get user pages calls.

    This series includes handling of device-managed anonymous pages returned
    by vm_normal_pages.  Although they behave like normal pages for purposes
    of mapping in CPU page tables and for COW, they do not support LRU lists,
    NUMA migration or THP.

    We also introduced a FOLL_LRU flag that adds the same behaviour to
    follow_page and related APIs, to allow callers to specify that they expect
    to put pages on an LRU list.

    This patch (of 14):

    is_pinnable_page() and folio_is_pinnable() are renamed to
    is_longterm_pinnable_page() and folio_is_longterm_pinnable() respectively.
    These functions are used in the FOLL_LONGTERM flag context.

    Link: https://lkml.kernel.org/r/20220715150521.18165-1-alex.sierra@amd.com
    Link: https://lkml.kernel.org/r/20220715150521.18165-2-alex.sierra@amd.com
    Signed-off-by: Alex Sierra <alex.sierra@amd.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:22 -04:00
Chris von Recklinghausen d1693d16b0 mm: hugetlb: kill set_huge_swap_pte_at()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 18f3962953e40401b7ed98e8524167282c3e626e
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Sun Jun 26 22:57:17 2022 +0800

    mm: hugetlb: kill set_huge_swap_pte_at()

    Commit e5251fd430 ("mm/hugetlb: introduce set_huge_swap_pte_at()
    helper") add set_huge_swap_pte_at() to handle swap entries on
    architectures that support hugepages consisting of contiguous ptes.  And
    currently the set_huge_swap_pte_at() is only overridden by arm64.

    set_huge_swap_pte_at() provide a sz parameter to help determine the number
    of entries to be updated.  But in fact, all hugetlb swap entries contain
    pfn information, so we can find the corresponding folio through the pfn
    recorded in the swap entry, then the folio_size() is the number of entries
    that need to be updated.

    And considering that users will easily cause bugs by ignoring the
    difference between set_huge_swap_pte_at() and set_huge_pte_at().  Let's
    handle swap entries in set_huge_pte_at() and remove the
    set_huge_swap_pte_at(), then we can call set_huge_pte_at() anywhere, which
    simplifies our coding.

    Link: https://lkml.kernel.org/r/20220626145717.53572-1-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:21 -04:00
Chris von Recklinghausen 7691941a7c mm: hugetlb: remove minimum_order variable
Bugzilla: https://bugzilla.redhat.com/2160210

commit dc2628f39582e79bce41842fc91235b70054838c
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Thu Jun 16 11:38:46 2022 +0800

    mm: hugetlb: remove minimum_order variable

    commit 641844f561 ("mm/hugetlb: introduce minimum hugepage order") fixed
    a static checker warning and introduced a global variable minimum_order to
    fix the warning.  However, the local variable in
    dissolve_free_huge_pages() can be initialized to
    huge_page_order(&default_hstate) to fix the warning.

    So remove minimum_order to simplify the code.

    Link: https://lkml.kernel.org/r/20220616033846.96937-1-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:20 -04:00
Chris von Recklinghausen 5cd60c6a67 mm/hugetlb: remove unnecessary huge_ptep_set_access_flags() in hugetlb_mcopy_atomic_pte()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 8edaec0756005a3f286c9272e909dff07d12cf75
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri May 27 10:01:35 2022 +0800

    mm/hugetlb: remove unnecessary huge_ptep_set_access_flags() in hugetlb_mcopy_atomic_pte()

    There is no need to update the hugetlb access flags after just setting the
    hugetlb page table entry by set_huge_pte_at(), since the page table entry
    value has no changes.

    Thus remove the unnecessary huge_ptep_set_access_flags() in
    hugetlb_mcopy_atomic_pte().

    Link: https://lkml.kernel.org/r/f3e28b897b53a69967a8b98a6fdcda3be80c9229.1653616175.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Chris von Recklinghausen bf73ab833d hugetlb: Convert huge_add_to_page_cache() to use a folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit d9ef44de5d731e1a1fa94ddb5e39ea1b308b1456
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jun 1 15:11:01 2022 -0400

    hugetlb: Convert huge_add_to_page_cache() to use a folio

    Remove the last caller of add_to_page_cache()

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:15 -04:00
Chris von Recklinghausen 267a7a9b62 docs: rename Documentation/vm to Documentation/mm
Conflicts: drop changes to arch/loongarch/Kconfig - unsupported config

Bugzilla: https://bugzilla.redhat.com/2160210

commit ee65728e103bb7dd99d8604bf6c7aa89c7d7e446
Author: Mike Rapoport <rppt@kernel.org>
Date:   Mon Jun 27 09:00:26 2022 +0300

    docs: rename Documentation/vm to Documentation/mm

    so it will be consistent with code mm directory and with
    Documentation/admin-guide/mm and won't be confused with virtual machines.

    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Tested-by: Ira Weiny <ira.weiny@intel.com>
    Acked-by: Jonathan Corbet <corbet@lwn.net>
    Acked-by: Wu XiangCheng <bobwxc@email.cn>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:15 -04:00
Chris von Recklinghausen aadb0028d0 delayacct: track delays from write-protect copy
Bugzilla: https://bugzilla.redhat.com/2160210

commit 662ce1dc9caf493c309200edbe38d186f1ea20d0
Author: Yang Yang <yang.yang29@zte.com.cn>
Date:   Wed Jun 1 15:55:25 2022 -0700

    delayacct: track delays from write-protect copy

    Delay accounting does not track the delay of write-protect copy.  When
    tasks trigger many write-protect copys(include COW and unsharing of
    anonymous pages[1]), it may spend a amount of time waiting for them.  To
    get the delay of tasks in write-protect copy, could help users to evaluate
    the impact of using KSM or fork() or GUP.

    Also update tools/accounting/getdelays.c:

        / # ./getdelays -dl -p 231
        print delayacct stats ON
        listen forever
        PID     231

        CPU             count     real total  virtual total    delay total  delay average
                         6247     1859000000     2154070021     1674255063          0.268ms
        IO              count    delay total  delay average
                            0              0              0ms
        SWAP            count    delay total  delay average
                            0              0              0ms
        RECLAIM         count    delay total  delay average
                            0              0              0ms
        THRASHING       count    delay total  delay average
                            0              0              0ms
        COMPACT         count    delay total  delay average
                            3          72758              0ms
        WPCOPY          count    delay total  delay average
                         3635      271567604              0ms

    [1] commit 31cc5bc4af70("mm: support GUP-triggered unsharing of anonymous pages")

    Link: https://lkml.kernel.org/r/20220409014342.2505532-1-yang.yang29@zte.com.cn
    Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
    Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Reviewed-by: wangyong <wang.yong12@zte.com.cn>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Balbir Singh <bsingharora@gmail.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:12 -04:00
Chris von Recklinghausen 522d352177 mm/hugetlb: only drop uffd-wp special pte if required
Bugzilla: https://bugzilla.redhat.com/2160210

commit 05e90bd05eea33fc77d6b11e121e2da01fee379f
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:55 2022 -0700

    mm/hugetlb: only drop uffd-wp special pte if required

    As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
    if unmapping an entire vma or synchronized such that faults can not race
    with the unmap operation.  This requires passing zap_flags all the way to
    the lowest level hugetlb unmap routine: __unmap_hugepage_range.

    In general, unmap calls originated in hugetlbfs code will pass the
    ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
    faults.  The exception is hole punch which will first unmap without any
    synchronization.  Later when hole punch actually removes the page from the
    file, it will check to see if there was a subsequent fault and if so take
    the hugetlb fault mutex while unmapping again.  This second unmap will
    pass in ZAP_FLAG_DROP_MARKER.

    The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
    unmap a hugetlb range" is (IMHO): we should never reach a state when a
    page fault could errornously fault in a page-cache page that was
    wr-protected to be writable, even in an extremely short period.  That
    could happen if e.g.  we pass ZAP_FLAG_DROP_MARKER when
    hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
    faults after that call and before remove_inode_hugepages() is executed,
    the page cache can be mapped writable again in the small racy window, that
    can cause unexpected data overwritten.

    [peterx@redhat.com: fix sparse warning]
      Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
    [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
    Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen df80077278 mm/hugetlb: allow uffd wr-protect none ptes
Bugzilla: https://bugzilla.redhat.com/2160210

commit 60dfaad65aa97fb6755b9798a6b3c9e79bcd5930
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:55 2022 -0700

    mm/hugetlb: allow uffd wr-protect none ptes

    Teach hugetlbfs code to wr-protect none ptes just in case the page cache
    existed for that pte.  Meanwhile we also need to be able to recognize a
    uffd-wp marker pte and remove it for uffd_wp_resolve.

    Since at it, introduce a variable "psize" to replace all references to the
    huge page size fetcher.

    Link: https://lkml.kernel.org/r/20220405014912.14815-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen cbd1fe7b79 mm/hugetlb: handle UFFDIO_WRITEPROTECT
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5a90d5a103c2badfcf12d48e2fec350969e3f486
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:54 2022 -0700

    mm/hugetlb: handle UFFDIO_WRITEPROTECT

    This starts from passing cp_flags into hugetlb_change_protection() so
    hugetlb will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.

    huge_pte_clear_uffd_wp() is introduced to handle the case where the
    UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.

    Link: https://lkml.kernel.org/r/20220405014906.14708-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen 5989780018 mm/hugetlb: take care of UFFDIO_COPY_MODE_WP
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6041c69179034278ac6d57f90a55b09e588f4b90
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:54 2022 -0700

    mm/hugetlb: take care of UFFDIO_COPY_MODE_WP

    Pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout the
    stack.  Apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with UFFDIO_COPY.

    Hugetlb pages are only managed by hugetlbfs, so we're safe even without
    setting dirty bit in the huge pte if the page is installed as read-only.
    However we'd better still keep the dirty bit set for a read-only
    UFFDIO_COPY pte (when UFFDIO_COPY_MODE_WP bit is set), not only to match
    what we do with shmem, but also because the page does contain dirty data
    that the kernel just copied from the userspace.

    Link: https://lkml.kernel.org/r/20220405014904.14643-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen 66dc87e35b mm/hugetlb: hook page faults for uffd write protection
Bugzilla: https://bugzilla.redhat.com/2160210

commit 166f3ecc0daf0c164bd7e2f780dbcd1e213ac95f
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:54 2022 -0700

    mm/hugetlb: hook page faults for uffd write protection

    Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp
    faults.

    We do this slightly earlier than hugetlb_cow() so that we can avoid taking
    some extra locks that we definitely don't need.

    Link: https://lkml.kernel.org/r/20220405014901.14590-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen 3f74a6283d hugetlb: remove use of list iterator variable after loop
Bugzilla: https://bugzilla.redhat.com/2160210

commit 84448c8ecd9a130e8cddef5c585446c5520e774b
Author: Jakob Koschel <jakobkoschel@gmail.com>
Date:   Thu Apr 28 23:16:03 2022 -0700

    hugetlb: remove use of list iterator variable after loop

    In preparation to limit the scope of the list iterator to the list
    traversal loop, use a dedicated pointer to iterate through the list [1].

    Before hugetlb_resv_map_add() was expecting a file_region struct, but in
    case the list iterator in add_reservation_in_range() did not exit early,
    the variable passed in, is not actually a valid structure.

    In such a case 'rg' is computed on the head element of the list and
    represents an out-of-bounds pointer.  This still remains safe *iff* you
    only use the link member (as it is done in hugetlb_resv_map_add()).

    To avoid the type-confusion altogether and limit the list iterator to the
    loop, only a list_head pointer is kept to pass to hugetlb_resv_map_add().

    Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1]
    Link: https://lkml.kernel.org/r/20220331224323.903842-1-jakobkoschel@gmail.com
    Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: "Brian Johannesmeyer" <bjohannesmeyer@gmail.com>
    Cc: Cristiano Giuffrida <c.giuffrida@vu.nl>
    Cc: "Bos, H.J." <h.j.bos@vu.nl>
    Cc: Jakob Koschel <jakobkoschel@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen f4ca3e9bff mm, hugetlb, hwpoison: separate branch for free and in-use hugepage
Bugzilla: https://bugzilla.redhat.com/2160210

commit b283d983a7a6ffe3939ff26f06d151331a7c1071
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    mm, hugetlb, hwpoison: separate branch for free and in-use hugepage

    We know that HPageFreed pages should have page refcount 0, so
    get_page_unless_zero() always fails and returns 0.  So explicitly separate
    the branch based on page state for minor optimization and better
    readability.

    Link: https://lkml.kernel.org/r/20220415041848.GA3034499@ik1-406-35019.vs.sakura.ne.jp
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Rafael Aquini 0b8ddad58b mm/hugetlb: use hugetlb_pte_stable in migration race check
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2158123
CVE: CVE-2022-3522

This patch is a backport of the following upstream commit:
commit f9bf6c03eca1077cae8de0e6d86427656fa42a9b
Author: Peter Xu <peterx@redhat.com>
Date:   Tue Oct 4 15:33:59 2022 -0400

    mm/hugetlb: use hugetlb_pte_stable in migration race check

    After hugetlb_pte_stable() introduced, we can also rewrite the migration
    race condition against page allocation to use the new helper too.

    Link: https://lkml.kernel.org/r/20221004193400.110155-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2023-01-13 12:09:52 -05:00
Rafael Aquini 102588629f mm/hugetlb: fix race condition of uffd missing/minor handling
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2158123
CVE: CVE-2022-3522
Conflicts: as documented on the backport notes section.

This patch is a backport of the following upstream commit:
commit 2ea7ff1e39cbe3753d3c649beb70f2cf861dca75
Author: Peter Xu <peterx@redhat.com>
Date:   Tue Oct 4 15:33:58 2022 -0400

    mm/hugetlb: fix race condition of uffd missing/minor handling

    Patch series "mm/hugetlb: Fix selftest failures with write check", v3.

    Currently akpm mm-unstable fails with uffd hugetlb private mapping test
    randomly on a write check.

    The initial bisection of that points to the recent pmd unshare series, but
    it turns out there's no direction relationship with the series but only
    some timing change caused the race to start trigger.

    The race should be fixed in patch 1.  Patch 2 is a trivial cleanup on the
    similar race with hugetlb migrations, patch 3 comment on the write check
    so when anyone read it again it'll be clear why it's there.

    This patch (of 3):

    After the recent rework patchset of hugetlb locking on pmd sharing,
    kselftest for userfaultfd sometimes fails on hugetlb private tests with
    unexpected write fault checks.

    It turns out there's nothing wrong within the locking series regarding
    this matter, but it could have changed the timing of threads so it can
    trigger an old bug.

    The real bug is when we call hugetlb_no_page() we're not with the pgtable
    lock.  It means we're reading the pte values lockless.  It's perfectly
    fine in most cases because before we do normal page allocations we'll take
    the lock and check pte_same() again.  However before that, there are
    actually two paths on userfaultfd missing/minor handling that may directly
    move on with the fault process without checking the pte values.

    It means for these two paths we may be generating an uffd message based on
    an unstable pte, while an unstable pte can legally be anything as long as
    the modifier holds the pgtable lock.

    One example, which is also what happened in the failing kselftest and
    caused the test failure, is that for private mappings wr-protection
    changes can happen on one page.  While hugetlb_change_protection()
    generally requires pte being cleared before being changed, then there can
    be a race condition like:

            thread 1                              thread 2
            --------                              --------

          UFFDIO_WRITEPROTECT                     hugetlb_fault
            hugetlb_change_protection
              pgtable_lock()
              huge_ptep_modify_prot_start
                                                  pte==NULL
                                                  hugetlb_no_page
                                                    generate uffd missing event
                                                    even if page existed!!
              huge_ptep_modify_prot_commit
              pgtable_unlock()

    Fix this by rechecking the pte after pgtable lock for both userfaultfd
    missing & minor fault paths.

    This bug should have been around starting from uffd hugetlb introduced, so
    attaching a Fixes to the commit.  Also attach another Fixes to the minor
    support commit for easier tracking.

    Note that userfaultfd is actually fine with false positives (e.g.  caused
    by pte changed), but not wrong logical events (e.g.  caused by reading a
    pte during changing).  The latter can confuse the userspace, so the
    strictness is very much preferred.  E.g., MISSING event should never
    happen on the page after UFFDIO_COPY has correctly installed the page and
    returned.

    Link: https://lkml.kernel.org/r/20221004193400.110155-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20221004193400.110155-2-peterx@redhat.com
    Fixes: 1a1aad8a9b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
    Fixes: 7677f7fd8b ("userfaultfd: add minor fault registration mode")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Co-developed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

RHEL BACKPORT NOTES:
* all noted context differences are due to RHEL-9 codebase missing upstream
  commit 958f32ce832b ("mm: hugetlb: fix UAF in hugetlb_handle_userfault"),
  which itself depends on commit 40549ba8f8e0 ("hugetlb: use new vma_lock for
  pmd sharing synchronization") and its long series.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2023-01-13 12:09:51 -05:00
Rafael Aquini 7d0607ca83 mm/hugetlb: handle pte markers in page faults
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2158123
CVE: CVE-2022-3522

This patch is a backport of the following upstream commit:
commit c64e912c865a1a0df1c312bca946985eb095afa5
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:54 2022 -0700

    mm/hugetlb: handle pte markers in page faults

    Allow hugetlb code to handle pte markers just like none ptes.  It's mostly
    there, we just need to make sure we don't assume hugetlb_no_page() only
    handles none pte, so when detecting pte change we should use pte_same()
    rather than pte_none().  We need to pass in the old_pte to do the
    comparison.

    Check the original pte to see whether it's a pte marker, if it is, we
    should recover uffd-wp bit on the new pte to be installed, so that the
    next write will be trapped by uffd.

    Link: https://lkml.kernel.org/r/20220405014909.14761-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2023-01-13 12:09:50 -05:00
Nico Pache b1ff39aea3 mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages
commit 12df140f0bdfae5dcfc81800970dd7f6f632e00c
Author: Rik van Riel <riel@surriel.com>
Date:   Mon Oct 17 20:25:05 2022 -0400

    mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages

    The h->*_huge_pages counters are protected by the hugetlb_lock, but
    alloc_huge_page has a corner case where it can decrement the counter
    outside of the lock.

    This could lead to a corrupted value of h->resv_huge_pages, which we have
    observed on our systems.

    Take the hugetlb_lock before decrementing h->resv_huge_pages to avoid a
    potential race.

    Link: https://lkml.kernel.org/r/20221017202505.0e6a4fcd@imladris.surriel.com
    Fixes: a88c769548 ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count")
    Signed-off-by: Rik van Riel <riel@surriel.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Glen McCready <gkmccready@meta.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:44 -07:00
Nico Pache 0cba48960b mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page
commit fac35ba763ed07ba93154c95ffc0c4a55023707f
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Thu Sep 1 18:41:31 2022 +0800

    mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page

    On some architectures (like ARM64), it can support CONT-PTE/PMD size
    hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and
    1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified.

    So when looking up a CONT-PTE size hugetlb page by follow_page(), it will
    use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size
    hugetlb in follow_page_pte().  However this pte entry lock is incorrect
    for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get
    the correct lock, which is mm->page_table_lock.

    That means the pte entry of the CONT-PTE size hugetlb under current pte
    lock is unstable in follow_page_pte(), we can continue to migrate or
    poison the pte entry of the CONT-PTE size hugetlb, which can cause some
    potential race issues, even though they are under the 'pte lock'.

    For example, suppose thread A is trying to look up a CONT-PTE size hugetlb
    page by move_pages() syscall under the lock, however antoher thread B can
    migrate the CONT-PTE hugetlb page at the same time, which will cause
    thread A to get an incorrect page, if thread A also wants to do page
    migration, then data inconsistency error occurs.

    Moreover we have the same issue for CONT-PMD size hugetlb in
    follow_huge_pmd().

    To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte()
    to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to
    get the correct pte entry lock to make the pte entry stable.

    Mike said:

    Support for CONT_PMD/_PTE was added with bb9dd3df8e ("arm64: hugetlb:
    refactor find_num_contig()").  Patch series "Support for contiguous pte
    hugepages", v4.  However, I do not believe these code paths were
    executed until migration support was added with 5480280d3f ("arm64/mm:
    enable HugeTLB migration for contiguous bit HugeTLB pages") I would go
    with 5480280d3f for the Fixes: targe.

    Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.1662017562.git.baolin.wang@linux.alibaba.com
    Fixes: 5480280d3f ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages")
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:44 -07:00
Nico Pache 6dc345f13e mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process
commit d2226ebd5484afcf9f9b71b394ec1567a7730eb1
Author: Feng Tang <feng.tang@intel.com>
Date:   Fri Aug 5 08:59:03 2022 +0800

    mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process

    Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in
    commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple
    preferred nodes"), the policy_nodemask_current()'s semantics for this new
    policy has been changed, which returns 'preferred' nodes instead of
    'allowed' nodes.

    With the changed semantic of policy_nodemask_current, a task with
    MPOL_PREFERRED_MANY policy could fail to get its reservation even though
    it can fall back to other nodes (either defined by cpusets or all online
    nodes) for that reservation failing mmap calles unnecessarily early.

    The fix is to not consider MPOL_PREFERRED_MANY for reservations at all
    because they, unlike MPOL_MBIND, do not pose any actual hard constrain.

    Michal suggested the policy_nodemask_current() is only used by hugetlb,
    and could be moved to hugetlb code with more explicit name to enforce the
    'allowed' semantics for which only MPOL_BIND policy matters.

    apply_policy_zone() is made extern to be called in hugetlb code and its
    return value is changed to bool.

    [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/

    Link: https://lkml.kernel.org/r/20220805005903.95563-1-feng.tang@intel.com
    Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
    Signed-off-by: Feng Tang <feng.tang@intel.com>
    Reported-by: Muchun Song <songmuchun@bytedance.com>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Ben Widawsky <bwidawsk@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:42 -07:00
Nico Pache bb5a6221d0 mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte
commit ab74ef708dc51df7cf2b8a890b9c6990fac5c0c6
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 12 21:05:42 2022 +0800

    mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte

    In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
    cache are installed in the ptes.  But hugepage_add_new_anon_rmap is called
    for them mistakenly because they're not vm_shared.  This will corrupt the
    page->mapping used by page cache code.

    Link: https://lkml.kernel.org/r/20220712130542.18836-1-linmiaohe@huawei.com
    Fixes: f619147104 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:41 -07:00
Nico Pache 98f0be7d23 mm/hugetlb: support write-faults in shared mappings
commit 1d8d14641fd94a01b20a4abbf2749fd8eddcf57b
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Aug 11 12:34:35 2022 +0200

    mm/hugetlb: support write-faults in shared mappings

    If we ever get a write-fault on a write-protected page in a shared
    mapping, we'd be in trouble (again).  Instead, we can simply map the page
    writable.

    And in fact, there is even a way right now to trigger that code via
    uffd-wp ever since we stared to support it for shmem in 5.19:

    --------------------------------------------------------------------------
     #include <stdio.h>
     #include <stdlib.h>
     #include <string.h>
     #include <fcntl.h>
     #include <unistd.h>
     #include <errno.h>
     #include <sys/mman.h>
     #include <sys/syscall.h>
     #include <sys/ioctl.h>
     #include <linux/userfaultfd.h>

     #define HUGETLB_SIZE (2 * 1024 * 1024u)

     static char *map;
     int uffd;

     static int temp_setup_uffd(void)
     {
            struct uffdio_api uffdio_api;
            struct uffdio_register uffdio_register;
            struct uffdio_writeprotect uffd_writeprotect;
            struct uffdio_range uffd_range;

            uffd = syscall(__NR_userfaultfd,
                           O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
            if (uffd < 0) {
                    fprintf(stderr, "syscall() failed: %d\n", errno);
                    return -errno;
            }

            uffdio_api.api = UFFD_API;
            uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
            if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
                    fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
                    return -errno;
            }

            if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
                    fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
                    return -ENOSYS;
            }

            /* Register UFFD-WP */
            uffdio_register.range.start = (unsigned long) map;
            uffdio_register.range.len = HUGETLB_SIZE;
            uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
            if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
                    fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
                    return -errno;
            }

            /* Writeprotect a single page. */
            uffd_writeprotect.range.start = (unsigned long) map;
            uffd_writeprotect.range.len = HUGETLB_SIZE;
            uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
            if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
                    fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
                    return -errno;
            }

            /* Unregister UFFD-WP without prior writeunprotection. */
            uffd_range.start = (unsigned long) map;
            uffd_range.len = HUGETLB_SIZE;
            if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) {
                    fprintf(stderr, "UFFDIO_UNREGISTER failed: %d\n", errno);
                    return -errno;
            }

            return 0;
     }

     int main(int argc, char **argv)
     {
            int fd;

            fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
            if (!fd) {
                    fprintf(stderr, "open() failed\n");
                    return -errno;
            }
            if (ftruncate(fd, HUGETLB_SIZE)) {
                    fprintf(stderr, "ftruncate() failed\n");
                    return -errno;
            }

            map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
            if (map == MAP_FAILED) {
                    fprintf(stderr, "mmap() failed\n");
                    return -errno;
            }

            *map = 0;

            if (temp_setup_uffd())
                    return 1;

            *map = 0;

            return 0;
     }
    --------------------------------------------------------------------------

    Above test fails with SIGBUS when there is only a single free hugetlb page.
     # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
     # ./test
     Bus error (core dumped)

    And worse, with sufficient free hugetlb pages it will map an anonymous page
    into a shared mapping, for example, messing up accounting during unmap
    and breaking MAP_SHARED semantics:
     # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
     # ./test
     # cat /proc/meminfo | grep HugePages_
     HugePages_Total:       2
     HugePages_Free:        1
     HugePages_Rsvd:    18446744073709551615
     HugePages_Surp:        0

    Reason is that uffd-wp doesn't clear the uffd-wp PTE bit when
    unregistering and consequently keeps the PTE writeprotected.  Reason for
    this is to avoid the additional overhead when unregistering.  Note that
    this is the case also for !hugetlb and that we will end up with writable
    PTEs that still have the uffd-wp PTE bit set once we return from
    hugetlb_wp().  I'm not touching the uffd-wp PTE bit for now, because it
    seems to be a generic thing -- wp_page_reuse() also doesn't clear it.

    VM_MAYSHARE handling in hugetlb_fault() for FAULT_FLAG_WRITE indicates
    that MAP_SHARED handling was at least envisioned, but could never have
    worked as expected.

    While at it, make sure that we never end up in hugetlb_wp() on write
    faults without VM_WRITE, because we don't support maybe_mkwrite()
    semantics as commonly used in the !hugetlb case -- for example, in
    wp_page_reuse().

    Note that there is no need to do any kind of reservation in
    hugetlb_fault() in this case ...  because we already have a hugetlb page
    mapped R/O that we will simply map writable and we are not dealing with
    COW/unsharing.

    Link: https://lkml.kernel.org/r/20220811103435.188481-3-david@redhat.com
    Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Bjorn Helgaas <bhelgaas@google.com>
    Cc: Cyrill Gorcunov <gorcunov@openvz.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jamie Liu <jamieliu@google.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Pavel Emelyanov <xemul@parallels.com>
    Cc: Peter Feiner <pfeiner@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: <stable@vger.kernel.org>    [5.19]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:41 -07:00
Nico Pache e3a98076ca hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte
commit da9a298f5fad0dc615079a340da42928bc5b138e
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Sat Jul 9 17:26:29 2022 +0800

    hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte

    When alloc_huge_page fails, *pagep is set to NULL without put_page first.
    So the hugepage indicated by *pagep is leaked.

    Link: https://lkml.kernel.org/r/20220709092629.54291-1-linmiaohe@huawei.com
    Fixes: 8cc5fcbb5b ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:40 -07:00
Nico Pache 4cd418ee41 mm/migration: fix potential pte_unmap on an not mapped pte
commit ad1ac596e8a8c4b06715dfbd89853eb73c9886b2
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon May 30 19:30:16 2022 +0800

    mm/migration: fix potential pte_unmap on an not mapped pte

    __migration_entry_wait and migration_entry_wait_on_locked assume pte is
    always mapped from caller.  But this is not the case when it's called from
    migration_entry_wait_huge and follow_huge_pmd.  Add a hugetlbfs variant
    that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue.

    Link: https://lkml.kernel.org/r/20220530113016.16663-5-linmiaohe@huawei.com
    Fixes: 30dad30922 ("mm: migration: add migrate_entry_wait_huge()")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Suggested-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:39 -07:00
Nico Pache aa28eb0c17 mm/migration: return errno when isolate_huge_page failed
commit 7ce82f4c3f3ead13a9d9498768e3b1a79975c4d8
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon May 30 19:30:15 2022 +0800

    mm/migration: return errno when isolate_huge_page failed

    We might fail to isolate huge page due to e.g.  the page is under
    migration which cleared HPageMigratable.  We should return errno in this
    case rather than always return 1 which could confuse the user, i.e.  the
    caller might think all of the memory is migrated while the hugetlb page is
    left behind.  We make the prototype of isolate_huge_page consistent with
    isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
    to isolate_hugetlb as suggested by Muchun to improve the readability.

    Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
    Fixes: e8db67eb0d ("mm: migrate: move_pages() supports thp migration")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Suggested-by: Huang Ying <ying.huang@intel.com>
    Reported-by: kernel test robot <lkp@intel.com> (build error)
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:39 -07:00
Nico Pache 1b208c83a3 hugetlb: fix huge_pmd_unshare address update
commit 48381273f8734d28ef56a5bdf1966dd8530111bc
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue May 24 13:50:03 2022 -0700

    hugetlb: fix huge_pmd_unshare address update

    The routine huge_pmd_unshare() is passed a pointer to an address
    associated with an area which may be unshared.  If unshare is successful
    this address is updated to 'optimize' callers iterating over huge page
    addresses.  For the optimization to work correctly, address should be
    updated to the last huge page in the unmapped/unshared area.  However, in
    the common case where the passed address is PUD_SIZE aligned, the address
    is incorrectly updated to the address of the preceding huge page.  That
    wastes CPU cycles as the unmapped/unshared range is scanned twice.

    Link: https://lkml.kernel.org/r/20220524205003.126184-1-mike.kravetz@oracle.com
    Fixes: 39dde65c99 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:37 -07:00
Nico Pache 06cc5ae90a mm: hugetlb: add missing cache flushing in hugetlb_unshare_all_pmds()
commit 9c8bbfaca1bce84664403fd7dddbef6b3ff0a05a
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri Apr 29 14:36:58 2022 -0700

    mm: hugetlb: add missing cache flushing in hugetlb_unshare_all_pmds()

    Missed calling flush_cache_range() before removing the sharing PMD
    entrires, otherwise data consistence issue may be occurred on some
    architectures whose caches are strict and require a virtual>physical
    translation to exist for a virtual address.  Thus add it.

    Now no architectures enabling PMD sharing will be affected, since they do
    not have a VIVT cache.  That means this issue can not be happened in
    practice so far.

    Link: https://lkml.kernel.org/r/47441086affcabb6ecbe403173e9283b0d904b38.1650956489.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/419b0e777c9e6d1454dcd906e0f5b752a736d335.1650781755.git.baolin.wang@linux.alibaba.com
    Fixes: 6dfeaff93b ("hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp")
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Peter Xu <peterx@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:36 -07:00
Chris von Recklinghausen de2161f9e2 mm/hugetlb: correct demote page offset logic
Bugzilla: https://bugzilla.redhat.com/2131716

commit 317314527d173e1f139ceaf8cb87cb1746abf240
Author: Doug Berger <opendmb@gmail.com>
Date:   Wed Sep 14 12:09:17 2022 -0700

    mm/hugetlb: correct demote page offset logic

    With gigantic pages it may not be true that struct page structures are
    contiguous across the entire gigantic page.  The nth_page macro is used
    here in place of direct pointer arithmetic to correct for this.

    Mike said:

    : This error could cause addressing exceptions.  However, this is only
    : possible in configurations where CONFIG_SPARSEMEM &&
    : !CONFIG_SPARSEMEM_VMEMMAP.  Such a configuration option is rare and
    : unknown to be the default anywhere.

    Link: https://lkml.kernel.org/r/20220914190917.3517663-1-opendmb@gmail.com
    Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
    Signed-off-by: Doug Berger <opendmb@gmail.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-25 07:43:03 -04:00
Chris von Recklinghausen 2f91f422ba mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
Bugzilla: https://bugzilla.redhat.com/2120352

commit c2cb0dcce9dd8b748b6ca8bb8d4a389f2e232307
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Mon Jul 4 10:33:05 2022 +0900

    mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()

    Originally copy_hugetlb_page_range() handles migration entries and
    hwpoisoned entries in similar manner.  But recently the related code path
    has more code for migration entries, and when
    is_writable_migration_entry() was converted to
    !is_readable_migration_entry(), hwpoison entries on source processes got
    to be unexpectedly updated (which is legitimate for migration entries, but
    not for hwpoison entries).  This results in unexpected serious issues like
    kernel panic when forking processes with hwpoison entries in pmd.

    Separate the if branch into one for hwpoison entries and one for migration
    entries.

    Link: https://lkml.kernel.org/r/20220704013312.2415700-3-naoya.horiguchi@linux.dev
    Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: <stable@vger.kernel.org>    [5.18]
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:12 -04:00
Chris von Recklinghausen d6859c7858 mm/hugetlb: handle uffd-wp during fork()
Bugzilla: https://bugzilla.redhat.com/2120352

commit bc70fbf269fdff410b0b6d75c3770b9f59117b90
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:55 2022 -0700

    mm/hugetlb: handle uffd-wp during fork()

    Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range()
    because for uffd-wp it's the dst vma that matters on deciding how we
    should treat uffd-wp protected ptes.

    We should recognize pte markers during fork and do the pte copy if needed.

    [lkp@intel.com: vma_needs_copy can be static]
      Link: https://lkml.kernel.org/r/Ylb0CGeFJlc4EzLk@7ec4ff11d4ae
    Link: https://lkml.kernel.org/r/20220405014918.14932-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:12 -04:00
Chris von Recklinghausen eed1e135d7 mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning
Bugzilla: https://bugzilla.redhat.com/2120352

commit b6a2619c60b41a929bbb9c09f193d690d707b1af
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:45 2022 -0700

    mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning

    Let's verify when (un)pinning anonymous pages that we always deal with
    exclusive anonymous pages, which guarantees that we'll have a reliable
    PIN, meaning that we cannot end up with the GUP pin being inconsistent
    with he pages mapped into the page tables due to a COW triggered by a
    write fault.

    When pinning pages, after conditionally triggering GUP unsharing of
    possibly shared anonymous pages, we should always only see exclusive
    anonymous pages.  Note that anonymous pages that are mapped writable must
    be marked exclusive, otherwise we'd have a BUG.

    When pinning during ordinary GUP, simply add a check after our conditional
    GUP-triggered unsharing checks.  As we know exactly how the page is
    mapped, we know exactly in which page we have to check for
    PageAnonExclusive().

    When pinning via GUP-fast we have to be careful, because we can race with
    fork(): verify only after we made sure via the seqcount that we didn't
    race with concurrent fork() that we didn't end up pinning a possibly
    shared anonymous page.

    Similarly, when unpinning, verify that the pages are still marked as
    exclusive: otherwise something turned the pages possibly shared, which can
    result in random memory corruptions, which we really want to catch.

    With only the pinned pages at hand and not the actual page table entries
    we have to be a bit careful: hugetlb pages are always mapped via a single
    logical page table entry referencing the head page and PG_anon_exclusive
    of the head page applies.  Anon THP are a bit more complicated, because we
    might have obtained the page reference either via a PMD or a PTE --
    depending on the mapping type we either have to check PageAnonExclusive of
    the head page (PMD-mapped THP) or the tail page (PTE-mapped THP) applies:
    as we don't know and to make our life easier, check that either is set.

    Take care to not verify in case we're unpinning during GUP-fast because we
    detected concurrent fork(): we might stumble over an anonymous page that
    is now shared.

    Link: https://lkml.kernel.org/r/20220428083441.37290-18-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 5160dd7755 mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page
Bugzilla: https://bugzilla.redhat.com/2120352

commit a7f226604170acd6b142b76472c1a49c12ebb83d
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:45 2022 -0700

    mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page

    Whenever GUP currently ends up taking a R/O pin on an anonymous page that
    might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
    on the page table entry will end up replacing the mapped anonymous page
    due to COW, resulting in the GUP pin no longer being consistent with the
    page actually mapped into the page table.

    The possible ways to deal with this situation are:
     (1) Ignore and pin -- what we do right now.
     (2) Fail to pin -- which would be rather surprising to callers and
         could break user space.
     (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
         pins.

    Let's implement 3) because it provides the clearest semantics and allows
    for checking in unpin_user_pages() and friends for possible BUGs: when
    trying to unpin a page that's no longer exclusive, clearly something went
    very wrong and might result in memory corruptions that might be hard to
    debug.  So we better have a nice way to spot such issues.

    This change implies that whenever user space *wrote* to a private mapping
    (IOW, we have an anonymous page mapped), that GUP pins will always remain
    consistent: reliable R/O GUP pins of anonymous pages.

    As a side note, this commit fixes the COW security issue for hugetlb with
    FOLL_PIN as documented in:
      https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
    The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
    instead of FOLL_PIN.

    Note that follow_huge_pmd() doesn't apply because we cannot end up in
    there with FOLL_PIN.

    This commit is heavily based on prototype patches by Andrea.

    Link: https://lkml.kernel.org/r/20220428083441.37290-17-david@redhat.com
    Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 5b02d5be5f mm: support GUP-triggered unsharing of anonymous pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit c89357e27f20dda3fff6791d27bb6c91eae99f4a
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:45 2022 -0700

    mm: support GUP-triggered unsharing of anonymous pages

    Whenever GUP currently ends up taking a R/O pin on an anonymous page that
    might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
    on the page table entry will end up replacing the mapped anonymous page
    due to COW, resulting in the GUP pin no longer being consistent with the
    page actually mapped into the page table.

    The possible ways to deal with this situation are:
     (1) Ignore and pin -- what we do right now.
     (2) Fail to pin -- which would be rather surprising to callers and
         could break user space.
     (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
         pins.

    We want to implement 3) because it provides the clearest semantics and
    allows for checking in unpin_user_pages() and friends for possible BUGs:
    when trying to unpin a page that's no longer exclusive, clearly something
    went very wrong and might result in memory corruptions that might be hard
    to debug.  So we better have a nice way to spot such issues.

    To implement 3), we need a way for GUP to trigger unsharing:
    FAULT_FLAG_UNSHARE.  FAULT_FLAG_UNSHARE is only applicable to R/O mapped
    anonymous pages and resembles COW logic during a write fault.  However, in
    contrast to a write fault, GUP-triggered unsharing will, for example,
    still maintain the write protection.

    Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write
    fault handlers for all applicable anonymous page types: ordinary pages,
    THP and hugetlb.

    * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
      marked exclusive in the meantime by someone else, there is nothing to do.
    * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
      marked exclusive, it will try detecting if the process is the exclusive
      owner. If exclusive, it can be set exclusive similar to reuse logic
      during write faults via page_move_anon_rmap() and there is nothing
      else to do; otherwise, we either have to copy and map a fresh,
      anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
      THP.

    This commit is heavily based on patches by Andrea.

    Link: https://lkml.kernel.org/r/20220428083441.37290-16-david@redhat.com
    Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 35ed883ed5 mm/gup: disallow follow_page(FOLL_PIN)
Bugzilla: https://bugzilla.redhat.com/2120352

commit 8909691b6c5a84b67573b23ee8bb917b005628f0
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:44 2022 -0700

    mm/gup: disallow follow_page(FOLL_PIN)

    We want to change the way we handle R/O pins on anonymous pages that might
    be shared: if we detect a possibly shared anonymous page -- mapped R/O and
    not !PageAnonExclusive() -- we want to trigger unsharing via a page fault,
    resulting in an exclusive anonymous page that can be pinned reliably
    without getting replaced via COW on the next write fault.

    However, the required page fault will be problematic for follow_page(): in
    contrast to ordinary GUP, follow_page() doesn't trigger faults internally.
    So we would have to end up failing a R/O pin via follow_page(), although
    there is something mapped R/O into the page table, which might be rather
    surprising.

    We don't seem to have follow_page(FOLL_PIN) users, and it's a purely
    internal MM function.  Let's just make our life easier and the semantics
    of follow_page() clearer by just disallowing FOLL_PIN for follow_page()
    completely.

    Link: https://lkml.kernel.org/r/20220428083441.37290-15-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 30e9a2455a mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6c287605fd56466e645693eff3ae7c08fba56e0a
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:44 2022 -0700

    mm: remember exclusively mapped anonymous pages with PG_anon_exclusive

    Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
    exclusive, and use that information to make GUP pins reliable and stay
    consistent with the page mapped into the page table even if the page table
    entry gets write-protected.

    With that information at hand, we can extend our COW logic to always reuse
    anonymous pages that are exclusive.  For anonymous pages that might be
    shared, the existing logic applies.

    As already documented, PG_anon_exclusive is usually only expressive in
    combination with a page table entry.  Especially PTE vs.  PMD-mapped
    anonymous pages require more thought, some examples: due to mremap() we
    can easily have a single compound page PTE-mapped into multiple page
    tables exclusively in a single process -- multiple page table locks apply.
    Further, due to MADV_WIPEONFORK we might not necessarily write-protect
    all PTEs, and only some subpages might be pinned.  Long story short: once
    PTE-mapped, we have to track information about exclusivity per sub-page,
    but until then, we can just track it for the compound page in the head
    page and not having to update a whole bunch of subpages all of the time
    for a simple PMD mapping of a THP.

    For simplicity, this commit mostly talks about "anonymous pages", while
    it's for THP actually "the part of an anonymous folio referenced via a
    page table entry".

    To not spill PG_anon_exclusive code all over the mm code-base, we let the
    anon rmap code to handle all PG_anon_exclusive logic it can easily handle.

    If a writable, present page table entry points at an anonymous (sub)page,
    that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
    pin (FOLL_PIN) on an anonymous page references via a present page table
    entry, it must only pin if PG_anon_exclusive is set for the mapped
    (sub)page.

    This commit doesn't adjust GUP, so this is only implicitly handled for
    FOLL_WRITE, follow-up commits will teach GUP to also respect it for
    FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
    reliable.

    Whenever an anonymous page is to be shared (fork(), KSM), or when
    temporarily unmapping an anonymous page (swap, migration), the relevant
    PG_anon_exclusive bit has to be cleared to mark the anonymous page
    possibly shared.  Clearing will fail if there are GUP pins on the page:

    * For fork(), this means having to copy the page and not being able to
      share it.  fork() protects against concurrent GUP using the PT lock and
      the src_mm->write_protect_seq.

    * For KSM, this means sharing will fail.  For swap this means, unmapping
      will fail, For migration this means, migration will fail early.  All
      three cases protect against concurrent GUP using the PT lock and a
      proper clear/invalidate+flush of the relevant page table entry.

    This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
    pinned page gets mapped R/O and the successive write fault ends up
    replacing the page instead of reusing it.  It improves the situation for
    O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
    fork() is *not* involved, however swapout and fork() are still
    problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
    users will fix the issue for them.

    I. Details about basic handling

    I.1. Fresh anonymous pages

    page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
    given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
    the mechanism fresh anonymous pages come into life (besides migration code
    where we copy the page->mapping), all fresh anonymous pages will start out
    as exclusive.

    I.2. COW reuse handling of anonymous pages

    When a COW handler stumbles over a (sub)page that's marked exclusive, it
    simply reuses it.  Otherwise, the handler tries harder under page lock to
    detect if the (sub)page is exclusive and can be reused.  If exclusive,
    page_move_anon_rmap() will mark the given (sub)page exclusive.

    Note that hugetlb code does not yet check for PageAnonExclusive(), as it
    still uses the old COW logic that is prone to the COW security issue
    because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
    pages are a scarce resource.

    I.3. Migration handling

    try_to_migrate() has to try marking an exclusive anonymous page shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  migrate_vma_collect_pmd() and
    __split_huge_pmd_locked() are handled similarly.

    Writable migration entries implicitly point at shared anonymous pages.
    For readable migration entries that information is stored via a new
    "readable-exclusive" migration entry, specific to anonymous pages.

    When restoring a migration entry in remove_migration_pte(), information
    about exlusivity is detected via the migration entry type, and
    RMAP_EXCLUSIVE is set accordingly for
    page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.

    I.4. Swapout handling

    try_to_unmap() has to try marking the mapped page possibly shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  For now, information about exclusivity is lost.  In
    the future, we might want to remember that information in the swap entry
    in some cases, however, it requires more thought, care, and a way to store
    that information in swap entries.

    I.5. Swapin handling

    do_swap_page() will never stumble over exclusive anonymous pages in the
    swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
    to detect manually if an anonymous page is exclusive and has to set
    RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.

    I.6. THP handling

    __split_huge_pmd_locked() has to move the information about exclusivity
    from the PMD to the PTEs.

    a) In case we have a readable-exclusive PMD migration entry, simply
       insert readable-exclusive PTE migration entries.

    b) In case we have a present PMD entry and we don't want to freeze
       ("convert to migration entries"), simply forward PG_anon_exclusive to
       all sub-pages, no need to temporarily clear the bit.

    c) In case we have a present PMD entry and want to freeze, handle it
       similar to try_to_migrate(): try marking the page shared first.  In
       case we fail, we ignore the "freeze" instruction and simply split
       ordinarily.  try_to_migrate() will properly fail because the THP is
       still mapped via PTEs.

    When splitting a compound anonymous folio (THP), the information about
    exclusivity is implicitly handled via the migration entries: no need to
    replicate PG_anon_exclusive manually.

    I.7.  fork() handling fork() handling is relatively easy, because
    PG_anon_exclusive is only expressive for some page table entry types.

    a) Present anonymous pages

    page_try_dup_anon_rmap() will mark the given subpage shared -- which will
    fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
    PMD to handle it on the PTE level).

    Note that device exclusive entries are just a pointer at a PageAnon()
    page.  fork() will first convert a device exclusive entry to a present
    page table and handle it just like present anonymous pages.

    b) Device private entry

    Device private entries point at PageAnon() pages that cannot be mapped
    directly and, therefore, cannot get pinned.

    page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
    fail because they cannot get pinned.

    c) HW poison entries

    PG_anon_exclusive will remain untouched and is stale -- the page table
    entry is just a placeholder after all.

    d) Migration entries

    Writable and readable-exclusive entries are converted to readable entries:
    possibly shared.

    I.8. mprotect() handling

    mprotect() only has to properly handle the new readable-exclusive
    migration entry:

    When write-protecting a migration entry that points at an anonymous page,
    remember the information about exclusivity via the "readable-exclusive"
    migration entry type.

    II. Migration and GUP-fast

    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a migration entry, we have to mark the page possibly
    shared and synchronize against GUP-fast by a proper clear/invalidate+flush
    to make the following scenario impossible:

    1. try_to_migrate() places a migration entry after checking for GUP pins
       and marks the page possibly shared.

    2. GUP-fast pins the page due to lack of synchronization

    3. fork() converts the "writable/readable-exclusive" migration entry into a
       readable migration entry

    4. Migration fails due to the GUP pin (failing to freeze the refcount)

    5. Migration entries are restored. PG_anon_exclusive is lost

    -> We have a pinned page that is not marked exclusive anymore.

    Note that we move information about exclusivity from the page to the
    migration entry as it otherwise highly overcomplicates fork() and
    PTE-mapping a THP.

    III. Swapout and GUP-fast

    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a swap entry, we have to mark the page possibly shared
    and synchronize against GUP-fast by a proper clear/invalidate+flush to
    make the following scenario impossible:

    1. try_to_unmap() places a swap entry after checking for GUP pins and
       clears exclusivity information on the page.

    2. GUP-fast pins the page due to lack of synchronization.

    -> We have a pinned page that is not marked exclusive anymore.

    If we'd ever store information about exclusivity in the swap entry,
    similar to migration handling, the same considerations as in II would
    apply.  This is future work.

    Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen d8f21270d3 mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()
Bugzilla: https://bugzilla.redhat.com/2120352

commit fb3d824d1a46c5bb0584ea88f32dc2495544aebf
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()

    ...  and move the special check for pinned pages into
    page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages
    via a new pageflag, clearing it only after making sure that there are no
    GUP pins on the anonymous page.

    We really only care about pins on anonymous pages, because they are prone
    to getting replaced in the COW handler once mapped R/O.  For !anon pages
    in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about
    that, at least not that I could come up with an example.

    Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
    know we're dealing with anonymous pages.  Also, drop the handling of
    pinned pages from copy_huge_pud() and add a comment if ever supporting
    anonymous pages on the PUD level.

    This is a preparation for tracking exclusivity of anonymous pages in the
    rmap code, and disallowing marking a page shared (-> failing to duplicate)
    if there are GUP pins on a page.

    Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:10 -04:00
Chris von Recklinghausen 1eb12f2035 mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 623a1ddfeb232526275ddd0c8378771e6712aad4
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:42 2022 -0700

    mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range()

    Let's do it just like copy_page_range(), taking the seqlock and making
    sure the mmap_lock is held in write mode.

    This allows for add a VM_BUG_ON to page_needs_cow_for_dma() and properly
    synchronizes concurrent fork() with GUP-fast of hugetlb pages, which will
    be relevant for further changes.

    Link: https://lkml.kernel.org/r/20220428083441.37290-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:10 -04:00
Chris von Recklinghausen 73357efc3a mm: hugetlb: considering PMD sharing when flushing cache/TLBs
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3d0b95cd87b26b0b10e0cda8ee6105c2194a5800
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Mon May 9 18:20:52 2022 -0700

    mm: hugetlb: considering PMD sharing when flushing cache/TLBs

    This patchset fixes some cache flushing issues if PMD sharing is possible
    for hugetlb pages, which were found by code inspection.  Meanwhile Mike
    found the flush_cache_page() can not cover the whole size of a hugetlb
    page on some architectures [1], so I added a new patch 3 to fix this
    issue, since I found only try_to_unmap_one() and try_to_migrate_one() need
    to fix after some investigation.

    [1] https://lore.kernel.org/linux-mm/064da3bb-5b4b-7332-a722-c5a541128705@oracle.com/

    This patch (of 3):

    When moving hugetlb page tables, the cache flushing is called in
    move_page_tables() without considering the shared PMDs, which may be cause
    cache issues on some architectures.

    Thus we should move the hugetlb cache flushing into
    move_hugetlb_page_tables() with considering the shared PMDs ranges,
    calculated by adjust_range_if_pmd_sharing_possible().  Meanwhile also
    expanding the TLBs flushing range in case of shared PMDs.

    Note this is discovered via code inspection, and did not meet a real
    problem in practice so far.

    Link: https://lkml.kernel.org/r/cover.1651056365.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/0443c8cf20db554d3ff4b439b30e0ff26c0181dd.1651056365.git.baolin.wang@linux.alibaba.com
    Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma")
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:08 -04:00
Chris von Recklinghausen b4381e605e mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 78fbe906cc900b33ce078102e13e0e99b5b8c406
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:44 2022 -0700

    mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages

    The basic question we would like to have a reliable and efficient answer
    to is: is this anonymous page exclusive to a single process or might it be
    shared?  We need that information for ordinary/single pages, hugetlb
    pages, and possibly each subpage of a THP.

    Introduce a way to mark an anonymous page as exclusive, with the ultimate
    goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins
    lose consistency with the pages mapped into the page table, resulting in
    reported memory corruptions.

    Most pageflags already have semantics for anonymous pages, however,
    PG_mappedtodisk should never apply to pages in the swapcache, so let's
    reuse that flag.

    As PG_has_hwpoisoned also uses that flag on the second tail page of a
    compound page, convert it to PG_error instead, which is marked as
    PF_NO_TAIL, so never used for tail pages.

    Use custom page flag modification functions such that we can do additional
    sanity checks.  The semantics we'll put into some kernel doc in the future
    are:

    "
      PG_anon_exclusive is *usually* only expressive in combination with a
      page table entry. Depending on the page table entry type it might
      store the following information:

           Is what's mapped via this page table entry exclusive to the
           single process and can be mapped writable without further
           checks? If not, it might be shared and we might have to COW.

      For now, we only expect PTE-mapped THPs to make use of
      PG_anon_exclusive in subpages. For other anonymous compound
      folios (i.e., hugetlb), only the head page is logically mapped and
      holds this information.

      For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
      set on the head page. When replacing the PMD by a page table full
      of PTEs, PG_anon_exclusive, if set on the head page, will be set on
      all tail pages accordingly. Note that converting from a PTE-mapping
      to a PMD mapping using the same compound page is currently not
      possible and consequently doesn't require care.

      If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
      it should only pin if the relevant PG_anon_exclusive is set. In that
      case, the pin will be fully reliable and stay consistent with the pages
      mapped into the page table, as the bit cannot get cleared (e.g., by
      fork(), KSM) while the page is pinned. For anonymous pages that
      are mapped R/W, PG_anon_exclusive can be assumed to always be set
      because such pages cannot possibly be shared.

      The page table lock protecting the page table entry is the primary
      synchronization mechanism for PG_anon_exclusive; GUP-fast that does
      not take the PT lock needs special care when trying to clear the
      flag.

      Page table entry types and PG_anon_exclusive:
      * Present: PG_anon_exclusive applies.
      * Swap: the information is lost. PG_anon_exclusive was cleared.
      * Migration: the entry holds this information instead.
                   PG_anon_exclusive was cleared.
      * Device private: PG_anon_exclusive applies.
      * Device exclusive: PG_anon_exclusive applies.
      * HW Poison: PG_anon_exclusive is stale and not changed.

      If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
      not allowed and the flag will stick around until the page is freed
      and folio->mapping is cleared.
    "

    We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
    zapping) of page table entries, page freeing code will handle that when
    also invalidate page->mapping to not indicate PageAnon() anymore.  Letting
    information about exclusivity stick around will be an important property
    when adding sanity checks to unpinning code.

    Note that we properly clear the flag in free_pages_prepare() via
    PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
    so there is no need to manually clear the flag.

    Link: https://lkml.kernel.org/r/20220428083441.37290-12-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:08 -04:00
Chris von Recklinghausen a805faea7e mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 405ce051236cc65b30bbfe490b28ce60ae6aed85
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 21 16:35:33 2022 -0700

    mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()

    There is a race condition between memory_failure_hugetlb() and hugetlb
    free/demotion, which causes setting PageHWPoison flag on the wrong page.
    The one simple result is that wrong processes can be killed, but another
    (more serious) one is that the actual error is left unhandled, so no one
    prevents later access to it, and that might lead to more serious results
    like consuming corrupted data.

    Think about the below race window:

      CPU 1                                   CPU 2
      memory_failure_hugetlb
      struct page *head = compound_head(p);
                                              hugetlb page might be freed to
                                              buddy, or even changed to another
                                              compound page.

      get_hwpoison_page -- page is not what we want now...

    The current code first does prechecks roughly and then reconfirms after
    taking refcount, but it's found that it makes code overly complicated,
    so move the prechecks in a single hugetlb_lock range.

    A newly introduced function, try_memory_failure_hugetlb(), always takes
    hugetlb_lock (even for non-hugetlb pages).  That can be improved, but
    memory_failure() is rare in principle, so should not be a big problem.

    Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
    Fixes: 761ad8d7c7 ("mm: hwpoison: introduce memory_failure_hugetlb()")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen b6778dceeb userfaultfd: provide unmasked address on page-fault
Bugzilla: https://bugzilla.redhat.com/2120352

commit 824ddc601adc2cc48efb7f58b57997986c1c1276
Author: Nadav Amit <namit@vmware.com>
Date:   Tue Mar 22 14:45:32 2022 -0700

    userfaultfd: provide unmasked address on page-fault

    Userfaultfd is supposed to provide the full address (i.e., unmasked) of
    the faulting access back to userspace.  However, that is not the case for
    quite some time.

    Even running "userfaultfd_demo" from the userfaultfd man page provides the
    wrong output (and contradicts the man page).  Notice that
    "UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and
    not the first read address (0x7fc5e30b300f).

            Address returned by mmap() = 0x7fc5e30b3000

            fault_handler_thread():
                poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
                UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000
                    (uffdio_copy.copy returned 4096)
            Read address 0x7fc5e30b300f in main(): A
            Read address 0x7fc5e30b340f in main(): A
            Read address 0x7fc5e30b380f in main(): A
            Read address 0x7fc5e30b3c0f in main(): A

    The exact address is useful for various reasons and specifically for
    prefetching decisions.  If it is known that the memory is populated by
    certain objects whose size is not page-aligned, then based on the faulting
    address, the uffd-monitor can decide whether to prefetch and prefault the
    adjacent page.

    This bug has been for quite some time in the kernel: since commit
    1a29d85eb0 ("mm: use vmf->address instead of of vmf->virtual_address")
    vmf->virtual_address"), which dates back to 2016.  A concern has been
    raised that existing userspace application might rely on the old/wrong
    behavior in which the address is masked.  Therefore, it was suggested to
    provide the masked address unless the user explicitly asks for the exact
    address.

    Add a new userfaultfd feature UFFD_FEATURE_EXACT_ADDRESS to direct
    userfaultfd to provide the exact address.  Add a new "real_address" field
    to vmf to hold the unmasked address.  Provide the address to userspace
    accordingly.

    Initialize real_address in various code-paths to be consistent with
    address, even when it is not used, to be on the safe side.

    [namit@vmware.com: initialize real_address on all code paths, per Jan]
      Link: https://lkml.kernel.org/r/20220226022655.350562-1-namit@vmware.com
    [akpm@linux-foundation.org: fix typo in comment, per Jan]

    Link: https://lkml.kernel.org/r/20220218041003.3508-1-namit@vmware.com
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen 04b5285d2b mm/hugetlb: use helper macro __ATTR_RW
Bugzilla: https://bugzilla.redhat.com/2120352

commit 98bc26ac770fe507b4c23f5ee748f641146fb076
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:45:23 2022 -0700

    mm/hugetlb: use helper macro __ATTR_RW

    Use helper macro __ATTR_RW to define HSTATE_ATTR to make code more clear.
    Minor readability improvement.

    Link: https://lkml.kernel.org/r/20220222112731.33479-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen 098b1944c2 mm/hugetlb: fix kernel crash with hugetlb mremap
Bugzilla: https://bugzilla.redhat.com/2120352

commit db110a99d3367936058727ff4798e3a39c707969
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Fri Feb 25 19:10:56 2022 -0800

    mm/hugetlb: fix kernel crash with hugetlb mremap

    This fixes the below crash:

      kernel BUG at include/linux/mm.h:2373!
      cpu 0x5d: Vector: 700 (Program Check) at [c00000003c6e76e0]
          pc: c000000000581a54: pmd_to_page+0x54/0x80
          lr: c00000000058d184: move_hugetlb_page_tables+0x4e4/0x5b0
          sp: c00000003c6e7980
         msr: 9000000000029033
        current = 0xc00000003bd8d980
        paca    = 0xc000200fff610100   irqmask: 0x03   irq_happened: 0x01
          pid   = 9349, comm = hugepage-mremap
      kernel BUG at include/linux/mm.h:2373!
        move_hugetlb_page_tables+0x4e4/0x5b0 (link register)
        move_hugetlb_page_tables+0x22c/0x5b0 (unreliable)
        move_page_tables+0xdbc/0x1010
        move_vma+0x254/0x5f0
        sys_mremap+0x7c0/0x900
        system_call_exception+0x160/0x2c0

    the kernel can't use huge_pte_offset before it set the pte entry because
    a page table lookup check for huge PTE bit in the page table to
    differentiate between a huge pte entry and a pointer to pte page.  A
    huge_pte_alloc won't mark the page table entry huge and hence kernel
    should not use huge_pte_offset after a huge_pte_alloc.

    Link: https://lkml.kernel.org/r/20220211063221.99293-1-aneesh.kumar@linux.ibm.com
    Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma")
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Mina Almasry <almasrymina@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:45 -04:00
Chris von Recklinghausen a7309fba72 hugetlbfs: flush before unlock on move_hugetlb_page_tables()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 13e4ad2ce8df6e058ef482a31fdd81c725b0f7ea
Author: Nadav Amit <namit@vmware.com>
Date:   Sun Nov 21 12:40:08 2021 -0800

    hugetlbfs: flush before unlock on move_hugetlb_page_tables()

    We must flush the TLB before releasing i_mmap_rwsem to avoid the
    potential reuse of an unshared PMDs page.  This is not true in the case
    of move_hugetlb_page_tables().  The last reference on the page table can
    therefore be dropped before the TLB flush took place.

    Prevent it by reordering the operations and flushing the TLB before
    releasing i_mmap_rwsem.

    Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma")
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:32 -04:00
Chris von Recklinghausen a807a3ea77 hugetlb: fix hugetlb cgroup refcounting during mremap
Bugzilla: https://bugzilla.redhat.com/2120352

commit afe041c2d0febd83698b8b0164e6b3b1dfae0b66
Author: Bui Quang Minh <minhquangbui99@gmail.com>
Date:   Fri Nov 19 16:43:40 2021 -0800

    hugetlb: fix hugetlb cgroup refcounting during mremap

    When hugetlb_vm_op_open() is called during copy_vma(), we may take the
    reference to resv_map->css.  Later, when clearing the reservation
    pointer of old_vma after transferring it to new_vma, we forget to drop
    the reference to resv_map->css.  This leads to a reference leak of css.

    Fixes this by adding a check to drop reservation css reference in
    clear_vma_resv_huge_pages()

    Link: https://lkml.kernel.org/r/20211113154412.91134-1-minhquangbui99@gmail.com
    Fixes: 550a7d60bd5e35 ("mm, hugepages: add mremap() support for hugepage backed vma")
    Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Mina Almasry <almasrymina@google.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:32 -04:00
Chris von Recklinghausen 6bc54c3a8d hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 76efc67a5e7a3dc1226c4ad1b266a15741347031
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri Nov 5 13:42:01 2021 -0700

    hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range()

    When calling hugetlb_resv_map_add(), we've guaranteed that the parameter
    'to' is always larger than 'from', so it never returns a negative value
    from hugetlb_resv_map_add().  Thus remove the redundant VM_BUG_ON().

    Link: https://lkml.kernel.org/r/2b565552f3d06753da1e8dda439c0d96d6d9a5a3.1634797639.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 14640fd949 hugetlb: remove redundant validation in has_same_uncharge_info()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 0739eb437f3d397eb39b5dc653aa250ee7b453f0
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri Nov 5 13:41:58 2021 -0700

    hugetlb: remove redundant validation in has_same_uncharge_info()

    The callers of has_same_uncharge_info() has accessed the original
    file_region and new file_region, and they are impossible to be NULL now.

    So we can remove the file_region validation in has_same_uncharge_info()
    to simplify the code.

    Link: https://lkml.kernel.org/r/97fc68d3f8d34f63c204645e10d7a718997e50b7.1634797639.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 691a2d775f hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments
Bugzilla: https://bugzilla.redhat.com/2120352

commit aa6d2e8cba2dc6f1f3dd39f7a7cc8ac788ad6c1a
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri Nov 5 13:41:55 2021 -0700

    hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments

    After commit 8382d914eb ("mm, hugetlb: improve page-fault
    scalability"), the hugetlb_instantiation_mutex lock had been replaced by
    hugetlb_fault_mutex_table to serializes faults on the same logical page.

    Thus update the obsolete hugetlb_instantiation_mutex related comments.

    Link: https://lkml.kernel.org/r/4b3febeae37455ff7b74aa0aad16cc6909cf0926.1634797639.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 1e2a6f4040 mm, hugepages: add mremap() support for hugepage backed vma
Bugzilla: https://bugzilla.redhat.com/2120352

commit 550a7d60bd5e35a56942dba6d8a26752beb26c9f
Author: Mina Almasry <almasrymina@google.com>
Date:   Fri Nov 5 13:41:40 2021 -0700

    mm, hugepages: add mremap() support for hugepage backed vma

    Support mremap() for hugepage backed vma segment by simply repositioning
    page table entries.  The page table entries are repositioned to the new
    virtual address on mremap().

    Hugetlb mremap() support is of course generic; my motivating use case is
    a library (hugepage_text), which reloads the ELF text of executables in
    hugepages.  This significantly increases the execution performance of
    said executables.

    Restrict the mremap operation on hugepages to up to the size of the
    original mapping as the underlying hugetlb reservation is not yet
    capable of handling remapping to a larger size.

    During the mremap() operation we detect pmd_share'd mappings and we
    unshare those during the mremap().  On access and fault the sharing is
    established again.

    Link: https://lkml.kernel.org/r/20211013195825.3058275-1-almasrymina@google.com
    Signed-off-by: Mina Almasry <almasrymina@google.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Ken Chen <kenchen@google.com>
    Cc: Chris Kennelly <ckennelly@google.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Kirill Shutemov <kirill@shutemov.name>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 01fdf9c347 mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h
Bugzilla: https://bugzilla.redhat.com/2120352

commit 73c54763482b841d9c14bad87eec98f80f700e0b
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Nov 5 13:41:17 2021 -0700

    mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h

    Remove __unmap_hugepage_range() from the header file, because it is only
    used in hugetlb.c.

    Link: https://lkml.kernel.org/r/20210917165108.9341-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen a069b82226 mm: use for_each_online_node and node_online instead of open coding
Bugzilla: https://bugzilla.redhat.com/2120352

commit 30a514002db23fd630e3e52a2bdfb05c0de03378
Author: Peng Liu <liupeng256@huawei.com>
Date:   Fri Apr 29 14:36:58 2022 -0700

    mm: use for_each_online_node and node_online instead of open coding

    Use more generic functions to deal with issues related to online nodes.
    The changes will make the code simplified.

    Link: https://lkml.kernel.org/r/20220429030218.644635-1-liupeng256@huawei.com
    Signed-off-by: Peng Liu <liupeng256@huawei.com>
    Suggested-by: Davidlohr Bueso <dave@stgolabs.net>
    Suggested-by: Andrew Morton <akpm@linux-foundation.org>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:19 -04:00
Chris von Recklinghausen 4663ad75d5 hugetlb: fix return value of __setup handlers
Bugzilla: https://bugzilla.redhat.com/2120352

commit f81f6e4b5eedb41045fd0ccc8ea31f4f07ce0993
Author: Peng Liu <liupeng256@huawei.com>
Date:   Fri Apr 29 14:36:57 2022 -0700

    hugetlb: fix return value of __setup handlers

    When __setup() return '0', using invalid option values causes the entire
    kernel boot option string to be reported as Unknown.  Hugetlb calls
    __setup() and will return '0' when set invalid parameter string.

    The following phenomenon is observed:
     cmdline:
      hugepagesz=1Y hugepages=1
     dmesg:
      HugeTLB: unsupported hugepagesz=1Y
      HugeTLB: hugepages=1 does not follow a valid hugepagesz, ignoring
      Unknown kernel command line parameters "hugepagesz=1Y hugepages=1"

    Since hugetlb will print warning/error information before return for
    invalid parameter string, just use return '1' to avoid print again.

    Link: https://lkml.kernel.org/r/20220413032915.251254-4-liupeng256@huawei.com
    Signed-off-by: Peng Liu <liupeng256@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Liu Yuntao <liuyuntao10@huawei.com>
    Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:18 -04:00
Chris von Recklinghausen 363a0a617f hugetlb: fix hugepages_setup when deal with pernode
Bugzilla: https://bugzilla.redhat.com/2120352

commit f87442f407af80dac4dc81c8a7772b71b36b2e09
Author: Peng Liu <liupeng256@huawei.com>
Date:   Fri Apr 29 14:36:57 2022 -0700

    hugetlb: fix hugepages_setup when deal with pernode

    Hugepages can be specified to pernode since "hugetlbfs: extend the
    definition of hugepages parameter to support node allocation", but the
    following problem is observed.

    Confusing behavior is observed when both 1G and 2M hugepage is set
    after "numa=off".
     cmdline hugepage settings:
      hugepagesz=1G hugepages=0:3,1:3
      hugepagesz=2M hugepages=0:1024,1:1024
     results:
      HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
      HugeTLB registered 2.00 MiB page size, pre-allocated 1024 pages

    Furthermore, confusing behavior can be also observed when an invalid node
    behind a valid node.  To fix this, never allocate any typical hugepage
    when an invalid parameter is received.

    Link: https://lkml.kernel.org/r/20220413032915.251254-3-liupeng256@huawei.com
    Fixes: b5389086ad7b ("hugetlbfs: extend the definition of hugepages parameter to support node allocation")
    Signed-off-by: Peng Liu <liupeng256@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Liu Yuntao <liuyuntao10@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:18 -04:00
Chris von Recklinghausen 8062f6f82d hugetlb: fix wrong use of nr_online_nodes
Bugzilla: https://bugzilla.redhat.com/2120352

commit 0a7a0f6f7f3679c906fc55e3805c1d5e2c566f55
Author: Peng Liu <liupeng256@huawei.com>
Date:   Fri Apr 29 14:36:57 2022 -0700

    hugetlb: fix wrong use of nr_online_nodes

    Patch series "hugetlb: Fix some incorrect behavior", v3.

    This series fix three bugs of hugetlb:
    1) Invalid use of nr_online_nodes;
    2) Inconsistency between 1G hugepage and 2M hugepage;
    3) Useless information in dmesg.

    This patch (of 4):

    Certain systems are designed to have sparse/discontiguous nodes.  In this
    case, nr_online_nodes can not be used to walk through numa node.  Also, a
    valid node may be greater than nr_online_nodes.

    However, in hugetlb, it is assumed that nodes are contiguous.

    For sparse/discontiguous nodes, the current code may treat a valid node
    as invalid, and will fail to allocate all hugepages on a valid node that
    "nid >= nr_online_nodes".

    As David suggested:

            if (tmp >= nr_online_nodes)
                    goto invalid;

    Just imagine node 0 and node 2 are online, and node 1 is offline.
    Assuming that "node < 2" is valid is wrong.

    Recheck all the places that use nr_online_nodes, and repair them one by
    one.

    [liupeng256@huawei.com: v4]
      Link: https://lkml.kernel.org/r/20220416103526.3287348-1-liupeng256@huawei.com
    Link: https://lkml.kernel.org/r/20220413032915.251254-1-liupeng256@huawei.com
    Link: https://lkml.kernel.org/r/20220413032915.251254-2-liupeng256@huawei.com
    Fixes: 4178158ef8ca ("hugetlbfs: fix issue of preallocation of gigantic pages can't work")
    Fixes: b5389086ad7b ("hugetlbfs: extend the definition of hugepages parameter to support node allocation")
    Fixes: e79ce9832316 ("hugetlbfs: fix a truncation issue in hugepages parameter")
    Fixes: f9317f77a6e0 ("hugetlb: clean up potential spectre issue warnings")
    Signed-off-by: Peng Liu <liupeng256@huawei.com>
    Suggested-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Liu Yuntao <liuyuntao10@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:18 -04:00
Chris von Recklinghausen 63a09ef932 mm: hugetlb_vmemmap: cleanup hugetlb_vmemmap related functions
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5981611d0a006472d367d7a8e6ead8afaecf17c7
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Thu Apr 28 23:16:14 2022 -0700

    mm: hugetlb_vmemmap: cleanup hugetlb_vmemmap related functions

    Patch series "cleanup hugetlb_vmemmap".

    The word of "free" is not expressive enough to express the feature of
    optimizing vmemmap pages associated with each HugeTLB, rename this keywork
    to "optimize" is more clear.  In this series, cheanup related codes to
    make it more clear and expressive.  This is suggested by David.

    This patch (of 3):

    The word of "free" is not expressive enough to express the feature of
    optimizing vmemmap pages associated with each HugeTLB, rename this keywork
    to "optimize".  And some function names are prefixed with "huge_page"
    instead of "hugetlb", it is easily to be confused with THP.  In this
    patch, cheanup related functions to make code more clear and expressive.

    Link: https://lkml.kernel.org/r/20220404074652.68024-1-songmuchun@bytedance.com
    Link: https://lkml.kernel.org/r/20220404074652.68024-2-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:18 -04:00