Commit Graph

555 Commits

Author SHA1 Message Date
Luiz Capitulino 7535c3fec0 Revert "mm: add vma_has_recency()"
This reverts commit d908e3177a.

JIRA: https://issues.redhat.com/browse/RHEL-80655
Upstream Status: RHEL-only

It was found that the introduction of POSIX_FADV_NOREUSE in 9.6 is
causing 10-20x slow down in OCP's etcd compaction which we believe
is due to increased OCP API latency.

In particular, we believe that this commit is regressing MADV_RANDOM
in a way that causes performance degradation for applications
using this hint, as after this commit the pages backing the VMAs that
are marked for random access will not receive a second chance to be
re-activated once they are in the LRU inactive list.

The conflict is due to downstream a85223eeb8 ("mm: ptep_get()
conversion") in mm/rmap.c::folio_referenced_one().

Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
2025-03-06 16:13:44 -05:00
Rafael Aquini 2a9317ff87 mm/rmap: pass folio to hugepage_add_anon_rmap()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 09c550508a4b8f7844b197cc16877dd0f7c42d8f
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Sep 13 14:51:13 2023 +0200

    mm/rmap: pass folio to hugepage_add_anon_rmap()

    Let's pass a folio; we are always mapping the entire thing.

    Link: https://lkml.kernel.org/r/20230913125113.313322-7-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:22 -05:00
Rafael Aquini 5949d07873 mm/rmap: simplify PageAnonExclusive sanity checks when adding anon rmap
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 132b180f06a74ddfc526709928036db3b7a1cf6d
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Sep 13 14:51:12 2023 +0200

    mm/rmap: simplify PageAnonExclusive sanity checks when adding anon rmap

    Let's sanity-check PageAnonExclusive vs.  mapcount in page_add_anon_rmap()
    and hugepage_add_anon_rmap() after setting PageAnonExclusive simply by
    re-reading the mapcounts.

    We can stop initializing the "first" variable in page_add_anon_rmap() and
    no longer need an atomic_inc_and_test() in hugepage_add_anon_rmap().

    While at it, switch to VM_WARN_ON_FOLIO().

    [david@redhat.com: update check for doubly-mapped page]
      Link: https://lkml.kernel.org/r/d8e5a093-2e22-c14b-7e64-6da280398d9f@redhat.com
    Link: https://lkml.kernel.org/r/20230913125113.313322-6-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:21 -05:00
Rafael Aquini 0f5daa7014 mm/rmap: warn on new PTE-mapped folios in page_add_anon_rmap()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit a1f34ee1de2c3a55bc2a6b9a38e1ecd2830dcc03
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Sep 13 14:51:11 2023 +0200

    mm/rmap: warn on new PTE-mapped folios in page_add_anon_rmap()

    If swapin code would ever decide to not use order-0 pages and supply a
    PTE-mapped large folio, we will have to change how we call
    __folio_set_anon() -- eventually with exclusive=false and an adjusted
    address.  For now, let's add a VM_WARN_ON_FOLIO() with a comment about the
    situation.

    Link: https://lkml.kernel.org/r/20230913125113.313322-5-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:20 -05:00
Rafael Aquini 8261f40a65 mm/rmap: move folio_test_anon() check out of __folio_set_anon()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit c5c540034747dfe450f64d1151081a6080daa8f9
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Sep 13 14:51:10 2023 +0200

    mm/rmap: move folio_test_anon() check out of __folio_set_anon()

    Let's handle it in the caller; no need for the "first" check based on the
    mapcount.

    We really only end up with !anon pages in page_add_anon_rmap() via
    do_swap_page(), where we hold the folio lock.  So races are not possible.
    Add a VM_WARN_ON_FOLIO() to make sure that we really hold the folio lock.

    In the future, we might want to let do_swap_page() use
    folio_add_new_anon_rmap() on new pages instead: however, we might have to
    pass then whether the folio is exclusive or not.  So keep it in there for
    now.

    For hugetlb we never expect to have a non-anon page in
    hugepage_add_anon_rmap().  Remove that code, along with some other checks
    that are either not required or were checked in
    hugepage_add_new_anon_rmap() already.

    Link: https://lkml.kernel.org/r/20230913125113.313322-4-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:19 -05:00
Rafael Aquini 52e9dfc111 mm/rmap: move SetPageAnonExclusive out of __page_set_anon_rmap()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit c66db8c0702c0ab741ecfd5e12b323ff49fe9089
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Sep 13 14:51:09 2023 +0200

    mm/rmap: move SetPageAnonExclusive out of __page_set_anon_rmap()

    Let's handle it in the caller.  No need to pass the page.  While at it,
    rename the function to __folio_set_anon() and pass "bool exclusive"
    instead of "int exclusive".

    Link: https://lkml.kernel.org/r/20230913125113.313322-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:18 -05:00
Rafael Aquini 749f15ce91 mm/rmap: drop stale comment in page_add_anon_rmap and hugepage_add_anon_rmap()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit fd63908706f79c963946a77b7f352db5431deed5
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Sep 13 14:51:08 2023 +0200

    mm/rmap: drop stale comment in page_add_anon_rmap and hugepage_add_anon_rmap()

    Patch series "Anon rmap cleanups".

    Some cleanups around rmap for anon pages.  I'm working on more cleanups
    also around file rmap -- also to handle the "compound" parameter
    internally only and to let hugetlb use page_add_file_rmap(), but these
    changes make sense separately.

    This patch (of 6):

    That comment was added in commit 5dbe0af47f ("mm: fix kernel BUG at
    mm/rmap.c:1017!") to document why we can see vma->vm_end getting adjusted
    concurrently due to a VMA split.

    However, the optimized locking code was changed again in bf181b9f9d ("mm
    anon rmap: replace same_anon_vma linked list with an interval tree.").

    ...  and later, the comment was changed in commit 0503ea8f5ba7 ("mm/mmap:
    remove __vma_adjust()") to talk about "vma_merge" although the original
    issue was with VMA splitting.

    Let's just remove that comment.  Nowadays, it's outdated, imprecise and
    confusing.

    Link: https://lkml.kernel.org/r/20230913125113.313322-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20230913125113.313322-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:17 -05:00
Rafael Aquini e66e65400a mm: hugetlb: add huge page size param to set_huge_pte_at()
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * arch/parisc/include/asm/hugetlb.h: hunks dropped (unsupported arch)
  * arch/parisc/mm/hugetlbpage.c:  hunks dropped (unsupported arch)
  * arch/riscv/include/asm/hugetlb.h: hunks dropped (unsupported arch)
  * arch/riscv/mm/hugetlbpage.c: hunks dropped (unsupported arch)
  * arch/sparc/mm/hugetlbpage.c: hunks dropped (unsupported arch)
  * mm/rmap.c: minor context conflict on the 7th hunk due to backport of
      upstream commit 322842ea3c72 ("mm/rmap: fix missing swap_free() in
      try_to_unmap() after arch_unmap_one() failed")

This patch is a backport of the following upstream commit:
commit 935d4f0c6dc8b3533e6e39346de7389a84490178
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Fri Sep 22 12:58:03 2023 +0100

    mm: hugetlb: add huge page size param to set_huge_pte_at()

    Patch series "Fix set_huge_pte_at() panic on arm64", v2.

    This series fixes a bug in arm64's implementation of set_huge_pte_at(),
    which can result in an unprivileged user causing a kernel panic.  The
    problem was triggered when running the new uffd poison mm selftest for
    HUGETLB memory.  This test (and the uffd poison feature) was merged for
    v6.5-rc7.

    Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
    (correctly this time) to get it backported to v6.5, where the issue first
    showed up.

    Description of Bug
    ==================

    arm64's huge pte implementation supports multiple huge page sizes, some of
    which are implemented in the page table with multiple contiguous entries.
    So set_huge_pte_at() needs to work out how big the logical pte is, so that
    it can also work out how many physical ptes (or pmds) need to be written.
    It previously did this by grabbing the folio out of the pte and querying
    its size.

    However, there are cases when the pte being set is actually a swap entry.
    But this also used to work fine, because for huge ptes, we only ever saw
    migration entries and hwpoison entries.  And both of these types of swap
    entries have a PFN embedded, so the code would grab that and everything
    still worked out.

    But over time, more calls to set_huge_pte_at() have been added that set
    swap entry types that do not embed a PFN.  And this causes the code to go
    bang.  The triggering case is for the uffd poison test, commit
    99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
    causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
    8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
    added in v6.5-rc7.  Although review shows that there are other call sites
    that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
    on arm64 because arm64 doesn't support UFFD WP.

    If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
    it will dereference a bad pointer in page_folio():

        static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
        {
            VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));

            return page_folio(pfn_to_page(swp_offset_pfn(entry)));
        }

    Fix
    ===

    The simplest fix would have been to revert the dodgy cleanup commit
    18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
    things have moved on, this would have required an audit of all the new
    set_huge_pte_at() call sites to see if they should be converted to
    set_huge_swap_pte_at().  As per the original intent of the change, it
    would also leave us open to future bugs when people invariably get it
    wrong and call the wrong helper.

    So instead, I've added a huge page size parameter to set_huge_pte_at().
    This means that the arm64 code has the size in all cases.  It's a bigger
    change, due to needing to touch the arches that implement the function,
    but it is entirely mechanical, so in my view, low risk.

    I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
    s390, sparc (and additionally x86_64).  I've additionally booted and run
    mm selftests against arm64, where I observe the uffd poison test is fixed,
    and there are no other regressions.

    This patch (of 2):

    In order to fix a bug, arm64 needs to be told the size of the huge page
    for which the pte is being set in set_huge_pte_at().  Provide for this by
    adding an `unsigned long sz` parameter to the function.  This follows the
    same pattern as huge_pte_clear().

    This commit makes the required interface modifications to the core mm as
    well as all arches that implement this function (arm64, parisc, powerpc,
    riscv, s390, sparc).  The actual arm64 bug will be fixed in a separate
    commit.

    No behavioral changes intended.

    Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
    Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
    Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>     [powerpc 8xx]
    Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>       [vmalloc change]
    Cc: Alexandre Ghiti <alex@ghiti.fr>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: <stable@vger.kernel.org>    [6.5+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:19 -04:00
Rafael Aquini 33f1751df5 mm/swap: stop using page->private on tail pages for THP_SWAP
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit cfeed8ffe55b37fa10286aaaa1369da00cb88440
Author: David Hildenbrand <david@redhat.com>
Date:   Mon Aug 21 18:08:46 2023 +0200

    mm/swap: stop using page->private on tail pages for THP_SWAP

    Patch series "mm/swap: stop using page->private on tail pages for THP_SWAP
    + cleanups".

    This series stops using page->private on tail pages for THP_SWAP, replaces
    folio->private by folio->swap for swapcache folios, and starts using
    "new_folio" for tail pages that we are splitting to remove the usage of
    page->private for swapcache handling completely.

    This patch (of 4):

    Let's stop using page->private on tail pages, making it possible to just
    unconditionally reuse that field in the tail pages of large folios.

    The remaining usage of the private field for THP_SWAP is in the THP
    splitting code (mm/huge_memory.c), that we'll handle separately later.

    Update the THP_SWAP documentation and sanity checks in mm_types.h and
    __split_huge_page_tail().

    [david@redhat.com: stop using page->private on tail pages for THP_SWAP]
      Link: https://lkml.kernel.org/r/6f0a82a3-6948-20d9-580b-be1dbf415701@redhat.com
    Link: https://lkml.kernel.org/r/20230821160849.531668-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20230821160849.531668-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>     [arm64]
    Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Seth Jennings <sjenning@redhat.com>
    Cc: Vitaly Wool <vitaly.wool@konsulko.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:05 -04:00
Rafael Aquini e3f6c7213f rmap: add folio_add_file_rmap_range()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 86f35f69db8e7d169c36472a349507ab0a461f49
Author: Yin Fengwei <fengwei.yin@intel.com>
Date:   Wed Aug 2 16:14:03 2023 +0100

    rmap: add folio_add_file_rmap_range()

    folio_add_file_rmap_range() allows to add pte mapping to a specific range
    of file folio.  Comparing to page_add_file_rmap(), it batched updates
    __lruvec_stat for large folio.

    Link: https://lkml.kernel.org/r/20230802151406.3735276-36-willy@infradead.org
    Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:32 -04:00
Rafael Aquini 9eec727e47 mm/rmap: correct stale comment of rmap_walk_anon and rmap_walk_file
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 89be82b4fed258b63a201d92fca95e7c55913c23
Author: Kemeng Shi <shikemeng@huaweicloud.com>
Date:   Tue Jul 18 17:21:36 2023 +0800

    mm/rmap: correct stale comment of rmap_walk_anon and rmap_walk_file

    1. update page to folio in comment
    2. add comment of new added @locked

    Link: https://lkml.kernel.org/r/20230718092136.1935789-1-shikemeng@huaweicloud.com
    Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:10 -04:00
Rafael Aquini ccdeed627d mm/tlbbatch: introduce arch_flush_tlb_batched_pending()
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * arch/x86/include/asm/tlbflush.h: minor context difference due to the backport
      of upstream commit 1af5a8109904 ("mmu_notifiers: rename invalidate_range notifier")

This patch is a backport of the following upstream commit:
commit db6c1f6f236dbcd271d51d37675bbccfcea7c7be
Author: Yicong Yang <yangyicong@hisilicon.com>
Date:   Mon Jul 17 21:10:03 2023 +0800

    mm/tlbbatch: introduce arch_flush_tlb_batched_pending()

    Currently we'll flush the mm in flush_tlb_batched_pending() to avoid race
    between reclaim unmaps pages by batched TLB flush and mprotect/munmap/etc.
    Other architectures like arm64 may only need a synchronization
    barrier(dsb) here rather than a full mm flush.  So add
    arch_flush_tlb_batched_pending() to allow an arch-specific implementation
    here.  This intends no functional changes on x86 since still a full mm
    flush for x86.

    Link: https://lkml.kernel.org/r/20230717131004.12662-4-yangyicong@huawei.com
    Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Barry Song <v-songbaohua@oppo.com>
    Cc: Darren Hart <darren@os.amperecomputing.com>
    Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: lipeifeng <lipeifeng@oppo.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Steven Miao <realmz6@gmail.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Zeng Tao <prime.zeng@hisilicon.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:08 -04:00
Rafael Aquini 0f790635b1 mm/tlbbatch: rename and extend some functions
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit f73419bb89d606de9be2043febf0957d56627a5b
Author: Barry Song <v-songbaohua@oppo.com>
Date:   Mon Jul 17 21:10:02 2023 +0800

    mm/tlbbatch: rename and extend some functions

    This patch does some preparation works to extend batched TLB flush to
    arm64. Including:
    - Extend set_tlb_ubc_flush_pending() and arch_tlbbatch_add_mm()
      to accept an additional argument for address, architectures
      like arm64 may need this for tlbi.
    - Rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending()
      to match its current function since we don't need to handle
      mm on architectures like arm64 and add_mm is not proper,
      add_pending will make sense to both as on x86 we're pending the
      TLB flush operations while on arm64 we're pending the synchronize
      operations.

    This intends no functional changes on x86.

    Link: https://lkml.kernel.org/r/20230717131004.12662-3-yangyicong@huawei.com
    Tested-by: Yicong Yang <yangyicong@hisilicon.com>
    Tested-by: Xin Hao <xhao@linux.alibaba.com>
    Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: Barry Song <v-songbaohua@oppo.com>
    Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Darren Hart <darren@os.amperecomputing.com>
    Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Cc: lipeifeng <lipeifeng@oppo.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Steven Miao <realmz6@gmail.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zeng Tao <prime.zeng@hisilicon.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:07 -04:00
Rafael Aquini 9c128d243d mm/tlbbatch: introduce arch_tlbbatch_should_defer()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 65c8d30e679bdffeeaa0b84b7094a3c719aa6585
Author: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Date:   Mon Jul 17 21:10:01 2023 +0800

    mm/tlbbatch: introduce arch_tlbbatch_should_defer()

    Patch series "arm64: support batched/deferred tlb shootdown during page
    reclamation/migration", v11.

    Though ARM64 has the hardware to do tlb shootdown, the hardware
    broadcasting is not free.  A simplest micro benchmark shows even on
    snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is
    huge even for paging out one page mapped by only one process: 5.36% a.out
    [kernel.kallsyms] [k] ptep_clear_flush

    While pages are mapped by multiple processes or HW has more CPUs, the cost
    should become even higher due to the bad scalability of tlb shootdown.
    The same benchmark can result in 16.99% CPU consumption on ARM64 server
    with around 100 cores according to the test on patch 4/4.

    This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
    1. only send tlbi instructions in the first stage -
            arch_tlbbatch_add_mm()
    2. wait for the completion of tlbi by dsb while doing tlbbatch
            sync in arch_tlbbatch_flush()

    Testing on snapdragon shows the overhead of ptep_clear_flush is removed by
    the patchset.  The micro benchmark becomes 5% faster even for one page
    mapped by single process on snapdragon 888.

    Since BATCHED_UNMAP_TLB_FLUSH is implemented only on x86, the patchset
    does some renaming/extension for the current implementation first (Patch
    1-3), then add the support on arm64 (Patch 4).

    This patch (of 4):

    The entire scheme of deferred TLB flush in reclaim path rests on the fact
    that the cost to refill TLB entries is less than flushing out individual
    entries by sending IPI to remote CPUs.  But architecture can have
    different ways to evaluate that.  Hence apart from checking
    TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
    architecture specific.

    [yangyicong@hisilicon.com: rebase and fix incorrect return value type]
    Link: https://lkml.kernel.org/r/20230717131004.12662-1-yangyicong@huawei.com
    Link: https://lkml.kernel.org/r/20230717131004.12662-2-yangyicong@huawei.com
    Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
    [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
    Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
    Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Darren Hart <darren@os.amperecomputing.com>
    Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: lipeifeng <lipeifeng@oppo.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Steven Miao <realmz6@gmail.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zeng Tao <prime.zeng@hisilicon.com>
    Cc: Barry Song <v-songbaohua@oppo.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Nadav Amit <namit@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:06 -04:00
Rafael Aquini f1db8c1431 rmap: pass the folio to __page_check_anon_rmap()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit dba438bd7663fefab870a6dd4b01ed0923c32d79
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Jul 6 20:52:51 2023 +0100

    rmap: pass the folio to __page_check_anon_rmap()

    The lone caller already has the folio, so pass it in instead of deriving
    it from the page again.

    Link: https://lkml.kernel.org/r/20230706195251.2707542-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:58 -04:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Rafael Aquini e24b3ade32 mm/gup: remove vmas parameter from get_user_pages_remote()
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  - virt/kvm/async_pf.c: minor context diff due to out-of-order backport of
    upstream commit 08284765f03b7 ("KVM: Get reference to VM's address space
    in the async #PF worker")

This patch is a backport of the following upstream commit:
commit ca5e863233e8f6acd1792fd85d6bc2729a1b2c10
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Wed May 17 20:25:39 2023 +0100

    mm/gup: remove vmas parameter from get_user_pages_remote()

    The only instances of get_user_pages_remote() invocations which used the
    vmas parameter were for a single page which can instead simply look up the
    VMA directly. In particular:-

    - __update_ref_ctr() looked up the VMA but did nothing with it so we simply
      remove it.

    - __access_remote_vm() was already using vma_lookup() when the original
      lookup failed so by doing the lookup directly this also de-duplicates the
      code.

    We are able to perform these VMA operations as we already hold the
    mmap_lock in order to be able to call get_user_pages_remote().

    As part of this work we add get_user_page_vma_remote() which abstracts the
    VMA lookup, error handling and decrementing the page reference count should
    the VMA lookup fail.

    This forms part of a broader set of patches intended to eliminate the vmas
    parameter altogether.

    [akpm@linux-foundation.org: avoid passing NULL to PTR_ERR]
    Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64)
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390)
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Jarkko Sakkinen <jarkko@kernel.org>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:35 -04:00
Chris von Recklinghausen b205ed0c44 mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 4d4b6d66db63ceed399f1fb1a4b24081d2590eb1
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Apr 24 14:54:08 2023 +0800

    mm,unmap: avoid flushing TLB in batch if PTE is inaccessible

    0Day/LKP reported a performance regression for commit 7e12beb8ca2a
    ("migrate_pages: batch flushing TLB").  In the commit, the TLB flushing
    during page migration is batched.  So, in try_to_migrate_one(),
    ptep_clear_flush() is replaced with set_tlb_ubc_flush_pending().  In
    further investigation, it is found that the TLB flushing can be avoided in
    ptep_clear_flush() if the PTE is inaccessible.  In fact, we can optimize
    in similar way for the batched TLB flushing too to improve the
    performance.

    So in this patch, we check pte_accessible() before
    set_tlb_ubc_flush_pending() in try_to_unmap/migrate_one().  Tests show
    that the benchmark score of the anon-cow-rand-mt test case of
    vm-scalability test suite can improve up to 2.1% with the patch on a Intel
    server machine.  The TLB flushing IPI can reduce up to 44.3%.

    Link: https://lore.kernel.org/oe-lkp/202303192325.ecbaf968-yujie.liu@intel.com
    Link: https://lore.kernel.org/oe-lkp/ab92aaddf1b52ede15e2c608696c36765a2602c1.camel@intel.com/
    Link: https://lkml.kernel.org/r/20230424065408.188498-1-ying.huang@intel.com
    Fixes: 7e12beb8ca2a ("migrate_pages: batch flushing TLB")
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reported-by: kernel test robot <yujie.liu@intel.com>
    Reviewed-by: Nadav Amit <namit@vmware.com>
    Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:07 -04:00
Chris von Recklinghausen 9a41b45f56 mm/khugepaged: write-lock VMA while collapsing a huge page
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 55fd6fccad3172c0feaaa817f0a1283629ff183e
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:14 2023 -0800

    mm/khugepaged: write-lock VMA while collapsing a huge page

    Protect VMA from concurrent page fault handler while collapsing a huge
    page.  Page fault handler needs a stable PMD to use PTL and relies on
    per-VMA lock to prevent concurrent PMD changes.  pmdp_collapse_flush(),
    set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
    not be detected by a page fault handler without proper locking.

    Before this patch, page tables can be walked under any one of the
    mmap_lock, the mapping lock, and the anon_vma lock; so when khugepaged
    unlinks and frees page tables, it must ensure that all of those either are
    locked or don't exist.  This patch adds a fourth lock under which page
    tables can be traversed, and so khugepaged must also lock out that one.

    [surenb@google.com: vm_lock/i_mmap_rwsem inversion in retract_page_tables]
      Link: https://lkml.kernel.org/r/20230303213250.3555716-1-surenb@google.com
    [surenb@google.com: build fix]
      Link: https://lkml.kernel.org/r/CAJuCfpFjWhtzRE1X=J+_JjgJzNKhq-=JT8yTBSTHthwp0pqWZw@mail.gmail.com
    Link: https://lkml.kernel.org/r/20230227173632.3292573-16-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:40 -04:00
Chris von Recklinghausen de16cbf628 mm/rmap: use atomic_try_cmpxchg in set_tlb_ubc_flush_pending
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit bdeb91881088810ab1d8ae620862c3b4d78f4041
Author: Uros Bizjak <ubizjak@gmail.com>
Date:   Mon Feb 27 22:42:28 2023 +0100

    mm/rmap: use atomic_try_cmpxchg in set_tlb_ubc_flush_pending

    Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old
    in set_tlb_ubc_flush_pending.  86 CMPXCHG instruction returns success in
    ZF flag, so this change saves a compare after cmpxchg (and related move
    instruction in front of cmpxchg).

    Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
    fails.

    No functional change intended.

    Link: https://lkml.kernel.org/r/20230227214228.3533299-1-ubizjak@gmail.com
    Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:06 -04:00
Aristeu Rozanski b495e79b31 mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 6da6b1d4a7df8c35770186b53ef65d388398e139
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Tue Feb 21 17:59:05 2023 +0900

    mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON

    After a memory error happens on a clean folio, a process unexpectedly
    receives SIGBUS when it accesses the error page.  This SIGBUS killing is
    pointless and simply degrades the level of RAS of the system, because the
    clean folio can be dropped without any data lost on memory error handling
    as we do for a clean pagecache.

    When memory_failure() is called on a clean folio, try_to_unmap() is called
    twice (one from split_huge_page() and one from hwpoison_user_mappings()).
    The root cause of the issue is that pte conversion to hwpoisoned entry is
    now done in the first call of try_to_unmap() because PageHWPoison is
    already set at this point, while it's actually expected to be done in the
    second call.  This behavior disturbs the error handling operation like
    removing pagecache, which results in the malfunction described above.

    So convert TTU_IGNORE_HWPOISON into TTU_HWPOISON and set TTU_HWPOISON only
    when we really intend to convert pte to hwpoison entry.  This can prevent
    other callers of try_to_unmap() from accidentally converting to hwpoison
    entries.

    Link: https://lkml.kernel.org/r/20230221085905.1465385-1-naoya.horiguchi@linux.dev
    Fixes: a42634a6c07d ("readahead: Use a folio in read_pages()")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:24 -04:00
Aristeu Rozanski 5784c8749c migrate_pages: batch flushing TLB
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7e12beb8ca2ac98b2ec42e0ea4b76cdc93b58654
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 13 20:34:43 2023 +0800

    migrate_pages: batch flushing TLB

    The TLB flushing will cost quite some CPU cycles during the folio
    migration in some situations.  For example, when migrate a folio of a
    process with multiple active threads that run on multiple CPUs.  After
    batching the _unmap and _move in migrate_pages(), the TLB flushing can be
    batched easily with the existing TLB flush batching mechanism.  This patch
    implements that.

    We use the following test case to test the patch.

    On a 2-socket Intel server,

    - Run pmbench memory accessing benchmark

    - Run `migratepages` to migrate pages of pmbench between node 0 and
      node 1 back and forth.

    With the patch, the TLB flushing IPI reduces 99.1% during the test and the
    number of pages migrated successfully per second increases 291.7%.

    Haoxin helped to test the patchset on an ARM64 server with 128 cores, 2
    NUMA nodes.  Test results show that the page migration performance
    increases up to 78%.

    NOTE: TLB flushing is batched only for normal folios, not for THP folios.
    Because the overhead of TLB flushing for THP folios is much lower than
    that for normal folios (about 1/512 on x86 platform).

    Link: https://lkml.kernel.org/r/20230213123444.155149-9-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Tested-by: Xin Hao <xhao@linux.alibaba.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski 456efc9e7d mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: reverting 830fb0c1df, which was a backport of da9a298f5fa twice by mistake

commit d0ce0e47b323a8d7fb5dc3314ce56afa650ade2d
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Wed Jan 25 09:05:33 2023 -0800

    mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()

    Change alloc_huge_page() to alloc_hugetlb_folio() by changing all callers
    to handle the now folio return type of the function.  In this conversion,
    alloc_huge_page_vma() is also changed to alloc_hugetlb_folio_vma() and
    hugepage_add_new_anon_rmap() is changed to take in a folio directly.  Many
    additions of '&folio->page' are cleaned up in subsequent patches.

    hugetlbfs_fallocate() is also refactored to use the RCU +
    page_cache_next_miss() API.

    Link: https://lkml.kernel.org/r/20230125170537.96973-5-sidhartha.kumar@oracle.com
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:21 -04:00
Aristeu Rozanski 8737e69a92 mm/mmap: remove __vma_adjust()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 0503ea8f5ba73eb3ab13a81c1eefbaf51405385a
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:49 2023 -0500

    mm/mmap: remove __vma_adjust()

    Inline the work of __vma_adjust() into vma_merge().  This reduces code
    size and has the added benefits of the comments for the cases being
    located with the code.

    Change the comments referencing vma_adjust() accordingly.

    [Liam.Howlett@oracle.com: fix vma_merge() offset when expanding the next vma]
      Link: https://lkml.kernel.org/r/20230130195713.2881766-1-Liam.Howlett@oracle.com
    Link: https://lkml.kernel.org/r/20230120162650.984577-49-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski 1adad9caeb rmap: add folio parameter to __page_set_anon_rmap()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 5b4bd90f9ac76136c7148684b12276d4ae2d64a2
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:29:59 2023 +0000

    rmap: add folio parameter to __page_set_anon_rmap()

    Avoid the compound_head() call in PageAnon() by passing in the folio that
    all callers have.  Also save me from wondering whether page->mapping can
    ever be overwritten on a tail page (I don't think it can, but I'm not 100%
    sure).

    Link: https://lkml.kernel.org/r/20230116192959.2147032-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:10 -04:00
Aristeu Rozanski 175b35ee46 mm: remove munlock_vma_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 672aa27d0bd241759376e62b78abb8aae1792479
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:28:26 2023 +0000

    mm: remove munlock_vma_page()

    All callers now have a folio and can call munlock_vma_folio().  Update the
    documentation to refer to munlock_vma_folio().

    Link: https://lkml.kernel.org/r/20230116192827.2146732-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:10 -04:00
Aristeu Rozanski 96cb17f8b1 mm: remove mlock_vma_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7efecffb8e7968c4a6c53177b0053ca4765fe233
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:28:25 2023 +0000

    mm: remove mlock_vma_page()

    All callers now have a folio and can call mlock_vma_folio().  Update the
    documentation to refer to mlock_vma_folio().

    Link: https://lkml.kernel.org/r/20230116192827.2146732-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:10 -04:00
Aristeu Rozanski e7030d52b7 mm: remove __HAVE_ARCH_PTE_SWP_EXCLUSIVE
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped arches we don't support

commit 950fe885a89770619e315f9b46301eebf0aab7b3
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Jan 13 18:10:26 2023 +0100

    mm: remove __HAVE_ARCH_PTE_SWP_EXCLUSIVE

    __HAVE_ARCH_PTE_SWP_EXCLUSIVE is now supported by all architectures that
    support swp PTEs, so let's drop it.

    Link: https://lkml.kernel.org/r/20230113171026.582290-27-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:08 -04:00
Aristeu Rozanski 067fb10657 mm: mlock: update the interface to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 96f97c438f61ddba94117dcd1a1eb0aaafa22309
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Thu Jan 12 12:39:31 2023 +0000

    mm: mlock: update the interface to use folios

    Update the mlock interface to accept folios rather than pages, bringing
    the interface in line with the internal implementation.

    munlock_vma_page() still requires a page_folio() conversion, however this
    is consistent with the existent mlock_vma_page() implementation and a
    product of rmap still dealing in pages rather than folios.

    Link: https://lkml.kernel.org/r/cba12777c5544305014bc0cbec56bb4cc71477d8.1673526881.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: William Kucharski <william.kucharski@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:08 -04:00
Aristeu Rozanski d51d49bf93 mm: convert deferred_split_huge_page() to deferred_split_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f158ed6195ef949060811fd85086928470651944
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:29:13 2023 +0000

    mm: convert deferred_split_huge_page() to deferred_split_folio()

    Now that both callers use a folio, pass the folio in and save a call to
    compound_head().

    Link: https://lkml.kernel.org/r/20230111142915.1001531-28-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:06 -04:00
Aristeu Rozanski 67f5c501c7 mm: use a folio in hugepage_add_anon_rmap() and hugepage_add_new_anon_rmap()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit db4e5dbdcdd55482ab23bf4a0ae6746f93efb0d9
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:56 2023 +0000

    mm: use a folio in hugepage_add_anon_rmap() and hugepage_add_new_anon_rmap()

    Remove uses of compound_mapcount_ptr()

    Link: https://lkml.kernel.org/r/20230111142915.1001531-11-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:05 -04:00
Aristeu Rozanski c753f1a741 mm: convert page_add_file_rmap() to use a folio internally
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit eb01a2ad7e9cba1b9dd131edc5a26ffbda90a5ed
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:53 2023 +0000

    mm: convert page_add_file_rmap() to use a folio internally

    The API for page_add_file_rmap() needs to be page-based, because we can
    add mappings of individual pages.  But inside the function, we want to
    only call compound_head() once and then use the folio APIs instead of the
    page APIs that each call compound_head().

    Link: https://lkml.kernel.org/r/20230111142915.1001531-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:05 -04:00
Aristeu Rozanski 10a8961161 mm: convert page_add_anon_rmap() to use a folio internally
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit ee0800c2f6a9e605947ce499d79fb7e2be16d6dd
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:52 2023 +0000

    mm: convert page_add_anon_rmap() to use a folio internally

    The API for page_add_anon_rmap() needs to be page-based, because we can
    add mappings of individual pages.  But inside the function, we want to
    only call compound_head() once and then use the folio APIs instead of the
    page APIs that each call compound_head().

    Link: https://lkml.kernel.org/r/20230111142915.1001531-7-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:05 -04:00
Aristeu Rozanski 1d71326a2b mm: convert page_remove_rmap() to use a folio internally
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 62beb906ef644b0f0555b2b9f9626c27e2038d84
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:51 2023 +0000

    mm: convert page_remove_rmap() to use a folio internally

    The API for page_remove_rmap() needs to be page-based, because we can
    remove mappings of pages individually.  But inside the function, we want
    to only call compound_head() once and then use the folio APIs instead of
    the page APIs that each call compound_head().

    Link: https://lkml.kernel.org/r/20230111142915.1001531-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:05 -04:00
Aristeu Rozanski 5455c3da6d mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7d4a8be0c4b2b7ffb367929d2b352651f083806b
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jan 10 13:57:22 2023 +1100

    mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export

    mmu_notifier_range_update_to_read_only() was originally introduced in
    commit c6d23413f8 ("mm/mmu_notifier:
    mmu_notifier_range_update_to_read_only() helper") as an optimisation for
    device drivers that know a range has only been mapped read-only.  However
    there are no users of this feature so remove it.  As it is the only user
    of the struct mmu_notifier_range.vma field remove that also.

    Link: https://lkml.kernel.org/r/20230110025722.600912-1-apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:05 -04:00
Aristeu Rozanski d908e3177a mm: add vma_has_recency()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 8788f6781486769d9598dcaedc3fe0eb12fc3e59
Author: Yu Zhao <yuzhao@google.com>
Date:   Fri Dec 30 14:52:51 2022 -0700

    mm: add vma_has_recency()

    Add vma_has_recency() to indicate whether a VMA may exhibit temporal
    locality that the LRU algorithm relies on.

    This function returns false for VMAs marked by VM_SEQ_READ or
    VM_RAND_READ.  While the former flag indicates linear access, i.e., a
    special case of spatial locality, both flags indicate a lack of temporal
    locality, i.e., the reuse of an area within a relatively small duration.

    "Recency" is chosen over "locality" to avoid confusion between temporal
    and spatial localities.

    Before this patch, the active/inactive LRU only ignored the accessed bit
    from VMAs marked by VM_SEQ_READ.  After this patch, the active/inactive
    LRU and MGLRU share the same logic: they both ignore the accessed bit if
    vma_has_recency() returns false.

    For the active/inactive LRU, the following fio test showed a [6, 8]%
    increase in IOPS when randomly accessing mapped files under memory
    pressure.

      kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo)
      kb=$((kb - 8*1024*1024))

      modprobe brd rd_nr=1 rd_size=$kb
      dd if=/dev/zero of=/dev/ram0 bs=1M

      mkfs.ext4 /dev/ram0
      mount /dev/ram0 /mnt/
      swapoff -a

      fio --name=test --directory=/mnt/ --ioengine=mmap --numjobs=8 \
          --size=8G --rw=randrw --time_based --runtime=10m \
          --group_reporting

    The discussion that led to this patch is here [1].  Additional test
    results are available in that thread.

    [1] https://lore.kernel.org/r/Y31s%2FK8T85jh05wH@google.com/

    Link: https://lkml.kernel.org/r/20221230215252.2628425-1-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Andrea Righi <andrea.righi@canonical.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:03 -04:00
Aristeu Rozanski d425293677 mm: rmap: remove lock_page_memcg()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit c7c3dec1c9db9746912af2bbb5d6a2dd9f152d20
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Tue Dec 6 18:13:40 2022 +0100

    mm: rmap: remove lock_page_memcg()

    The previous patch made sure charge moving only touches pages for which
    page_mapped() is stable.  lock_page_memcg() is no longer needed.

    Link: https://lkml.kernel.org/r/20221206171340.139790-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Hugh Dickins <hughd@google.com>
    Acked-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:00 -04:00
Audra Mitchell cb3a49da8a mm/rmap: fix comment in anon_vma_clone()
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit d8e454eb44473b2270e2675fb44a9d79dee36097
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Fri Oct 14 09:39:31 2022 +0800

    mm/rmap: fix comment in anon_vma_clone()

    Commit 2555283eb40d ("mm/rmap: Fix anon_vma->degree ambiguity leading to
    double-reuse") use num_children and num_active_vmas to replace the origin
    degree to fix anon_vma UAF problem.  Update the comment in anon_vma_clone
    to fit this change.

    Link: https://lkml.kernel.org/r/20221014013931.1565969-1-mawupeng1@huawei.com
    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Jerry Snitselaar efb6748971 mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
JIRA: https://issues.redhat.com/browse/RHEL-26541
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Conflicts: Context diff due to some commits not being backported yet such as c33c794828f2 ("mm: ptep_get() conversion"),
           and 959a78b6dd45 ("mm/hugetlb: use a folio in hugetlb_wp()").

commit ec8832d007cb7b50229ad5745eec35b847cc9120
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jul 25 23:42:06 2023 +1000

    mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()

    Secondary TLBs are now invalidated from the architecture specific TLB
    invalidation functions.  Therefore there is no need to explicitly notify
    or invalidate as part of the range end functions.  This means we can
    remove mmu_notifier_invalidate_range_end_only() and some of the
    ptep_*_notify() functions.

    Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Andrew Donnellan <ajd@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
    Cc: Frederic Barrat <fbarrat@linux.ibm.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kevin Tian <kevin.tian@intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Nicolin Chen <nicolinc@nvidia.com>
    Cc: Robin Murphy <robin.murphy@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zhi Wang <zhi.wang.linux@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit ec8832d007cb7b50229ad5745eec35b847cc9120)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2024-02-26 15:49:51 -07:00
Chris von Recklinghausen 422fe11cd3 mm: add folio_add_new_anon_rmap()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4d510f3da4c216d4c2695395f67aec38e2aa6cc7
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:54 2023 +0000

    mm: add folio_add_new_anon_rmap()

    In contrast to other rmap functions, page_add_new_anon_rmap() is always
    called with a freshly allocated page.  That means it can't be called with
    a tail page.  Turn page_add_new_anon_rmap() into folio_add_new_anon_rmap()
    and add a page_add_new_anon_rmap() wrapper.  Callers can be converted
    individually.

    [akpm@linux-foundation.org: fix NOMMU build.  page_add_new_anon_rmap() requires CONFIG_MMU]
    [willy@infradead.org: folio-compat.c needs rmap.h]
    Link: https://lkml.kernel.org/r/20230111142915.1001531-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:53 -04:00
Chris von Recklinghausen 0991577c1f mm: convert total_compound_mapcount() to folio_total_mapcount()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b14224fbea62e5bffd680613376fe1268f4103ba
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:50 2023 +0000

    mm: convert total_compound_mapcount() to folio_total_mapcount()

    Instead of enforcing that the argument must be a head page by naming,
    enforce it with the compiler by making it a folio.  Also rename the
    counter in struct folio from _compound_mapcount to _entire_mapcount.

    Link: https://lkml.kernel.org/r/20230111142915.1001531-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:52 -04:00
Chris von Recklinghausen d384489054 mm: convert head_subpages_mapcount() into folio_nr_pages_mapped()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit eec20426d48bd7b63c69969a793943ed1a99b731
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:48 2023 +0000

    mm: convert head_subpages_mapcount() into folio_nr_pages_mapped()

    Calling this 'mapcount' is confusing since mapcount is usually the number
    of times something is mapped; instead this is the number of mapped pages.
    It's also better to enforce that this is a folio rather than a head page.

    Move folio_nr_pages_mapped() into mm/internal.h since this is not
    something we want device drivers or filesystems poking at.  Get rid of
    folio_subpages_mapcount_ptr() and use folio->_nr_pages_mapped directly.

    Link: https://lkml.kernel.org/r/20230111142915.1001531-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:52 -04:00
Chris von Recklinghausen 1ac84a6f20 mm,thp,rmap: fix races between updates of subpages_mapcount
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 6287b7dae80944bfa37784a8f9d6861a4facaa6e
Author: Hugh Dickins <hughd@google.com>
Date:   Sun Dec 4 17:57:07 2022 -0800

    mm,thp,rmap: fix races between updates of subpages_mapcount

    Commit 4b51634cd16a, introducing the COMPOUND_MAPPED bit, paid attention
    to the impossibility of subpages_mapcount ever appearing negative; but did
    not attend to those races in which it can momentarily appear larger than
    thought possible.

    These arise from how page_remove_rmap() first decrements page->_mapcount
    or compound_mapcount, then, if that transition goes negative (logical 0),
    decrements subpages_mapcount.  The initial decrement lets a racing
    page_add_*_rmap() reincrement _mapcount or compound_mapcount immediately,
    and then in rare cases its corresponding increment of subpages_mapcount
    may be completed before page_remove_rmap()'s decrement.  There could even
    (with increasing unlikelihood) be a series of increments intermixed with
    the decrements.

    In practice, checking subpages_mapcount with a temporary WARN on range,
    has caught values of 0x1000000 (2*COMPOUND_MAPPED, when move_pages() was
    using remove_migration_pmd()) and 0x800201 (do_huge_pmd_wp_page() using
    __split_huge_pmd()): page_add_anon_rmap() racing page_remove_rmap(), as
    predicted.

    I certainly found it harder to reason about than when bit_spin_locked, but
    the easy case gives a clue to how to handle the harder case.  The easy
    case being the three !(nr & COMPOUND_MAPPED) checks, which should
    obviously be replaced by (nr < COMPOUND_MAPPED) checks - to count a page
    as compound mapped, even while the bit in that position is 0.

    The harder case is when trying to decide how many subpages are newly
    covered or uncovered, when compound map is first added or last removed:
    not knowing all that racily happened between first and second atomic ops.

    But the easy way to handle that, is again to count the page as compound
    mapped all the while that its subpages_mapcount indicates so - ignoring
    the _mapcount or compound_mapcount transition while it is on the way to
    being reversed.

    Link: https://lkml.kernel.org/r/4388158-3092-a960-ff2d-55f2b0fe4ef8@google.com
    Fixes: 4b51634cd16a ("mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped")
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:35 -04:00
Chris von Recklinghausen 6d31a562d5 mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4b51634cd16a01b2be0f6b69cc0dae63de4751f2
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Nov 22 01:49:36 2022 -0800

    mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped

    Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now?
    Yes.  Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
    but if we slightly abuse subpages_mapcount by additionally demanding that
    one bit be set there when the compound page is PMD-mapped, then a cascade
    of two atomic ops is able to maintain the stats without bit_spin_lock.

    This is harder to reason about than when bit_spin_locked, but I believe
    safe; and no drift in stats detected when testing.  When there are racing
    removes and adds, of course the sequence of operations is less well-
    defined; but each operation on subpages_mapcount is atomically good.  What
    might be disastrous, is if subpages_mapcount could ever fleetingly appear
    negative: but the pte lock (or pmd lock) these rmap functions are called
    under, ensures that a last remove cannot race ahead of a first add.

    Continue to make an exception for hugetlb (PageHuge) pages, though that
    exception can be easily removed by a further commit if necessary: leave
    subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
    carry on checking compound_mapcount too in folio_mapped(), page_mapped().

    Evidence is that this way goes slightly faster than the previous
    implementation in all cases (pmds after ptes now taking around 103ms); and
    relieves us of worrying about contention on the bit_spin_lock.

    Link: https://lkml.kernel.org/r/3978f3ca-5473-55a7-4e14-efea5968d892@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Dan Carpenter <error27@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:27 -04:00
Chris von Recklinghausen 5b97e0a4f4 mm,thp,rmap: subpages_mapcount of PTE-mapped subpages
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit be5ef2d9b006bbd93b1a03e1da2dbd19fb0b9f14
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Nov 22 01:42:04 2022 -0800

    mm,thp,rmap: subpages_mapcount of PTE-mapped subpages

    Patch series "mm,thp,rmap: rework the use of subpages_mapcount", v2.

    This patch (of 3):

    Following suggestion from Linus, instead of counting every PTE map of a
    compound page in subpages_mapcount, just count how many of its subpages
    are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED and
    NR_FILE_MAPPED stats, without any need for a locked scan of subpages; and
    requires updating the count less often.

    This does then revert total_mapcount() and folio_mapcount() to needing a
    scan of subpages; but they are inherently racy, and need no locking, so
    Linus is right that the scans are much better done there.  Plus (unlike in
    6.1 and previous) subpages_mapcount lets us avoid the scan in the common
    case of no PTE maps.  And page_mapped() and folio_mapped() remain scanless
    and just as efficient with the new meaning of subpages_mapcount: those are
    the functions which I most wanted to remove the scan from.

    The updated page_dup_compound_rmap() is no longer suitable for use by anon
    THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be used for
    that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted.

    Evidence is that this way goes slightly faster than the previous
    implementation for most cases; but significantly faster in the (now
    scanless) pmds after ptes case, which started out at 870ms and was brought
    down to 495ms by the previous series, now takes around 105ms.

    Link: https://lkml.kernel.org/r/a5849eca-22f1-3517-bf29-95d982242742@google.com
    Link: https://lkml.kernel.org/r/eec17e16-4e1-7c59-f1bc-5bca90dac919@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Dan Carpenter <error27@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:27 -04:00
Chris von Recklinghausen 0b7a394eff mm,thp,rmap: handle the normal !PageCompound case first
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d8dd5e979d09c7463618853fb4aedd88e3efc8ae
Author: Hugh Dickins <hughd@google.com>
Date:   Wed Nov 9 18:18:49 2022 -0800

    mm,thp,rmap: handle the normal !PageCompound case first

    Commit ("mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts")
    propagated the "if (compound) {lock} else if (PageCompound) {lock} else
    {atomic}" pattern throughout; but Linus hated the way that gives primacy
    to the uncommon case: switch to "if (!PageCompound) {atomic} else if
    (compound) {lock} else {lock}" throughout.  Linus has a bigger idea for
    how to improve it all, but here just make that rearrangement.

    Link: https://lkml.kernel.org/r/fca2f694-2098-b0ef-d4e-f1d8b94d318c@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:26 -04:00
Chris von Recklinghausen 7acc64ba7d mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 9bd3155ed83b723be719e522760f107229e2a61b
Author: Hugh Dickins <hughd@google.com>
Date:   Wed Nov 2 18:53:45 2022 -0700

    mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts

    Fix the races in maintaining compound_mapcount, subpages_mapcount and
    subpage _mapcount by using PG_locked in the first tail of any compound
    page for a bit_spin_lock() on such modifications; skipping the usual
    atomic operations on those fields in this case.

    Bring page_remove_file_rmap() and page_remove_anon_compound_rmap() back
    into page_remove_rmap() itself.  Rearrange page_add_anon_rmap() and
    page_add_file_rmap() and page_remove_rmap() to follow the same "if
    (compound) {lock} else if (PageCompound) {lock} else {atomic}" pattern
    (with a PageTransHuge in the compound test, like before, to avoid BUG_ONs
    and optimize away that block when THP is not configured).  Move all the
    stats updates outside, after the bit_spin_locked section, so that it is
    sure to be a leaf lock.

    Add page_dup_compound_rmap() to manage compound locking versus atomics in
    sync with the rest.  In particular, hugetlb pages are still using the
    atomics: to avoid unnecessary interference there, and because they never
    have subpage mappings; but this exception can easily be changed.
    Conveniently, page_dup_compound_rmap() turns out to suit an anon THP's
    __split_huge_pmd_locked() too.

    bit_spin_lock() is not popular with PREEMPT_RT folks: but PREEMPT_RT
    sensibly excludes TRANSPARENT_HUGEPAGE already, so its only exposure is to
    the non-hugetlb non-THP pte-mapped compound pages (with large folios being
    currently dependent on TRANSPARENT_HUGEPAGE).  There is never any scan of
    subpages in this case; but we have chosen to use PageCompound tests rather
    than PageTransCompound tests to gate the use of lock_compound_mapcounts(),
    so that page_mapped() is correct on all compound pages, whether or not
    TRANSPARENT_HUGEPAGE is enabled: could that be a problem for PREEMPT_RT,
    when there is contention on the lock - under heavy concurrent forking for
    example?  If so, then it can be turned into a sleeping lock (like
    folio_lock()) when PREEMPT_RT.

    A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
    18 seconds on small pages, and used to take 1 second on huge pages, but
    now takes 115 milliseconds on huge pages.  Mapping by pmds a second time
    used to take 860ms and now takes 86ms; mapping by pmds after mapping by
    ptes (when the scan is needed) used to take 870ms and now takes 495ms.
    Mapping huge pages by ptes is largely unaffected but variable: between 5%
    faster and 5% slower in what I've recorded.  Contention on the lock is
    likely to behave worse than contention on the atomics behaved.

    Link: https://lkml.kernel.org/r/1b42bd1a-8223-e827-602f-d466c2db7d3c@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:26 -04:00
Chris von Recklinghausen e1c02a97f1 mm,thp,rmap: simplify compound page mapcount handling
Conflicts:
	include/linux/mm.h - We already have
		a1554c002699 ("include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h")
		so keep declaration of nr_free_buffer_pages
	mm/huge_memory.c - We already have RHEL-only commit
		0837bdd68b ("Revert "mm: thp: stabilize the THP mapcount in page_remove_anon_compound_rmap"")
		so there is a difference in deleted code.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit cb67f4282bf9693658dbda934a441ddbbb1446df
Author: Hugh Dickins <hughd@google.com>
Date:   Wed Nov 2 18:51:38 2022 -0700

    mm,thp,rmap: simplify compound page mapcount handling

    Compound page (folio) mapcount calculations have been different for anon
    and file (or shmem) THPs, and involved the obscure PageDoubleMap flag.
    And each huge mapping and unmapping of a file (or shmem) THP involved
    atomically incrementing and decrementing the mapcount of every subpage of
    that huge page, dirtying many struct page cachelines.

    Add subpages_mapcount field to the struct folio and first tail page, so
    that the total of subpage mapcounts is available in one place near the
    head: then page_mapcount() and total_mapcount() and page_mapped(), and
    their folio equivalents, are so quick that anon and file and hugetlb don't
    need to be optimized differently.  Delete the unloved PageDoubleMap.

    page_add and page_remove rmap functions must now maintain the
    subpages_mapcount as well as the subpage _mapcount, when dealing with pte
    mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and
    NR_FILE_MAPPED statistics still needs reading through the subpages, using
    nr_subpages_unmapped() - but only when first or last pmd mapping finds
    subpages_mapcount raised (double-map case, not the common case).

    But are those counts (used to decide when to split an anon THP, and in
    vmscan's pagecache_reclaimable heuristic) correctly maintained?  Not
    quite: since page_remove_rmap() (and also split_huge_pmd()) is often
    called without page lock, there can be races when a subpage pte mapcount
    0<->1 while compound pmd mapcount 0<->1 is scanning - races which the
    previous implementation had prevented.  The statistics might become
    inaccurate, and even drift down until they underflow through 0.  That is
    not good enough, but is better dealt with in a followup patch.

    Update a few comments on first and second tail page overlaid fields.
    hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
    subpages_mapcount and compound_pincount are already correctly at 0, so
    delete its reinitialization of compound_pincount.

    A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
    18 seconds on small pages, and used to take 1 second on huge pages, but
    now takes 119 milliseconds on huge pages.  Mapping by pmds a second time
    used to take 860ms and now takes 92ms; mapping by pmds after mapping by
    ptes (when the scan is needed) used to take 870ms and now takes 495ms.
    But there might be some benchmarks which would show a slowdown, because
    tail struct pages now fall out of cache until final freeing checks them.

    Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.co
m
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:26 -04:00
Chris von Recklinghausen bdc3c88db4 mm/hugetlb: unify clearing of RestoreReserve for private pages
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4781593d5dbae50500d1c7975be03b590ae2b92a
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Oct 20 15:38:32 2022 -0400

    mm/hugetlb: unify clearing of RestoreReserve for private pages

    A trivial cleanup to move clearing of RestoreReserve into adding anon rmap
    of private hugetlb mappings.  It matches with the shared mappings where we
    only clear the bit when adding into page cache, rather than spreading it
    around the code paths.

    Link: https://lkml.kernel.org/r/20221020193832.776173-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:17 -04:00
Chris von Recklinghausen 3b309fdf96 rmap: remove page_unlock_anon_vma_read()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 0c826c0b6a176b9ed5ace7106fd1770bb48f1898
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:51 2022 +0100

    rmap: remove page_unlock_anon_vma_read()

    This was simply an alias for anon_vma_unlock_read() since 2011.

    Link: https://lkml.kernel.org/r/20220902194653.1739778-56-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:05 -04:00