Commit Graph

1243 Commits

Author SHA1 Message Date
Chris von Recklinghausen 98ae253390 mm: refactor do_fault_around()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 9042599e81c295f0b12d940248d6608e87e7b6b6
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Fri Mar 17 21:58:25 2023 +0000

    mm: refactor do_fault_around()

    Patch series "Refactor do_fault_around()"

    Refactor do_fault_around() to avoid bitwise tricks and rather difficult to
    follow logic.  Additionally, prefer fault_around_pages to
    fault_around_bytes as the operations are performed at a base page
    granularity.

    This patch (of 2):

    The existing logic is confusing and fails to abstract a number of bitwise
    tricks.

    Use ALIGN_DOWN() to perform alignment, pte_index() to obtain a PTE index
    and represent the address range using PTE offsets, which naturally make it
    clear that the operation is intended to occur within only a single PTE and
    prevent spanning of more than one page table.

    We rely on the fact that fault_around_bytes will always be page-aligned,
    at least one page in size, a power of two and that it will not exceed
    PAGE_SIZE * PTRS_PER_PTE in size (i.e.  the address space mapped by a
    PTE).  These are all guaranteed by fault_around_bytes_set().

    Link: https://lkml.kernel.org/r/cover.1679089214.git.lstoakes@gmail.com
    Link: https://lkml.kernel.org/r/d125db1c3665a63b80cea29d56407825482e2262.1679089214.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:29 -04:00
Chris von Recklinghausen a500967bd6 mm: memory: use folio_throttle_swaprate() in do_cow_fault()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 68fa572b503ce8bfd0d0c2e5bb185134086d7d7d
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 2 19:58:34 2023 +0800

    mm: memory: use folio_throttle_swaprate() in do_cow_fault()

    Directly use folio_throttle_swaprate() instead of
    cgroup_throttle_swaprate().

    Link: https://lkml.kernel.org/r/20230302115835.105364-7-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:07 -04:00
Chris von Recklinghausen 6fce747219 mm: memory: use folio_throttle_swaprate() in do_anonymous_page()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit e2bf3e2caa62f72d6a67048df440d83a12ae1a2a
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 2 19:58:33 2023 +0800

    mm: memory: use folio_throttle_swaprate() in do_anonymous_page()

    Directly use folio_throttle_swaprate() instead of
    cgroup_throttle_swaprate().

    Link: https://lkml.kernel.org/r/20230302115835.105364-6-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:07 -04:00
Chris von Recklinghausen d37b08b2a2 mm: memory: use folio_throttle_swaprate() in wp_page_copy()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 4d4f75bf3293f35ae1eb1ecf8b70bffdde58ffbe
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 2 19:58:32 2023 +0800

    mm: memory: use folio_throttle_swaprate() in wp_page_copy()

    Directly use folio_throttle_swaprate() instead of
    cgroup_throttle_swaprate().

    Link: https://lkml.kernel.org/r/20230302115835.105364-5-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:07 -04:00
Chris von Recklinghausen b9e719cc5f mm: memory: use folio_throttle_swaprate() in page_copy_prealloc()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit e601ded4247f959702adb5170ca8abac17a0313f
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 2 19:58:31 2023 +0800

    mm: memory: use folio_throttle_swaprate() in page_copy_prealloc()

    Directly use folio_throttle_swaprate() instead of
    cgroup_throttle_swaprate().

    Link: https://lkml.kernel.org/r/20230302115835.105364-4-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:06 -04:00
Chris von Recklinghausen 0358f65269 mm: memory: use folio_throttle_swaprate() in do_swap_page()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 4231f8425833b144f165f01f33887b67f494acf0
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 2 19:58:30 2023 +0800

    mm: memory: use folio_throttle_swaprate() in do_swap_page()

    Directly use folio_throttle_swaprate() instead of
    cgroup_throttle_swaprate().

    Link: https://lkml.kernel.org/r/20230302115835.105364-3-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:06 -04:00
Chris von Recklinghausen eca45431b9 x86/mm/pat: clear VM_PAT if copy_p4d_range failed
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit d155df53f31068c3340733d586eb9b3ddfd70fc5
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Fri Feb 17 10:56:15 2023 +0800

    x86/mm/pat: clear VM_PAT if copy_p4d_range failed

    Syzbot reports a warning in untrack_pfn().  Digging into the root we found
    that this is due to memory allocation failure in pmd_alloc_one.  And this
    failure is produced due to failslab.

    In copy_page_range(), memory alloaction for pmd failed.  During the error
    handling process in copy_page_range(), mmput() is called to remove all
    vmas.  While untrack_pfn this empty pfn, warning happens.

    Here's a simplified flow:

    dup_mm
      dup_mmap
        copy_page_range
          copy_p4d_range
            copy_pud_range
              copy_pmd_range
                pmd_alloc
                  __pmd_alloc
                    pmd_alloc_one
                      page = alloc_pages(gfp, 0);
                        if (!page)
                          return NULL;
        mmput
            exit_mmap
              unmap_vmas
                unmap_single_vma
                  untrack_pfn
                    follow_phys
                      WARN_ON_ONCE(1);

    Since this vma is not generate successfully, we can clear flag VM_PAT.  In
    this case, untrack_pfn() will not be called while cleaning this vma.

    Function untrack_pfn_moved() has also been renamed to fit the new logic.

    Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com
    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Borislav Petkov <bp@suse.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Suresh Siddha <suresh.b.siddha@intel.com>
    Cc: Toshi Kani <toshi.kani@hp.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:04 -04:00
Aristeu Rozanski b36ffff80e mm/uffd: fix comment in handling pte markers
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7a079ba20090ab50d2f4203ceccd1e0f4becd1a6
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Feb 15 15:58:00 2023 -0500

    mm/uffd: fix comment in handling pte markers

    The comment is obsolete after f369b07c8614 ("mm/uffd: reset write
    protection when unregister with wp-mode", 2022-08-20).  Remove it.

    Link: https://lkml.kernel.org/r/20230215205800.223549-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:24 -04:00
Aristeu Rozanski b6aad98b56 mm: introduce __vm_flags_mod and use it in untrack_pfn
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 68f48381d7fdd1cbb9d88c37a4dfbb98ac78226d
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:51 2023 -0800

    mm: introduce __vm_flags_mod and use it in untrack_pfn

    There are scenarios when vm_flags can be modified without exclusive
    mmap_lock, such as:
    - after VMA was isolated and mmap_lock was downgraded or dropped
    - in exit_mmap when there are no other mm users and locking is unnecessary
    Introduce __vm_flags_mod to avoid assertions when the caller takes
    responsibility for the required locking.
    Pass a hint to untrack_pfn to conditionally use __vm_flags_mod for
    flags modification to avoid assertion.

    Link: https://lkml.kernel.org/r/20230126193752.297968-7-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski e214620cfb mm: replace vma->vm_flags direct modifications with modifier calls
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped stuff we don't support when not applying cleanly, left the rest for sake of saving work

commit 1c71222e5f2393b5ea1a41795c67589eea7e3490
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:49 2023 -0800

    mm: replace vma->vm_flags direct modifications with modifier calls

    Replace direct modifications to vma->vm_flags with calls to modifier
    functions to be able to track flag changes and to keep vma locking
    correctness.

    [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
    Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski 274713ebbe mm: use a folio in copy_present_pte()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 14ddee4126fecff5c5c0a84940ba34f0bfe3e708
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:18:13 2023 +0000

    mm: use a folio in copy_present_pte()

    We still have to keep the page around because we need to know which page
    in the folio we're copying, but we can replace five implict calls to
    compound_head() with one.

    Link: https://lkml.kernel.org/r/20230116191813.2145215-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski 532bb0bf59 mm: use a folio in copy_pte_range()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit edf5047058395c89a912783ea29ec8f9e53be414
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:18:12 2023 +0000

    mm: use a folio in copy_pte_range()

    Allocate an order-0 folio instead of a page and pass it all the way down
    the call chain.  Removes dozens of calls to compound_head().

    Link: https://lkml.kernel.org/r/20230116191813.2145215-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski a9ebe2a98c mm: convert do_anonymous_page() to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: d4f9565ae598 is already backported

commit cb3184deef10fdc7658fb366189864c89ad118c9
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:18:10 2023 +0000

    mm: convert do_anonymous_page() to use a folio

    Removes six calls to compound_head(); some inline and some external.

    Link: https://lkml.kernel.org/r/20230116191813.2145215-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski e7030d52b7 mm: remove __HAVE_ARCH_PTE_SWP_EXCLUSIVE
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped arches we don't support

commit 950fe885a89770619e315f9b46301eebf0aab7b3
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Jan 13 18:10:26 2023 +0100

    mm: remove __HAVE_ARCH_PTE_SWP_EXCLUSIVE

    __HAVE_ARCH_PTE_SWP_EXCLUSIVE is now supported by all architectures that
    support swp PTEs, so let's drop it.

    Link: https://lkml.kernel.org/r/20230113171026.582290-27-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:08 -04:00
Aristeu Rozanski 5455c3da6d mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7d4a8be0c4b2b7ffb367929d2b352651f083806b
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jan 10 13:57:22 2023 +1100

    mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export

    mmu_notifier_range_update_to_read_only() was originally introduced in
    commit c6d23413f8 ("mm/mmu_notifier:
    mmu_notifier_range_update_to_read_only() helper") as an optimisation for
    device drivers that know a range has only been mapped read-only.  However
    there are no users of this feature so remove it.  As it is the only user
    of the struct mmu_notifier_range.vma field remove that also.

    Link: https://lkml.kernel.org/r/20230110025722.600912-1-apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:05 -04:00
Aristeu Rozanski d908e3177a mm: add vma_has_recency()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 8788f6781486769d9598dcaedc3fe0eb12fc3e59
Author: Yu Zhao <yuzhao@google.com>
Date:   Fri Dec 30 14:52:51 2022 -0700

    mm: add vma_has_recency()

    Add vma_has_recency() to indicate whether a VMA may exhibit temporal
    locality that the LRU algorithm relies on.

    This function returns false for VMAs marked by VM_SEQ_READ or
    VM_RAND_READ.  While the former flag indicates linear access, i.e., a
    special case of spatial locality, both flags indicate a lack of temporal
    locality, i.e., the reuse of an area within a relatively small duration.

    "Recency" is chosen over "locality" to avoid confusion between temporal
    and spatial localities.

    Before this patch, the active/inactive LRU only ignored the accessed bit
    from VMAs marked by VM_SEQ_READ.  After this patch, the active/inactive
    LRU and MGLRU share the same logic: they both ignore the accessed bit if
    vma_has_recency() returns false.

    For the active/inactive LRU, the following fio test showed a [6, 8]%
    increase in IOPS when randomly accessing mapped files under memory
    pressure.

      kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo)
      kb=$((kb - 8*1024*1024))

      modprobe brd rd_nr=1 rd_size=$kb
      dd if=/dev/zero of=/dev/ram0 bs=1M

      mkfs.ext4 /dev/ram0
      mount /dev/ram0 /mnt/
      swapoff -a

      fio --name=test --directory=/mnt/ --ioengine=mmap --numjobs=8 \
          --size=8G --rw=randrw --time_based --runtime=10m \
          --group_reporting

    The discussion that led to this patch is here [1].  Additional test
    results are available in that thread.

    [1] https://lore.kernel.org/r/Y31s%2FK8T85jh05wH@google.com/

    Link: https://lkml.kernel.org/r/20221230215252.2628425-1-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Andrea Righi <andrea.righi@canonical.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:03 -04:00
Aristeu Rozanski 20dd56698e mm: remove zap_page_range and create zap_vma_pages
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped RISCV changes, and due missing b59c9dc4d9d47b

commit e9adcfecf572fcfaa9f8525904cf49c709974f73
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Jan 3 16:27:32 2023 -0800

    mm: remove zap_page_range and create zap_vma_pages

    zap_page_range was originally designed to unmap pages within an address
    range that could span multiple vmas.  While working on [1], it was
    discovered that all callers of zap_page_range pass a range entirely within
    a single vma.  In addition, the mmu notification call within zap_page
    range does not correctly handle ranges that span multiple vmas.  When
    crossing a vma boundary, a new mmu_notifier_range_init/end call pair with
    the new vma should be made.

    Instead of fixing zap_page_range, do the following:
    - Create a new routine zap_vma_pages() that will remove all pages within
      the passed vma.  Most users of zap_page_range pass the entire vma and
      can use this new routine.
    - For callers of zap_page_range not passing the entire vma, instead call
      zap_page_range_single().
    - Remove zap_page_range.

    [1] https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.kravetz@oracle.com/
    Link: https://lkml.kernel.org/r/20230104002732.232573-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Peter Xu <peterx@redhat.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Acked-by: Heiko Carstens <hca@linux.ibm.com>    [s390]
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:03 -04:00
Rafael Aquini da2c6c408a mm/swap: fix race when skipping swapcache
JIRA: https://issues.redhat.com/browse/RHEL-31646
CVE: CVE-2024-26759

This patch is a backport of the following upstream commit:
commit 13ddaf26be324a7f951891ecd9ccd04466d27458
Author: Kairui Song <kasong@tencent.com>
Date:   Wed Feb 7 02:25:59 2024 +0800

    mm/swap: fix race when skipping swapcache

    When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
    swapin the same entry at the same time, they get different pages (A, B).
    Before one thread (T0) finishes the swapin and installs page (A) to the
    PTE, another thread (T1) could finish swapin of page (B), swap_free the
    entry, then swap out the possibly modified page reusing the same entry.
    It breaks the pte_same check in (T0) because PTE value is unchanged,
    causing ABA problem.  Thread (T0) will install a stalled page (A) into the
    PTE and cause data corruption.

    One possible callstack is like this:

    CPU0                                 CPU1
    ----                                 ----
    do_swap_page()                       do_swap_page() with same entry
    <direct swapin path>                 <direct swapin path>
    <alloc page A>                       <alloc page B>
    swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
    <slow on later locks or interrupt>   <finished swapin first>
    ...                                  set_pte_at()
                                         swap_free() <- entry is free
                                         <write to page B, now page A stalled>
                                         <swap out page B to same swap entry>
    pte_same() <- Check pass, PTE seems
                  unchanged, but page A
                  is stalled!
    swap_free() <- page B content lost!
    set_pte_at() <- staled page A installed!

    And besides, for ZRAM, swap_free() allows the swap device to discard the
    entry content, so even if page (B) is not modified, if swap_read_folio()
    on CPU0 happens later than swap_free() on CPU1, it may also cause data
    loss.

    To fix this, reuse swapcache_prepare which will pin the swap entry using
    the cache flag, and allow only one thread to swap it in, also prevent any
    parallel code from putting the entry in the cache.  Release the pin after
    PT unlocked.

    Racers just loop and wait since it's a rare and very short event.  A
    schedule_timeout_uninterruptible(1) call is added to avoid repeated page
    faults wasting too much CPU, causing livelock or adding too much noise to
    perf statistics.  A similar livelock issue was described in commit
    029c4628b2eb ("mm: swap: get rid of livelock in swapin readahead")

    Reproducer:

    This race issue can be triggered easily using a well constructed
    reproducer and patched brd (with a delay in read path) [1]:

    With latest 6.8 mainline, race caused data loss can be observed easily:
    $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
      Polulating 32MB of memory region...
      Keep swapping out...
      Starting round 0...
      Spawning 65536 workers...
      32746 workers spawned, wait for done...
      Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
      Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
      Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
      Round 0 Failed, 15 data loss!

    This reproducer spawns multiple threads sharing the same memory region
    using a small swap device.  Every two threads updates mapped pages one by
    one in opposite direction trying to create a race, with one dedicated
    thread keep swapping out the data out using madvise.

    The reproducer created a reproduce rate of about once every 5 minutes, so
    the race should be totally possible in production.

    After this patch, I ran the reproducer for over a few hundred rounds and
    no data loss observed.

    Performance overhead is minimal, microbenchmark swapin 10G from 32G
    zram:

    Before:     10934698 us
    After:      11157121 us
    Cached:     13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)

    [kasong@tencent.com: v4]
      Link: https://lkml.kernel.org/r/20240219082040.7495-1-ryncsn@gmail.com
    Link: https://lkml.kernel.org/r/20240206182559.32264-1-ryncsn@gmail.com
    Fixes: 0bcac06f27 ("mm, swap: skip swapcache for swapin of synchronous device")
    Reported-by: "Huang, Ying" <ying.huang@intel.com>
    Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
    Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
    Signed-off-by: Kairui Song <kasong@tencent.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Chris Li <chrisl@kernel.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Barry Song <21cnbao@gmail.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-04-11 16:08:24 -04:00
Audra Mitchell 4efa595c94 mm: hwpoison: support recovery from ksm_might_need_to_copy()
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    minor context conflict due to out of order backport related to the v6.1 update
    e6131c89a5 ("mm/swapoff: allow pte_offset_map[_lock]() to fail")

This patch is a backport of the following upstream commit:
commit 6b970599e807ea95c653926d41b095a92fd381e2
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Dec 9 15:28:01 2022 +0800

    mm: hwpoison: support recovery from ksm_might_need_to_copy()

    When the kernel copies a page from ksm_might_need_to_copy(), but runs into
    an uncorrectable error, it will crash since poisoned page is consumed by
    kernel, this is similar to the issue recently fixed by Copy-on-write
    poison recovery.

    When an error is detected during the page copy, return VM_FAULT_HWPOISON
    in do_swap_page(), and install a hwpoison entry in unuse_pte() when
    swapoff, which help us to avoid system crash.  Note, memory failure on a
    KSM page will be skipped, but still call memory_failure_queue() to be
    consistent with general memory failure process, and we could support KSM
    page recovery in the feature.

    [wangkefeng.wang@huawei.com: enhance unuse_pte(), fix issue found by lkp]
      Link: https://lkml.kernel.org/r/20221213120523.141588-1-wangkefeng.wang@huawei.com
    [wangkefeng.wang@huawei.com: update changelog, alter ksm_might_need_to_copy(), restore unlikely() in unuse_pte()]
      Link: https://lkml.kernel.org/r/20230201074433.96641-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20221209072801.193221-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:03 -04:00
Audra Mitchell e2dcaac9a6 mm: remove VM_FAULT_WRITE
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Minor context differences due to out of order backport
    9e9103fead ("mm: convert wp_page_copy() to use folios")

This patch is a backport of the following upstream commit:
commit cb8d863313436339fb60f7dd5131af2e5854621e
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Oct 21 12:11:35 2022 +0200

    mm: remove VM_FAULT_WRITE

    All users -- GUP and KSM -- are gone, let's just remove it.

    Link: https://lkml.kernel.org/r/20221021101141.84170-4-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:01 -04:00
Audra Mitchell c234788260 mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit aea06577a9005ca81c35196d6171cac346d3b251
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:46 2022 +0100

    mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings

    If we already have a PMD/PUD mapped write-protected in a private mapping
    and we want to break COW either due to FAULT_FLAG_WRITE or
    FAULT_FLAG_UNSHARE, there is no need to inform the file system just like on
    the PTE path.

    Let's just split (->zap) + fallback in that case.

    This is a preparation for more generic FAULT_FLAG_UNSHARE support in
    COW mappings.

    Link: https://lkml.kernel.org/r/20221116102659.70287-8-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:57 -04:00
Audra Mitchell f9d1d9a9f0 mm: add early FAULT_FLAG_WRITE consistency checks
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 79881fed6052a9ce00cfb63297832b9faacf8cf3
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:44 2022 +0100

    mm: add early FAULT_FLAG_WRITE consistency checks

    Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to
    care in all other handlers and might get "surprises" if we forget to do
    so.

    Write faults without VM_MAYWRITE don't make any sense, and our
    maybe_mkwrite() logic could have hidden such abuse for now.

    Write faults without VM_WRITE on something that is not a COW mapping is
    similarly broken, and e.g., do_wp_page() could end up placing an
    anonymous page into a shared mapping, which would be bad.

    This is a preparation for reliable R/O long-term pinning of pages in
    private mappings, whereby we want to make sure that we will never break
    COW in a read-only private mapping.

    Link: https://lkml.kernel.org/r/20221116102659.70287-6-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:57 -04:00
Audra Mitchell d115504bd5 mm: add early FAULT_FLAG_UNSHARE consistency checks
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Minor context difference due to out of order backports:
    c007e2df2e ("mm/hugetlb: fix uffd wr-protection for CoW optimization path")
    92a1aa89946b ("mm: rework handling in do_wp_page() based on private vs. shared mappings")
    887f390a3d60 ("mm: ptep_get() conversion")

This patch is a backport of the following upstream commit:
commit cdc5021cda194112bc0962d6a0e90b379968c504
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:43 2022 +0100

    mm: add early FAULT_FLAG_UNSHARE consistency checks

    For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which
    implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not
    dealing with a COW mapping, such that we treat it like a read fault as
    documented and don't have to worry about the flag throughout all fault
    handlers.

    While at it, centralize the check for mutual exclusion of
    FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that
    either flag is set in the WP handler.

    Link: https://lkml.kernel.org/r/20221116102659.70287-5-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:57 -04:00
Audra Mitchell 48f4b25890 mm: mmu_gather: do not expose delayed_rmap flag
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit f036c8184f8b6750fa642485fb01eb6ff036a86b
Author: Alexander Gordeev <agordeev@linux.ibm.com>
Date:   Wed Nov 16 08:49:30 2022 +0100

    mm: mmu_gather: do not expose delayed_rmap flag

    Flag delayed_rmap of 'struct mmu_gather' is rather a private member, but
    it is still accessed directly.  Instead, let the TLB gather code access
    the flag.

    Link: https://lkml.kernel.org/r/Y3SWCu6NRaMQ5dbD@li-4a3a4a4c-28e5-11b2-a85c-a8d192c6f089.ibm.com
    Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:56 -04:00
Audra Mitchell d2722a500f mm: delay page_remove_rmap() until after the TLB has been flushed
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 5df397dec7c4c08c23bd14f162f1228836faa4ce
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Nov 9 12:30:51 2022 -0800

    mm: delay page_remove_rmap() until after the TLB has been flushed

    When we remove a page table entry, we are very careful to only free the
    page after we have flushed the TLB, because other CPUs could still be
    using the page through stale TLB entries until after the flush.

    However, we have removed the rmap entry for that page early, which means
    that functions like folio_mkclean() would end up not serializing with the
    page table lock because the page had already been made invisible to rmap.

    And that is a problem, because while the TLB entry exists, we could end up
    with the following situation:

     (a) one CPU could come in and clean it, never seeing our mapping of the
         page

     (b) another CPU could continue to use the stale and dirty TLB entry and
         continue to write to said page

    resulting in a page that has been dirtied, but then marked clean again,
    all while another CPU might have dirtied it some more.

    End result: possibly lost dirty data.

    This extends our current TLB gather infrastructure to optionally track a
    "should I do a delayed page_remove_rmap() for this page after flushing the
    TLB".  It uses the newly introduced 'encoded page pointer' to do that
    without having to keep separate data around.

    Note, this is complicated by a couple of issues:

     - we want to delay the rmap removal, but not past the page table lock,
       because that simplifies the memcg accounting

     - only SMP configurations want to delay TLB flushing, since on UP
       there are obviously no remote TLBs to worry about, and the page
       table lock means there are no preemption issues either

     - s390 has its own mmu_gather model that doesn't delay TLB flushing,
       and as a result also does not want the delayed rmap. As such, we can
       treat S390 like the UP case and use a common fallback for the "no
       delays" case.

     - we can track an enormous number of pages in our mmu_gather structure,
       with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
       all set up to be approximately 10k pending pages.

       We do not want to have a huge number of batched pages that we then
       need to check for delayed rmap handling inside the page table lock.

    Particularly that last point results in a noteworthy detail, where the
    normal page batch gathering is limited once we have delayed rmaps pending,
    in such a way that only the last batch (the so-called "active batch") in
    the mmu_gather structure can have any delayed entries.

    NOTE!  While the "possibly lost dirty data" sounds catastrophic, for this
    all to happen you need to have a user thread doing either madvise() with
    MADV_DONTNEED or a full re-mmap() of the area concurrently with another
    thread continuing to use said mapping.

    So arguably this is about user space doing crazy things, but from a VM
    consistency standpoint it's better if we track the dirty bit properly even
    when user space goes off the rails.

    [akpm@linux-foundation.org: fix UP build, per Linus]
    Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
    Link: https://lkml.kernel.org/r/20221109203051.1835763-4-torvalds@linux-foundation.org
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Hugh Dickins <hughd@google.com>
    Reported-by: Nadav Amit <nadav.amit@gmail.com>
    Tested-by: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:56 -04:00
Audra Mitchell c2df118635 mm: always compile in pte markers
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Minor context conflicts due to out of order backport:
    804153f7df ("mm: use pte markers for swap errors")

This patch is a backport of the following upstream commit:
commit ca92ea3dc5a2b01f98e9f02b7a6bc03be06fe124
Author: Peter Xu <peterx@redhat.com>
Date:   Sun Oct 30 17:41:50 2022 -0400

    mm: always compile in pte markers

    Patch series "mm: Use pte marker for swapin errors".

    This series uses the pte marker to replace the swapin error swap entry,
    then we save one more swap entry slot for swap devices.  A new pte marker
    bit is defined.

    This patch (of 2):

    The PTE markers code is tiny and now it's enabled for most of the
    distributions.  It's fine to keep it as-is, but to make a broader use of
    it (e.g.  replacing read error swap entry) it needs to be there always
    otherwise we need special code path to take care of !PTE_MARKER case.

    It'll be easier just make pte marker always exist.  Use this chance to
    extend its usage to anonymous too by simply touching up some of the old
    comments, because it'll be used for anonymous pages in the follow up
    patches.

    Link: https://lkml.kernel.org/r/20221030214151.402274-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20221030214151.402274-2-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Huang Ying <ying.huang@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:56 -04:00
Chris von Recklinghausen 7096ad3b1e mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 8d6a0ac09a16c026e1e2a03a61e12e95c48a25a6
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:47 2022 +0100

    mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping

    Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a
    COW (i.e., private writable) mapping and adjust the documentation
    accordingly.

    FAULT_FLAG_UNSHARE will now also break COW when encountering the shared
    zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by
    properly replacing the mapped page/pfn by a private copy (an exclusive
    anonymous page).

    Note that only do_wp_page() needs care: hugetlb_wp() already handles
    FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it
    correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE
    such that we can handle FAULT_FLAG_UNSHARE on the PTE level.

    This change is a requirement for reliable long-term R/O pinning in
    COW mappings.

    Link: https://lkml.kernel.org/r/20221116102659.70287-9-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:06 -04:00
Chris von Recklinghausen d43c5e6f48 mm: rework handling in do_wp_page() based on private vs. shared mappings
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit b9086fde6d44e8a95dc95b822bd87386129b832d
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:45 2022 +0100

    mm: rework handling in do_wp_page() based on private vs. shared mappings

    We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a
    COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages.
    Let's prepare for that by handling shared mappings first such that we can
    handle private mappings last.

    While at it, use folio-based functions instead of page-based functions
    where we touch the code either way.

    Link: https://lkml.kernel.org/r/20221116102659.70287-7-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:06 -04:00
Chris von Recklinghausen 26437a89ef mm: remove the vma linked list
Conflicts:
	include/linux/mm.h - We already have
		21b85b09527c ("madvise: use zap_page_range_single for madvise dontneed")
		so keep declaration for zap_page_range_single
	kernel/fork.c - We already have
		f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
		so keep declaration of i
	mm/mmap.c - We already have
		a1e8cb93bf ("mm: drop oom code from exit_mmap")
		and
		db3644c677 ("mm: delete unused MMF_OOM_VICTIM flag")
		so keep setting MMF_OOM_SKIP in mm->flags

JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 763ecb035029f500d7e6dc99acd1ad299b7726a1
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:06 2022 +0000

    mm: remove the vma linked list

    Replace any vm_next use with vma_find().

    Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the
    maple tree.

    Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap().  At
    the same time, alter the loop to be more compact.

    Now that free_pgtables() and unmap_vmas() take a maple tree as an
    argument, rearrange do_mas_align_munmap() to use the new tree to hold the
    vmas to remove.

    Remove __vma_link_list() and __vma_unlink_list() as they are exclusively
    used to update the linked list.

    Drop linked list update from __insert_vm_struct().

    Rework validation of tree as it was depending on the linked list.

    [yang.lee@linux.alibaba.com: fix one kernel-doc comment]
      Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1949
      Link: https://lkml.kernel.org/r/20220824021918.94116-1-yang.lee@linux.alib
aba.comLink: https://lkml.kernel.org/r/20220906194824.2110408-69-Liam.Howlett@or
acle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:57 -04:00
Prarit Bhargava 25cf7e4e50 mm: Make pte_mkwrite() take a VMA
JIRA: https://issues.redhat.com/browse/RHEL-25415

Conflicts: This is a rip and replace of pt_mkwrite() with one arg for
pte_mkwrite() with two args.  There are uses upstream that are not yet
in RHEL9.

commit 161e393c0f63592a3b95bdd8b55752653763fc6d
Author: Rick Edgecombe <rick.p.edgecombe@intel.com>
Date:   Mon Jun 12 17:10:29 2023 -0700

    mm: Make pte_mkwrite() take a VMA

    The x86 Shadow stack feature includes a new type of memory called shadow
    stack. This shadow stack memory has some unusual properties, which requires
    some core mm changes to function properly.

    One of these unusual properties is that shadow stack memory is writable,
    but only in limited ways. These limits are applied via a specific PTE
    bit combination. Nevertheless, the memory is writable, and core mm code
    will need to apply the writable permissions in the typical paths that
    call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
    that the x86 implementation of it can know whether to create regular
    writable or shadow stack mappings.

    But there are a couple of challenges to this. Modifying the signatures of
    each arch pte_mkwrite() implementation would be error prone because some
    are generated with macros and would need to be re-implemented. Also, some
    pte_mkwrite() callers operate on kernel memory without a VMA.

    So this can be done in a three step process. First pte_mkwrite() can be
    renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
    added that just calls pte_mkwrite_novma(). Next callers without a VMA can
    be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
    can be changed to take/pass a VMA.

    Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted
    callers that don't have a VMA were to use pte_mkwrite_novma(). So now
    change pte_mkwrite() to take a VMA and change the remaining callers to
    pass a VMA. Apply the same changes for pmd_mkwrite().

    No functional change.

    Suggested-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: David Hildenbrand <david@redhat.com>
    Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com

Omitted-fix: f441ff73f1ec powerpc: Fix pud_mkwrite() definition after pte_mkwrite() API changes
	pud_mkwrite() not in RHEL9 code for powerpc (removed previously)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:43:13 -04:00
Jerry Snitselaar efb6748971 mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
JIRA: https://issues.redhat.com/browse/RHEL-26541
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Conflicts: Context diff due to some commits not being backported yet such as c33c794828f2 ("mm: ptep_get() conversion"),
           and 959a78b6dd45 ("mm/hugetlb: use a folio in hugetlb_wp()").

commit ec8832d007cb7b50229ad5745eec35b847cc9120
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jul 25 23:42:06 2023 +1000

    mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()

    Secondary TLBs are now invalidated from the architecture specific TLB
    invalidation functions.  Therefore there is no need to explicitly notify
    or invalidate as part of the range end functions.  This means we can
    remove mmu_notifier_invalidate_range_end_only() and some of the
    ptep_*_notify() functions.

    Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Andrew Donnellan <ajd@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
    Cc: Frederic Barrat <fbarrat@linux.ibm.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kevin Tian <kevin.tian@intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Nicolin Chen <nicolinc@nvidia.com>
    Cc: Robin Murphy <robin.murphy@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zhi Wang <zhi.wang.linux@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit ec8832d007cb7b50229ad5745eec35b847cc9120)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2024-02-26 15:49:51 -07:00
Mika Penttilä 7ef8f6ec98 mm: fix a few rare cases of using swapin error pte marker
JIRA: https://issues.redhat.com/browse/RHEL-1349
Upstream Status: v6.2-rc7

commit 7e3ce3f8d2d235f916baad1582f6cf12e0319013
Author:     Peter Xu <peterx@redhat.com>
AuthorDate: Wed Dec 14 15:04:53 2022 -0500
Commit:     Andrew Morton <akpm@linux-foundation.org>
CommitDate: Wed Jan 18 17:02:19 2023 -0800

    This patch should harden commit 15520a3f0469 ("mm: use pte markers for
    swap errors") on using pte markers for swapin errors on a few corner
    cases.

    1. Propagate swapin errors across fork()s: if there're swapin errors in
       the parent mm, after fork()s the child should sigbus too when an error
       page is accessed.

    2. Fix a rare condition race in pte_marker_clear() where a uffd-wp pte
       marker can be quickly switched to a swapin error.

    3. Explicitly ignore swapin error pte markers in change_protection().

    I mostly don't worry on (2) or (3) at all, but we should still have them. 
    Case (1) is special because it can potentially cause silent data corrupt
    on child when parent has swapin error triggered with swapoff, but since
    swapin error is rare itself already it's probably not easy to trigger
    either.

    Currently there is a priority difference between the uffd-wp bit and the
    swapin error entry, in which the swapin error always has higher priority
    (e.g.  we don't need to wr-protect a swapin error pte marker).

    If there will be a 3rd bit introduced, we'll probably need to consider a
    more involved approach so we may need to start operate on the bits.  Let's
    leave that for later.

    This patch is tested with case (1) explicitly where we'll get corrupted
    data before in the child if there's existing swapin error pte markers, and
    after patch applied the child can be rightfully killed.

    We don't need to copy stable for this one since 15520a3f0469 just landed
    as part of v6.2-rc1, only "Fixes" applied.

    Link: https://lkml.kernel.org/r/20221214200453.1772655-3-peterx@redhat.com
    Fixes: 15520a3f0469 ("mm: use pte markers for swap errors")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Pengfei Xu <pengfei.xu@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
2023-10-30 07:03:06 +02:00
Mika Penttilä 6b269e16a3 mm/uffd: fix pte marker when fork() without fork event
JIRA: https://issues.redhat.com/browse/RHEL-1349
Upstream Status: v6.2-rc7

commit 49d6d7fb631345b0f2957a7c4be24ad63903150f
Author:     Peter Xu <peterx@redhat.com>
AuthorDate: Wed Dec 14 15:04:52 2022 -0500
Commit:     Andrew Morton <akpm@linux-foundation.org>
CommitDate: Wed Jan 18 17:02:19 2023 -0800

    Patch series "mm: Fixes on pte markers".

    Patch 1 resolves the syzkiller report from Pengfei.

    Patch 2 further harden pte markers when used with the recent swapin error
    markers.  The major case is we should persist a swapin error marker after
    fork(), so child shouldn't read a corrupted page.


    This patch (of 2):

    When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may
    have it and has pte marker installed.  The warning is improper along with
    the comment.  The right thing is to inherit the pte marker when needed, or
    keep the dst pte empty.

    A vague guess is this happened by an accident when there's the prior patch
    to introduce src/dst vma into this helper during the uffd-wp feature got
    developed and I probably messed up in the rebase, since if we replace
    dst_vma with src_vma the warning & comment it all makes sense too.

    Hugetlb did exactly the right here (copy_hugetlb_page_range()).  Fix the
    general path.

    Reproducer:

    https://github.com/xupengfe/syzkaller_logs/blob/main/221208_115556_copy_page_range/repro.c

    Bugzilla report: https://bugzilla.kernel.org/show_bug.cgi?id=216808

    Link: https://lkml.kernel.org/r/20221214200453.1772655-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20221214200453.1772655-2-peterx@redhat.com
    Fixes: c56d1b62cce8 ("mm/shmem: handle uffd-wp during fork()")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: Pengfei Xu <pengfei.xu@intel.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: <stable@vger.kernel.org> # 5.19+
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
2023-10-30 07:03:06 +02:00
Mika Penttilä 804153f7df mm: use pte markers for swap errors
JIRA: https://issues.redhat.com/browse/RHEL-1349
Upstream Status: v6.2-rc1

commit 15520a3f046998e3f57e695743e99b0875e2dae7
Author:     Peter Xu <peterx@redhat.com>
AuthorDate: Sun Oct 30 17:41:51 2022 -0400
Commit:     Andrew Morton <akpm@linux-foundation.org>
CommitDate: Wed Nov 30 15:58:46 2022 -0800

    PTE markers are ideal mechanism for things like SWP_SWAPIN_ERROR.  Using a
    whole swap entry type for this purpose can be an overkill, especially if
    we already have PTE markers.  Define a new bit for swapin error and
    replace it with pte markers.  Then we can safely drop SWP_SWAPIN_ERROR and
    give one device slot back to swap.

    We used to have SWP_SWAPIN_ERROR taking the page pfn as part of the swap
    entry, but it's never used.  Neither do I see how it can be useful because
    normally the swapin failure should not be caused by a bad page but bad
    swap device.  Drop it alongside.

    Link: https://lkml.kernel.org/r/20221030214151.402274-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Huang Ying <ying.huang@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
2023-10-26 06:54:59 +03:00
Chris von Recklinghausen 41172cafb6 mm/memory: handle_pte_fault() use pte_offset_map_nolock()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit c7ad08804fae5baa7f71c0790038e8259e1066a5
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:45:05 2023 -0700

    mm/memory: handle_pte_fault() use pte_offset_map_nolock()

    handle_pte_fault() use pte_offset_map_nolock() to get the vmf.ptl which
    corresponds to vmf.pte, instead of pte_lockptr() being used later, when
    there's a chance that the pmd entry might have changed, perhaps to none,
    or to a huge pmd, with no split ptlock in its struct page.

    Remove its pmd_devmap_trans_unstable() call: pte_offset_map_nolock() will
    handle that case by failing.  Update the "morph" comment above, looking
    forward to when shmem or file collapse to THP may not take mmap_lock for
    write (or not at all).

    do_numa_page() use the vmf->ptl from handle_pte_fault() at first, but
    refresh it when refreshing vmf->pte.

    do_swap_page()'s pte_unmap_same() (the thing that takes ptl to verify a
    two-part PAE orig_pte) use the vmf->ptl from handle_pte_fault() too; but
    do_swap_page() is also used by anon THP's __collapse_huge_page_swapin(),
    so adjust that to set vmf->ptl by pte_offset_map_nolock().

    Link: https://lkml.kernel.org/r/c1107654-3929-60ac-223e-6877cbb86065@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:20 -04:00
Chris von Recklinghausen efe1a9d970 mm/memory: allow pte_offset_map[_lock]() to fail
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3db82b9374ca921b8b820a75e83809d5c4133d8f
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:43:38 2023 -0700

    mm/memory: allow pte_offset_map[_lock]() to fail

    copy_pte_range(): use pte_offset_map_nolock(), and allow for it to fail;
    but with a comment on some further assumptions that are being made there.

    zap_pte_range() and zap_pmd_range(): adjust their interaction so that a
    pte_offset_map_lock() failure in zap_pte_range() leads to a retry in
    zap_pmd_range(); remove call to pmd_none_or_trans_huge_or_clear_bad().

    Allow pte_offset_map_lock() to fail in many functions.  Update comment on
    calling pte_alloc() in do_anonymous_page().  Remove redundant calls to
    pmd_trans_unstable(), pmd_devmap_trans_unstable(), pmd_none() and
    pmd_bad(); but leave pmd_none_or_clear_bad() calls in free_pmd_range() and
    copy_pmd_range(), those do simplify the next level down.

    Link: https://lkml.kernel.org/r/bb548d50-e99a-f29e-eab1-a43bef2a1287@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:20 -04:00
Chris von Recklinghausen 176bb35f89 mm: use pmdp_get_lockless() without surplus barrier()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 26e1a0c3277d7f43856ec424902423be212cc178
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:06:53 2023 -0700

    mm: use pmdp_get_lockless() without surplus barrier()

    Patch series "mm: allow pte_offset_map[_lock]() to fail", v2.

    What is it all about?  Some mmap_lock avoidance i.e.  latency reduction.
    Initially just for the case of collapsing shmem or file pages to THPs; but
    likely to be relied upon later in other contexts e.g.  freeing of empty
    page tables (but that's not work I'm doing).  mmap_write_lock avoidance
    when collapsing to anon THPs?  Perhaps, but again that's not work I've
    done: a quick attempt was not as easy as the shmem/file case.

    I would much prefer not to have to make these small but wide-ranging
    changes for such a niche case; but failed to find another way, and have
    heard that shmem MADV_COLLAPSE's usefulness is being limited by that
    mmap_write_lock it currently requires.

    These changes (though of course not these exact patches) have been in
    Google's data centre kernel for three years now: we do rely upon them.

    What is this preparatory series about?

    The current mmap locking will not be enough to guard against that tricky
    transition between pmd entry pointing to page table, and empty pmd entry,
    and pmd entry pointing to huge page: pte_offset_map() will have to
    validate the pmd entry for itself, returning NULL if no page table is
    there.  What to do about that varies: sometimes nearby error handling
    indicates just to skip it; but in many cases an ACTION_AGAIN or "goto
    again" is appropriate (and if that risks an infinite loop, then there must
    have been an oops, or pfn 0 mistaken for page table, before).

    Given the likely extension to freeing empty page tables, I have not
    limited this set of changes to a THP config; and it has been easier, and
    sets a better example, if each site is given appropriate handling: even
    where deeper study might prove that failure could only happen if the pmd
    table were corrupted.

    Several of the patches are, or include, cleanup on the way; and by the
    end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
    pte_offset_map_lock() then handle those original races and more.  Most
    uses of pte_lockptr() are deprecated, with pte_offset_map_nolock() taking
    its place.

    This patch (of 32):

    Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
    reliable result with PAE (or READ_ONCE as before without PAE); and remove
    the unnecessary extra barrier()s which got left behind in its callers.

    HOWEVER: Note the small print in linux/pgtable.h, where it was designed
    specifically for fast GUP, and depends on interrupts being disabled for
    its full guarantee: most callers which have been added (here and before)
    do NOT have interrupts disabled, so there is still some need for caution.

    Link: https://lkml.kernel.org/r/f35279a9-9ac0-de22-d245-591afbfb4dc@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:11 -04:00
Chris von Recklinghausen a68e16dd11 mm: fix failure to unmap pte on highmem systems
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3b65f437d9e8dd696a2b88e7afcd51385532ab35
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Fri Jun 2 10:29:49 2023 +0100

    mm: fix failure to unmap pte on highmem systems

    The loser of a race to service a pte for a device private entry in the
    swap path previously unlocked the ptl, but failed to unmap the pte.  This
    only affects highmem systems since unmapping a pte is a noop on
    non-highmem systems.

    Link: https://lkml.kernel.org/r/20230602092949.545577-5-ryan.roberts@arm.com
    Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page")
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:10 -04:00
Chris von Recklinghausen c2aa4ee6d2 mm/uffd: UFFD_FEATURE_WP_UNPOPULATED
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 2bad466cc9d9b4c3b4b16eb9c03c919b59561316
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Mar 9 17:37:10 2023 -0500

    mm/uffd: UFFD_FEATURE_WP_UNPOPULATED

    Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4.

    The new feature bit makes anonymous memory acts the same as file memory on
    userfaultfd-wp in that it'll also wr-protect none ptes.

    It can be useful in two cases:

    (1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot,
        so pre-fault can be replaced by enabling this flag and speed up
        protections

    (2) It helps to implement async uffd-wp mode that Muhammad is working on [1]

    It's debatable whether this is the most ideal solution because with the
    new feature bit set, wr-protect none pte needs to pre-populate the
    pgtables to the last level (PAGE_SIZE).  But it seems fine so far to
    service either purpose above, so we can leave optimizations for later.

    The series brings pte markers to anonymous memory too.  There's some
    change in the common mm code path in the 1st patch, great to have some eye
    looking at it, but hopefully they're still relatively straightforward.

    This patch (of 2):

    This is a new feature that controls how uffd-wp handles none ptes.  When
    it's set, the kernel will handle anonymous memory the same way as file
    memory, by allowing the user to wr-protect unpopulated ptes.

    File memories handles none ptes consistently by allowing wr-protecting of
    none ptes because of the unawareness of page cache being exist or not.
    For anonymous it was not as persistent because we used to assume that we
    don't need protections on none ptes or known zero pages.

    One use case of such a feature bit was VM live snapshot, where if without
    wr-protecting empty ptes the snapshot can contain random rubbish in the
    holes of the anonymous memory, which can cause misbehave of the guest when
    the guest OS assumes the pages should be all zeros.

    QEMU worked it around by pre-populate the section with reads to fill in
    zero page entries before starting the whole snapshot process [1].

    Recently there's another need raised on using userfaultfd wr-protect for
    detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
    case if without being able to wr-protect none ptes by default, the dirty
    info can get lost, since we cannot treat every none pte to be dirty (the
    current design is identify a page dirty based on uffd-wp bit being
    cleared).

    In general, we want to be able to wr-protect empty ptes too even for
    anonymous.

    This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
    uffd-wp handling on none ptes being consistent no matter what the memory
    type is underneath.  It doesn't have any impact on file memories so far
    because we already have pte markers taking care of that.  So it only
    affects anonymous.

    The feature bit is by default off, so the old behavior will be maintained.
    Sometimes it may be wanted because the wr-protect of none ptes will
    contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte
    markers to anonymous), but also on creating the pgtables to store the pte
    markers.  So there's potentially less chance of using thp on the first
    fault for a none pmd or larger than a pmd.

    The major implementation part is teaching the whole kernel to understand
    pte markers even for anonymously mapped ranges, meanwhile allowing the
    UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
    new feature bit is set.

    Note that even if the patch subject starts with mm/uffd, there're a few
    small refactors to major mm path of handling anonymous page faults.  But
    they should be straightforward.

    With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all
    the memory before wr-protect during taking a live snapshot.  Quotting from
    Muhammad's test result here [3] based on a simple program [4]:

      (1) With huge page disabled
      echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
      ./uffd_wp_perf
      Test DEFAULT: 4
      Test PRE-READ: 1111453 (pre-fault 1101011)
      Test MADVISE: 278276 (pre-fault 266378)
      Test WP-UNPOPULATE: 11712

      (2) With Huge page enabled
      echo always > /sys/kernel/mm/transparent_hugepage/enabled
      ./uffd_wp_perf
      Test DEFAULT: 4
      Test PRE-READ: 22521 (pre-fault 22348)
      Test MADVISE: 4909 (pre-fault 4743)
      Test WP-UNPOPULATE: 14448

    There'll be a great perf boost for no-thp case, while for thp enabled with
    extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE,
    but that's low possibility in reality, also the overhead was not reduced
    but postponed until a follow up write on any huge zero thp, so potentially
    it is faster by making the follow up writes slower.

    [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
    [2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
    [3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/
    [4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c

    [peterx@redhat.com: comment changes, oneliner fix to khugepaged]
      Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n
    Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Paul Gofman <pgofman@codeweavers.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:02 -04:00
Chris von Recklinghausen 9e9103fead mm: convert wp_page_copy() to use folios
Conflicts: mm/memory.c - We don't have
	7d4a8be0c4b2 ("mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export")
	so call mmu_notifier_range_init with both mm and vma (context)

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 28d41a4863316321bb5aa616bd82d65c84fc0f8b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:18:11 2023 +0000

    mm: convert wp_page_copy() to use folios

    Use new_folio instead of new_page throughout, because we allocated it
    and know it's an order-0 folio.  Most old_page uses become old_folio,
    but use vmf->page where we need the precise page.

    Link: https://lkml.kernel.org/r/20230116191813.2145215-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:54 -04:00
Chris von Recklinghausen ad7b024ba7 mm: add vma_alloc_zeroed_movable_folio()
Conflicts: drop changes to arch/alpha/include/asm/page.h
	arch/ia64/include/asm/page.h arch/m68k/include/asm/page_no.h -
		unsupported arches

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 6bc56a4d855303705802c5ede4625973637484c7
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:18:09 2023 +0000

    mm: add vma_alloc_zeroed_movable_folio()

    Replace alloc_zeroed_user_highpage_movable().  The main difference is
    returning a folio containing a single page instead of returning the page,
    but take the opportunity to rename the function to match other allocation
    functions a little better and rewrite the documentation to place more
    emphasis on the zeroing rather than the highmem aspect.

    Link: https://lkml.kernel.org/r/20230116191813.2145215-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:53 -04:00
Chris von Recklinghausen c28dba63db mm/memory: add vm_normal_folio()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 318e9342fbbb6888d903d86e83865609901a1c65
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Wed Dec 21 10:08:45 2022 -0800

    mm/memory: add vm_normal_folio()

    Patch series "Convert deactivate_page() to folio_deactivate()", v4.

    Deactivate_page() has already been converted to use folios.  This patch
    series modifies the callers of deactivate_page() to use folios.  It also
    introduces vm_normal_folio() to assist with folio conversions, and
    converts deactivate_page() to folio_deactivate() which takes in a folio.

    This patch (of 4):

    Introduce a wrapper function called vm_normal_folio().  This function
    calls vm_normal_page() and returns the folio of the page found, or null if
    no page is found.

    This function allows callers to get a folio from a pte, which will
    eventually allow them to completely replace their struct page variables
    with struct folio instead.

    Link: https://lkml.kernel.org/r/20221221180848.20774-1-vishal.moola@gmail.com
    Link: https://lkml.kernel.org/r/20221221180848.20774-2-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:47 -04:00
Chris von Recklinghausen 653ae76632 mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()
Conflicts: mm/userfaultfd.c - RHEL-only patch
	8e95bedaa1a ("mm: Fix CVE-2022-2590 by reverting "mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"")
	causes a merge conflict with this patch. Since upstream commit
	5535be309971 ("mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW")
	actually fixes the CVE we can safely remove the conflicted lines
	and replace them with the lines the upstream version of thes
	patch adds

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f1eb1bacfba9019823b2fce42383f010cd561fa6
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Dec 14 15:15:33 2022 -0500

    mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()

    This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths.

    The reasons I still think this patch is worthwhile, are:

      (1) It is a cleanup already; diffstat tells.

      (2) It just feels natural after I thought about this, if the pte is uffd
          protected, let's remove the write bit no matter what it was.

      (2) Since x86 is the only arch that supports uffd-wp, it also redefines
          pte|pmd_mkuffd_wp() in that it should always contain removals of
          write bits.  It means any future arch that want to implement uffd-wp
          should naturally follow this rule too.  It's good to make it a
          default, even if with vm_page_prot changes on VM_UFFD_WP.

      (3) It covers more than vm_page_prot.  So no chance of any potential
          future "accident" (like pte_mkdirty() sparc64 or loongarch, even
          though it just got its pte_mkdirty fixed <1 month ago).  It'll be
          fairly clear when reading the code too that we don't worry anything
          before a pte_mkuffd_wp() on uncertainty of the write bit.

    We may call pte_wrprotect() one more time in some paths (e.g.  thp split),
    but that should be fully local bitop instruction so the overhead should be
    negligible.

    Although this patch should logically also fix all the known issues on
    uffd-wp too recently on page migration (not for numa hint recovery - that
    may need another explcit pte_wrprotect), but this is not the plan for that
    fix.  So no fixes, and stable doesn't need this.

    Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ives van Hoorne <ives@codesandbox.io>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:43 -04:00
Chris von Recklinghausen 4808276894 mm, hwpoison: when copy-on-write hits poison, take page offline
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d302c2398ba269e788a4f37ae57c07a7fcabaa42
Author: Tony Luck <tony.luck@intel.com>
Date:   Fri Oct 21 13:01:20 2022 -0700

    mm, hwpoison: when copy-on-write hits poison, take page offline

    Cannot call memory_failure() directly from the fault handler because
    mmap_lock (and others) are held.

    It is important, but not urgent, to mark the source page as h/w poisoned
    and unmap it from other tasks.

    Use memory_failure_queue() to request a call to memory_failure() for the
    page with the error.

    Also provide a stub version for CONFIG_MEMORY_FAILURE=n

    Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.com
    Signed-off-by: Tony Luck <tony.luck@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Shuai Xue <xueshuai@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:22 -04:00
Chris von Recklinghausen 360555fbb4 mm, hwpoison: try to recover from copy-on write faults
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a873dfe1032a132bf89f9e19a6ac44f5a0b78754
Author: Tony Luck <tony.luck@intel.com>
Date:   Fri Oct 21 13:01:19 2022 -0700

    mm, hwpoison: try to recover from copy-on write faults

    Patch series "Copy-on-write poison recovery", v3.

    Part 1 deals with the process that triggered the copy on write fault with
    a store to a shared read-only page.  That process is send a SIGBUS with
    the usual machine check decoration to specify the virtual address of the
    lost page, together with the scope.

    Part 2 sets up to asynchronously take the page with the uncorrected error
    offline to prevent additional machine check faults.  H/t to Miaohe Lin
    <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com> for
    pointing me to the existing function to queue a call to memory_failure().

    On x86 there is some duplicate reporting (because the error is also
    signalled by the memory controller as well as by the core that triggered
    the machine check).  Console logs look like this:

    This patch (of 2):

    If the kernel is copying a page as the result of a copy-on-write
    fault and runs into an uncorrectable error, Linux will crash because
    it does not have recovery code for this case where poison is consumed
    by the kernel.

    It is easy to set up a test case. Just inject an error into a private
    page, fork(2), and have the child process write to the page.

    I wrapped that neatly into a test at:

      git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

    just enable ACPI error injection and run:

      # ./einj_mem-uc -f copy-on-write

    Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
    on architectures where that is available (currently x86 and powerpc).
    When an error is detected during the page copy, return VM_FAULT_HWPOISON
    to caller of wp_page_copy(). This propagates up the call stack. Both x86
    and powerpc have code in their fault handler to deal with this code by
    sending a SIGBUS to the application.

    Note that this patch avoids a system crash and signals the process that
    triggered the copy-on-write action. It does not take any action for the
    memory error that is still in the shared page. To handle that a call to
    memory_failure() is needed. But this cannot be done from wp_page_copy()
    because it holds mmap_lock(). Perhaps the architecture fault handlers
    can deal with this loose end in a subsequent patch?

    On Intel/x86 this loose end will often be handled automatically because
    the memory controller provides an additional notification of the h/w
    poison in memory, the handler for this will call memory_failure(). This
    isn't a 100% solution. If there are multiple errors, not all may be
    logged in this way.

    [tony.luck@intel.com: add call to kmsan_unpoison_memory(), per Miaohe Lin]
      Link: https://lkml.kernel.org/r/20221031201029.102123-2-tony.luck@intel.com
    Link: https://lkml.kernel.org/r/20221021200120.175753-1-tony.luck@intel.com
    Link: https://lkml.kernel.org/r/20221021200120.175753-2-tony.luck@intel.com
    Signed-off-by: Tony Luck <tony.luck@intel.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:22 -04:00
Chris von Recklinghausen 9cec47342a mm: convert mm's rss stats into percpu_counter
Conflicts:
	include/linux/sched.h - We don't have
		7964cf8caa4d ("mm: remove vmacache")
		so don't remove the declaration for vmacache
	kernel/fork.c - We don't have
		d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
		so don't add calls to mt_init_flags or mt_set_external_lock

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25
Author: Shakeel Butt <shakeelb@google.com>
Date:   Mon Oct 24 05:28:41 2022 +0000

    mm: convert mm's rss stats into percpu_counter

    Currently mm_struct maintains rss_stats which are updated on page fault
    and the unmapping codepaths.  For page fault codepath the updates are
    cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
    The reason for caching is performance for multithreaded applications
    otherwise the rss_stats updates may become hotspot for such applications.

    However this optimization comes with the cost of error margin in the rss
    stats.  The rss_stats for applications with large number of threads can be
    very skewed.  At worst the error margin is (nr_threads * 64) and we have a
    lot of applications with 100s of threads, so the error margin can be very
    high.  Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.

    Recently we started seeing the unbounded errors for rss_stats for specific
    applications which use TCP rx0cp.  It seems like vm_insert_pages()
    codepath does not sync rss_stats at all.

    This patch converts the rss_stats into percpu_counter to convert the error
    margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).  However
    this conversion enable us to get the accurate stats for situations where
    accuracy is more important than the cpu cost.

    This patch does not make such tradeoffs - we can just use
    percpu_counter_add_local() for the updates and percpu_counter_sum() (or
    percpu_counter_sync() + percpu_counter_read) for the readers.  At the
    moment the readers are either procfs interface, oom_killer and memory
    reclaim which I think are not performance critical and should be ok with
    slow read.  However I think we can make that change in a separate patch.

    Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:21 -04:00
Chris von Recklinghausen ac4694cf43 Revert "mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in"
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b12fdbf15f92b6cf5fecdd8a1855afe8809e5c58
Author: Peter Xu <peterx@redhat.com>
Date:   Mon Oct 24 15:33:36 2022 -0400

    Revert "mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in"

    With " mm/uffd: Fix vma check on userfault for wp" to fix the
    registration, we'll be safe to remove the macro hacks now.

    Link: https://lkml.kernel.org/r/20221024193336.1233616-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:18 -04:00
Chris von Recklinghausen 2e4f279847 hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 04ada095dcfc4ae359418053c0be94453bdf1e84
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Mon Nov 14 15:55:06 2022 -0800

    hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing

    madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
    tables associated with the address range.  For hugetlb vmas,
    zap_page_range will call __unmap_hugepage_range_final.  However,
    __unmap_hugepage_range_final assumes the passed vma is about to be removed
    and deletes the vma_lock to prevent pmd sharing as the vma is on the way
    out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
    missing vma_lock prevents pmd sharing and could potentially lead to issues
    with truncation/fault races.

    This issue was originally reported here [1] as a BUG triggered in
    page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
    vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
    prevent pmd sharing.  Subsequent faults on this vma were confused as
    VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
    not set in new pages added to the page table.  This resulted in pages that
    appeared anonymous in a VM_SHARED vma and triggered the BUG.

    Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
    call from unmap_vmas().  This is used to indicate the 'final' unmapping of
    a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
    the vm_lock is not deleted.

    [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

    Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
    Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reported-by: Wei Chen <harperchen1110@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:13 -04:00
Chris von Recklinghausen 68b5e6cc07 madvise: use zap_page_range_single for madvise dontneed
Conflicts: include/linux/mm.h - We don't have
	763ecb035029 ("mm: remove the vma linked list")
	so keep the old definition of unmap_vmas

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 21b85b09527c28e242db55c1b751f7f7549b830c
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Mon Nov 14 15:55:05 2022 -0800

    madvise: use zap_page_range_single for madvise dontneed

    This series addresses the issue first reported in [1], and fully described
    in patch 2.  Patches 1 and 2 address the user visible issue and are tagged
    for stable backports.

    While exploring solutions to this issue, related problems with mmu
    notification calls were discovered.  This is addressed in the patch
    "hugetlb: remove duplicate mmu notifications:".  Since there are no user
    visible effects, this third is not tagged for stable backports.

    Previous discussions suggested further cleanup by removing the
    routine zap_page_range.  This is possible because zap_page_range_single
    is now exported, and all callers of zap_page_range pass ranges entirely
    within a single vma.  This work will be done in a later patch so as not
    to distract from this bug fix.

    [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjU
b6JudD0g@mail.gmail.com/

    This patch (of 2):

    Expose the routine zap_page_range_single to zap a range within a single
    vma.  The madvise routine madvise_dontneed_single_vma can use this routine
    as it explicitly operates on a single vma.  Also, update the mmu
    notification range in zap_page_range_single to take hugetlb pmd sharing
    into account.  This is required as MADV_DONTNEED supports hugetlb vmas.

    Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
    Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reported-by: Wei Chen <harperchen1110@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:12 -04:00
Chris von Recklinghausen 02174dae48 hugetlb: fix vma lock handling during split vma and range unmapping
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 131a79b474e973f023c5c75e2323a940332103be
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Oct 4 18:17:05 2022 -0700

    hugetlb: fix vma lock handling during split vma and range unmapping

    Patch series "hugetlb: fixes for new vma lock series".

    In review of the series "hugetlb: Use new vma lock for huge pmd sharing
    synchronization", Miaohe Lin pointed out two key issues:

    1) There is a race in the routine hugetlb_unmap_file_folio when locks
       are dropped and reacquired in the correct order [1].

    2) With the switch to using vma lock for fault/truncate synchronization,
       we need to make sure lock exists for all VM_MAYSHARE vmas, not just
       vmas capable of pmd sharing.

    These two issues are addressed here.  In addition, having a vma lock
    present in all VM_MAYSHARE vmas, uncovered some issues around vma
    splitting.  Those are also addressed.

    [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/

    This patch (of 3):

    The hugetlb vma lock hangs off the vm_private_data field and is specific
    to the vma.  When vm_area_dup() is called as part of vma splitting, the
    vma lock pointer is copied to the new vma.  This will result in issues
    such as double freeing of the structure.  Update the hugetlb open vm_ops
    to allocate a new vma lock for the new vma.

    The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
    to prevent subsequent pmd sharing.  hugetlb_vma_lock_free attempted to
    anticipate this by checking both VM_MAYSHARE and VM_SHARED.  However, if
    only VM_MAYSHARE was set we would miss the free.  With the introduction of
    the vma lock, a vma can not participate in pmd sharing if vm_private_data
    is NULL.  Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
    free the vma lock to prevent sharing.  Also, update the sharing code to
    make sure vma lock is indeed a condition for pmd sharing.
    hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.

    Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
    Fixes: "hugetlb: add vma based lock for pmd sharing"
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:55 -04:00
Chris von Recklinghausen 271a98f55e mm: kmsan: maintain KMSAN metadata for page operations
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b073d7f8aee4ebf05d10e3380df377b73120cf16
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Sep 15 17:03:48 2022 +0200

    mm: kmsan: maintain KMSAN metadata for page operations

    Insert KMSAN hooks that make the necessary bookkeeping changes:
     - poison page shadow and origins in alloc_pages()/free_page();
     - clear page shadow and origins in clear_page(), copy_user_highpage();
     - copy page metadata in copy_highpage(), wp_page_copy();
     - handle vmap()/vunmap()/iounmap();

    Link: https://lkml.kernel.org/r/20220915150417.722975-15-glider@google.com
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@google.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Eric Biggers <ebiggers@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Ilya Leoshkevich <iii@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vegard Nossum <vegard.nossum@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:35 -04:00
Chris von Recklinghausen c9afac7c40 mm: remove try_to_free_swap()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3b344157c0c15b8f9588e3021dfb22ee25f4508a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:44 2022 +0100

    mm: remove try_to_free_swap()

    All callers have now been converted to folio_free_swap() and we can remove
    this wrapper.

    Link: https://lkml.kernel.org/r/20220902194653.1739778-49-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:03 -04:00
Chris von Recklinghausen a38c145ddd memcg: convert mem_cgroup_swap_full() to take a folio
Conflicts: mm/memcontrol.c - We already have
	b25806dcd3d5 ("mm: memcontrol: deprecate swapaccounting=0 mode")
	and
	b6c1a8af5b1ee ("mm: memcontrol: add new kernel parameter cgroup.memory=nobpf")
	so keep existing check in mem_cgroup_get_nr_swap_pages
	(surrounding context)

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 9202d527b715f67bcdccbb9b712b65fe053f8109
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:43 2022 +0100

    memcg: convert mem_cgroup_swap_full() to take a folio

    All callers now have a folio, so convert the function to take a folio.
    Saves a couple of calls to compound_head().

    Link: https://lkml.kernel.org/r/20220902194653.1739778-48-willy@infradead.or
g
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:03 -04:00
Chris von Recklinghausen e6207e54af mm: convert do_swap_page() to use folio_free_swap()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a160e5377b55bc5c1925a7456b656aabfc07261f
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:42 2022 +0100

    mm: convert do_swap_page() to use folio_free_swap()

    Also convert should_try_to_free_swap() to use a folio.  This removes a few
    calls to compound_head().

    Link: https://lkml.kernel.org/r/20220902194653.1739778-47-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:03 -04:00
Chris von Recklinghausen b4e641842b mm: convert do_wp_page() to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e4a2ed94908cc0104b8826ed8d831661ed1c3ea1
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:37 2022 +0100

    mm: convert do_wp_page() to use a folio

    Saves many calls to compound_head().

    Link: https://lkml.kernel.org/r/20220902194653.1739778-42-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:01 -04:00
Chris von Recklinghausen 8533996e3a mm: convert do_swap_page() to use swap_cache_get_folio()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 5a423081b2465d38baf2fcbbc19f77d211507061
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:33 2022 +0100

    mm: convert do_swap_page() to use swap_cache_get_folio()

    Saves a folio->page->folio conversion.

    Link: https://lkml.kernel.org/r/20220902194653.1739778-38-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:00 -04:00
Chris von Recklinghausen ed065cc3f1 memcg: convert mem_cgroup_swapin_charge_page() to mem_cgroup_swapin_charge_folio()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 6599591816f522c1cc8ec4eb5cea75738963756a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:12 2022 +0100

    memcg: convert mem_cgroup_swapin_charge_page() to mem_cgroup_swapin_charge_folio()

    All callers now have a folio, so pass it in here and remove an unnecessary
    call to page_folio().

    Link: https://lkml.kernel.org/r/20220902194653.1739778-17-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:53 -04:00
Chris von Recklinghausen 56265cc436 mm: convert do_swap_page()'s swapcache variable to a folio
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d4f9565ae598bd6b6ffbd8b4dfbf97a9e339da2d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:11 2022 +0100

    mm: convert do_swap_page()'s swapcache variable to a folio

    The 'swapcache' variable is used to track whether the page is from the
    swapcache or not.  It can do this equally well by being the folio of the
    page rather than the page itself, and this saves a number of calls to
    compound_head().

    Link: https://lkml.kernel.org/r/20220902194653.1739778-16-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:53 -04:00
Chris von Recklinghausen 2af7596eac mm: multi-gen LRU: groundwork
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit ec1c86b25f4bdd9dce6436c0539d2a6ae676e1c4
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:02 2022 -0600

    mm: multi-gen LRU: groundwork

    Evictable pages are divided into multiple generations for each lruvec.
    The youngest generation number is stored in lrugen->max_seq for both
    anon and file types as they are aged on an equal footing. The oldest
    generation numbers are stored in lrugen->min_seq[] separately for anon
    and file types as clean file pages can be evicted regardless of swap
    constraints. These three variables are monotonically increasing.

    Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
    in order to fit into the gen counter in folio->flags. Each truncated
    generation number is an index to lrugen->lists[]. The sliding window
    technique is used to track at least MIN_NR_GENS and at most
    MAX_NR_GENS generations. The gen counter stores a value within [1,
    MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
    stores 0.

    There are two conceptually independent procedures: "the aging", which
    produces young generations, and "the eviction", which consumes old
    generations.  They form a closed-loop system, i.e., "the page reclaim".
    Both procedures can be invoked from userspace for the purposes of working
    set estimation and proactive reclaim.  These techniques are commonly used
    to optimize job scheduling (bin packing) in data centers [1][2].

    To avoid confusion, the terms "hot" and "cold" will be applied to the
    multi-gen LRU, as a new convention; the terms "active" and "inactive" will
    be applied to the active/inactive LRU, as usual.

    The protection of hot pages and the selection of cold pages are based
    on page access channels and patterns. There are two access channels:
    one through page tables and the other through file descriptors. The
    protection of the former channel is by design stronger because:
    1. The uncertainty in determining the access patterns of the former
       channel is higher due to the approximation of the accessed bit.
    2. The cost of evicting the former channel is higher due to the TLB
       flushes required and the likelihood of encountering the dirty bit.
    3. The penalty of underprotecting the former channel is higher because
       applications usually do not prepare themselves for major page
       faults like they do for blocked I/O. E.g., GUI applications
       commonly use dedicated I/O threads to avoid blocking rendering
       threads.

    There are also two access patterns: one with temporal locality and the
    other without.  For the reasons listed above, the former channel is
    assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
    present; the latter channel is assumed to follow the latter pattern unless
    outlying refaults have been observed [3][4].

    The next patch will address the "outlying refaults".  Three macros, i.e.,
    LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
    this patch to make the entire patchset less diffy.

    A page is added to the youngest generation on faulting.  The aging needs
    to check the accessed bit at least twice before handing this page over to
    the eviction.  The first check takes care of the accessed bit set on the
    initial fault; the second check makes sure this page has not been used
    since then.  This protocol, AKA second chance, requires a minimum of two
    generations, hence MIN_NR_GENS.

    [1] https://dl.acm.org/doi/10.1145/3297858.3304053
    [2] https://dl.acm.org/doi/10.1145/3503222.3507731
    [3] https://lwn.net/Articles/495543/
    [4] https://lwn.net/Articles/815342/

    Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:45 -04:00
Chris von Recklinghausen b38a5e75b7 mm: x86, arm64: add arch_has_hw_pte_young()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e1fd09e3d1dd4a1a8b3b33bc1fd647eee9f4e475
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 01:59:58 2022 -0600

    mm: x86, arm64: add arch_has_hw_pte_young()

    Patch series "Multi-Gen LRU Framework", v14.

    What's new
    ==========
    1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
       Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
    2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
       machines. The old direct reclaim backoff, which tries to enforce a
       minimum fairness among all eligible memcgs, over-swapped by about
       (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
       pulls the plug on swapping once the target is met, trades some
       fairness for curtailed latency:
       https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
    3. Fixed minior build warnings and conflicts. More comments and nits.

    TLDR
    ====
    The current page reclaim is too expensive in terms of CPU usage and it
    often makes poor choices about what to evict. This patchset offers an
    alternative solution that is performant, versatile and
    straightforward.

    Patchset overview
    =================
    The design and implementation overview is in patch 14:
    https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/

    01. mm: x86, arm64: add arch_has_hw_pte_young()
    02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
    Take advantage of hardware features when trying to clear the accessed
    bit in many PTEs.

    03. mm/vmscan.c: refactor shrink_node()
    04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
        its sole caller"
    Minor refactors to improve readability for the following patches.

    05. mm: multi-gen LRU: groundwork
    Adds the basic data structure and the functions that insert pages to
    and remove pages from the multi-gen LRU (MGLRU) lists.

    06. mm: multi-gen LRU: minimal implementation
    A minimal implementation without optimizations.

    07. mm: multi-gen LRU: exploit locality in rmap
    Exploits spatial locality to improve efficiency when using the rmap.

    08. mm: multi-gen LRU: support page table walks
    Further exploits spatial locality by optionally scanning page tables.

    09. mm: multi-gen LRU: optimize multiple memcgs
    Optimizes the overall performance for multiple memcgs running mixed
    types of workloads.

    10. mm: multi-gen LRU: kill switch
    Adds a kill switch to enable or disable MGLRU at runtime.

    11. mm: multi-gen LRU: thrashing prevention
    12. mm: multi-gen LRU: debugfs interface
    Provide userspace with features like thrashing prevention, working set
    estimation and proactive reclaim.

    13. mm: multi-gen LRU: admin guide
    14. mm: multi-gen LRU: design doc
    Add an admin guide and a design doc.

    Benchmark results
    =================
    Independent lab results
    -----------------------
    Based on the popularity of searches [01] and the memory usage in
    Google's public cloud, the most popular open-source memory-hungry
    applications, in alphabetical order, are:
          Apache Cassandra      Memcached
          Apache Hadoop         MongoDB
          Apache Spark          PostgreSQL
          MariaDB (MySQL)       Redis

    An independent lab evaluated MGLRU with the most widely used benchmark
    suites for the above applications. They posted 960 data points along
    with kernel metrics and perf profiles collected over more than 500
    hours of total benchmark time. Their final reports show that, with 95%
    confidence intervals (CIs), the above applications all performed
    significantly better for at least part of their benchmark matrices.

    On 5.14:
    1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
       less wall time to sort three billion random integers, respectively,
       under the medium- and the high-concurrency conditions, when
       overcommitting memory. There were no statistically significant
       changes in wall time for the rest of the benchmark matrix.
    2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
       more transactions per minute (TPM), respectively, under the medium-
       and the high-concurrency conditions, when overcommitting memory.
       There were no statistically significant changes in TPM for the rest
       of the benchmark matrix.
    3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
       and [21.59, 30.02]% more operations per second (OPS), respectively,
       for sequential access, random access and Gaussian (distribution)
       access, when THP=always; 95% CIs [13.85, 15.97]% and
       [23.94, 29.92]% more OPS, respectively, for random access and
       Gaussian access, when THP=never. There were no statistically
       significant changes in OPS for the rest of the benchmark matrix.
    4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
       [2.16, 3.55]% more operations per second (OPS), respectively, for
       exponential (distribution) access, random access and Zipfian
       (distribution) access, when underutilizing memory; 95% CIs
       [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
       respectively, for exponential access, random access and Zipfian
       access, when overcommitting memory.

    On 5.15:
    5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
       and [4.11, 7.50]% more operations per second (OPS), respectively,
       for exponential (distribution) access, random access and Zipfian
       (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
       [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
       exponential access, random access and Zipfian access, when swap was
       on.
    6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
       less average wall time to finish twelve parallel TeraSort jobs,
       respectively, under the medium- and the high-concurrency
       conditions, when swap was on. There were no statistically
       significant changes in average wall time for the rest of the
       benchmark matrix.
    7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
       minute (TPM) under the high-concurrency condition, when swap was
       off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
       respectively, under the medium- and the high-concurrency
       conditions, when swap was on. There were no statistically
       significant changes in TPM for the rest of the benchmark matrix.
    8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
       [11.47, 19.36]% more total operations per second (OPS),
       respectively, for sequential access, random access and Gaussian
       (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
       [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
       for sequential access, random access and Gaussian access, when
       THP=never.

    Our lab results
    ---------------
    To supplement the above results, we ran the following benchmark suites
    on 5.16-rc7 and found no regressions [10].
          fs_fio_bench_hdd_mq      pft
          fs_lmbench               pgsql-hammerdb
          fs_parallelio            redis
          fs_postmark              stream
          hackbench                sysbenchthread
          kernbench                tpcc_spark
          memcached                unixbench
          multichase               vm-scalability
          mutilate                 will-it-scale
          nginx

    [01] https://trends.google.com
    [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
    [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
    [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
    [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
    [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
    [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
    [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
    [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
    [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/

    Read-world applications
    =======================
    Third-party testimonials
    ------------------------
    Konstantin reported [11]:
       I have Archlinux with 8G RAM + zswap + swap. While developing, I
       have lots of apps opened such as multiple LSP-servers for different
       langs, chats, two browsers, etc... Usually, my system gets quickly
       to a point of SWAP-storms, where I have to kill LSP-servers,
       restart browsers to free memory, etc, otherwise the system lags
       heavily and is barely usable.

       1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
       patchset, and I started up by opening lots of apps to create memory
       pressure, and worked for a day like this. Till now I had not a
       single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
       getting to the point of 3G in SWAP before without a single
       SWAP-storm.

    Vaibhav from IBM reported [12]:
       In a synthetic MongoDB Benchmark, seeing an average of ~19%
       throughput improvement on POWER10(Radix MMU + 64K Page Size) with
       MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
       three different request distributions, namely, Exponential, Uniform
       and Zipfan.

    Shuang from U of Rochester reported [13]:
       With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
       and [9.26, 10.36]% higher throughput, respectively, for random
       access, Zipfian (distribution) access and Gaussian (distribution)
       access, when the average number of jobs per CPU is 1; 95% CIs
       [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
       throughput, respectively, for random access, Zipfian access and
       Gaussian access, when the average number of jobs per CPU is 2.

    Daniel from Michigan Tech reported [14]:
       With Memcached allocating ~100GB of byte-addressable Optante,
       performance improvement in terms of throughput (measured as queries
       per second) was about 10% for a series of workloads.

    Large-scale deployments
    -----------------------
    We've rolled out MGLRU to tens of millions of ChromeOS users and
    about a million Android users. Google's fleetwide profiling [15] shows
    an overall 40% decrease in kswapd CPU usage, in addition to
    improvements in other UX metrics, e.g., an 85% decrease in the number
    of low-memory kills at the 75th percentile and an 18% decrease in
    app launch time at the 50th percentile.

    The downstream kernels that have been using MGLRU include:
    1. Android [16]
    2. Arch Linux Zen [17]
    3. Armbian [18]
    4. ChromeOS [19]
    5. Liquorix [20]
    6. OpenWrt [21]
    7. post-factum [22]
    8. XanMod [23]

    [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
    [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
    [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
    [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
    [15] https://dl.acm.org/doi/10.1145/2749469.2750392
    [16] https://android.com
    [17] https://archlinux.org
    [18] https://armbian.com
    [19] https://chromium.org
    [20] https://liquorix.net
    [21] https://openwrt.org
    [22] https://codeberg.org/pf-kernel
    [23] https://xanmod.org

    Summary
    =======
    The facts are:
    1. The independent lab results and the real-world applications
       indicate substantial improvements; there are no known regressions.
    2. Thrashing prevention, working set estimation and proactive reclaim
       work out of the box; there are no equivalent solutions.
    3. There is a lot of new code; no smaller changes have been
       demonstrated similar effects.

    Our options, accordingly, are:
    1. Given the amount of evidence, the reported improvements will likely
       materialize for a wide range of workloads.
    2. Gauging the interest from the past discussions, the new features
       will likely be put to use for both personal computers and data
       centers.
    3. Based on Google's track record, the new code will likely be well
       maintained in the long term. It'd be more difficult if not
       impossible to achieve similar effects with other approaches.

    This patch (of 14):

    Some architectures automatically set the accessed bit in PTEs, e.g., x86
    and arm64 v8.2.  On architectures that do not have this capability,
    clearing the accessed bit in a PTE usually triggers a page fault following
    the TLB miss of this PTE (to emulate the accessed bit).

    Being aware of this capability can help make better decisions, e.g.,
    whether to spread the work out over a period of time to reduce bursty page
    faults when trying to clear the accessed bit in many PTEs.

    Note that theoretically this capability can be unreliable, e.g.,
    hotplugged CPUs might be different from builtin ones.  Therefore it should
    not be used in architecture-independent code that involves correctness,
    e.g., to determine whether TLB flushes are required (in combination with
    the accessed bit).

    Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
    Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Acked-by: Will Deacon <will@kernel.org>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:43 -04:00
Chris von Recklinghausen 94099237ea mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a7f4e6e4c47c41869fe5bea17e013b5557c57ed3
Author: Zach O'Keefe <zokeefe@google.com>
Date:   Wed Jul 6 16:59:25 2022 -0700

    mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()

    MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].

    hugepage_vma_check() is the authority on determining if a VMA is eligible
    for THP allocation/collapse, and currently enforces the sysfs THP
    settings.  Add a flag to disable these checks.  For now, only apply this
    arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled.
    We can expand this to shmem, which uses
    /sys/kernel/transparent_hugepage/shmem_enabled, later.

    Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
    passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
    VM_HUGEPAGE check in "madvise" THP mode.  Prior to "mm: khugepaged: check
    THP flag in hugepage_vma_check()", this check also didn't check "never"
    THP mode.  As such, this restores the previous behavior of
    collapse_pte_mapped_thp() where sysfs THP settings are ignored.  See
    comment in code for justification why this is OK.

    [1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/

    Link: https://lkml.kernel.org/r/20220706235936.2197195-8-zokeefe@google.com
    Signed-off-by: Zach O'Keefe <zokeefe@google.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Alex Shi <alex.shi@linux.alibaba.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Chris Kennelly <ckennelly@google.com>
    Cc: Chris Zankel <chris@zankel.net>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Max Filippov <jcmvbkbc@gmail.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Pavel Begunkov <asml.silence@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:22 -04:00
Jan Stancek ff54bd1d81 Merge: Proactively Backport MM fixes for el9.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2533

Proactively backport available fixes for the MM SST and RHEL9.3

This is part of an ongoing effort to keep the RHEL MM sst as stable as possible by backporting upstream fixes for each release.

Most of these are cherry picks but there are some series' that have been included for completeness.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372

Signed-off-by: Nico Pache <npache@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-28 07:52:43 +02:00
Tobias Huschle 56fe56abc7 mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044921
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=53419368
Conflicts: None
commit 99c29133639a29fa803ea27ec79bf9e732efd062
Author: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Date:   Mon Mar 6 17:15:48 2023 +0100

    mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault()

    s390 can do more fine-grained handling of spurious TLB protection faults,
    when there also is the PTE pointer available.

    Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as an
    additional parameter.

    This will add no functional change to other architectures, but those with
    private flush_tlb_fix_spurious_fault() implementations need to be made
    aware of the new parameter.

    Link: https://lkml.kernel.org/r/20230306161548.661740-1-gerald.schaefer@linux.ibm.com
    Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>     [arm64]
    Acked-by: Michael Ellerman <mpe@ellerman.id.au>         [powerpc]
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Borislav Petkov (AMD) <bp@alien8.de>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Tobias Huschle <thuschle@redhat.com>
2023-06-20 07:16:15 +00:00
Nico Pache c5f02bd692 mm: use update_mmu_tlb() on the second thread
commit bce8cb3c04dc01d21b6b17baf1cb6c277e7e6848
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Thu Sep 29 19:23:17 2022 +0800

    mm: use update_mmu_tlb() on the second thread

    As message in commit 7df6769743 ("mm/memory.c: Update local TLB if PTE
    entry exists") said, we should update local TLB only on the second thread.
    So in the do_anonymous_page() here, we should use update_mmu_tlb()
    instead of update_mmu_cache() on the second thread.

    As David pointed out, this is a performance improvement, not a
    correctness fix.

    Link: https://lkml.kernel.org/r/20220929112318.32393-2-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Bibo Mao <maobibo@loongson.cn>
    Cc: Chris Zankel <chris@zankel.net>
    Cc: Huacai Chen <chenhuacai@loongson.cn>
    Cc: Max Filippov <jcmvbkbc@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache 897a73840c hugetlb: use new vma_lock for pmd sharing synchronization
commit 40549ba8f8e0ed1f8b235979563f619e9aa34fdf
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:09 2022 -0700

    hugetlb: use new vma_lock for pmd sharing synchronization

    The new hugetlb vma lock is used to address this race:

    Faulting thread                                 Unsharing thread
    ...                                                  ...
    ptep = huge_pte_offset()
          or
    ptep = huge_pte_alloc()
    ...
                                                    i_mmap_lock_write
                                                    lock page table
    ptep invalid   <------------------------        huge_pmd_unshare()
    Could be in a previously                        unlock_page_table
    sharing process or worse                        i_mmap_unlock_write
    ...

    The vma_lock is used as follows:
    - During fault processing. The lock is acquired in read mode before
      doing a page table lock and allocation (huge_pte_alloc).  The lock is
      held until code is finished with the page table entry (ptep).
    - The lock must be held in write mode whenever huge_pmd_unshare is
      called.

    Lock ordering issues come into play when unmapping a page from all
    vmas mapping the page.  The i_mmap_rwsem must be held to search for the
    vmas, and the vma lock must be held before calling unmap which will
    call huge_pmd_unshare.  This is done today in:
    - try_to_migrate_one and try_to_unmap_ for page migration and memory
      error handling.  In these routines we 'try' to obtain the vma lock and
      fail to unmap if unsuccessful.  Calling routines already deal with the
      failure of unmapping.
    - hugetlb_vmdelete_list for truncation and hole punch.  This routine
      also tries to acquire the vma lock.  If it fails, it skips the
      unmapping.  However, we can not have file truncation or hole punch
      fail because of contention.  After hugetlb_vmdelete_list, truncation
      and hole punch call remove_inode_hugepages.  remove_inode_hugepages
      checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
      hugetlb_unmap_file_page is designed to drop locks and reacquire in the
      correct order to guarantee unmap success.

    Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache d9678e29b9 mm: use nth_page instead of mem_map_offset mem_map_next
commit 14455eabd8404a503dc8e80cd8ce185e96a94b22
Author: Cheng Li <lic121@chinatelecom.cn>
Date:   Fri Sep 9 07:31:09 2022 +0000

    mm: use nth_page instead of mem_map_offset mem_map_next

    To handle the discontiguous case, mem_map_next() has a parameter named
    `offset`.  As a function caller, one would be confused why "get next
    entry" needs a parameter named "offset".  The other drawback of
    mem_map_next() is that the callers must take care of the map between
    parameter "iter" and "offset", otherwise we may get an hole or duplication
    during iteration.  So we use nth_page instead of mem_map_next.

    And replace mem_map_offset with nth_page() per Matthew's comments.

    Link: https://lkml.kernel.org/r/1662708669-9395-1-git-send-email-lic121@chinatelecom.cn
    Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
    Fixes: 69d177c2fc ("hugetlbfs: handle pages higher order than MAX_ORDER")
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:00 -06:00
Jan Stancek 761df83677 Merge: mm/demotion: Memory tiers and demotion
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2399

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2186559

The automatic generation of node migration order lists to introduce
the capability of having the least referenced memory demoted to
"lower memory tiers" was introduced upstream circa v5.15 and backported
into RHEL-9.0 [1]. However, this tier-ization was introduced in a suboptimal
fashion, limiting the number of slots in the node demotion list and only
implicitly conveying the idea of having proper memory tiers as it's becoming
more common nowadays with memory systems being composed of multiple kinds
of memory (HBM, PMEM, DRAM, ...).

This merge brings into RHEL-9 the upstream v6.1 "mm/demotion: Memory tiers
and demotion" patch series in order to provide us with proper explicit
memory tiers as well as to address some of the arbitrary hard limits that
were part of the origial implementation introduced with [1].

Testing will be performed as a joint venture between MM and VIRT QE,
and preliminary verifications are already documented at [2]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2023396
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2186559

Rafael Aquini (18):
  mm/demotion: add support for explicit memory tiers
  mm/demotion: move memory demotion related code
  mm/demotion: add hotplug callbacks to handle new numa node onlined
  mm/demotion/dax/kmem: set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE
  mm/demotion: build demotion targets based on explicit memory tiers
  mm/demotion: add pg_data_t member to track node memory tier details
  mm/demotion: drop memtier from memtype
  mm/demotion: demote pages according to allocation fallback order
  mm/demotion: update node_is_toptier to work with memory tiers
  mm/demotion: expose memory tier details via sysfs
  mm/demotion: fix NULL vs IS_ERR checking in memory_tier_init
  memory tier, sysfs: rename attribute "nodes" to "nodelist"
  memory tier: release the new_memtier in find_create_memory_tier()
  lib/nodemask: optimize node_random for nodemask with single NUMA node
  lib/kstrtox.c: add "false"/"true" support to kstrtobool()
  arm64/mm: fold check for KFENCE into can_set_direct_map()
  arm64: fix rodata=full
  arm64: fix rodata=full again

Signed-off-by: Rafael Aquini <aquini@redhat.com>

Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Herton R. Krzesinski <herton@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-16 11:49:50 +02:00
Donald Dutile 035064f58b mm: take a page reference when removing device exclusive entries
Bugzilla: http://bugzilla.redhat.com/2184200

commit 7c7b962938ddda6a9cd095de557ee5250706ea88
Author: Alistair Popple <apopple@nvidia.com>
Date:   Thu Mar 30 12:25:19 2023 +1100

    mm: take a page reference when removing device exclusive entries

    Device exclusive page table entries are used to prevent CPU access to a
    page whilst it is being accessed from a device.  Typically this is used to
    implement atomic operations when the underlying bus does not support
    atomic access.  When a CPU thread encounters a device exclusive entry it
    locks the page and restores the original entry after calling mmu notifiers
    to signal drivers that exclusive access is no longer available.

    The device exclusive entry holds a reference to the page making it safe to
    access the struct page whilst the entry is present.  However the fault
    handling code does not hold the PTL when taking the page lock.  This means
    if there are multiple threads faulting concurrently on the device
    exclusive entry one will remove the entry whilst others will wait on the
    page lock without holding a reference.

    This can lead to threads locking or waiting on a folio with a zero
    refcount.  Whilst mmap_lock prevents the pages getting freed via munmap()
    they may still be freed by a migration.  This leads to warnings such as
    PAGE_FLAGS_CHECK_AT_FREE due to the page being locked when the refcount
    drops to zero.

    Fix this by trying to take a reference on the folio before locking it.
    The code already checks the PTE under the PTL and aborts if the entry is
    no longer there.  It is also possible the folio has been unmapped, freed
    and re-allocated allowing a reference to be taken on an unrelated folio.
    This case is also detected by the PTE check and the folio is unlocked
    without further changes.

    Link: https://lkml.kernel.org/r/20230330012519.804116-1-apopple@nvidia.com
    Fixes: b756a3b5e7 ("mm: device exclusive memory access")
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2023-05-02 16:47:27 -04:00
Donald Dutile 1dfc72776e mm: convert lock_page_or_retry() to folio_lock_or_retry()
Bugzilla: http://bugzilla.redhat.com/2184200

commit 19672a9e4a75252871cba319f4e3b859b8fdf671
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:53 2022 +0100

    mm: convert lock_page_or_retry() to folio_lock_or_retry()

    Remove a call to compound_head() in each of the two callers.

    Link: https://lkml.kernel.org/r/20220902194653.1739778-58-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2023-05-02 16:47:27 -04:00
Donald Dutile 8332f94d36 mm: convert do_swap_page() to use a folio
Bugzilla: http://bugzilla.redhat.com/2184200

commit 63ad4add3823051aeb1fcd1ba981f6efd07086bf
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:10 2022 +0100

    mm: convert do_swap_page() to use a folio

    Removes quite a lot of calls to compound_head().

    Link: https://lkml.kernel.org/r/20220902194653.1739778-15-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2023-05-02 16:47:27 -04:00
Donald Dutile ee486de9b1 mm/memory: return vm_fault_t result from migrate_to_ram() callback
Bugzilla: http://bugzilla.redhat.com/2159905

commit 4a955bed882e734807024afd8f53213d4c61ff97
Author: Alistair Popple <apopple@nvidia.com>
Date:   Mon Nov 14 22:55:37 2022 +1100

    mm/memory: return vm_fault_t result from migrate_to_ram() callback

    The migrate_to_ram() callback should always succeed, but in rare cases can
    fail usually returning VM_FAULT_SIGBUS.  Commit 16ce101db85d
    ("mm/memory.c: fix race when faulting a device private page") incorrectly
    stopped passing the return code up the stack.  Fix this by setting the ret
    variable, restoring the previous behaviour on migrate_to_ram() failure.

    Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
    Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page")
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2023-05-02 16:47:27 -04:00
Donald Dutile 6217af187c mm/memory.c: fix race when faulting a device private page
Bugzilla: http://bugzilla.redhat.com/2159905

commit 16ce101db85db694a91380aa4c89b25530871d33
Author: Alistair Popple <apopple@nvidia.com>
Date:   Wed Sep 28 22:01:15 2022 +1000

    mm/memory.c: fix race when faulting a device private page

    Patch series "Fix several device private page reference counting issues",
    v2

    This series aims to fix a number of page reference counting issues in
    drivers dealing with device private ZONE_DEVICE pages.  These result in
    use-after-free type bugs, either from accessing a struct page which no
    longer exists because it has been removed or accessing fields within the
    struct page which are no longer valid because the page has been freed.

    During normal usage it is unlikely these will cause any problems.  However
    without these fixes it is possible to crash the kernel from userspace.
    These crashes can be triggered either by unloading the kernel module or
    unbinding the device from the driver prior to a userspace task exiting.
    In modules such as Nouveau it is also possible to trigger some of these
    issues by explicitly closing the device file-descriptor prior to the task
    exiting and then accessing device private memory.

    This involves some minor changes to both PowerPC and AMD GPU code.
    Unfortunately I lack hardware to test either of those so any help there
    would be appreciated.  The changes mimic what is done in for both Nouveau
    and hmm-tests though so I doubt they will cause problems.

    This patch (of 8):

    When the CPU tries to access a device private page the migrate_to_ram()
    callback associated with the pgmap for the page is called.  However no
    reference is taken on the faulting page.  Therefore a concurrent migration
    of the device private page can free the page and possibly the underlying
    pgmap.  This results in a race which can crash the kernel due to the
    migrate_to_ram() function pointer becoming invalid.  It also means drivers
    can't reliably read the zone_device_data field because the page may have
    been freed with memunmap_pages().

    Close the race by getting a reference on the page while holding the ptl to
    ensure it has not been freed.  Unfortunately the elevated reference count
    will cause the migration required to handle the fault to fail.  To avoid
    this failure pass the faulting page into the migrate_vma functions so that
    if an elevated reference count is found it can be checked to see if it's
    expected or not.

    [mpe@ellerman.id.au: fix build]
      Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
    Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
    Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2023-05-02 16:47:26 -04:00
Rafael Aquini c33f95c55c mm/demotion: update node_is_toptier to work with memory tiers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2186559

This patch is a backport of the following upstream commit:
commit 467b171af881282fc627328e6c164f044a6df888
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Thu Aug 18 18:40:41 2022 +0530

    mm/demotion: update node_is_toptier to work with memory tiers

    With memory tier support we can have memory only NUMA nodes in the top
    tier from which we want to avoid promotion tracking NUMA faults.  Update
    node_is_toptier to work with memory tiers.  All NUMA nodes are by default
    top tier nodes.  With lower(slower) memory tiers added we consider all
    memory tiers above a memory tier having CPU NUMA nodes as a top memory
    tier

    [sj@kernel.org: include missed header file, memory-tiers.h]
      Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org
    [akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h]
    [aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers]
      Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com
    Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: Wei Xu <weixugc@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hesham Almatary <hesham.almatary@huawei.com>
    Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Tim Chen <tim.c.chen@intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2023-04-26 08:55:47 -04:00
Chris von Recklinghausen ad6d7b5ea6 mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6a56ccbcf6c69538b152644107a1d7383c876ca7
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Nov 8 18:46:50 2022 +0100

    mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite

    commit b191f9b106 ("mm: numa: preserve PTE write permissions across a
    NUMA hinting fault") added remembering write permissions using ordinary
    pte_write() for PROT_NONE mapped pages to avoid write faults when
    remapping the page !PROT_NONE on NUMA hinting faults.

    That commit noted:

        The patch looks hacky but the alternatives looked worse. The tidest was
        to rewalk the page tables after a hinting fault but it was more complex
        than this approach and the performance was worse. It's not generally
        safe to just mark the page writable during the fault if it's a write
        fault as it may have been read-only for COW so that approach was
        discarded.

    Later, commit 288bc54949 ("mm/autonuma: let architecture override how
    the write bit should be stashed in a protnone pte.") introduced a family
    of savedwrite PTE functions that didn't necessarily improve the whole
    situation.

    One confusing thing is that nowadays, if a page is pte_protnone()
    and pte_savedwrite() then also pte_write() is true. Another source of
    confusion is that there is only a single pte_mk_savedwrite() call in the
    kernel. All other write-protection code seems to silently rely on
    pte_wrprotect().

    Ever since PageAnonExclusive was introduced and we started using it in
    mprotect context via commit 64fe24a3e05e ("mm/mprotect: try avoiding write
    faults for exclusive anonymous pages when changing protection"), we do
    have machinery in place to avoid write faults when changing protection,
    which is exactly what we want to do here.

    Let's similarly do what ordinary mprotect() does nowadays when upgrading
    write permissions and reuse can_change_pte_writable() and
    can_change_pmd_writable() to detect if we can upgrade PTE permissions to be
    writable.

    For anonymous pages there should be absolutely no change: if an
    anonymous page is not exclusive, it could not have been mapped writable --
    because only exclusive anonymous pages can be mapped writable.

    However, there *might* be a change for writable shared mappings that
    require writenotify: if they are not dirty, we cannot map them writable.
    While it might not matter in practice, we'd need a different way to
    identify whether writenotify is actually required -- and ordinary mprotect
    would benefit from that as well.

    Note that we don't optimize for the actual migration case:
    (1) When migration succeeds the new PTE will not be writable because the
        source PTE was not writable (protnone); in the future we
        might just optimize that case similarly by reusing
        can_change_pte_writable()/can_change_pmd_writable() when removing
        migration PTEs.
    (2) When migration fails, we'd have to recalculate the "writable" flag
        because we temporarily dropped the PT lock; for now keep it simple and
        set "writable=false".

    We'll remove all savedwrite leftovers next.

    Link: https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:34 -04:00
Chris von Recklinghausen 2f2ceb6140 mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in
Bugzilla: https://bugzilla.redhat.com/2160210

commit 515778e2d790652a38a24554fdb7f21420d91efc
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Sep 30 20:25:55 2022 -0400

    mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in

    When PTE_MARKER_UFFD_WP not configured, it's still possible to reach pte
    marker code and trigger an warning. Add a few CONFIG_PTE_MARKER_UFFD_WP
    ifdefs to make sure the code won't be reached when not compiled in.

    Link: https://lkml.kernel.org/r/YzeR+R6b4bwBlBHh@x1n
    Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: <syzbot+2b9b4f0895be09a6dec3@syzkaller.appspotmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Edward Liaw <edliaw@google.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:33 -04:00
Chris von Recklinghausen 636e84cec8 mm: bring back update_mmu_cache() to finish_fault()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 70427f6e9ecfc8c5f977b21dd9f846b3bda02500
Author: Sergei Antonov <saproj@gmail.com>
Date:   Thu Sep 8 23:48:09 2022 +0300

    mm: bring back update_mmu_cache() to finish_fault()

    Running this test program on ARMv4 a few times (sometimes just once)
    reproduces the bug.

    int main()
    {
            unsigned i;
            char paragon[SIZE];
            void* ptr;

            memset(paragon, 0xAA, SIZE);
            ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                       MAP_ANON | MAP_SHARED, -1, 0);
            if (ptr == MAP_FAILED) return 1;
            printf("ptr = %p\n", ptr);
            for (i=0;i<10000;i++){
                    memset(ptr, 0xAA, SIZE);
                    if (memcmp(ptr, paragon, SIZE)) {
                            printf("Unexpected bytes on iteration %u!!!\n", i);
                            break;
                    }
            }
            munmap(ptr, SIZE);
    }

    In the "ptr" buffer there appear runs of zero bytes which are aligned
    by 16 and their lengths are multiple of 16.

    Linux v5.11 does not have the bug, "git bisect" finds the first bad commit:
    f9ce0be71d ("mm: Cleanup faultaround and finish_fault() codepaths")

    Before the commit update_mmu_cache() was called during a call to
    filemap_map_pages() as well as finish_fault(). After the commit
    finish_fault() lacks it.

    Bring back update_mmu_cache() to finish_fault() to fix the bug.
    Also call update_mmu_tlb() only when returning VM_FAULT_NOPAGE to more
    closely reproduce the code of alloc_set_pte() function that existed before
    the commit.

    On many platforms update_mmu_cache() is nop:
     x86, see arch/x86/include/asm/pgtable
     ARMv6+, see arch/arm/include/asm/tlbflush.h
    So, it seems, few users ran into this bug.

    Link: https://lkml.kernel.org/r/20220908204809.2012451-1-saproj@gmail.com
    Fixes: f9ce0be71d ("mm: Cleanup faultaround and finish_fault() codepaths")
    Signed-off-by: Sergei Antonov <saproj@gmail.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:32 -04:00
Chris von Recklinghausen b9ef7d0aa3 memory tiering: hot page selection with hint page fault latency
Bugzilla: https://bugzilla.redhat.com/2160210

commit 33024536bafd9129f1d16ade0974671c648700ac
Author: Huang Ying <ying.huang@intel.com>
Date:   Wed Jul 13 16:39:51 2022 +0800

    memory tiering: hot page selection with hint page fault latency

    Patch series "memory tiering: hot page selection", v4.

    To optimize page placement in a memory tiering system with NUMA balancing,
    the hot pages in the slow memory nodes need to be identified.
    Essentially, the original NUMA balancing implementation selects the mostly
    recently accessed (MRU) pages to promote.  But this isn't a perfect
    algorithm to identify the hot pages.  Because the pages with quite low
    access frequency may be accessed eventually given the NUMA balancing page
    table scanning period could be quite long (e.g.  60 seconds).  So in this
    patchset, we implement a new hot page identification algorithm based on
    the latency between NUMA balancing page table scanning and hint page
    fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.

    In NUMA balancing memory tiering mode, if there are hot pages in slow
    memory node and cold pages in fast memory node, we need to promote/demote
    hot/cold pages between the fast and cold memory nodes.

    A choice is to promote/demote as fast as possible.  But the CPU cycles and
    memory bandwidth consumed by the high promoting/demoting throughput will
    hurt the latency of some workload because of accessing inflating and slow
    memory bandwidth contention.

    A way to resolve this issue is to restrict the max promoting/demoting
    throughput.  It will take longer to finish the promoting/demoting.  But
    the workload latency will be better.  This is implemented in this patchset
    as the page promotion rate limit mechanism.

    The promotion hot threshold is workload and system configuration
    dependent.  So in this patchset, a method to adjust the hot threshold
    automatically is implemented.  The basic idea is to control the number of
    the candidate promotion pages to match the promotion rate limit.

    We used the pmbench memory accessing benchmark tested the patchset on a
    2-socket server system with DRAM and PMEM installed.  The test results are
    as follows,

                    pmbench score           promote rate
                     (accesses/s)                   MB/s
                    -------------           ------------
    base              146887704.1                  725.6
    hot selection     165695601.2                  544.0
    rate limit        162814569.8                  165.2
    auto adjustment   170495294.0                  136.9

    From the results above,

    With hot page selection patch [1/3], the pmbench score increases about
    12.8%, and promote rate (overhead) decreases about 25.0%, compared with
    base kernel.

    With rate limit patch [2/3], pmbench score decreases about 1.7%, and
    promote rate decreases about 69.6%, compared with hot page selection
    patch.

    With threshold auto adjustment patch [3/3], pmbench score increases about
    4.7%, and promote rate decrease about 17.1%, compared with rate limit
    patch.

    Baolin helped to test the patchset with MySQL on a machine which contains
    1 DRAM node (30G) and 1 PMEM node (126G).

    sysbench /usr/share/sysbench/oltp_read_write.lua \
    ......
    --tables=200 \
    --table-size=1000000 \
    --report-interval=10 \
    --threads=16 \
    --time=120

    The tps can be improved about 5%.

    This patch (of 3):

    To optimize page placement in a memory tiering system with NUMA balancing,
    the hot pages in the slow memory node need to be identified.  Essentially,
    the original NUMA balancing implementation selects the mostly recently
    accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
    identify the hot pages.  Because the pages with quite low access frequency
    may be accessed eventually given the NUMA balancing page table scanning
    period could be quite long (e.g.  60 seconds).  The most frequently
    accessed (MFU) algorithm is better.

    So, in this patch we implemented a better hot page selection algorithm.
    Which is based on NUMA balancing page table scanning and hint page fault
    as follows,

    - When the page tables of the processes are scanned to change PTE/PMD
      to be PROT_NONE, the current time is recorded in struct page as scan
      time.

    - When the page is accessed, hint page fault will occur.  The scan
      time is gotten from the struct page.  And The hint page fault
      latency is defined as

        hint page fault time - scan time

    The shorter the hint page fault latency of a page is, the higher the
    probability of their access frequency to be higher.  So the hint page
    fault latency is a better estimation of the page hot/cold.

    It's hard to find some extra space in struct page to hold the scan time.
    Fortunately, we can reuse some bits used by the original NUMA balancing.

    NUMA balancing uses some bits in struct page to store the page accessing
    CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
    multi-stage node selection algorithm to avoid to migrate pages shared
    accessed by the NUMA nodes back and forth.  But for pages in the slow
    memory node, even if they are shared accessed by multiple NUMA nodes, as
    long as the pages are hot, they need to be promoted to the fast memory
    node.  So the accessing CPU and PID information are unnecessary for the
    slow memory pages.  We can reuse these bits in struct page to record the
    scan time.  For the fast memory pages, these bits are used as before.

    For the hot threshold, the default value is 1 second, which works well in
    our performance test.  All pages with hint page fault latency < hot
    threshold will be considered hot.

    It's hard for users to determine the hot threshold.  So we don't provide a
    kernel ABI to set it, just provide a debugfs interface for advanced users
    to experiment.  We will continue to work on a hot threshold automatic
    adjustment mechanism.

    The downside of the above method is that the response time to the workload
    hot spot changing may be much longer.  For example,

    - A previous cold memory area becomes hot

    - The hint page fault will be triggered.  But the hint page fault
      latency isn't shorter than the hot threshold.  So the pages will
      not be promoted.

    - When the memory area is scanned again, maybe after a scan period,
      the hint page fault latency measured will be shorter than the hot
      threshold and the pages will be promoted.

    To mitigate this, if there are enough free space in the fast memory node,
    the hot threshold will not be used, all pages will be promoted upon the
    hint page fault for fast response.

    Thanks Zhong Jiang reported and tested the fix for a bug when disabling
    memory tiering mode dynamically.

    Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: osalvador <osalvador@suse.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:32 -04:00
Chris von Recklinghausen ef58a8e839 mm: remove unneeded PageAnon check in restore_exclusive_pte()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 4d8ff64097092701a5e5506d0d7f643d421e0432
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Sat Jul 16 16:18:16 2022 +0800

    mm: remove unneeded PageAnon check in restore_exclusive_pte()

    When code reaches here, the page must be !PageAnon.  There's no need to
    check PageAnon again.  Remove it.

    Link: https://lkml.kernel.org/r/20220716081816.10752-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:28 -04:00
Chris von Recklinghausen 01b53e24e2 mm: remove obsolete comment in do_fault_around()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 0f0b6931ff0d8de344392f5d470f88af64130709
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Sat Jul 16 16:03:59 2022 +0800

    mm: remove obsolete comment in do_fault_around()

    Since commit 7267ec008b ("mm: postpone page table allocation until we
    have page to map"), do_fault_around is not called with page table lock
    held.  Cleanup the corresponding comments.

    Link: https://lkml.kernel.org/r/20220716080359.38791-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:28 -04:00
Chris von Recklinghausen 83001d8990 hugetlb: lazy page table copies in fork()
Bugzilla: https://bugzilla.redhat.com/2160210

commit bcd51a3c679d179cf526414f859c57d081fd37e7
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Jun 21 16:56:20 2022 -0700

    hugetlb: lazy page table copies in fork()

    Lazy page table copying at fork time was introduced with commit
    d992895ba2 ("[PATCH] Lazy page table copies in fork()").  At the time,
    hugetlb was very new and did not support page faulting.  As a result, it
    was excluded.  When full page fault support was added for hugetlb, the
    exclusion was not removed.

    Simply remove the check that prevents lazy copying of hugetlb page tables
    at fork.  Of course, like other mappings this only applies to shared
    mappings.

    Lazy page table copying at fork will be less advantageous for hugetlb
    mappings because:
    - There are fewer page table entries with hugetlb
    - hugetlb pmds can be shared instead of copied

    In any case, completely eliminating the copy at fork time should speed
    things up.

    Link: https://lkml.kernel.org/r/20220621235620.291305-5-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:24 -04:00
Chris von Recklinghausen d8958baf24 mm: thp: kill __transhuge_page_enabled()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 7da4e2cb8b1ff8221759bfc7512d651ee69516dc
Author: Yang Shi <shy828301@gmail.com>
Date:   Thu Jun 16 10:48:38 2022 -0700

    mm: thp: kill __transhuge_page_enabled()

    The page fault path checks THP eligibility with __transhuge_page_enabled()
    which does the similar thing as hugepage_vma_check(), so use
    hugepage_vma_check() instead.

    However page fault allows DAX and !anon_vma cases, so added a new flag,
    in_pf, to hugepage_vma_check() to make page fault work correctly.

    The in_pf flag is also used to skip shmem and file THP for page fault
    since shmem handles THP in its own shmem_fault() and file THP allocation
    on fault is not supported yet.

    Also remove hugepage_vma_enabled() since hugepage_vma_check() is the only
    caller now, it is not necessary to have a helper function.

    Link: https://lkml.kernel.org/r/20220616174840.1202070-6-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Zach O'Keefe <zokeefe@google.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:24 -04:00
Chris von Recklinghausen aadb0028d0 delayacct: track delays from write-protect copy
Bugzilla: https://bugzilla.redhat.com/2160210

commit 662ce1dc9caf493c309200edbe38d186f1ea20d0
Author: Yang Yang <yang.yang29@zte.com.cn>
Date:   Wed Jun 1 15:55:25 2022 -0700

    delayacct: track delays from write-protect copy

    Delay accounting does not track the delay of write-protect copy.  When
    tasks trigger many write-protect copys(include COW and unsharing of
    anonymous pages[1]), it may spend a amount of time waiting for them.  To
    get the delay of tasks in write-protect copy, could help users to evaluate
    the impact of using KSM or fork() or GUP.

    Also update tools/accounting/getdelays.c:

        / # ./getdelays -dl -p 231
        print delayacct stats ON
        listen forever
        PID     231

        CPU             count     real total  virtual total    delay total  delay average
                         6247     1859000000     2154070021     1674255063          0.268ms
        IO              count    delay total  delay average
                            0              0              0ms
        SWAP            count    delay total  delay average
                            0              0              0ms
        RECLAIM         count    delay total  delay average
                            0              0              0ms
        THRASHING       count    delay total  delay average
                            0              0              0ms
        COMPACT         count    delay total  delay average
                            3          72758              0ms
        WPCOPY          count    delay total  delay average
                         3635      271567604              0ms

    [1] commit 31cc5bc4af70("mm: support GUP-triggered unsharing of anonymous pages")

    Link: https://lkml.kernel.org/r/20220409014342.2505532-1-yang.yang29@zte.com.cn
    Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
    Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Reviewed-by: wangyong <wang.yong12@zte.com.cn>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Balbir Singh <bsingharora@gmail.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:12 -04:00
Chris von Recklinghausen d7197a5b04 mm/swapfile: unuse_pte can map random data if swap read fails
Bugzilla: https://bugzilla.redhat.com/2160210

commit 9f186f9e5fa9ebdaef909fd45f825a6ce281f13c
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu May 19 20:50:26 2022 +0800

    mm/swapfile: unuse_pte can map random data if swap read fails

    Patch series "A few fixup patches for mm", v4.

    This series contains a few patches to avoid mapping random data if swap
    read fails and fix lost swap bits in unuse_pte.  Also we free hwpoison and
    swapin error entry in madvise_free_pte_range and so on.  More details can
    be found in the respective changelogs.

    This patch (of 5):

    There is a bug in unuse_pte(): when swap page happens to be unreadable,
    page filled with random data is mapped into user address space.  In case
    of error, a special swap entry indicating swap read fails is set to the
    page table.  So the swapcache page can be freed and the user won't end up
    with a permanently mounted swap because a sector is bad.  And if the page
    is accessed later, the user process will be killed so that corrupted data
    is never consumed.  On the other hand, if the page is never accessed, the
    user won't even notice it.

    Link: https://lkml.kernel.org/r/20220519125030.21486-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220519125030.21486-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Howells <dhowells@redhat.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:12 -04:00
Chris von Recklinghausen bdcc532d41 mm/swap: avoid calling swp_swap_info when try to check SWP_STABLE_WRITES
Bugzilla: https://bugzilla.redhat.com/2160210

commit eacde32757c7566d3aa760609585c78909532e40
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu May 19 14:08:51 2022 -0700

    mm/swap: avoid calling swp_swap_info when try to check SWP_STABLE_WRITES

    Use flags of si directly to check SWP_STABLE_WRITES to avoid possible
    READ_ONCE and thus save some cpu cycles.

    [akpm@linux-foundation.org: use data_race() on si->flags, per Neil]
    Link: https://lkml.kernel.org/r/20220509131416.17553-10-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:10 -04:00
Chris von Recklinghausen b30d54323b mm/shmem: remove duplicate include in memory.c
Bugzilla: https://bugzilla.redhat.com/2160210

commit 54943a1a4d2a3fed23d31e96a91aa9d18ea74500
Author: Wan Jiabing <wanjiabing@vivo.com>
Date:   Thu May 12 20:23:00 2022 -0700

    mm/shmem: remove duplicate include in memory.c

    Fix following checkincludes.pl warning:
    mm/memory.c: linux/mm_inline.h is included more than once.

    The include is in line 44. Remove the duplicated here.

    Link: https://lkml.kernel.org/r/20220427064717.803019-1-wanjiabing@vivo.com
    Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:06 -04:00
Chris von Recklinghausen 522d352177 mm/hugetlb: only drop uffd-wp special pte if required
Bugzilla: https://bugzilla.redhat.com/2160210

commit 05e90bd05eea33fc77d6b11e121e2da01fee379f
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:55 2022 -0700

    mm/hugetlb: only drop uffd-wp special pte if required

    As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
    if unmapping an entire vma or synchronized such that faults can not race
    with the unmap operation.  This requires passing zap_flags all the way to
    the lowest level hugetlb unmap routine: __unmap_hugepage_range.

    In general, unmap calls originated in hugetlbfs code will pass the
    ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
    faults.  The exception is hole punch which will first unmap without any
    synchronization.  Later when hole punch actually removes the page from the
    file, it will check to see if there was a subsequent fault and if so take
    the hugetlb fault mutex while unmapping again.  This second unmap will
    pass in ZAP_FLAG_DROP_MARKER.

    The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
    unmap a hugetlb range" is (IMHO): we should never reach a state when a
    page fault could errornously fault in a page-cache page that was
    wr-protected to be writable, even in an extremely short period.  That
    could happen if e.g.  we pass ZAP_FLAG_DROP_MARKER when
    hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
    faults after that call and before remove_inode_hugepages() is executed,
    the page cache can be mapped writable again in the small racy window, that
    can cause unexpected data overwritten.

    [peterx@redhat.com: fix sparse warning]
      Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
    [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
    Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen 42509e6e65 mm/shmem: persist uffd-wp bit across zapping for file-backed
Bugzilla: https://bugzilla.redhat.com/2160210

commit 999dad824c39ed14dee7c4412aae531ba9e74a90
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:53 2022 -0700

    mm/shmem: persist uffd-wp bit across zapping for file-backed

    File-backed memory is prone to being unmapped at any time.  It means all
    information in the pte will be dropped, including the uffd-wp flag.

    To persist the uffd-wp flag, we'll use the pte markers.  This patch
    teaches the zap code to understand uffd-wp and know when to keep or drop
    the uffd-wp bit.

    Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we
    don't want to persist such an information, for example, when destroying
    the whole vma, or punching a hole in a shmem file.  For the rest cases we
    should never drop the uffd-wp bit, or the wr-protect information will get
    lost.

    The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than
    memory.c because it'll be further referenced in hugetlb files later.

    Link: https://lkml.kernel.org/r/20220405014847.14295-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen dec1b9ad8e mm/shmem: handle uffd-wp special pte in page fault handler
Bugzilla: https://bugzilla.redhat.com/2160210

commit 9c28a205c06123b9f0a0c4d819ece9f5f552d004
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:53 2022 -0700

    mm/shmem: handle uffd-wp special pte in page fault handler

    File-backed memories are prone to unmap/swap so the ptes are always
    unstable, because they can be easily faulted back later using the page
    cache.  This could lead to uffd-wp getting lost when unmapping or swapping
    out such memory.  One example is shmem.  PTE markers are needed to store
    those information.

    This patch prepares it by handling uffd-wp pte markers first it is applied
    elsewhere, so that the page fault handler can recognize uffd-wp pte
    markers.

    The handling of uffd-wp pte markers is similar to missing fault, it's just
    that we'll handle this "missing fault" when we see the pte markers,
    meanwhile we need to make sure the marker information is kept during
    processing the fault.

    This is a slow path of uffd-wp handling, because zapping of wr-protected
    shmem ptes should be rare.  So far it should only trigger in two
    conditions:

      (1) When trying to punch holes in shmem_fallocate(), there is an
          optimization to zap the pgtables before evicting the page.

      (2) When swapping out shmem pages.

    Because of this, the page fault handling is simplifed too by not sending
    the wr-protect message in the 1st page fault, instead the page will be
    installed read-only, so the uffd-wp message will be generated in the next
    fault, which will trigger the do_wp_page() path of general uffd-wp
    handling.

    Disable fault-around for all uffd-wp registered ranges for extra safety
    just like uffd-minor fault, and clean the code up.

    Link: https://lkml.kernel.org/r/20220405014844.14239-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen 76171a01f8 mm: check against orig_pte for finish_fault()
Bugzilla: https://bugzilla.redhat.com/2160210

commit f46f2adecdcc1ba0799383e67fe98f65f41fea5c
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:52 2022 -0700

    mm: check against orig_pte for finish_fault()

    This patch allows do_fault() to trigger on !pte_none() cases too.  This
    prepares for the pte markers to be handled by do_fault() just like none
    pte.

    To achieve this, instead of unconditionally check against pte_none() in
    finish_fault(), we may hit the case that the orig_pte was some pte marker
    so what we want to do is to replace the pte marker with some valid pte
    entry.  Then if orig_pte was set we'd want to check the current *pte
    (under pgtable lock) against orig_pte rather than none pte.

    Right now there's no solid way to safely reference orig_pte because when
    pmd is not allocated handle_pte_fault() will not initialize orig_pte, so
    it's not safe to reference it.

    There's another solution proposed before this patch to do pte_clear() for
    vmf->orig_pte for pmd==NULL case, however it turns out it'll break arm32
    because arm32 could have assumption that pte_t* pointer will always reside
    on a real ram32 pgtable, not any kernel stack variable.

    To solve this, we add a new flag FAULT_FLAG_ORIG_PTE_VALID, and it'll be
    set along with orig_pte when there is valid orig_pte, or it'll be cleared
    when orig_pte was not initialized.

    It'll be updated every time we call handle_pte_fault(), so e.g.  if a page
    fault retry happened it'll be properly updated along with orig_pte.

    [1] https://lore.kernel.org/lkml/710c48c9-406d-e4c5-a394-10501b951316@samsung.com/

    [akpm@linux-foundation.org: coding-style cleanups]
    [peterx@redhat.com: fix crash reported by Marek]
      Link: https://lkml.kernel.org/r/Ylb9rXJyPm8/ao8f@xz-m1.local
    Link: https://lkml.kernel.org/r/20220405014836.14077-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen 220173faac mm: teach core mm about pte markers
Conflicts: fs/userfaultfd.c - We already have
	2d5de004e009 ("userfaultfd: add /dev/userfaultfd for fine grained access control")
	so keep #include <linux/miscdevice.h>

Bugzilla: https://bugzilla.redhat.com/2160210

commit 5c041f5d1f23d3a172dd0db3215634c484b4acd6
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:52 2022 -0700

    mm: teach core mm about pte markers

    This patch still does not use pte marker in any way, however it teaches
    the core mm about the pte marker idea.

    For example, handle_pte_marker() is introduced that will parse and handle
    all the pte marker faults.

    Many of the places are more about commenting it up - so that we know
    there's the possibility of pte marker showing up, and why we don't need
    special code for the cases.

    [peterx@redhat.com: userfaultfd.c needs swapops.h]
      Link: https://lkml.kernel.org/r/YmRlVj3cdizYJsr0@xz-m1.local
    Link: https://lkml.kernel.org/r/20220405014833.14015-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen 9ecb054eeb mm: submit multipage reads for SWP_FS_OPS swap-space
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5169b844b7dd5934cd4f22ab66de0cc669abf0b0
Author: NeilBrown <neilb@suse.de>
Date:   Mon May 9 18:20:49 2022 -0700

    mm: submit multipage reads for SWP_FS_OPS swap-space

    swap_readpage() is given one page at a time, but may be called repeatedly
    in succession.

    For block-device swap-space, the blk_plug functionality allows the
    multiple pages to be combined together at lower layers.  That cannot be
    used for SWP_FS_OPS as blk_plug may not exist - it is only active when
    CONFIG_BLOCK=y.  Consequently all swap reads over NFS are single page
    reads.

    With this patch we pass in a pointer-to-pointer when swap_readpage can
    store state between calls - much like the effect of blk_plug.  After
    calling swap_readpage() some number of times, the state will be passed to
    swap_read_unplug() which can submit the combined request.

    Link: https://lkml.kernel.org/r/164859778127.29473.14059420492644907783.stgit@noble.brown
    Signed-off-by: NeilBrown <neilb@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Tested-by: David Howells <dhowells@redhat.com>
    Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:00 -04:00
Chris von Recklinghausen f942ace7a2 mm: create new mm/swap.h header file
Bugzilla: https://bugzilla.redhat.com/2160210

commit 014bb1de4fc17d54907d54418126a9a9736f4aff
Author: NeilBrown <neilb@suse.de>
Date:   Mon May 9 18:20:47 2022 -0700

    mm: create new mm/swap.h header file

    Patch series "MM changes to improve swap-over-NFS support".

    Assorted improvements for swap-via-filesystem.

    This is a resend of these patches, rebased on current HEAD.  The only
    substantial changes is that swap_dirty_folio has replaced
    swap_set_page_dirty.

    Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem.  It
    has previously worked for NFS but that broke a few releases back.  This
    series changes to use a new ->swap_rw rather than ->readpage and
    ->direct_IO.  It also makes other improvements.

    There is a companion series already in linux-next which fixes various
    issues with NFS.  Once both series land, a final patch is needed which
    changes NFS over to use ->swap_rw.

    This patch (of 10):

    Many functions declared in include/linux/swap.h are only used within mm/

    Create a new "mm/swap.h" and move some of these declarations there.
    Remove the redundant 'extern' from the function declarations.

    [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
    Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
    Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brown
    Signed-off-by: NeilBrown <neilb@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Tested-by: David Howells <dhowells@redhat.com>
    Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:00 -04:00
Chris von Recklinghausen 764d797fb9 mm,fs: Remove aops->readpage
Conflicts: mm/filemap.c - We already have
	176042404ee6 ("mm: add PSI accounting around ->read_folio and ->readahead calls")
	so just replace the logic to call either read_page or read_folio
	with an unconditional call to read_folio

Bugzilla: https://bugzilla.redhat.com/2160210

commit 7e0a126519b82648b254afcd95a168c15f65ea40
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Apr 29 11:53:28 2022 -0400

    mm,fs: Remove aops->readpage

    With all implementations of aops->readpage converted to aops->read_folio,
    we can stop checking whether it's set and remove the member from aops.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:59 -04:00
Chris von Recklinghausen c48611f141 mm: avoid unnecessary page fault retires on shared memory types
Bugzilla: https://bugzilla.redhat.com/2160210

commit d92725256b4f22d084b813b37ddc394da79aacab
Author: Peter Xu <peterx@redhat.com>
Date:   Mon May 30 14:34:50 2022 -0400

    mm: avoid unnecessary page fault retires on shared memory types

    I observed that for each of the shared file-backed page faults, we're very
    likely to retry one more time for the 1st write fault upon no page.  It's
    because we'll need to release the mmap lock for dirty rate limit purpose
    with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

    Then after that throttling we return VM_FAULT_RETRY.

    We did that probably because VM_FAULT_RETRY is the only way we can return
    to the fault handler at that time telling it we've released the mmap lock.

    However that's not ideal because it's very likely the fault does not need
    to be retried at all since the pgtable was well installed before the
    throttling, so the next continuous fault (including taking mmap read lock,
    walk the pgtable, etc.) could be in most cases unnecessary.

    It's not only slowing down page faults for shared file-backed, but also add
    more mmap lock contention which is in most cases not needed at all.

    To observe this, one could try to write to some shmem page and look at
    "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
    shmem write simply because we retried, and vm event "pgfault" will capture
    that.

    To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
    show that we've completed the whole fault and released the lock.  It's also
    a hint that we should very possibly not need another fault immediately on
    this page because we've just completed it.

    This patch provides a ~12% perf boost on my aarch64 test VM with a simple
    program sequentially dirtying 400MB shmem file being mmap()ed and these are
    the time it needs:

      Before: 650.980 ms (+-1.94%)
      After:  569.396 ms (+-1.38%)

    I believe it could help more than that.

    We need some special care on GUP and the s390 pgfault handler (for gmap
    code before returning from pgfault), the rest changes in the page fault
    handlers should be relatively straightforward.

    Another thing to mention is that mm_account_fault() does take this new
    fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

    I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
    not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
    them as-is.

    Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Vineet Gupta <vgupta@kernel.org>
    Acked-by: Guo Ren <guoren@kernel.org>
    Acked-by: Max Filippov <jcmvbkbc@gmail.com>
    Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
    Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>    [arm part]
    Acked-by: Heiko Carstens <hca@linux.ibm.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Stafford Horne <shorne@gmail.com>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Johannes Berg <johannes@sipsolutions.net>
    Cc: Brian Cain <bcain@quicinc.com>
    Cc: Richard Henderson <rth@twiddle.net>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Janosch Frank <frankja@linux.ibm.com>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Jonas Bonn <jonas@southpole.se>
    Cc: Will Deacon <will@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Michal Simek <monstr@monstr.eu>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: Chris Zankel <chris@zankel.net>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dinh Nguyen <dinguyen@kernel.org>
    Cc: Rich Felker <dalias@libc.org>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Yoshinori Sato <ysato@users.osdn.me>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:32 -04:00
Frantisek Hrbata 9ebb8c46cd Merge: mm: Proactive Fixes for 9.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1566

The following patchset contains a number of fixes from upstream. Ordered by upstream commit order.

The patches have been found using `Fixes` or contain some reference to fixing a commit we currently have backported.

These changes have been tested through our KT1+ test suite, and showed improvements in stability and less errors.

Ommited-fix: 28148a17c988 ("openrisc: Fix pagewalk usage in arch_dma_{clear, set}_uncached")

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498

Signed-off-by: Nico Pache <npache@redhat.com>

Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-14 02:40:23 -05:00
Nico Pache 90270e6e77 mm: fix page leak with multiple threads mapping the same page
commit 3fe2895cfecd03ac74977f32102b966b6589f481
Author: Josef Bacik <josef@toxicpanda.com>
Date:   Tue Jul 5 16:00:36 2022 -0400

    mm: fix page leak with multiple threads mapping the same page

    We have an application with a lot of threads that use a shared mmap backed
    by tmpfs mounted with -o huge=within_size.  This application started
    leaking loads of huge pages when we upgraded to a recent kernel.

    Using the page ref tracepoints and a BPF program written by Tejun Heo we
    were able to determine that these pages would have multiple refcounts from
    the page fault path, but when it came to unmap time we wouldn't drop the
    number of refs we had added from the faults.

    I wrote a reproducer that mmap'ed a file backed by tmpfs with -o
    huge=always, and then spawned 20 threads all looping faulting random
    offsets in this map, while using madvise(MADV_DONTNEED) randomly for huge
    page aligned ranges.  This very quickly reproduced the problem.

    The problem here is that we check for the case that we have multiple
    threads faulting in a range that was previously unmapped.  One thread maps
    the PMD, the other thread loses the race and then returns 0.  However at
    this point we already have the page, and we are no longer putting this
    page into the processes address space, and so we leak the page.  We
    actually did the correct thing prior to f9ce0be71d, however it looks
    like Kirill copied what we do in the anonymous page case.  In the
    anonymous page case we don't yet have a page, so we don't have to drop a
    reference on anything.  Previously we did the correct thing for file based
    faults by returning VM_FAULT_NOPAGE so we correctly drop the reference on
    the page we faulted in.

    Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable()
    case, this makes us drop the ref on the page properly, and now my
    reproducer no longer leaks the huge pages.

    [josef@toxicpanda.com: v2]
      Link: https://lkml.kernel.org/r/e90c8f0dbae836632b669c2afc434006a00d4a67.1657721478.git.josef@toxicpanda.com
    Link: https://lkml.kernel.org/r/2b798acfd95c9ab9395fe85e8d5a835e2e10a920.1657051137.git.josef@toxicpanda.com
    Fixes: f9ce0be71d ("mm: Cleanup faultaround and finish_fault() codepaths")
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Rik van Riel <riel@surriel.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:40 -07:00
Nico Pache 6cb8042742 mm: split huge PUD on wp_huge_pud fallback
commit 14c99d65941538aa33edd8dc7b1bbbb593c324a2
Author: Gowans, James <jgowans@amazon.com>
Date:   Thu Jun 23 05:24:03 2022 +0000

    mm: split huge PUD on wp_huge_pud fallback

    Currently the implementation will split the PUD when a fallback is taken
    inside the create_huge_pud function.  This isn't where it should be done:
    the splitting should be done in wp_huge_pud, just like it's done for PMDs.
    Reason being that if a callback is taken during create, there is no PUD
    yet so nothing to split, whereas if a fallback is taken when encountering
    a write protection fault there is something to split.

    It looks like this was the original intention with the commit where the
    splitting was introduced, but somehow it got moved to the wrong place
    between v1 and v2 of the patch series.  Rebase mistake perhaps.

    Link: https://lkml.kernel.org/r/6f48d622eb8bce1ae5dd75327b0b73894a2ec407.camel@amazon.com
    Fixes: 327e9fd489 ("mm: Split huge pages on write-notify or COW")
    Signed-off-by: James Gowans <jgowans@amazon.com>
    Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Jan H. Schönherr <jschoenh@amazon.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:39 -07:00
Nico Pache c59a06cb0a mm: simplify follow_invalidate_pte()
commit 0e5e64c0b0d7bb33a5e971ad17e771cf6e0a1127
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Thu Apr 28 23:16:10 2022 -0700

    mm: simplify follow_invalidate_pte()

    The only user (DAX) of range and pmdpp parameters of
    follow_invalidate_pte() is gone, it is safe to remove them and make it
    static to simlify the code.  This is revertant of the following commits:

      0979639595 ("mm: add follow_pte_pmd()")
      a4d1a88525 ("dax: update to new mmu_notifier semantic")

    There is only one caller of the follow_invalidate_pte().  So just fold it
    into follow_pte() and remove it.

    Link: https://lkml.kernel.org/r/20220403053957.10770-7-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ross Zwisler <zwisler@kernel.org>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:36 -07:00
Waiman Long 0b0a448353 mm/vmstat: add events for ksm cow
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2138950

commit 94bfe85bde18a99612bb5d0ce348643c2d352836
Author: Yang Yang <yang.yang29@zte.com.cn>
Date:   Thu, 28 Apr 2022 23:16:16 -0700

    mm/vmstat: add events for ksm cow

    Users may use ksm by calling madvise(, , MADV_MERGEABLE) when they want to
    save memory, it's a tradeoff by suffering delay on ksm cow.  Users can get
    to know how much memory ksm saved by reading
    /sys/kernel/mm/ksm/pages_sharing, but they don't know what's the costs of
    ksm cow, and this is important of some delay sensitive tasks.

    So add ksm cow events to help users evaluate whether or how to use ksm.
    Also update Documentation/admin-guide/mm/ksm.rst with new added events.

    Link: https://lkml.kernel.org/r/20220331035616.2390805-1-yang.yang29@zte.com.cn
    Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: xu xin <xu.xin16@zte.com.cn>
    Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Saravanan D <saravanand@fb.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-10-31 19:33:16 -04:00
Chris von Recklinghausen 5915bca234 mm: fix NULL pointer dereference in wp_page_reuse()
Bugzilla: https://bugzilla.redhat.com/2120352

commit cdb281e63874086a650552d36c504ea717a0e0cb
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Tue Jul 26 14:24:36 2022 +0800

    mm: fix NULL pointer dereference in wp_page_reuse()

    The vmf->page can be NULL when the wp_page_reuse() is invoked by
    wp_pfn_shared(), it will cause the following panic:

      BUG: kernel NULL pointer dereference, address: 000000000000008
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 18 PID: 923 Comm: Xorg Not tainted 5.19.0-rc8.bm.1-amd64 #263
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g14
      RIP: 0010:_compound_head+0x0/0x40
      [...]
      Call Trace:
        wp_page_reuse+0x1c/0xa0
        do_wp_page+0x1a5/0x3f0
        __handle_mm_fault+0x8cf/0xd20
        handle_mm_fault+0xd5/0x2a0
        do_user_addr_fault+0x1d0/0x680
        exc_page_fault+0x78/0x170
        asm_exc_page_fault+0x22/0x30

    To fix it, this patch performs a NULL pointer check before dereferencing
    the vmf->page.

    Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:12 -04:00
Chris von Recklinghausen d6859c7858 mm/hugetlb: handle uffd-wp during fork()
Bugzilla: https://bugzilla.redhat.com/2120352

commit bc70fbf269fdff410b0b6d75c3770b9f59117b90
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:55 2022 -0700

    mm/hugetlb: handle uffd-wp during fork()

    Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range()
    because for uffd-wp it's the dst vma that matters on deciding how we
    should treat uffd-wp protected ptes.

    We should recognize pte markers during fork and do the pte copy if needed.

    [lkp@intel.com: vma_needs_copy can be static]
      Link: https://lkml.kernel.org/r/Ylb0CGeFJlc4EzLk@7ec4ff11d4ae
    Link: https://lkml.kernel.org/r/20220405014918.14932-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:12 -04:00
Chris von Recklinghausen 6348a03958 mm/shmem: handle uffd-wp during fork()
Bugzilla: https://bugzilla.redhat.com/2120352

commit c56d1b62cce83695823c13e52f73e92eb568c0c1
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:53 2022 -0700

    mm/shmem: handle uffd-wp during fork()

    Normally we skip copy page when fork() for VM_SHARED shmem, but we can't
    skip it anymore if uffd-wp is enabled on dst vma.  This should only happen
    when the src uffd has UFFD_FEATURE_EVENT_FORK enabled on uffd-wp shmem
    vma, so that VM_UFFD_WP will be propagated onto dst vma too, then we
    should copy the pgtables with uffd-wp bit and pte markers, because these
    information will be lost otherwise.

    Since the condition checks will become even more complicated for deciding
    "whether a vma needs to copy the pgtable during fork()", introduce a
    helper vma_needs_copy() for it, so everything will be clearer.

    Link: https://lkml.kernel.org/r/20220405014855.14468-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:12 -04:00
Chris von Recklinghausen 5f2529c672 mm/swap: remember PG_anon_exclusive via a swp pte bit
Bugzilla: https://bugzilla.redhat.com/2120352

commit 1493a1913e34b0ac366e33f9ebad721e69fd06ac
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:45 2022 -0700

    mm/swap: remember PG_anon_exclusive via a swp pte bit

    Patch series "mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages", v2.

    This series fixes memory corruptions when a GUP R/W reference (FOLL_WRITE
    | FOLL_GET) was taken on an anonymous page and COW logic fails to detect
    exclusivity of the page to then replacing the anonymous page by a copy in
    the page table: The GUP reference lost synchronicity with the pages mapped
    into the page tables.  This series focuses on x86, arm64, s390x and
    ppc64/book3s -- other architectures are fairly easy to support by
    implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.

    This primarily fixes the O_DIRECT memory corruptions that can happen on
    concurrent swapout, whereby we lose DMA reads to a page (modifying the
    user page by writing to it).

    O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM) DMA
    from/to a user page.  In the long run, we want to convert it to properly
    use FOLL_PIN, and John is working on it, but that might take a while and
    might not be easy to backport.  In the meantime, let's restore what used
    to work before we started modifying our COW logic: make R/W FOLL_GET
    references reliable as long as there is no fork() after GUP involved.

    This is just the natural follow-up of part 2, that will also further
    reduce "wrong COW" on the swapin path, for example, when we cannot remove
    a page from the swapcache due to concurrent writeback, or if we have two
    threads faulting on the same swapped-out page.  Fixing O_DIRECT is just a
    nice side-product

    This issue, including other related COW issues, has been summarized in [3]
    under 2):
    "
      2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)

      It was discovered that we can create a memory corruption by reading a
      file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
      concurrently writing to an unrelated part (e.g., last byte) of the same
      page, and concurrently write-protecting the page via clear_refs
      SOFTDIRTY tracking [6].

      For the reproducer, the issue is that O_DIRECT grabs a reference of the
      target page (via FOLL_GET) and clear_refs write-protects the relevant
      page table entry. On successive write access to the page from the
      process itself, we wrongly COW the page when resolving the write fault,
      resulting in a loss of synchronicity and consequently a memory corruption.

      While some people might think that using clear_refs in this combination
      is a corner cases, it turns out to be a more generic problem unfortunately.

      For example, it was just recently discovered that we can similarly
      create a memory corruption without clear_refs, simply by concurrently
      swapping out the buffer pages [7]. Note that we nowadays even use the
      swap infrastructure in Linux without an actual swap disk/partition: the
      prime example is zram which is enabled as default under Fedora [10].

      The root issue is that a write-fault on a page that has additional
      references results in a COW and thereby a loss of synchronicity
      and consequently a memory corruption if two parties believe they are
      referencing the same page.
    "

    We don't particularly care about R/O FOLL_GET references: they were never
    reliable and O_DIRECT doesn't expect to observe modifications from a page
    after DMA was started.

    Note that:
    * this only fixes the issue on x86, arm64, s390x and ppc64/book3s
      ("enterprise architectures"). Other architectures have to implement
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
    * this does *not * consider any kind of fork() after taking the reference:
      fork() after GUP never worked reliably with FOLL_GET.
    * Not losing PG_anon_exclusive during swapout was the last remaining
      piece. KSM already makes sure that there are no other references on
      a page before considering it for sharing. Page migration maintains
      PG_anon_exclusive and simply fails when there are additional references
      (freezing the refcount fails). Only swapout code dropped the
      PG_anon_exclusive flag because it requires more work to remember +
      restore it.

    With this series in place, most COW issues of [3] are fixed on said
    architectures. Other architectures can implement
    __HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.

    [1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
    [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
    [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com

    This patch (of 8):

    Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
    it.  We do this, to keep fork() logic on swap entries easy and efficient:
    for example, if we wouldn't clear it when unmapping, we'd have to lookup
    the page in the swapcache for each and every swap entry during fork() and
    clear PG_anon_exclusive if set.

    Instead, we want to store that information directly in the swap pte,
    protected by the page table lock, similarly to how we handle
    SWP_MIGRATION_READ_EXCLUSIVE for migration entries.  However, for actual
    swap entries, we don't want to mess with the swap type (e.g., still one
    bit) because it overcomplicates swap code.

    In try_to_unmap(), we already reject to unmap in case the page might be
    pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
    Checking if there are other unexpected references reliably *before*
    completely unmapping a page is unfortunately not really possible: THP
    heavily overcomplicate the situation.  Once fully unmapped it's easier --
    we, for example, make sure that there are no unexpected references *after*
    unmapping a page before starting writeback on that page.

    So, we currently might end up unmapping a page and clearing
    PG_anon_exclusive if that page has additional references, for example, due
    to a FOLL_GET.

    do_swap_page() has to re-determine if a page is exclusive, which will
    easily fail if there are other references on a page, most prominently GUP
    references via FOLL_GET.  This can currently result in memory corruptions
    when taking a FOLL_GET | FOLL_WRITE reference on a page even when fork()
    is never involved: try_to_unmap() will succeed, and when refaulting the
    page, it cannot be marked exclusive and will get replaced by a copy in the
    page tables on the next write access, resulting in writes via the GUP
    reference to the page being lost.

    In an ideal world, everybody that uses GUP and wants to modify page
    content, such as O_DIRECT, would properly use FOLL_PIN.  However, that
    conversion will take a while.  It's easier to fix what used to work in the
    past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive.  In addition,
    by remembering PG_anon_exclusive we can further reduce unnecessary COW in
    some cases, so it's the natural thing to do.

    So let's transfer the PG_anon_exclusive information to the swap pte and
    store it via an architecture-dependant pte bit; use that information when
    restoring the swap pte in do_swap_page() and unuse_pte().  During fork(),
    we simply have to clear the pte bit and are done.

    Of course, there is one corner case to handle: swap backends that don't
    support concurrent page modifications while the page is under writeback.
    Special case these, and drop the exclusive marker.  Add a comment why that
    is just fine (also, reuse_swap_page() would have done the same in the
    past).

    In the future, we'll hopefully have all architectures support
    __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty stubs
    and the define completely.  Then, we can also convert
    SWP_MIGRATION_READ_EXCLUSIVE.  For architectures it's fairly easy to
    support: either simply use a yet unused pte bit that can be used for swap
    entries, steal one from the arch type bits if they exceed 5, or steal one
    from the offset bits.

    Note: R/O FOLL_GET references were never really reliable, especially when
    taking one on a shared page and then writing to the page (e.g., GUP after
    fork()).  FOLL_GET, including R/W references, were never really reliable
    once fork was involved (e.g., GUP before fork(), GUP during fork()).  KSM
    steps back in case it stumbles over unexpected references and is,
    therefore, fine.

    [david@redhat.com: fix SWP_STABLE_WRITES test]
      Link: https://lkml.kernel.org/r/ac725bcb-313a-4fff-250a-68ba9a8f85fb@redhat.comLink: https://lkml.kernel.org/r/20220329164329.208407-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20220329164329.208407-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Jann Horn <jannh@google.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 5b02d5be5f mm: support GUP-triggered unsharing of anonymous pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit c89357e27f20dda3fff6791d27bb6c91eae99f4a
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:45 2022 -0700

    mm: support GUP-triggered unsharing of anonymous pages

    Whenever GUP currently ends up taking a R/O pin on an anonymous page that
    might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
    on the page table entry will end up replacing the mapped anonymous page
    due to COW, resulting in the GUP pin no longer being consistent with the
    page actually mapped into the page table.

    The possible ways to deal with this situation are:
     (1) Ignore and pin -- what we do right now.
     (2) Fail to pin -- which would be rather surprising to callers and
         could break user space.
     (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
         pins.

    We want to implement 3) because it provides the clearest semantics and
    allows for checking in unpin_user_pages() and friends for possible BUGs:
    when trying to unpin a page that's no longer exclusive, clearly something
    went very wrong and might result in memory corruptions that might be hard
    to debug.  So we better have a nice way to spot such issues.

    To implement 3), we need a way for GUP to trigger unsharing:
    FAULT_FLAG_UNSHARE.  FAULT_FLAG_UNSHARE is only applicable to R/O mapped
    anonymous pages and resembles COW logic during a write fault.  However, in
    contrast to a write fault, GUP-triggered unsharing will, for example,
    still maintain the write protection.

    Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write
    fault handlers for all applicable anonymous page types: ordinary pages,
    THP and hugetlb.

    * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
      marked exclusive in the meantime by someone else, there is nothing to do.
    * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
      marked exclusive, it will try detecting if the process is the exclusive
      owner. If exclusive, it can be set exclusive similar to reuse logic
      during write faults via page_move_anon_rmap() and there is nothing
      else to do; otherwise, we either have to copy and map a fresh,
      anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
      THP.

    This commit is heavily based on patches by Andrea.

    Link: https://lkml.kernel.org/r/20220428083441.37290-16-david@redhat.com
    Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 30e9a2455a mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6c287605fd56466e645693eff3ae7c08fba56e0a
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:44 2022 -0700

    mm: remember exclusively mapped anonymous pages with PG_anon_exclusive

    Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
    exclusive, and use that information to make GUP pins reliable and stay
    consistent with the page mapped into the page table even if the page table
    entry gets write-protected.

    With that information at hand, we can extend our COW logic to always reuse
    anonymous pages that are exclusive.  For anonymous pages that might be
    shared, the existing logic applies.

    As already documented, PG_anon_exclusive is usually only expressive in
    combination with a page table entry.  Especially PTE vs.  PMD-mapped
    anonymous pages require more thought, some examples: due to mremap() we
    can easily have a single compound page PTE-mapped into multiple page
    tables exclusively in a single process -- multiple page table locks apply.
    Further, due to MADV_WIPEONFORK we might not necessarily write-protect
    all PTEs, and only some subpages might be pinned.  Long story short: once
    PTE-mapped, we have to track information about exclusivity per sub-page,
    but until then, we can just track it for the compound page in the head
    page and not having to update a whole bunch of subpages all of the time
    for a simple PMD mapping of a THP.

    For simplicity, this commit mostly talks about "anonymous pages", while
    it's for THP actually "the part of an anonymous folio referenced via a
    page table entry".

    To not spill PG_anon_exclusive code all over the mm code-base, we let the
    anon rmap code to handle all PG_anon_exclusive logic it can easily handle.

    If a writable, present page table entry points at an anonymous (sub)page,
    that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
    pin (FOLL_PIN) on an anonymous page references via a present page table
    entry, it must only pin if PG_anon_exclusive is set for the mapped
    (sub)page.

    This commit doesn't adjust GUP, so this is only implicitly handled for
    FOLL_WRITE, follow-up commits will teach GUP to also respect it for
    FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
    reliable.

    Whenever an anonymous page is to be shared (fork(), KSM), or when
    temporarily unmapping an anonymous page (swap, migration), the relevant
    PG_anon_exclusive bit has to be cleared to mark the anonymous page
    possibly shared.  Clearing will fail if there are GUP pins on the page:

    * For fork(), this means having to copy the page and not being able to
      share it.  fork() protects against concurrent GUP using the PT lock and
      the src_mm->write_protect_seq.

    * For KSM, this means sharing will fail.  For swap this means, unmapping
      will fail, For migration this means, migration will fail early.  All
      three cases protect against concurrent GUP using the PT lock and a
      proper clear/invalidate+flush of the relevant page table entry.

    This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
    pinned page gets mapped R/O and the successive write fault ends up
    replacing the page instead of reusing it.  It improves the situation for
    O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
    fork() is *not* involved, however swapout and fork() are still
    problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
    users will fix the issue for them.

    I. Details about basic handling

    I.1. Fresh anonymous pages

    page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
    given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
    the mechanism fresh anonymous pages come into life (besides migration code
    where we copy the page->mapping), all fresh anonymous pages will start out
    as exclusive.

    I.2. COW reuse handling of anonymous pages

    When a COW handler stumbles over a (sub)page that's marked exclusive, it
    simply reuses it.  Otherwise, the handler tries harder under page lock to
    detect if the (sub)page is exclusive and can be reused.  If exclusive,
    page_move_anon_rmap() will mark the given (sub)page exclusive.

    Note that hugetlb code does not yet check for PageAnonExclusive(), as it
    still uses the old COW logic that is prone to the COW security issue
    because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
    pages are a scarce resource.

    I.3. Migration handling

    try_to_migrate() has to try marking an exclusive anonymous page shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  migrate_vma_collect_pmd() and
    __split_huge_pmd_locked() are handled similarly.

    Writable migration entries implicitly point at shared anonymous pages.
    For readable migration entries that information is stored via a new
    "readable-exclusive" migration entry, specific to anonymous pages.

    When restoring a migration entry in remove_migration_pte(), information
    about exlusivity is detected via the migration entry type, and
    RMAP_EXCLUSIVE is set accordingly for
    page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.

    I.4. Swapout handling

    try_to_unmap() has to try marking the mapped page possibly shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  For now, information about exclusivity is lost.  In
    the future, we might want to remember that information in the swap entry
    in some cases, however, it requires more thought, care, and a way to store
    that information in swap entries.

    I.5. Swapin handling

    do_swap_page() will never stumble over exclusive anonymous pages in the
    swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
    to detect manually if an anonymous page is exclusive and has to set
    RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.

    I.6. THP handling

    __split_huge_pmd_locked() has to move the information about exclusivity
    from the PMD to the PTEs.

    a) In case we have a readable-exclusive PMD migration entry, simply
       insert readable-exclusive PTE migration entries.

    b) In case we have a present PMD entry and we don't want to freeze
       ("convert to migration entries"), simply forward PG_anon_exclusive to
       all sub-pages, no need to temporarily clear the bit.

    c) In case we have a present PMD entry and want to freeze, handle it
       similar to try_to_migrate(): try marking the page shared first.  In
       case we fail, we ignore the "freeze" instruction and simply split
       ordinarily.  try_to_migrate() will properly fail because the THP is
       still mapped via PTEs.

    When splitting a compound anonymous folio (THP), the information about
    exclusivity is implicitly handled via the migration entries: no need to
    replicate PG_anon_exclusive manually.

    I.7.  fork() handling fork() handling is relatively easy, because
    PG_anon_exclusive is only expressive for some page table entry types.

    a) Present anonymous pages

    page_try_dup_anon_rmap() will mark the given subpage shared -- which will
    fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
    PMD to handle it on the PTE level).

    Note that device exclusive entries are just a pointer at a PageAnon()
    page.  fork() will first convert a device exclusive entry to a present
    page table and handle it just like present anonymous pages.

    b) Device private entry

    Device private entries point at PageAnon() pages that cannot be mapped
    directly and, therefore, cannot get pinned.

    page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
    fail because they cannot get pinned.

    c) HW poison entries

    PG_anon_exclusive will remain untouched and is stale -- the page table
    entry is just a placeholder after all.

    d) Migration entries

    Writable and readable-exclusive entries are converted to readable entries:
    possibly shared.

    I.8. mprotect() handling

    mprotect() only has to properly handle the new readable-exclusive
    migration entry:

    When write-protecting a migration entry that points at an anonymous page,
    remember the information about exclusivity via the "readable-exclusive"
    migration entry type.

    II. Migration and GUP-fast

    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a migration entry, we have to mark the page possibly
    shared and synchronize against GUP-fast by a proper clear/invalidate+flush
    to make the following scenario impossible:

    1. try_to_migrate() places a migration entry after checking for GUP pins
       and marks the page possibly shared.

    2. GUP-fast pins the page due to lack of synchronization

    3. fork() converts the "writable/readable-exclusive" migration entry into a
       readable migration entry

    4. Migration fails due to the GUP pin (failing to freeze the refcount)

    5. Migration entries are restored. PG_anon_exclusive is lost

    -> We have a pinned page that is not marked exclusive anymore.

    Note that we move information about exclusivity from the page to the
    migration entry as it otherwise highly overcomplicates fork() and
    PTE-mapping a THP.

    III. Swapout and GUP-fast

    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a swap entry, we have to mark the page possibly shared
    and synchronize against GUP-fast by a proper clear/invalidate+flush to
    make the following scenario impossible:

    1. try_to_unmap() places a swap entry after checking for GUP pins and
       clears exclusivity information on the page.

    2. GUP-fast pins the page due to lack of synchronization.

    -> We have a pinned page that is not marked exclusive anymore.

    If we'd ever store information about exclusivity in the swap entry,
    similar to migration handling, the same considerations as in II would
    apply.  This is future work.

    Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 39e0d6a152 mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6c54dc6c74371eebf7eddc16b4f64b8c841c1585
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively

    We want to mark anonymous pages exclusive, and when using
    page_move_anon_rmap() we know that we are the exclusive user, as properly
    documented.  This is a preparation for marking anonymous pages exclusive
    in page_move_anon_rmap().

    In both instances, we're holding page lock and are sure that we're the
    exclusive owner (page_count() == 1).  hugetlb already properly uses
    page_move_anon_rmap() in the write fault handler.

    Note that in case of a PTE-mapped THP, we'll only end up calling this
    function if the whole THP is only referenced by the single PTE mapping a
    single subpage (page_count() == 1); consequently, it's fine to modify the
    compound page mapping inside page_move_anon_rmap().

    Link: https://lkml.kernel.org/r/20220428083441.37290-10-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen f0a431e143 mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 40f2bbf71161fa9195c7869004290003af152375
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()

    New anonymous pages are always mapped natively: only THP/khugepaged code
    maps a new compound anonymous page and passes "true".  Otherwise, we're
    just dealing with simple, non-compound pages.

    Let's give the interface clearer semantics and document these.  Remove the
    PageTransCompound() sanity check from page_add_new_anon_rmap().

    Link: https://lkml.kernel.org/r/20220428083441.37290-9-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen ab8c3870a8 mm/rmap: remove do_page_add_anon_rmap()
Bugzilla: https://bugzilla.redhat.com/2120352

commit f1e2db12e45baaa2d366f87c885968096c2ff5aa
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: remove do_page_add_anon_rmap()

    ... and instead convert page_add_anon_rmap() to accept flags.

    Passing flags instead of bools is usually nicer either way, and we want to
    more often also pass RMAP_EXCLUSIVE in follow up patches when detecting
    that an anonymous page is exclusive: for example, when restoring an
    anonymous page from a writable migration entry.

    This is a preparation for marking an anonymous page inside
    page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.

    Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:10 -04:00
Chris von Recklinghausen b52eeebf78 mm/rmap: convert RMAP flags to a proper distinct rmap_t type
Bugzilla: https://bugzilla.redhat.com/2120352

commit 14f9135d547060d1d0c182501201f8da19895fe3
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: convert RMAP flags to a proper distinct rmap_t type

    We want to pass the flags to more than one anon rmap function, getting rid
    of special "do_page_add_anon_rmap()".  So let's pass around a distinct
    __bitwise type and refine documentation.

    Link: https://lkml.kernel.org/r/20220428083441.37290-6-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:10 -04:00
Chris von Recklinghausen d8f21270d3 mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()
Bugzilla: https://bugzilla.redhat.com/2120352

commit fb3d824d1a46c5bb0584ea88f32dc2495544aebf
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()

    ...  and move the special check for pinned pages into
    page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages
    via a new pageflag, clearing it only after making sure that there are no
    GUP pins on the anonymous page.

    We really only care about pins on anonymous pages, because they are prone
    to getting replaced in the COW handler once mapped R/O.  For !anon pages
    in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about
    that, at least not that I could come up with an example.

    Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
    know we're dealing with anonymous pages.  Also, drop the handling of
    pinned pages from copy_huge_pud() and add a comment if ever supporting
    anonymous pages on the PUD level.

    This is a preparation for tracking exclusivity of anonymous pages in the
    rmap code, and disallowing marking a page shared (-> failing to duplicate)
    if there are GUP pins on a page.

    Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:10 -04:00
Chris von Recklinghausen 85f85f728e mm/memory: slightly simplify copy_present_pte()
Bugzilla: https://bugzilla.redhat.com/2120352

commit b51ad4f8679e50284ce35ff671767f8f0309b64a
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:42 2022 -0700

    mm/memory: slightly simplify copy_present_pte()

    Let's move the pinning check into the caller, to simplify return code
    logic and prepare for further changes: relocating the
    page_needs_cow_for_dma() into rmap handling code.

    While at it, remove the unused pte parameter and simplify the comments a
    bit.

    No functional change intended.

    Link: https://lkml.kernel.org/r/20220428083441.37290-4-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:10 -04:00
Chris von Recklinghausen 6697b528b0 mm: handling Non-LRU pages returned by vm_normal_pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3218f8712d6bba1812efd5e0d66c1e15134f2a91
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:11 2022 -0500

    mm: handling Non-LRU pages returned by vm_normal_pages

    With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
    device-managed anonymous pages that are not LRU pages.  Although they
    behave like normal pages for purposes of mapping in CPU page, and for COW.
    They do not support LRU lists, NUMA migration or THP.

    Callers to follow_page() currently don't expect ZONE_DEVICE pages,
    however, with DEVICE_COHERENT we might now return ZONE_DEVICE.  Check for
    ZONE_DEVICE pages in applicable users of follow_page() as well.

    Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.com
    Signed-off-by: Alex Sierra <alex.sierra@amd.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>       [v2]
    Reviewed-by: Alistair Popple <apopple@nvidia.com>       [v6]
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen b4381e605e mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 78fbe906cc900b33ce078102e13e0e99b5b8c406
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:44 2022 -0700

    mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages

    The basic question we would like to have a reliable and efficient answer
    to is: is this anonymous page exclusive to a single process or might it be
    shared?  We need that information for ordinary/single pages, hugetlb
    pages, and possibly each subpage of a THP.

    Introduce a way to mark an anonymous page as exclusive, with the ultimate
    goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins
    lose consistency with the pages mapped into the page table, resulting in
    reported memory corruptions.

    Most pageflags already have semantics for anonymous pages, however,
    PG_mappedtodisk should never apply to pages in the swapcache, so let's
    reuse that flag.

    As PG_has_hwpoisoned also uses that flag on the second tail page of a
    compound page, convert it to PG_error instead, which is marked as
    PF_NO_TAIL, so never used for tail pages.

    Use custom page flag modification functions such that we can do additional
    sanity checks.  The semantics we'll put into some kernel doc in the future
    are:

    "
      PG_anon_exclusive is *usually* only expressive in combination with a
      page table entry. Depending on the page table entry type it might
      store the following information:

           Is what's mapped via this page table entry exclusive to the
           single process and can be mapped writable without further
           checks? If not, it might be shared and we might have to COW.

      For now, we only expect PTE-mapped THPs to make use of
      PG_anon_exclusive in subpages. For other anonymous compound
      folios (i.e., hugetlb), only the head page is logically mapped and
      holds this information.

      For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
      set on the head page. When replacing the PMD by a page table full
      of PTEs, PG_anon_exclusive, if set on the head page, will be set on
      all tail pages accordingly. Note that converting from a PTE-mapping
      to a PMD mapping using the same compound page is currently not
      possible and consequently doesn't require care.

      If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
      it should only pin if the relevant PG_anon_exclusive is set. In that
      case, the pin will be fully reliable and stay consistent with the pages
      mapped into the page table, as the bit cannot get cleared (e.g., by
      fork(), KSM) while the page is pinned. For anonymous pages that
      are mapped R/W, PG_anon_exclusive can be assumed to always be set
      because such pages cannot possibly be shared.

      The page table lock protecting the page table entry is the primary
      synchronization mechanism for PG_anon_exclusive; GUP-fast that does
      not take the PT lock needs special care when trying to clear the
      flag.

      Page table entry types and PG_anon_exclusive:
      * Present: PG_anon_exclusive applies.
      * Swap: the information is lost. PG_anon_exclusive was cleared.
      * Migration: the entry holds this information instead.
                   PG_anon_exclusive was cleared.
      * Device private: PG_anon_exclusive applies.
      * Device exclusive: PG_anon_exclusive applies.
      * HW Poison: PG_anon_exclusive is stale and not changed.

      If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
      not allowed and the flag will stick around until the page is freed
      and folio->mapping is cleared.
    "

    We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
    zapping) of page table entries, page freeing code will handle that when
    also invalidate page->mapping to not indicate PageAnon() anymore.  Letting
    information about exclusivity stick around will be an important property
    when adding sanity checks to unpinning code.

    Note that we properly clear the flag in free_pages_prepare() via
    PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
    so there is no need to manually clear the flag.

    Link: https://lkml.kernel.org/r/20220428083441.37290-12-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:08 -04:00
Chris von Recklinghausen c24618f5e1 mm,hwpoison: unmap poisoned page before invalidation
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3149c79f3cb0e2e3bafb7cfadacec090cbd250d3
Author: Rik van Riel <riel@surriel.com>
Date:   Fri Apr 1 11:28:42 2022 -0700

    mm,hwpoison: unmap poisoned page before invalidation

    In some cases it appears the invalidation of a hwpoisoned page fails
    because the page is still mapped in another process.  This can cause a
    program to be continuously restarted and die when it page faults on the
    page that was not invalidated.  Avoid that problem by unmapping the
    hwpoisoned page when we find it.

    Another issue is that sometimes we end up oopsing in finish_fault, if
    the code tries to do something with the now-NULL vmf->page.  I did not
    hit this error when submitting the previous patch because there are
    several opportunities for alloc_set_pte to bail out before accessing
    vmf->page, and that apparently happened on those systems, and most of
    the time on other systems, too.

    However, across several million systems that error does occur a handful
    of times a day.  It can be avoided by returning VM_FAULT_NOPAGE which
    will cause do_read_fault to return before calling finish_fault.

    Link: https://lkml.kernel.org/r/20220325161428.5068d97e@imladris.surriel.com
    Fixes: e53ac7374e64 ("mm: invalidate hwpoison page cache page in fault path")
    Signed-off-by: Rik van Riel <riel@surriel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:04 -04:00
Chris von Recklinghausen 70d93dcf3b mm: streamline COW logic in do_swap_page()
Bugzilla: https://bugzilla.redhat.com/2120352

commit c145e0b47c77ebeefdfd73dbb344577b2fc9b065
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Mar 24 18:13:40 2022 -0700

    mm: streamline COW logic in do_swap_page()

    Currently we have a different COW logic when:
    * triggering a read-fault to swapin first and then trigger a write-fault
      -> do_swap_page() + do_wp_page()
    * triggering a write-fault to swapin
      -> do_swap_page() + do_wp_page() only if we fail reuse in do_swap_page()

    The COW logic in do_swap_page() is different than our reuse logic in
    do_wp_page().  The COW logic in do_wp_page() -- page_count() == 1 -- makes
    currently sure that we certainly don't have a remaining reference, e.g.,
    via GUP, on the target page we want to reuse: if there is any unexpected
    reference, we have to copy to avoid information leaks.

    As do_swap_page() behaves differently, in environments with swap enabled
    we can currently have an unintended information leak from the parent to
    the child, similar as known from CVE-2020-29374:

            1. Parent writes to anonymous page
            -> Page is mapped writable and modified
            2. Page is swapped out
            -> Page is unmapped and replaced by swap entry
            3. fork()
            -> Swap entries are copied to child
            4. Child pins page R/O
            -> Page is mapped R/O into child
            5. Child unmaps page
            -> Child still holds GUP reference
            6. Parent writes to page
            -> Page is reused in do_swap_page()
            -> Child can observe changes

    Exchanging 2. and 3. should have the same effect.

    Let's apply the same COW logic as in do_wp_page(), conditionally trying to
    remove the page from the swapcache after freeing the swap entry, however,
    before actually mapping our page.  We can change the order now that we use
    try_to_free_swap(), which doesn't care about the mapcount, instead of
    reuse_swap_page().

    To handle references from the LRU pagevecs, conditionally drain the local
    LRU pagevecs when required, however, don't consider the page_count() when
    deciding whether to drain to keep it simple for now.

    Link: https://lkml.kernel.org/r/20220131162940.210846-5-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:03 -04:00
Chris von Recklinghausen a0cd9664b0 mm: slightly clarify KSM logic in do_swap_page()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 84d60fdd3733fb86c126f2adfd0361fdc44087c3
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Mar 24 18:13:37 2022 -0700

    mm: slightly clarify KSM logic in do_swap_page()

    Let's make it clearer that KSM might only have to copy a page in case we
    have a page in the swapcache, not if we allocated a fresh page and
    bypassed the swapcache.  While at it, add a comment why this is usually
    necessary and merge the two swapcache conditions.

    [akpm@linux-foundation.org: fix comment, per David]

    Link: https://lkml.kernel.org/r/20220131162940.210846-4-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:03 -04:00
Chris von Recklinghausen 36be0c7d65 mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
Bugzilla: https://bugzilla.redhat.com/2120352

commit d4c470970d45c863fafc757521a82be2f80b1232
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Mar 24 18:13:34 2022 -0700

    mm: optimize do_wp_page() for fresh pages in local LRU pagevecs

    For example, if a page just got swapped in via a read fault, the LRU
    pagevecs might still hold a reference to the page.  If we trigger a write
    fault on such a page, the additional reference from the LRU pagevecs will
    prohibit reusing the page.

    Let's conditionally drain the local LRU pagevecs when we stumble over a
    !PageLRU() page.  We cannot easily drain remote LRU pagevecs and it might
    not be desirable performance-wise.  Consequently, this will only avoid
    copying in some cases.

    Add a simple "page_count(page) > 3" check first but keep the
    "page_count(page) > 1 + PageSwapCache(page)" check in place, as we want to
    minimize cases where we remove a page from the swapcache but won't be able
    to reuse it, for example, because another process has it mapped R/O, to
    not affect reclaim.

    We cannot easily handle the following cases and we will always have to
    copy:

    (1) The page is referenced in the LRU pagevecs of other CPUs. We really
        would have to drain the LRU pagevecs of all CPUs -- most probably
        copying is much cheaper.

    (2) The page is already PageLRU() but is getting moved between LRU
        lists, for example, for activation (e.g., mark_page_accessed()),
        deactivation (MADV_COLD), or lazyfree (MADV_FREE). We'd have to
        drain mostly unconditionally, which might be bad performance-wise.
        Most probably this won't happen too often in practice.

    Note that there are other reasons why an anon page might temporarily not
    be PageLRU(): for example, compaction and migration have to isolate LRU
    pages from the LRU lists first (isolate_lru_page()), moving them to
    temporary local lists and clearing PageLRU() and holding an additional
    reference on the page.  In that case, we'll always copy.

    This change seems to be fairly effective with the reproducer [1] shared by
    Nadav, as long as writeback is done synchronously, for example, using
    zram.  However, with asynchronous writeback, we'll usually fail to free
    the swapcache because the page is still under writeback: something we
    cannot easily optimize for, and maybe it's not really relevant in
    practice.

    [1] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com

    Link: https://lkml.kernel.org/r/20220131162940.210846-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:03 -04:00
Chris von Recklinghausen 6798eaa677 mm: optimize do_wp_page() for exclusive pages in the swapcache
Bugzilla: https://bugzilla.redhat.com/2120352

commit 53a05ad9f21d858d24f76d12b3e990405f2036d1
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Mar 24 18:13:30 2022 -0700

    mm: optimize do_wp_page() for exclusive pages in the swapcache

    Patch series "mm: COW fixes part 1: fix the COW security issue for THP and swap", v3.

    This series attempts to optimize and streamline the COW logic for ordinary
    anon pages and THP anon pages, fixing two remaining instances of
    CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information
    can leak from a parent process to a child process via anonymous pages
    shared during fork().

    This issue, including other related COW issues, has been summarized in [2]:

     "1. Observing Memory Modifications of Private Pages From A Child Process

      Long story short: process-private memory might not be as private as you
      think once you fork(): successive modifications of private memory
      regions in the parent process can still be observed by the child
      process, for example, by smart use of vmsplice()+munmap().

      The core problem is that pinning pages readable in a child process, such
      as done via the vmsplice system call, can result in a child process
      observing memory modifications done in the parent process the child is
      not supposed to observe. [1] contains an excellent summary and [2]
      contains further details. This issue was assigned CVE-2020-29374 [9].

      For this to trigger, it's required to use a fork() without subsequent
      exec(), for example, as used under Android zygote. Without further
      details about an application that forks less-privileged child processes,
      one cannot really say what's actually affected and what's not -- see the
      details section the end of this mail for a short sshd/openssh analysis.

      While commit 17839856fd ("gup: document and work around "COW can break
      either way" issue") fixed this issue and resulted in other problems
      (e.g., ptrace on pmem), commit 09854ba94c ("mm: do_wp_page()
      simplification") re-introduced part of the problem unfortunately.

      The original reproducer can be modified quite easily to use THP [3] and
      make the issue appear again on upstream kernels. I modified it to use
      hugetlb [4] and it triggers as well. The problem is certainly less
      severe with hugetlb than with THP; it merely highlights that we still
      have plenty of open holes we should be closing/fixing.

      Regarding vmsplice(), the only known workaround is to disallow the
      vmsplice() system call ... or disable THP and hugetlb. But who knows
      what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
      the end, it's a more generic issue"

    This security issue was first reported by Jann Horn on 27 May 2020 and it
    currently affects anonymous pages during swapin, anonymous THP and hugetlb.
    This series tackles anonymous pages during swapin and anonymous THP:

     - do_swap_page() for handling COW on PTEs during swapin directly

     - do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write
       faults

    With this series, we'll apply the same COW logic we have in do_wp_page()
    to all swappable anon pages: don't reuse (map writable) the page in
    case there are additional references (page_count() != 1). All users of
    reuse_swap_page() are remove, and consequently reuse_swap_page() is
    removed.

    In general, we're struggling with the following COW-related issues:

    (1) "missed COW": we miss to copy on write and reuse the page (map it
        writable) although we must copy because there are pending references
        from another process to this page. The result is a security issue.

    (2) "wrong COW": we copy on write although we wouldn't have to and
        shouldn't: if there are valid GUP references, they will become out
        of sync with the pages mapped into the page table. We fail to detect
        that such a page can be reused safely, especially if never more than
        a single process mapped the page. The result is an intra process
        memory corruption.

    (3) "unnecessary COW": we copy on write although we wouldn't have to:
        performance degradation and temporary increases swap+memory
        consumption can be the result.

    While this series fixes (1) for swappable anon pages, it tries to reduce
    reported cases of (3) first as good and easy as possible to limit the
    impact when streamlining.  The individual patches try to describe in
    which cases we will run into (3).

    This series certainly makes (2) worse for THP, because a THP will now
    get PTE-mapped on write faults if there are additional references, even
    if there was only ever a single process involved: once PTE-mapped, we'll
    copy each and every subpage and won't reuse any subpage as long as the
    underlying compound page wasn't split.

    I'm working on an approach to fix (2) and improve (3): PageAnonExclusive
    to mark anon pages that are exclusive to a single process, allow GUP
    pins only on such exclusive pages, and allow turning exclusive pages
    shared (clearing PageAnonExclusive) only if there are no GUP pins.  Anon
    pages with PageAnonExclusive set never have to be copied during write
    faults, but eventually during fork() if they cannot be turned shared.
    The improved reuse logic in this series will essentially also be the
    logic to reset PageAnonExclusive.  This work will certainly take a
    while, but I'm planning on sharing details before having code fully
    ready.

    #1-#5 can be applied independently of the rest. #6-#9 are mostly only
    cleanups related to reuse_swap_page().

    Notes:
    * For now, I'll leave hugetlb code untouched: "unnecessary COW" might
      easily break existing setups because hugetlb pages are a scarce resource
      and we could just end up having to crash the application when we run out
      of hugetlb pages. We have to be very careful and the security aspect with
      hugetlb is most certainly less relevant than for unprivileged anon pages.
    * Instead of lru_add_drain() we might actually just drain the lru_add list
      or even just remove the single page of interest from the lru_add list.
      This would require a new helper function, and could be added if the
      conditional lru_add_drain() turn out to be a problem.
    * I extended the test case already included in [1] to also test for the
      newly found do_swap_page() case. I'll send that out separately once/if
      this part was merged.

    [1] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
    [2] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com

    This patch (of 9):

    Liang Zhang reported [1] that the current COW logic in do_wp_page() is
    sub-optimal when it comes to swap+read fault+write fault of anonymous
    pages that have a single user, visible via a performance degradation in
    the redis benchmark.  Something similar was previously reported [2] by
    Nadav with a simple reproducer.

    After we put an anon page into the swapcache and unmapped it from a single
    process, that process might read that page again and refault it read-only.
    If that process then writes to that page, the process is actually the
    exclusive user of the page, however, the COW logic in do_co_page() won't
    be able to reuse it due to the additional reference from the swapcache.

    Let's optimize for pages that have been added to the swapcache but only
    have an exclusive user.  Try removing the swapcache reference if there is
    hope that we're the exclusive user.

    We will fail removing the swapcache reference in two scenarios:
    (1) There are additional swap entries referencing the page: copying
        instead of reusing is the right thing to do.
    (2) The page is under writeback: theoretically we might be able to reuse
        in some cases, however, we cannot remove the additional reference
        and will have to copy.

    Note that we'll only try removing the page from the swapcache when it's
    highly likely that we'll be the exclusive owner after removing the page
    from the swapache.  As we're about to map that page writable and redirty
    it, that should not affect reclaim but is rather the right thing to do.

    Further, we might have additional references from the LRU pagevecs, which
    will force us to copy instead of being able to reuse.  We'll try handling
    such references for some scenarios next.  Concurrent writeback cannot be
    handled easily and we'll always have to copy.

    While at it, remove the superfluous page_mapcount() check: it's
    implicitly covered by the page_count() for ordinary anon pages.

    [1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com
    [2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com

    Link: https://lkml.kernel.org/r/20220131162940.210846-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reported-by: Liang Zhang <zhangliang5@huawei.com>
    Reported-by: Nadav Amit <nadav.amit@gmail.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:02 -04:00
Chris von Recklinghausen b6778dceeb userfaultfd: provide unmasked address on page-fault
Bugzilla: https://bugzilla.redhat.com/2120352

commit 824ddc601adc2cc48efb7f58b57997986c1c1276
Author: Nadav Amit <namit@vmware.com>
Date:   Tue Mar 22 14:45:32 2022 -0700

    userfaultfd: provide unmasked address on page-fault

    Userfaultfd is supposed to provide the full address (i.e., unmasked) of
    the faulting access back to userspace.  However, that is not the case for
    quite some time.

    Even running "userfaultfd_demo" from the userfaultfd man page provides the
    wrong output (and contradicts the man page).  Notice that
    "UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and
    not the first read address (0x7fc5e30b300f).

            Address returned by mmap() = 0x7fc5e30b3000

            fault_handler_thread():
                poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
                UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000
                    (uffdio_copy.copy returned 4096)
            Read address 0x7fc5e30b300f in main(): A
            Read address 0x7fc5e30b340f in main(): A
            Read address 0x7fc5e30b380f in main(): A
            Read address 0x7fc5e30b3c0f in main(): A

    The exact address is useful for various reasons and specifically for
    prefetching decisions.  If it is known that the memory is populated by
    certain objects whose size is not page-aligned, then based on the faulting
    address, the uffd-monitor can decide whether to prefetch and prefault the
    adjacent page.

    This bug has been for quite some time in the kernel: since commit
    1a29d85eb0 ("mm: use vmf->address instead of of vmf->virtual_address")
    vmf->virtual_address"), which dates back to 2016.  A concern has been
    raised that existing userspace application might rely on the old/wrong
    behavior in which the address is masked.  Therefore, it was suggested to
    provide the masked address unless the user explicitly asks for the exact
    address.

    Add a new userfaultfd feature UFFD_FEATURE_EXACT_ADDRESS to direct
    userfaultfd to provide the exact address.  Add a new "real_address" field
    to vmf to hold the unmasked address.  Provide the address to userspace
    accordingly.

    Initialize real_address in various code-paths to be consistent with
    address, even when it is not used, to be on the safe side.

    [namit@vmware.com: initialize real_address on all code paths, per Jan]
      Link: https://lkml.kernel.org/r/20220226022655.350562-1-namit@vmware.com
    [akpm@linux-foundation.org: fix typo in comment, per Jan]

    Link: https://lkml.kernel.org/r/20220218041003.3508-1-namit@vmware.com
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen c6556701d5 mm: invalidate hwpoison page cache page in fault path
Bugzilla: https://bugzilla.redhat.com/2120352

commit e53ac7374e64dede04d745ff0e70ff5048378d1f
Author: Rik van Riel <riel@surriel.com>
Date:   Tue Mar 22 14:44:09 2022 -0700

    mm: invalidate hwpoison page cache page in fault path

    Sometimes the page offlining code can leave behind a hwpoisoned clean
    page cache page.  This can lead to programs being killed over and over
    and over again as they fault in the hwpoisoned page, get killed, and
    then get re-spawned by whatever wanted to run them.

    This is particularly embarrassing when the page was offlined due to
    having too many corrected memory errors.  Now we are killing tasks due
    to them trying to access memory that probably isn't even corrupted.

    This problem can be avoided by invalidating the page from the page fault
    handler, which already has a branch for dealing with these kinds of
    pages.  With this patch we simply pretend the page fault was successful
    if the page was invalidated, return to userspace, incur another page
    fault, read in the file from disk (to a new memory page), and then
    everything works again.

    Link: https://lkml.kernel.org/r/20220212213740.423efcea@imladris.surriel.com
    Signed-off-by: Rik van Riel <riel@surriel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen 4bd01448e8 mm/memory.c: use helper macro min and max in unmap_mapping_range_tree()
Bugzilla: https://bugzilla.redhat.com/2120352

commit f9871da927437dc85bc3fec206fc9bfddea4a34b
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:42:33 2022 -0700

    mm/memory.c: use helper macro min and max in unmap_mapping_range_tree()

    Use helper macro min and max to help simplify the code logic.  Minor
    readability improvement.

    Link: https://lkml.kernel.org/r/20220224121134.35068-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:50 -04:00
Chris von Recklinghausen ee5a47ce9b mm/memory.c: use helper function range_in_vma()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 88a359125a2b8f2437f09ab3b1af4815c89690d4
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:42:30 2022 -0700

    mm/memory.c: use helper function range_in_vma()

    Use helper function range_in_vma() to check if address, address + size are
    within the vma range.  Minor readability improvement.

    Link: https://lkml.kernel.org/r/20220219021441.29173-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:50 -04:00
Chris von Recklinghausen ea2fa2fb80 uaccess: remove CONFIG_SET_FS
Conflicts: in arch/, only keep changes to arch/Kconfig and
	arch/arm64/kernel/traps.c. All other arch files in the upstream version
	of this patch are dropped.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 967747bbc084b93b54e66f9047d342232314cd25
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Fri Feb 11 21:42:45 2022 +0100

    uaccess: remove CONFIG_SET_FS

    There are no remaining callers of set_fs(), so CONFIG_SET_FS
    can be removed globally, along with the thread_info field and
    any references to it.

    This turns access_ok() into a cheaper check against TASK_SIZE_MAX.

    As CONFIG_SET_FS is now gone, drop all remaining references to
    set_fs()/get_fs(), mm_segment_t, user_addr_max() and uaccess_kernel().

    Acked-by: Sam Ravnborg <sam@ravnborg.org> # for sparc32 changes
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Tested-by: Sergey Matyukevich <sergey.matyukevich@synopsys.com> # for arc ch
anges
    Acked-by: Stafford Horne <shorne@gmail.com> # [openrisc, asm-generic]
    Acked-by: Dinh Nguyen <dinguyen@kernel.org>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:45 -04:00
Chris von Recklinghausen 1c0357bc09 delayacct: support swapin delay accounting for swapping without blkio
Bugzilla: https://bugzilla.redhat.com/2120352

commit a3d5dc908a5f572ce3e31fe83fd2459a1c3c5422
Author: Yang Yang <yang.yang29@zte.com.cn>
Date:   Wed Jan 19 18:10:02 2022 -0800

    delayacct: support swapin delay accounting for swapping without blkio

    Currently delayacct accounts swapin delay only for swapping that cause
    blkio.  If we use zram for swapping, tools/accounting/getdelays can't
    get any SWAP delay.

    It's useful to get zram swapin delay information, for example to adjust
    compress algorithm or /proc/sys/vm/swappiness.

    Reference to PSI, it accounts any kind of swapping by doing its work in
    swap_readpage(), no matter whether swapping causes blkio.  Let delayacct
    do the similar work.

    Link: https://lkml.kernel.org/r/20211112083813.8559-1-yang.yang29@zte.com.cn
    Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
    Reported-by: Zeal Robot <zealci@zte.com.cn>
    Cc: Balbir Singh <bsingharora@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:42 -04:00
Chris von Recklinghausen 6dd7d3d455 mm: remove last argument of reuse_swap_page()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 020e87650af9f43683546729f959fdc78422a4b7
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 14 14:06:44 2022 -0800

    mm: remove last argument of reuse_swap_page()

    None of the callers care about the total_map_swapcount() any more.

    Link: https://lkml.kernel.org/r/20211220205943.456187-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:39 -04:00
Chris von Recklinghausen ac3988cadd mm: move tlb_flush_pending inline helpers to mm_inline.h
Bugzilla: https://bugzilla.redhat.com/2120352

commit 36090def7bad06a6346f86a7cfdbfda2d138cb64
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Fri Jan 14 14:06:10 2022 -0800

    mm: move tlb_flush_pending inline helpers to mm_inline.h

    linux/mm_types.h should only define structure definitions, to make it
    cheap to include elsewhere.  The atomic_t helper function definitions
    are particularly large, so it's better to move the helpers using those
    into the existing linux/mm_inline.h and only include that where needed.

    As a follow-up, we may want to go through all the indirect includes in
    mm_types.h and reduce them as much as possible.

    Link: https://lkml.kernel.org/r/20211207125710.2503446-2-arnd@kernel.org
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Colin Cross <ccross@google.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:39 -04:00
Chris von Recklinghausen e4096b4cd5 mm/memory.c: avoid unnecessary kernel/user pointer conversion
Bugzilla: https://bugzilla.redhat.com/2120352

commit b063e374e7ae75589a36f4bcbfb47956ac197c57
Author: Amit Daniel Kachhap <amit.kachhap@arm.com>
Date:   Fri Nov 5 13:38:18 2021 -0700

    mm/memory.c: avoid unnecessary kernel/user pointer conversion

    Annotating a pointer from __user to kernel and then back again might
    confuse sparse.  In copy_huge_page_from_user() it can be avoided by
    removing the intermediate variable since it is never used.

    Link: https://lkml.kernel.org/r/20210914150820.19326-1-amit.kachhap@arm.com
    Signed-off-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen d311f2f0fb mm: change page type prior to adding page table entry
Conflicts:
* mm/memory.c: minor fuzz due to missing upstream commit (and related series)
  b756a3b5e7 ("mm: device exclusive memory access")

* mm/migrate.c: minor fuzz due to backport commits
  86b6e00a7b ("mm/migrate: Convert remove_migration_ptes() to folios")
  e462accf60 ("mm/munlock: protect the per-CPU pagevec by a local_lock_t")

Bugzilla: https://bugzilla.redhat.com/2120352

commit 1eba86c096e35e3cc83de1ad2c26f2d70470211b
Author: Pasha Tatashin <pasha.tatashin@soleen.com>
Date:   Fri Jan 14 14:06:29 2022 -0800

    mm: change page type prior to adding page table entry

    Patch series "page table check", v3.

    Ensure that some memory corruptions are prevented by checking at the
    time of insertion of entries into user page tables that there is no
    illegal sharing.

    We have recently found a problem [1] that existed in kernel since 4.14.
    The problem was caused by broken page ref count and led to memory
    leaking from one process into another.  The problem was accidentally
    detected by studying a dump of one process and noticing that one page
    contains memory that should not belong to this process.

    There are some other page->_refcount related problems that were recently
    fixed: [2], [3] which potentially could also lead to illegal sharing.

    In addition to hardening refcount [4] itself, this work is an attempt to
    prevent this class of memory corruption issues.

    It uses a simple state machine that is independent from regular MM logic
    to check for illegal sharing at time pages are inserted and removed from
    page tables.

    [1] https://lore.kernel.org/all/xr9335nxwc5y.fsf@gthelen2.svl.corp.google.com
    [2] https://lore.kernel.org/all/1582661774-30925-2-git-send-email-akaher@vmware.com
    [3] https://lore.kernel.org/all/20210622021423.154662-3-mike.kravetz@oracle.com
    [4] https://lore.kernel.org/all/20211221150140.988298-1-pasha.tatashin@soleen.com

    This patch (of 4):

    There are a few places where we first update the entry in the user page
    table, and later change the struct page to indicate that this is
    anonymous or file page.

    In most places, however, we first configure the page metadata and then
    insert entries into the page table.  Page table check, will use the
    information from struct page to verify the type of entry is inserted.

    Change the order in all places to first update struct page, and later to
    update page table.

    This means that we first do calls that may change the type of page (anon
    or file):

            page_move_anon_rmap
            page_add_anon_rmap
            do_page_add_anon_rmap
            page_add_new_anon_rmap
            page_add_file_rmap
            hugepage_add_anon_rmap
            hugepage_add_new_anon_rmap

    And after that do calls that add entries to the page table:

            set_huge_pte_at
            set_pte_at

    Link: https://lkml.kernel.org/r/20211221154650.1047963-1-pasha.tatashin@soleen.com
    Link: https://lkml.kernel.org/r/20211221154650.1047963-2-pasha.tatashin@soleen.com
    Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Paul Turner <pjt@google.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Will Deacon <will@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Masahiro Yamada <masahiroy@kernel.org>
    Cc: Sami Tolvanen <samitolvanen@google.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Frederic Weisbecker <frederic@kernel.org>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:13 -04:00
Chris von Recklinghausen baf876910d mm: remove redundant smp_wmb()
Bugzilla: https://bugzilla.redhat.com/2120352

commit ed33b5a677da33d6e8f959879bb61e9791b80354
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Fri Nov 5 13:38:41 2021 -0700

    mm: remove redundant smp_wmb()

    The smp_wmb() which is in the __pte_alloc() is used to ensure all ptes
    setup is visible before the pte is made visible to other CPUs by being
    put into page tables.  We only need this when the pte is actually
    populated, so move it to pmd_install().  __pte_alloc_kernel(),
    __p4d_alloc(), __pud_alloc() and __pmd_alloc() are similar to this case.

    We can also defer smp_wmb() to the place where the pmd entry is really
    populated by preallocated pte.  There are two kinds of user of
    preallocated pte, one is filemap & finish_fault(), another is THP.  The
    former does not need another smp_wmb() because the smp_wmb() has been
    done by pmd_install().  Fortunately, the latter also does not need
    another smp_wmb() because there is already a smp_wmb() before populating
    the new pte when the THP uses a preallocated pte to split a huge pmd.

    Link: https://lkml.kernel.org/r/20210901102722.47686-3-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mika Penttila <mika.penttila@nextfour.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:08 -04:00
Chris von Recklinghausen 207adffc8b mm: introduce pmd_install() helper
Bugzilla: https://bugzilla.redhat.com/2120352

commit 03c4f20454e0231d2cdec4373841a3a25cf4efed
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Fri Nov 5 13:38:38 2021 -0700

    mm: introduce pmd_install() helper

    Patch series "Do some code cleanups related to mm", v3.

    This patch (of 2):

    Currently we have three times the same few lines repeated in the code.
    Deduplicate them by newly introduced pmd_install() helper.

    Link: https://lkml.kernel.org/r/20210901102722.47686-1-zhengqi.arch@bytedance.com
    Link: https://lkml.kernel.org/r/20210901102722.47686-2-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Mika Penttila <mika.penttila@nextfour.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:08 -04:00
Chris von Recklinghausen 5366f692cf Revert "mm: gup: COR: copy-on-read fault"
Bugzilla: https://bugzilla.redhat.com/2120352
Upstream status: RHEL only

This reverts commit 9d4e1cf460.
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:01 -04:00
Chris von Recklinghausen fd81b53da4 Revert "mm: gup: gup_must_unshare() use can_read_pin_swap_page()"
Bugzilla: https://bugzilla.redhat.com/2120352
Upstream status: RHEL only

This reverts commit 32fd7f268e.
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:00 -04:00
Chris von Recklinghausen c274da9bf9 Revert "mm: COW: skip the page lock in the COW copy path"
Bugzilla: https://bugzilla.redhat.com/2120352
Upstream status: RHEL only

This reverts commit 68a2dbbfe6.
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:00 -04:00
Chris von Recklinghausen 38603e7e65 Revert "mm: COW: restore full accuracy in page reuse"
Conflicts:
	include/linux/ksm.h - The backport of
		84fbbe21894b ("mm/rmap: Constify the rmap_walk_control argument"
)
		and
		2f031c6f042c ("mm/rmap: Convert rmap_walk() to take a folio")
		were added on top of 0515249880. Keep their changes.
	mm/memory.c - The backport of
		cea86fe246b6 ("mm/munlock: rmap call mlock_vma_page() munlock_vma_page()")
		removed the block starting with the comment
		"Don't let another task, with possibly unlocked....". Leave
		that block out.

Bugzilla: https://bugzilla.redhat.com/2120352
Upstream status: RHEL only

This reverts commit 0515249880.
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:00 -04:00
Jeffrey Layton 07a704e493 afs: Fix mmap coherency vs 3rd-party changes
Bugzilla: http://bugzilla.redhat.com/1229736

commit 6e0e99d58a6530cf65f10e4bb16630c5be6c254d
Author: David Howells <dhowells@redhat.com>
Date:   Thu Sep 2 16:43:10 2021 +0100

    afs: Fix mmap coherency vs 3rd-party changes

    Fix the coherency management of mmap'd data such that 3rd-party changes
    become visible as soon as possible after the callback notification is
    delivered by the fileserver.  This is done by the following means:

     (1) When we break a callback on a vnode specified by the CB.CallBack call
         from the server, we queue a work item (vnode->cb_work) to go and
         clobber all the PTEs mapping to that inode.

         This causes the CPU to trip through the ->map_pages() and
         ->page_mkwrite() handlers if userspace attempts to access the page(s)
         again.

         (Ideally, this would be done in the service handler for CB.CallBack,
         but the server is waiting for our reply before considering, and we
         have a list of vnodes, all of which need breaking - and the process of
         getting the mmap_lock and stripping the PTEs on all CPUs could be
         quite slow.)

     (2) Call afs_validate() from the ->map_pages() handler to check to see if
         the file has changed and to get a new callback promise from the
         server.

    Also handle the fileserver telling us that it's dropping all callbacks,
    possibly after it's been restarted by sending us a CB.InitCallBackState*
    call by the following means:

     (3) Maintain a per-cell list of afs files that are currently mmap'd
         (cell->fs_open_mmaps).

     (4) Add a work item to each server that is invoked if there are any open
         mmaps when CB.InitCallBackState happens.  This work item goes through
         the aforementioned list and invokes the vnode->cb_work work item for
         each one that is currently using this server.

         This causes the PTEs to be cleared, causing ->map_pages() or
         ->page_mkwrite() to be called again, thereby calling afs_validate()
         again.

    I've chosen to simply strip the PTEs at the point of notification reception
    rather than invalidate all the pages as well because (a) it's faster, (b)
    we may get a notification for other reasons than the data being altered (in
    which case we don't want to clobber the pagecache) and (c) we need to ask
    the server to find out - and I don't want to wait for the reply before
    holding up userspace.

    This was tested using the attached test program:

            #include <stdbool.h>
            #include <stdio.h>
            #include <stdlib.h>
            #include <unistd.h>
            #include <fcntl.h>
            #include <sys/mman.h>
            int main(int argc, char *argv[])
            {
                    size_t size = getpagesize();
                    unsigned char *p;
                    bool mod = (argc == 3);
                    int fd;
                    if (argc != 2 && argc != 3) {
                            fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]);
                            exit(2);
                    }
                    fd = open(argv[1], mod ? O_RDWR : O_RDONLY);
                    if (fd < 0) {
                            perror(argv[1]);
                            exit(1);
                    }

                    p = mmap(NULL, size, mod ? PROT_READ|PROT_WRITE : PROT_READ,
                             MAP_SHARED, fd, 0);
                    if (p == MAP_FAILED) {
                            perror("mmap");
                            exit(1);
                    }
                    for (;;) {
                            if (mod) {
                                    p[0]++;
                                    msync(p, size, MS_ASYNC);
                                    fsync(fd);
                            }
                            printf("%02x", p[0]);
                            fflush(stdout);
                            sleep(1);
                    }
            }

    It runs in two modes: in one mode, it mmaps a file, then sits in a loop
    reading the first byte, printing it and sleeping for a second; in the
    second mode it mmaps a file, then sits in a loop incrementing the first
    byte and flushing, then printing and sleeping.

    Two instances of this program can be run on different machines, one doing
    the reading and one doing the writing.  The reader should see the changes
    made by the writer, but without this patch, they aren't because validity
    checking is being done lazily - only on entry to the filesystem.

    Testing the InitCallBackState change is more complicated.  The server has
    to be taken offline, the saved callback state file removed and then the
    server restarted whilst the reading-mode program continues to run.  The
    client machine then has to poke the server to trigger the InitCallBackState
    call.

    Signed-off-by: David Howells <dhowells@redhat.com>
    Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
    cc: linux-afs@lists.infradead.org
    Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/

Signed-off-by: Jeffrey Layton <jlayton@redhat.com>
2022-08-22 12:31:29 -04:00
Aristeu Rozanski 6cca8f7219 mm: unmap_mapping_range_tree() with i_mmap_rwsem shared
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing 6e0e99d58a65

commit 2c8659951654fc14c0351e33ca7604cdf670341e
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Mar 24 18:14:02 2022 -0700

    mm: unmap_mapping_range_tree() with i_mmap_rwsem shared

    Revert 48ec833b78 ("Revert "mm/memory.c: share the i_mmap_rwsem"") to
    reinstate c8475d144a ("mm/memory.c: share the i_mmap_rwsem"): the
    unmap_mapping_range family of functions do the unmapping of user pages
    (ultimately via zap_page_range_single) without modifying the interval tree
    itself, and unmapping races are necessarily guarded by page table lock,
    thus the i_mmap_rwsem should be shared in unmap_mapping_pages() and
    unmap_mapping_folio().

    Commit 48ec833b78 was intended as a short-term measure, allowing the
    other shared lock changes into 3.19 final, before investigating three
    trinity crashes, one of which had been bisected to commit c8475d144ab:

    [1] https://lkml.org/lkml/2014/11/14/342
    https://lore.kernel.org/lkml/5466142C.60100@oracle.com/
    [2] https://lkml.org/lkml/2014/12/22/213
    https://lore.kernel.org/lkml/549832E2.8060609@oracle.com/
    [3] https://lkml.org/lkml/2014/12/9/741
    https://lore.kernel.org/lkml/5487ACC5.1010002@oracle.com/

    Two of those were Bad page states: free_pages_prepare() found PG_mlocked
    still set - almost certain to have been fixed by 4.4 commit b87537d9e2
    ("mm: rmap use pte lock not mmap_sem to set PageMlocked").  The NULL deref
    on rwsem in [2]: unclear, only happened once, not bisected to c8475d144a.

    No change to the i_mmap_lock_write() around __unmap_hugepage_range_final()
    in unmap_single_vma(): IIRC that's a special usage, helping to serialize
    hugetlbfs page table sharing, not to be dabbled with lightly.  No change
    to other uses of i_mmap_lock_write() by hugetlbfs.

    I am not aware of any significant gains from the concurrency allowed by
    this commit: it is submitted more to resolve an ancient misunderstanding.

    Link: https://lkml.kernel.org/r/e4a5e356-6c87-47b2-3ce8-c2a95ae84e20@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Sasha Levin <sashal@kernel.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:21 -04:00
Aristeu Rozanski 8553a69df5 mm: rework swap handling of zap_pte_range
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: page_remove_rmap() still passes vma as parameter (we already have cea86fe246b6)

commit 8018db8525947c2eeb9990a27ca0a50eecbfcd41
Author: Peter Xu <peterx@redhat.com>
Date:   Tue Mar 22 14:42:24 2022 -0700

    mm: rework swap handling of zap_pte_range

    Clean the code up by merging the device private/exclusive swap entry
    handling with the rest, then we merge the pte clear operation too.

    struct* page is defined in multiple places in the function, move it
    upward.

    free_swap_and_cache() is only useful for !non_swap_entry() case, put it
    into the condition.

    No functional change intended.

    Link: https://lkml.kernel.org/r/20220216094810.60572-5-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:21 -04:00
Aristeu Rozanski 661ab6d28a mm: change zap_details.zap_mapping into even_cows
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 2e148f1e3d9af3270c602fc7571a90b297204fde
Author: Peter Xu <peterx@redhat.com>
Date:   Tue Mar 22 14:42:21 2022 -0700

    mm: change zap_details.zap_mapping into even_cows

    Currently we have a zap_mapping pointer maintained in zap_details, when
    it is specified we only want to zap the pages that has the same mapping
    with what the caller has specified.

    But what we want to do is actually simpler: we want to skip zapping
    private (COW-ed) pages in some cases.  We can refer to
    unmap_mapping_pages() callers where we could have passed in different
    even_cows values.  The other user is unmap_mapping_folio() where we
    always want to skip private pages.

    According to Hugh, we used a mapping pointer for historical reason, as
    explained here:

      https://lore.kernel.org/lkml/391aa58d-ce84-9d4-d68d-d98a9c533255@google.com/

    Quoting partly from Hugh:

      Which raises the question again of why I did not just use a boolean flag
      there originally: aah, I think I've found why.  In those days there was a
      horrible "optimization", for better performance on some benchmark I guess,
      which when you read from /dev/zero into a private mapping, would map the zero
      page there (look up read_zero_pagealigned() and zeromap_page_range() if you
      dare).  So there was another category of page to be skipped along with the
      anon COWs, and I didn't want multiple tests in the zap loop, so checking
      check_mapping against page->mapping did both.  I think nowadays you could do
      it by checking for PageAnon page (or genuine swap entry) instead.

    This patch replaces the zap_details.zap_mapping pointer into the even_cows
    boolean, then we check it against PageAnon.

    Link: https://lkml.kernel.org/r/20220216094810.60572-4-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Suggested-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:21 -04:00
Aristeu Rozanski fe9d8a1208 mm: rename zap_skip_check_mapping() to should_zap_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 254ab940eb017e75574afc80951eb63bb74e0d34
Author: Peter Xu <peterx@redhat.com>
Date:   Tue Mar 22 14:42:18 2022 -0700

    mm: rename zap_skip_check_mapping() to should_zap_page()

    The previous name is against the natural way people think.  Invert the
    meaning and also the return value.  No functional change intended.

    Link: https://lkml.kernel.org/r/20220216094810.60572-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Suggested-by: David Hildenbrand <david@redhat.com>
    Suggested-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:21 -04:00
Aristeu Rozanski 7332b66f29 mm: don't skip swap entry even if zap_details specified
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 5abfd71d936a8aefd9f9ccd299dea7a164a5d455
Author: Peter Xu <peterx@redhat.com>
Date:   Tue Mar 22 14:42:15 2022 -0700

    mm: don't skip swap entry even if zap_details specified

    Patch series "mm: Rework zap ptes on swap entries", v5.

    Patch 1 should fix a long standing bug for zap_pte_range() on
    zap_details usage.  The risk is we could have some swap entries skipped
    while we should have zapped them.

    Migration entries are not the major concern because file backed memory
    always zap in the pattern that "first time without page lock, then
    re-zap with page lock" hence the 2nd zap will always make sure all
    migration entries are already recovered.

    However there can be issues with real swap entries got skipped
    errornoously.  There's a reproducer provided in commit message of patch
    1 for that.

    Patch 2-4 are cleanups that are based on patch 1.  After the whole
    patchset applied, we should have a very clean view of zap_pte_range().

    Only patch 1 needs to be backported to stable if necessary.

    This patch (of 4):

    The "details" pointer shouldn't be the token to decide whether we should
    skip swap entries.

    For example, when the callers specified details->zap_mapping==NULL, it
    means the user wants to zap all the pages (including COWed pages), then
    we need to look into swap entries because there can be private COWed
    pages that was swapped out.

    Skipping some swap entries when details is non-NULL may lead to wrongly
    leaving some of the swap entries while we should have zapped them.

    A reproducer of the problem:

    ===8<===
            #define _GNU_SOURCE         /* See feature_test_macros(7) */
            #include <stdio.h>
            #include <assert.h>
            #include <unistd.h>
            #include <sys/mman.h>
            #include <sys/types.h>

            int page_size;
            int shmem_fd;
            char *buffer;

            void main(void)
            {
                    int ret;
                    char val;

                    page_size = getpagesize();
                    shmem_fd = memfd_create("test", 0);
                    assert(shmem_fd >= 0);

                    ret = ftruncate(shmem_fd, page_size * 2);
                    assert(ret == 0);

                    buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
                                    MAP_PRIVATE, shmem_fd, 0);
                    assert(buffer != MAP_FAILED);

                    /* Write private page, swap it out */
                    buffer[page_size] = 1;
                    madvise(buffer, page_size * 2, MADV_PAGEOUT);

                    /* This should drop private buffer[page_size] already */
                    ret = ftruncate(shmem_fd, page_size);
                    assert(ret == 0);
                    /* Recover the size */
                    ret = ftruncate(shmem_fd, page_size * 2);
                    assert(ret == 0);

                    /* Re-read the data, it should be all zero */
                    val = buffer[page_size];
                    if (val == 0)
                            printf("Good\n");
                    else
                            printf("BUG\n");
            }
    ===8<===

    We don't need to touch up the pmd path, because pmd never had a issue with
    swap entries.  For example, shmem pmd migration will always be split into
    pte level, and same to swapping on anonymous.

    Add another helper should_zap_cows() so that we can also check whether we
    should zap private mappings when there's no page pointer specified.

    This patch drops that trick, so we handle swap ptes coherently.  Meanwhile
    we should do the same check upon migration entry, hwpoison entry and
    genuine swap entries too.

    To be explicit, we should still remember to keep the private entries if
    even_cows==false, and always zap them when even_cows==true.

    The issue seems to exist starting from the initial commit of git.

    [peterx@redhat.com: comment tweaks]
      Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com

    Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com
    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:21 -04:00
Aristeu Rozanski 42cad881d5 mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit e763243cc6cb1fcc720ec58cfd6e7c35ae90a479
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue Mar 22 14:41:59 2022 -0700

    mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()

    userfaultfd calls copy_huge_page_from_user() which does not do any cache
    flushing for the target page.  Then the target page will be mapped to
    the user space with a different address (user address), which might have
    an alias issue with the kernel address used to copy the data from the
    user to.

    Fix this issue by flushing dcache in copy_huge_page_from_user().

    Link: https://lkml.kernel.org/r/20220210123058.79206-4-songmuchun@bytedance.com
    Fixes: fa4d75c1de ("userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support")
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Fam Zheng <fam.zheng@bytedance.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lars Persson <lars.persson@axis.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:20 -04:00
Aristeu Rozanski 4b2aa38f6e mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context difference due lack of f4c4a3f484 and differences due RHEL-only 44740bc20b

commit cea86fe246b694a191804b47378eb9d77aefabec
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:26:39 2022 -0800

    mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

    Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
    inline functions which check (vma->vm_flags & VM_LOCKED) before calling
    mlock_page() and munlock_page() in mm/mlock.c.

    Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
    because we have understandable difficulty in accounting pte maps of THPs,
    and if passed a PageHead page, mlock_page() and munlock_page() cannot
    tell whether it's a pmd map to be counted or a pte map to be ignored.

    Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
    others, and use that to call mlock_vma_page() at the end of the page
    adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
    beginning? unimportant, but end was easier for assertions in testing).

    No page lock is required (although almost all adds happen to hold it):
    delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
    Certainly page lock did serialize with page migration, but I'm having
    difficulty explaining why that was ever important.

    Mlock accounting on THPs has been hard to define, differed between anon
    and file, involved PageDoubleMap in some places and not others, required
    clear_page_mlock() at some points.  Keep it simple now: just count the
    pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

    page_add_new_anon_rmap() callers unchanged: they have long been calling
    lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
    handling (it also checks for not VM_SPECIAL: I think that's overcautious,
    and inconsistent with other checks, that mmap_region() already prevents
    VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 4f77d61363 mm: Add unmap_mapping_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 3506659e18a61ae525f3b9b4f5af23b4b149d4db
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Nov 28 14:53:35 2021 -0500

    mm: Add unmap_mapping_folio()

    Convert both callers of unmap_mapping_page() to call unmap_mapping_folio()
    instead.  Also move zap_details from linux/mm.h to mm/memory.c

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:10 -04:00
Aristeu Rozanski 8516e2b0be mm: add zap_skip_check_mapping() helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 91b61ef333cf43f96b3522a086c9ac925763d6e5
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Nov 5 13:38:34 2021 -0700

    mm: add zap_skip_check_mapping() helper

    Use the helper for the checks.  Rename "check_mapping" into
    "zap_mapping" because "check_mapping" looks like a bool but in fact it
    stores the mapping itself.  When it's set, we check the mapping (it must
    be non-NULL).  When it's cleared we skip the check, which works like the
    old way.

    Move the duplicated comments to the helper too.

    Link: https://lkml.kernel.org/r/20210915181538.11288-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Liam Howlett <liam.howlett@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:06 -04:00
Aristeu Rozanski 4132565b67 mm: drop first_index/last_index in zap_details
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: small context conflict due lack of 6e0e99d58a6530cf

commit 232a6a1c0619d1b4d9cd8d21949b2f13821be0af
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Nov 5 13:38:31 2021 -0700

    mm: drop first_index/last_index in zap_details

    The first_index/last_index parameters in zap_details are actually only
    used in unmap_mapping_range_tree().  At the meantime, this function is
    only called by unmap_mapping_pages() once.

    Instead of passing these two variables through the whole stack of page
    zapping code, remove them from zap_details and let them simply be
    parameters of unmap_mapping_range_tree(), which is inlined.

    Link: https://lkml.kernel.org/r/20210915181535.11238-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:06 -04:00
Aristeu Rozanski 5607aa310d mm: clear vmf->pte after pte_unmap_same() returns
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 2ca99358671ad3a2065389476701f69f39ab5e5f
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Nov 5 13:38:28 2021 -0700

    mm: clear vmf->pte after pte_unmap_same() returns

    pte_unmap_same() will always unmap the pte pointer.  After the unmap,
    vmf->pte will not be valid any more, we should clear it.

    It was safe only because no one is accessing vmf->pte after
    pte_unmap_same() returns, since the only caller of pte_unmap_same() (so
    far) is do_swap_page(), where vmf->pte will in most cases be overwritten
    very soon.

    Directly pass in vmf into pte_unmap_same() and then we can also avoid
    the long parameter list too, which should be a nice cleanup.

    Link: https://lkml.kernel.org/r/20210915181533.11188-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:06 -04:00
Patrick Talbert f9a5b7f4d0 Merge: Scheduler RT prerequisites
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/754

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076594
Tested:  Sanity tested with scheduler stress tests.

This is a handful of commits to help the RT merge. Keeping the differences
as small as possible reduces the maintenance.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Fernando Pacheco <fpacheco@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-12 09:28:27 +02:00
Phil Auld 5d55e0afeb sched: Remove preempt_offset argument from __might_sleep()
Bugzilla: https://bugzilla.redhat.com/2076594

commit 42a387566c567603bafa1ec0c5b71c35cba83e86
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Sep 23 18:54:38 2021 +0200

    sched: Remove preempt_offset argument from __might_sleep()

    All callers hand in 0 and never will hand in anything else.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210923165358.054321586@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:33 -04:00
Aristeu Rozanski 75621ac628 mm/workingset: Convert workingset_refault() to take a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 0995d7e568141226f10f8216aa4965e06ab5db8a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Apr 29 10:27:16 2021 -0400

    mm/workingset: Convert workingset_refault() to take a folio

    This nets us 178 bytes of savings from removing calls to compound_head.
    The three callers all grow a little, but each of them will be converted
    to use folios soon, so that's fine.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:31 -04:00
Aristeu Rozanski bddf5b2fad mm/memcg: Convert mem_cgroup_charge() to take a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 8f425e4ed0eb3ef0b2d85a9efccf947ca6aa9b1c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 25 09:27:04 2021 -0400

    mm/memcg: Convert mem_cgroup_charge() to take a folio

    Convert all callers of mem_cgroup_charge() to call page_folio() on the
    page they're currently passing in.  Many of them will be converted to
    use folios themselves soon.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:28 -04:00