Commit Graph

1243 Commits

Author SHA1 Message Date
Rafael Aquini 4b5fb83182 mm: make PTE_MARKER_SWAPIN_ERROR more general
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit af19487f00f34ff8643921d7909dbb3fedc7e329
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Fri Jul 7 14:55:33 2023 -0700

    mm: make PTE_MARKER_SWAPIN_ERROR more general

    Patch series "add UFFDIO_POISON to simulate memory poisoning with UFFD",
    v4.

    This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
    for a detailed description of the feature.

    This patch (of 8):

    Future patches will reuse PTE_MARKER_SWAPIN_ERROR to implement
    UFFDIO_POISON, so make some various preparations for that:

    First, rename it to just PTE_MARKER_POISONED.  The "SWAPIN" can be
    confusing since we're going to re-use it for something not really related
    to swap.  This can be particularly confusing for things like hugetlbfs,
    which doesn't support swap whatsoever.  Also rename some various helper
    functions.

    Next, fix pte marker copying for hugetlbfs.  Previously, it would WARN on
    seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap.
    But, since we're going to re-use it, we want it to go ahead and copy it
    just like non-hugetlbfs memory does today.  Since the code to do this is
    more complicated now, pull it out into a helper which can be re-used in
    both places.  While we're at it, also make it slightly more explicit in
    its handling of e.g.  uffd wp markers.

    For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for an
    error entry, return VM_FAULT_HWPOISON.  For most cases this change doesn't
    matter, e.g.  a userspace program would receive a SIGBUS either way.  But
    for UFFDIO_POISON, this change will let KVM guests get an MCE out of the
    box, instead of giving a SIGBUS to the hypervisor and requiring it to
    somehow inject an MCE.

    Finally, for hugetlbfs faults, handle PTE_MARKER_POISONED, and return
    VM_FAULT_HWPOISON_LARGE in such cases.  Note that this can't happen today
    because the lack of swap support means we'll never end up with such a PTE
    anyway, but this behavior will be needed once such entries *can* show up
    via UFFDIO_POISON.

    Link: https://lkml.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com
    Link: https://lkml.kernel.org/r/20230707215540.2324998-2-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:03 -04:00
Rafael Aquini 463a4bae82 mm: fix some kernel-doc comments
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 809ef83ccb61fedc951eccf876a327e940bc412a
Author: Yang Li <yang.lee@linux.alibaba.com>
Date:   Fri Jul 7 17:00:34 2023 +0800

    mm: fix some kernel-doc comments

    Add description of @mm_wr_locked and @mm.
    to silence the warnings:

    mm/memory.c:1716: warning: Function parameter or member 'mm_wr_locked' not described in 'unmap_vmas'
    mm/memory.c:5110: warning: Function parameter or member 'mm' not described in 'mm_account_fault'

    Link: https://lkml.kernel.org/r/20230707090034.125511-1-yang.lee@linux.alibaba.com
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:01 -04:00
Rafael Aquini 64459b6e7a mm/memory: convert do_read_fault() to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 22d1e68f5a23f8b068da77af6d037bc73748c6e3
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jul 6 09:38:47 2023 -0700

    mm/memory: convert do_read_fault() to use folios

    Saves one implicit call to compound_head().

    Link: https://lkml.kernel.org/r/20230706163847.403202-4-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:57 -04:00
Rafael Aquini a347d68286 mm/memory: convert do_shared_fault() to folios
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 6f609b7e37dff1e8b2261e93da8e2e9848d5513c
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jul 6 09:38:46 2023 -0700

    mm/memory: convert do_shared_fault() to folios

    Saves three implicit calls to compound_head().

    Link: https://lkml.kernel.org/r/20230706163847.403202-3-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:56 -04:00
Rafael Aquini f1d697a32d mm/memory: convert wp_page_shared() to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 5a97858b51658ccb1a20a3273eb9fedf8fcef6a5
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jul 6 09:38:45 2023 -0700

    mm/memory: convert wp_page_shared() to use folios

    Saves six implicit calls to compound_head().

    Link: https://lkml.kernel.org/r/20230706163847.403202-2-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:55 -04:00
Rafael Aquini ed9545b9ec mm/memory: convert do_page_mkwrite() to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 3d243659d94fd6d521c4573ec467bacef911ccb3
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jul 6 09:38:44 2023 -0700

    mm/memory: convert do_page_mkwrite() to use folios

    Saves one implicit call to compound_head().

    Link: https://lkml.kernel.org/r/20230706163847.403202-1-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:54 -04:00
Rafael Aquini 3d2e42f0ff mm: use a folio in fault_dirty_shared_page()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 15b4919a1e0703b77dd7cc0a4d9732f7f6181236
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Sat Jul 1 11:28:52 2023 +0800

    mm: use a folio in fault_dirty_shared_page()

    We can replace four implicit calls to compound_head() with one by using
    folio.

    Link: https://lkml.kernel.org/r/20230701032853.258697-2-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:43 -04:00
Rafael Aquini bbc807ec19 mm: handle userfaults under VMA lock
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 29a22b9e08d70d6c9b075c12c47b6e895cb65cf0
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Jun 30 14:19:57 2023 -0700

    mm: handle userfaults under VMA lock

    Enable handle_userfault to operate under VMA lock by releasing VMA lock
    instead of mmap_lock and retrying.  Note that FAULT_FLAG_RETRY_NOWAIT
    should never be used when handling faults under per-VMA lock protection
    because that would break the assumption that lock is dropped on retry.

    [surenb@google.com: fix a lockdep issue in vma_assert_write_locked]
      Link: https://lkml.kernel.org/r/20230712195652.969194-1-surenb@google.com
    Link: https://lkml.kernel.org/r/20230630211957.1341547-7-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Josef Bacik <josef@toxicpanda.com>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:42 -04:00
Rafael Aquini e4a9a6ac40 mm/memory.c: fix mismerge
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 08dff2810e8feb3096bf5c8242ab1649d1e8b1a4
Author: Matthew Wilcox <willy@infradead.org>
Date:   Sat Aug 12 16:56:25 2023 +0100

    mm/memory.c: fix mismerge

    Fix a build issue.

    Link: https://lkml.kernel.org/r/ZNerqcNS4EBJA/2v@casper.infradead.org
    Fixes: 4aaa60dad4d1 ("mm: allow per-VMA locks on file-backed VMAs")
    Signed-off-by: Matthew Wilcox <willy@infradead.org>
    Reported-by: kernel test robot <lkp@intel.com>
    Closes: https://lore.kernel.org/oe-kbuild-all/202308121909.XNYBtqNI-lkp@intel.com/
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:41 -04:00
Rafael Aquini fcb80f6ad4 mm: handle swap page faults under per-VMA lock
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 1235ccd05b6dd6970ff50baea99aa994023fbc4a
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Jun 30 14:19:56 2023 -0700

    mm: handle swap page faults under per-VMA lock

    When page fault is handled under per-VMA lock protection, all swap page
    faults are retried with mmap_lock because folio_lock_or_retry has to drop
    and reacquire mmap_lock if folio could not be immediately locked.  Follow
    the same pattern as mmap_lock to drop per-VMA lock when waiting for folio
    and retrying once folio is available.

    With this obstacle removed, enable do_swap_page to operate under per-VMA
    lock protection.  Drivers implementing ops->migrate_to_ram might still
    rely on mmap_lock, therefore we have to fall back to mmap_lock in that
    particular case.

    Note that the only time do_swap_page calls synchronous swap_readpage is
    when SWP_SYNCHRONOUS_IO is set, which is only set for
    QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and pmem).
    Therefore we don't sleep in this path, and there's no need to drop the
    mmap or per-VMA lock.

    Link: https://lkml.kernel.org/r/20230630211957.1341547-6-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Tested-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Josef Bacik <josef@toxicpanda.com>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:40 -04:00
Rafael Aquini eb30a03b12 mm: change folio_lock_or_retry to use vm_fault directly
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit fdc724d6aa44efd75cc9b6a3c3900baac44bc50a
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Jun 30 14:19:55 2023 -0700

    mm: change folio_lock_or_retry to use vm_fault directly

    Change folio_lock_or_retry to accept vm_fault struct and return the
    vm_fault_t directly.

    Link: https://lkml.kernel.org/r/20230630211957.1341547-5-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Josef Bacik <josef@toxicpanda.com>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:38 -04:00
Rafael Aquini 09f08f6301 mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * arch/riscv/mm/fault.c: hunk dropped (unsupported arch)

This patch is a backport of the following upstream commit:
commit 4089eef0e6ac1a179c58304c657b3df3bb6fe509
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Jun 30 14:19:54 2023 -0700

    mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED

    handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means
    mmap_lock has been released.  However with per-VMA locks behavior is
    different and the caller should still release it.  To make the rules
    consistent for the caller, drop the per-VMA lock when returning
    VM_FAULT_RETRY or VM_FAULT_COMPLETED.  Currently the only path returning
    VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns
    VM_FAULT_COMPLETED for now.

    [willy@infradead.org: fix riscv]
      Link: https://lkml.kernel.org/r/CAJuCfpE6GWEx1rPBmNpUfoD5o-gNFz9-UFywzCE2PbEGBiVz7g@mail.gmail.com
    Link: https://lkml.kernel.org/r/20230630211957.1341547-4-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Tested-by: Conor Dooley <conor.dooley@microchip.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Josef Bacik <josef@toxicpanda.com>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:37 -04:00
Rafael Aquini 89b7c01962 mm: increase usage of folio_next_index() helper
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 87b11f862254396a93636f0998377ac3f6648f5f
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Jun 27 10:43:49 2023 -0700

    mm: increase usage of folio_next_index() helper

    Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
    the existing helper folio_next_index().

    Link: https://lkml.kernel.org/r/20230627174349.491803-1-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Suggested-by: Christoph Hellwig <hch@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:23 -04:00
Rafael Aquini ec84ab01c5 ksm: add ksm zero pages for each process
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 6080d19f07043ade61094d0f58b14c05e1694a39
Author: xu xin <xu.xin16@zte.com.cn>
Date:   Tue Jun 13 11:09:38 2023 +0800

    ksm: add ksm zero pages for each process

    As the number of ksm zero pages is not included in ksm_merging_pages per
    process when enabling use_zero_pages, it's unclear of how many actual
    pages are merged by KSM. To let users accurately estimate their memory
    demands when unsharing KSM zero-pages, it's necessary to show KSM zero-
    pages per process. In addition, it help users to know the actual KSM
    profit because KSM-placed zero pages are also benefit from KSM.

    since unsharing zero pages placed by KSM accurately is achieved, then
    tracking empty pages merging and unmerging is not a difficult thing any
    longer.

    Since we already have /proc/<pid>/ksm_stat, just add the information of
    'ksm_zero_pages' in it.

    Link: https://lkml.kernel.org/r/20230613030938.185993-1-yang.yang29@zte.com.cn
    Signed-off-by: xu xin <xu.xin16@zte.com.cn>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:21 -04:00
Rafael Aquini 993ca53ef9 ksm: count all zero pages placed by KSM
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit e2942062e01df85b4692460fe5b48ab0c90fdb95
Author: xu xin <xu.xin16@zte.com.cn>
Date:   Tue Jun 13 11:09:34 2023 +0800

    ksm: count all zero pages placed by KSM

    As pages_sharing and pages_shared don't include the number of zero pages
    merged by KSM, we cannot know how many pages are zero pages placed by KSM
    when enabling use_zero_pages, which leads to KSM not being transparent
    with all actual merged pages by KSM.  In the early days of use_zero_pages,
    zero-pages was unable to get unshared by the ways like MADV_UNMERGEABLE so
    it's hard to count how many times one of those zeropages was then
    unmerged.

    But now, unsharing KSM-placed zero page accurately has been achieved, so
    we can easily count both how many times a page full of zeroes was merged
    with zero-page and how many times one of those pages was then unmerged.
    and so, it helps to estimate memory demands when each and every shared
    page could get unshared.

    So we add ksm_zero_pages under /sys/kernel/mm/ksm/ to show the number
    of all zero pages placed by KSM. Meanwhile, we update the Documentation.

    Link: https://lkml.kernel.org/r/20230613030934.185944-1-yang.yang29@zte.com.cn
    Signed-off-by: xu xin <xu.xin16@zte.com.cn>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
    Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:20 -04:00
Rafael Aquini 4c092a791b mm: avoid 'might_sleep()' in get_mmap_lock_carefully()
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 4542057e18caebe5ebaee28f0438878098674504
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Mon Aug 21 06:11:33 2023 +0200

    mm: avoid 'might_sleep()' in get_mmap_lock_carefully()

    This might_sleep() goes back a long time: it was originally introduced
    way back when by commit 010060741a ("x86: add might_sleep() to
    do_page_fault()"), and made it into the generic VM code when the x86
    fault path got re-organized and generalized in commit c2508ec5a58d ("mm:
    introduce new 'lock_mm_and_find_vma()' page fault helper").

    However, it turns out that the placement of that might_sleep() has
    always been rather questionable simply because it's not only a debug
    statement to warn about sleeping in contexts that shouldn't sleep (which
    was the original reason for adding it), but it also implies a voluntary
    scheduling point.

    That, in turn, is less than desirable for two reasons:

     (a) it ends up being done after we successfully got the mmap_lock, so
         just as we got the lock we will now eagerly schedule away and
         increase lock contention

    and

     (b) this is all very possibly part of the "oops, things went horribly
         wrong" path and we just haven't figured that out yet

    After all, the whole _reason_ for having that get_mmap_lock_carefully()
    rather than just doing the obvious mmap_read_lock() is because this code
    wants to deal somewhat gracefully with potential kernel wild pointer
    bugs.

    So then a voluntary scheduling point here is simply not a good idea.

    We could certainly turn the 'might_sleep()' into a '__might_sleep()' and
    make it be just the debug check that it was originally intended to be.

    But even that seems questionable in the wild kernel pointer case - which
    again is part of the whole point of this code.  The problem wouldn't be
    about the _sleeping_ part of the page fault, but about a bad kernel
    access.  The fact that that bad kernel access might happen in a section
    that you shouldn't sleep in is secondary.

    So it really ends up being the case that this is simply entirely the
    wrong place to do this debug check and related scheduling point at all.

    So let's just remove the check entirely.  It's been around for over a
    decade, it has served its purpose.

    The re-schedule will happen at return to user space anyway for the
    normal case, and the warning - if we even need it - might be better off
    done as a special case for "page fault from kernel mode" once we've
    dealt with any potential kernel oopses where the oops is the relevant
    thing, not some artificial "scheduling while atomic" test.

    Reported-by: Mateusz Guzik <mjguzik@gmail.com>
    Link: https://lore.kernel.org/lkml/20230820104303.2083444-1-mjguzik@gmail.com/
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:47 -04:00
Rafael Aquini cc097d440c mm: Fix access_remote_vm() regression on tagged addresses
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 22883973244b1caaa26f9c6171a41ba843c8d4bd
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Wed Aug 9 17:46:00 2023 +0300

    mm: Fix access_remote_vm() regression on tagged addresses

    GDB uses /proc/PID/mem to access memory of the target process. GDB
    doesn't untag addresses manually, but relies on kernel to do the right
    thing.

    mem_rw() of procfs uses access_remote_vm() to get data from the target
    process. It worked fine until recent changes in __access_remote_vm()
    that now checks if there's VMA at target address using raw address.

    Untag the address before looking up the VMA.

    Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Reported-by: Christina Schimpe <christina.schimpe@intel.com>
    Fixes: eee9c708cc89 ("gup: avoid stack expansion warning for known-good case")
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:47 -04:00
Rafael Aquini 0f4ac0e1b7 gup: avoid stack expansion warning for known-good case
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit eee9c708cc89b4600c6e6cdda5bc2b8b4dad96cb
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Jun 29 12:36:47 2023 -0700

    gup: avoid stack expansion warning for known-good case

    In commit a425ac5365f6 ("gup: add warning if some caller would seem to
    want stack expansion") I added a temporary warning to catch any strange
    GUP users that would be impacted by the fact that GUP no longer extends
    the stack.

    But it turns out that the warning is most easily triggered through
    __access_remote_vm(), that already knows to expand the stack - it just
    does it *after* calling GUP.  So the warning is easy to trigger by just
    running gdb (or similar) and accessing things remotely under the stack.

    This just adds a temporary extra "expand stack early" to avoid the
    warning for the already converted case - not because the warning is bad,
    but because getting the warning for this known good case would then hide
    any subsequent warnings for any actually interesting cases.

    Let's try to remember to revert this change when we remove the warnings.

    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:35 -04:00
Rafael Aquini 25e4aa840e mm: remove references to pagevec
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 1fec6890bf2247ecc93f5491c2d3f33c333d5c6e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jun 21 17:45:56 2023 +0100

    mm: remove references to pagevec

    Most of these should just refer to the LRU cache rather than the data
    structure used to implement the LRU cache.

    Link: https://lkml.kernel.org/r/20230621164557.3510324-13-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:32 -04:00
Rafael Aquini 410830503d mm: always expand the stack with the mmap write lock held
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * arch/parisc/mm/fault.c: hunks dropped as there were merge conflicts not
       worth of fixing for this unsupported hardware arch;
  * drivers/iommu/amd/iommu_v2.c: hunk dropped given out-of-order backport
       of upstream commit 5a0b11a180a9 ("iommu/amd: Remove iommu_v2 module")
  * mm/memory.c: differences on the 2nd hunk due to upstream conflict with
       commit ca5e863233e8 ("mm/gup: remove vmas parameter from
       get_user_pages_remote()") that ended up solved by merge commit
       9471f1f2f502 ("Merge branch 'expand-stack'").

This patch is a backport of the following upstream commit:
commit 8d7071af890768438c14db6172cc8f9f4d04e184
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sat Jun 24 13:45:51 2023 -0700

    mm: always expand the stack with the mmap write lock held

    This finishes the job of always holding the mmap write lock when
    extending the user stack vma, and removes the 'write_locked' argument
    from the vm helper functions again.

    For some cases, we just avoid expanding the stack at all: drivers and
    page pinning really shouldn't be extending any stacks.  Let's see if any
    strange users really wanted that.

    It's worth noting that architectures that weren't converted to the new
    lock_mm_and_find_vma() helper function are left using the legacy
    "expand_stack()" function, but it has been changed to drop the mmap_lock
    and take it for writing while expanding the vma.  This makes it fairly
    straightforward to convert the remaining architectures.

    As a result of dropping and re-taking the lock, the calling conventions
    for this function have also changed, since the old vma may no longer be
    valid.  So it will now return the new vma if successful, and NULL - and
    the lock dropped - if the area could not be extended.

    Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
    Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
    Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:19 -04:00
Rafael Aquini 0ce393dc54 mm: make find_extend_vma() fail if write lock not held
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit f440fa1ac955e2898893f9301568435eb5cdfc4b
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Fri Jun 16 15:58:54 2023 -0700

    mm: make find_extend_vma() fail if write lock not held

    Make calls to extend_vma() and find_extend_vma() fail if the write lock
    is required.

    To avoid making this a flag-day event, this still allows the old
    read-locking case for the trivial situations, and passes in a flag to
    say "is it write-locked".  That way write-lockers can say "yes, I'm
    being careful", and legacy users will continue to work in all the common
    cases until they have been fully converted to the new world order.

    Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:17 -04:00
Rafael Aquini 07c5a81ca0 mm: make the page fault mmap locking killable
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit eda0047296a16d65a7f2bc60a408f70d178b2014
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Jun 15 16:17:48 2023 -0700

    mm: make the page fault mmap locking killable

    This is done as a separate patch from introducing the new
    lock_mm_and_find_vma() helper, because while it's an obvious change,
    it's not what x86 used to do in this area.

    We already abort the page fault on fatal signals anyway, so why should
    we wait for the mmap lock only to then abort later? With the new helper
    function that returns without the lock held on failure anyway, this is
    particularly easy and straightforward.

    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:04 -04:00
Rafael Aquini 4cf2488be8 mm: introduce new 'lock_mm_and_find_vma()' page fault helper
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * arch/x86/Kconfig: minor context diff due to out-of-order backport
    of upstream commit 0c7ffa32dbd6 ("x86/smpboot/64: Implement
     arch_cpuhp_init_parallel_bringup() and enable it")
  * mm/Kconfig: minor contex diff due to out-of-order backport of
    upstream commit 8f23f5dba6b4 ("iommu: Change kconfig around IOMMU_SVA")

This patch is a backport of the following upstream commit:
commit c2508ec5a58db67093f4fb8bf89a9a7c53a109e9
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Jun 15 15:17:36 2023 -0700

    mm: introduce new 'lock_mm_and_find_vma()' page fault helper

    .. and make x86 use it.

    This basically extracts the existing x86 "find and expand faulting vma"
    code, but extends it to also take the mmap lock for writing in case we
    actually do need to expand the vma.

    We've historically short-circuited that case, and have some rather ugly
    special logic to serialize the stack segment expansion (since we only
    hold the mmap lock for reading) that doesn't match the normal VM
    locking.

    That slight violation of locking worked well, right up until it didn't:
    the maple tree code really does want proper locking even for simple
    extension of an existing vma.

    So extract the code for "look up the vma of the fault" from x86, fix it
    up to do the necessary write locking, and make it available as a helper
    function for other architectures that can use the common helper.

    Note: I say "common helper", but it really only handles the normal
    stack-grows-down case.  Which is all architectures except for PA-RISC
    and IA64.  So some rare architectures can't use the helper, but if they
    care they'll just need to open-code this logic.

    It's also worth pointing out that this code really would like to have an
    optimistic "mmap_upgrade_trylock()" to make it quicker to go from a
    read-lock (for the common case) to taking the write lock (for having to
    extend the vma) in the normal single-threaded situation where there is
    no other locking activity.

    But that _is_ all the very uncommon special case, so while it would be
    nice to have such an operation, it probably doesn't matter in reality.
    I did put in the skeleton code for such a possible future expansion,
    even if it only acts as pseudo-documentation for what we're doing.

    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:03 -04:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Rafael Aquini 0c216f7cbd perf/core: allow pte_offset_map() to fail
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit a92cbb82c8d375d47fbaf0e1ad3fd4074a7cb156
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:53:23 2023 -0700

    perf/core: allow pte_offset_map() to fail

    In rare transient cases, not yet made possible, pte_offset_map() and
    pte_offet_map_lock() may not find a page table: handle appropriately.

    [hughd@google.com: __wp_page_copy_user(): don't call update_mmu_tlb() with NULL]
      Link: https://lkml.kernel.org/r/1a4db221-7872-3594-57ce-42369945ec8d@google.com
    Link: https://lkml.kernel.org/r/a194441b-63f3-adb6-5964-7ca3171ae7c2@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:39 -04:00
Rafael Aquini c1cce5ecaa mm: fix __access_remote_vm() GUP failure case
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 6581ccf03e717926be97dc3d27182ce351232f3c
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Jun 28 12:20:24 2023 -0700

    mm: fix __access_remote_vm() GUP failure case

    Commit ca5e863233e8 ("mm/gup: remove vmas parameter from
    get_user_pages_remote()") removed the vma argument from GUP handling,
    and instead added a helper function (get_user_page_vma_remote()) that
    looks it up separately using 'vma_lookup()'.  And then converted
    existing users that needed a vma to use the helper instead.

    However, the helper function intentionally acts exactly like the old
    get_user_pages_remote() did, and only fills in 'vma' on successful page
    lookup.  Fine so far.

    However, __access_remote_vm() wants the vma even for the unsuccessful
    case, and used to do a

            vma = vma_lookup(mm, addr);

    explicitly to look it up when the get_user_page() failed.

    However, that conversion commit incorrectly removed that vma lookup,
    thinking that get_user_page_vma_remote() would have done it.  Not so.

    So add the vma_lookup() back in.

    Fixes: ca5e863233e8 ("mm/gup: remove vmas parameter from get_user_pages_remote()")
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:36 -04:00
Rafael Aquini e24b3ade32 mm/gup: remove vmas parameter from get_user_pages_remote()
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  - virt/kvm/async_pf.c: minor context diff due to out-of-order backport of
    upstream commit 08284765f03b7 ("KVM: Get reference to VM's address space
    in the async #PF worker")

This patch is a backport of the following upstream commit:
commit ca5e863233e8f6acd1792fd85d6bc2729a1b2c10
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Wed May 17 20:25:39 2023 +0100

    mm/gup: remove vmas parameter from get_user_pages_remote()

    The only instances of get_user_pages_remote() invocations which used the
    vmas parameter were for a single page which can instead simply look up the
    VMA directly. In particular:-

    - __update_ref_ctr() looked up the VMA but did nothing with it so we simply
      remove it.

    - __access_remote_vm() was already using vma_lookup() when the original
      lookup failed so by doing the lookup directly this also de-duplicates the
      code.

    We are able to perform these VMA operations as we already hold the
    mmap_lock in order to be able to call get_user_pages_remote().

    As part of this work we add get_user_page_vma_remote() which abstracts the
    VMA lookup, error handling and decrementing the page reference count should
    the VMA lookup fail.

    This forms part of a broader set of patches intended to eliminate the vmas
    parameter altogether.

    [akpm@linux-foundation.org: avoid passing NULL to PTR_ERR]
    Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64)
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390)
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Jarkko Sakkinen <jarkko@kernel.org>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:35 -04:00
Rafael Aquini 8be2ec7f05 mm: remove unused vmf_insert_mixed_prot()
JIRA: https://issues.redhat.com/browse/RHEL-48221
Conflicts:
    * include/linux/mm_types.h: minor context difference due to out-of-order
      backport for upstream commit 20cce633f425 ("mm: rcu safe VMA freeing")

This patch is a backport of the following upstream commit:
commit 28d8b812e97b31231e95864f36a6b32f4b307daa
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Sun Mar 12 23:40:13 2023 +0000

    mm: remove unused vmf_insert_mixed_prot()

    Patch series "Remove drm/ttm-specific mm changes".

    Functionality was added specifically for the DRM TTM driver to support
    mapping memory for VM_MIXEDMAP VMAs with customised protection flags,
    however this has now been rolled back as issues were found with this
    approach.

    This series removes the mm changes too, retaining some of the useful
    comments.

    This patch (of 3):

    The sole user of vmf_insert_mixed_prot(), the drm ttm module, stopped
    using this in commit f91142c621 ("drm/ttm: nuke VM_MIXEDMAP on BO
    mappings v3") citing use of VM_MIXEDMAP in this case being terribly
    broken.

    Remove this now-dead code and references to it, but retain the useful
    description of the prot != vma->vm_page_prot case, moving it to
    vmf_insert_pfn_prot() instead.

    Link: https://lkml.kernel.org/r/cover.1678661628.git.lstoakes@gmail.com
    Link: https://lkml.kernel.org/r/a069644388e6f1593a7020d15840e6fc9f39bcaf.1678661628.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Aaron Tomlin <atomlin@atomlin.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Frederic Weisbecker <frederic@kernel.org>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Huacai Chen <chenhuacai@kernel.org>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-07-16 09:30:02 -04:00
Lucas Zampieri d0d8f9c2bd Merge: mm/swap: fix race when skipping swapcache
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4031

JIRA: https://issues.redhat.com/browse/RHEL-31646  
CVE: CVE-2024-26759  
  
The commit in Fixes introduces a way to skip the swapcache for  
fast devices like zram, pmem and btt as a mean to reduce swap-in  
latency. By doing so, however it introduces the race conditions  
leading to the noted CVE.  
  
Fixes: 0bcac06f27 ("mm,swap: skip swapcache for swapin of synchronous device")

Signed-off-by: Rafael Aquini <aquini@redhat.com>

Approved-by: Nico Pache <npache@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-22 21:15:31 +00:00
Nico Pache 51217d9111 hugetlbfs: close race between MADV_DONTNEED and page fault
commit 2820b0f09be99f6406784b03a22dfc83e858449d
Author: Rik van Riel <riel@surriel.com>
Date:   Thu Oct 5 23:59:08 2023 -0400

    hugetlbfs: close race between MADV_DONTNEED and page fault

    Malloc libraries, like jemalloc and tcalloc, take decisions on when to
    call madvise independently from the code in the main application.

    This sometimes results in the application page faulting on an address,
    right after the malloc library has shot down the backing memory with
    MADV_DONTNEED.

    Usually this is harmless, because we always have some 4kB pages sitting
    around to satisfy a page fault.  However, with hugetlbfs systems often
    allocate only the exact number of huge pages that the application wants.

    Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
    any lock taken on the page fault path, which can open up the following
    race condition:

           CPU 1                            CPU 2

           MADV_DONTNEED
           unmap page
           shoot down TLB entry
                                           page fault
                                           fail to allocate a huge page
                                           killed with SIGBUS
           free page

    Fix that race by pulling the locking from __unmap_hugepage_final_range
    into helper functions called from zap_page_range_single.  This ensures
    page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
    have actually been freed.

    Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com
    Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
    Signed-off-by: Rik van Riel <riel@surriel.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:32 -06:00
Nico Pache bdb2b12d7b mm: replace mmap with vma write lock assertions when operating on a vma
commit e727bfd5e73a35ecbc4a01a15c659b9fafaa97c0
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Aug 4 08:27:21 2023 -0700

    mm: replace mmap with vma write lock assertions when operating on a vma

    Vma write lock assertion always includes mmap write lock assertion and
    additional vma lock checks when per-VMA locks are enabled. Replace
    weaker mmap_assert_write_locked() assertions with stronger
    vma_assert_write_locked() ones when we are operating on a vma which
    is expected to be locked.

    Link: https://lkml.kernel.org/r/20230804152724.3090321-4-surenb@google.com
    Suggested-by: Jann Horn <jannh@google.com>
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Linus Torvalds <torvalds@linuxfoundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:30 -06:00
Nico Pache 9086b6a722 x86/mm/pat: fix VM_PAT handling in COW mappings
commit 04c35ab3bdae7fefbd7c7a7355f29fa03a035221
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Apr 3 23:21:30 2024 +0200

    x86/mm/pat: fix VM_PAT handling in COW mappings

    PAT handling won't do the right thing in COW mappings: the first PTE (or,
    in fact, all PTEs) can be replaced during write faults to point at anon
    folios.  Reliably recovering the correct PFN and cachemode using
    follow_phys() from PTEs will not work in COW mappings.

    Using follow_phys(), we might just get the address+protection of the anon
    folio (which is very wrong), or fail on swap/nonswap entries, failing
    follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and
    track_pfn_copy(), not properly calling free_pfn_range().

    In free_pfn_range(), we either wouldn't call memtype_free() or would call
    it with the wrong range, possibly leaking memory.

    To fix that, let's update follow_phys() to refuse returning anon folios,
    and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings
    if we run into that.

    We will now properly handle untrack_pfn() with COW mappings, where we
    don't need the cachemode.  We'll have to fail fork()->track_pfn_copy() if
    the first page was replaced by an anon folio, though: we'd have to store
    the cachemode in the VMA to make this work, likely growing the VMA size.

    For now, lets keep it simple and let track_pfn_copy() just fail in that
    case: it would have failed in the past with swap/nonswap entries already,
    and it would have done the wrong thing with anon folios.

    Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():

    <--- C reproducer --->
     #include <stdio.h>
     #include <sys/mman.h>
     #include <unistd.h>
     #include <liburing.h>

     int main(void)
     {
             struct io_uring_params p = {};
             int ring_fd;
             size_t size;
             char *map;

             ring_fd = io_uring_setup(1, &p);
             if (ring_fd < 0) {
                     perror("io_uring_setup");
                     return 1;
             }
             size = p.sq_off.array + p.sq_entries * sizeof(unsigned);

             /* Map the submission queue ring MAP_PRIVATE */
             map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
                        ring_fd, IORING_OFF_SQ_RING);
             if (map == MAP_FAILED) {
                     perror("mmap");
                     return 1;
             }

             /* We have at least one page. Let's COW it. */
             *map = 0;
             pause();
             return 0;
     }
    <--- C reproducer --->

    On a system with 16 GiB RAM and swap configured:
     # ./iouring &
     # memhog 16G
     # killall iouring
    [  301.552930] ------------[ cut here ]------------
    [  301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100
    [  301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g
    [  301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1
    [  301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4
    [  301.559569] RIP: 0010:untrack_pfn+0xf4/0x100
    [  301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000
    [  301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282
    [  301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047
    [  301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200
    [  301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000
    [  301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000
    [  301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000
    [  301.564186] FS:  0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000
    [  301.564773] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0
    [  301.565725] PKRU: 55555554
    [  301.565944] Call Trace:
    [  301.566148]  <TASK>
    [  301.566325]  ? untrack_pfn+0xf4/0x100
    [  301.566618]  ? __warn+0x81/0x130
    [  301.566876]  ? untrack_pfn+0xf4/0x100
    [  301.567163]  ? report_bug+0x171/0x1a0
    [  301.567466]  ? handle_bug+0x3c/0x80
    [  301.567743]  ? exc_invalid_op+0x17/0x70
    [  301.568038]  ? asm_exc_invalid_op+0x1a/0x20
    [  301.568363]  ? untrack_pfn+0xf4/0x100
    [  301.568660]  ? untrack_pfn+0x65/0x100
    [  301.568947]  unmap_single_vma+0xa6/0xe0
    [  301.569247]  unmap_vmas+0xb5/0x190
    [  301.569532]  exit_mmap+0xec/0x340
    [  301.569801]  __mmput+0x3e/0x130
    [  301.570051]  do_exit+0x305/0xaf0
    ...

    Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reported-by: Wupeng Ma <mawupeng1@huawei.com>
    Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com
    Fixes: b1a86e15dc ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines")
    Fixes: 5899329b19 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3")
    Acked-by: Ingo Molnar <mingo@kernel.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:30 -06:00
Nico Pache fb49283764 mm: fix unmap_mapping_range high bits shift bug
commit 9eab0421fa94a3dde0d1f7e36ab3294fc306c99d
Author: Jiajun Xie <jiajun.xie.sh@gmail.com>
Date:   Wed Dec 20 13:28:39 2023 +0800

    mm: fix unmap_mapping_range high bits shift bug

    The bug happens when highest bit of holebegin is 1, suppose holebegin is
    0x8000000111111000, after shift, hba would be 0xfff8000000111111, then
    vma_interval_tree_foreach would look it up fail or leads to the wrong
    result.

    error call seq e.g.:
    - mmap(..., offset=0x8000000111111000)
      |- syscall(mmap, ... unsigned long, off):
         |- ksys_mmap_pgoff( ... , off >> PAGE_SHIFT);

      here pgoff is correctly shifted to 0x8000000111111,
      but pass 0x8000000111111000 as holebegin to unmap
      would then cause terrible result, as shown below:

    - unmap_mapping_range(..., loff_t const holebegin)
      |- pgoff_t hba = holebegin >> PAGE_SHIFT;
              /* hba = 0xfff8000000111111 unexpectedly */

    The issue happens in Heterogeneous computing, where the device(e.g.
    gpu) and host share the same virtual address space.

    A simple workflow pattern which hit the issue is:
            /* host */
        1. userspace first mmap a file backed VA range with specified offset.
                            e.g. (offset=0x800..., mmap return: va_a)
        2. write some data to the corresponding sys page
                             e.g. (va_a = 0xAABB)
            /* device */
        3. gpu workload touches VA, triggers gpu fault and notify the host.
            /* host */
        4. reviced gpu fault notification, then it will:
                4.1 unmap host pages and also takes care of cpu tlb
                      (use unmap_mapping_range with offset=0x800...)
                4.2 migrate sys page to device
                4.3 setup device page table and resolve device fault.
            /* device */
        5. gpu workload continued, it accessed va_a and got 0xAABB.
        6. gpu workload continued, it wrote 0xBBCC to va_a.
            /* host */
        7. userspace access va_a, as expected, it will:
                7.1 trigger cpu vm fault.
                7.2 driver handling fault to migrate gpu local page to host.
        8. userspace then could correctly get 0xBBCC from va_a
        9. done

    But in step 4.1, if we hit the bug this patch mentioned, then userspace
    would never trigger cpu fault, and still get the old value: 0xAABB.

    Making holebegin unsigned first fixes the bug.

    Link: https://lkml.kernel.org/r/20231220052839.26970-1-jiajun.xie.sh@gmail.com
    Signed-off-by: Jiajun Xie <jiajun.xie.sh@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:29 -06:00
Nico Pache 8eac7a3de1 mm: call arch_swap_restore() from do_swap_page()
commit 6dca4ac6fc91fd41ea4d6c4511838d37f4e0eab2
Author: Peter Collingbourne <pcc@google.com>
Date:   Mon May 22 17:43:08 2023 -0700

    mm: call arch_swap_restore() from do_swap_page()

    Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved
    the call to swap_free() before the call to set_pte_at(), which meant that
    the MTE tags could end up being freed before set_pte_at() had a chance to
    restore them.  Fix it by adding a call to the arch_swap_restore() hook
    before the call to swap_free().

    Link: https://lkml.kernel.org/r/20230523004312.1807357-2-pcc@google.com
    Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965
    Fixes: c145e0b47c77 ("mm: streamline COW logic in do_swap_page()")
    Signed-off-by: Peter Collingbourne <pcc@google.com>
    Reported-by: Qun-wei Lin <Qun-wei.Lin@mediatek.com>
    Closes: https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Steven Price <steven.price@arm.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: <stable@vger.kernel.org>    [6.1+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:27 -06:00
Nico Pache 35c73a9785 mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock
commit 657b5146955eba331e01b9a6ae89ce2e716ba306
Author: Jann Horn <jannh@google.com>
Date:   Wed Jul 26 23:41:03 2023 +0200

    mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock

    lock_vma_under_rcu() tries to guarantee that __anon_vma_prepare() can't
    be called in the VMA-locked page fault path by ensuring that
    vma->anon_vma is set.

    However, this check happens before the VMA is locked, which means a
    concurrent move_vma() can concurrently call unlink_anon_vmas(), which
    disassociates the VMA's anon_vma.

    This means we can get UAF in the following scenario:

      THREAD 1                   THREAD 2
      ========                   ========
      <page fault>
        lock_vma_under_rcu()
          rcu_read_lock()
          mas_walk()
          check vma->anon_vma

                                 mremap() syscall
                                   move_vma()
                                    vma_start_write()
                                     unlink_anon_vmas()
                                 <syscall end>

        handle_mm_fault()
          __handle_mm_fault()
            handle_pte_fault()
              do_pte_missing()
                do_anonymous_page()
                  anon_vma_prepare()
                    __anon_vma_prepare()
                      find_mergeable_anon_vma()
                        mas_walk() [looks up VMA X]

                                 munmap() syscall (deletes VMA X)

                        reusable_anon_vma() [called on freed VMA X]

    This is a security bug if you can hit it, although an attacker would
    have to win two races at once where the first race window is only a few
    instructions wide.

    This patch is based on some previous discussion with Linus Torvalds on
    the security list.

    Cc: stable@vger.kernel.org
    Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
    Signed-off-by: Jann Horn <jannh@google.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:25 -06:00
Nico Pache dc2f01811e tcp: Use per-vma locking for receive zerocopy
commit 7a7f094635349a7d0314364ad50bdeb770b6df4f
Author: Arjun Roy <arjunroy@google.com>
Date:   Fri Jun 16 12:34:27 2023 -0700

    tcp: Use per-vma locking for receive zerocopy

    Per-VMA locking allows us to lock a struct vm_area_struct without
    taking the process-wide mmap lock in read mode.

    Consider a process workload where the mmap lock is taken constantly in
    write mode. In this scenario, all zerocopy receives are periodically
    blocked during that period of time - though in principle, the memory
    ranges being used by TCP are not touched by the operations that need
    the mmap write lock. This results in performance degradation.

    Now consider another workload where the mmap lock is never taken in
    write mode, but there are many TCP connections using receive zerocopy
    that are concurrently receiving. These connections all take the mmap
    lock in read mode, but this does induce a lot of contention and atomic
    ops for this process-wide lock. This results in additional CPU
    overhead caused by contending on the cache line for this lock.

    However, with per-vma locking, both of these problems can be avoided.

    As a test, I ran an RPC-style request/response workload with 4KB
    payloads and receive zerocopy enabled, with 100 simultaneous TCP
    connections. I measured perf cycles within the
    find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
    without per-vma locking enabled.

    When using process-wide mmap semaphore read locking, about 1% of
    measured perf cycles were within this path. With per-VMA locking, this
    value dropped to about 0.45%.

    Signed-off-by: Arjun Roy <arjunroy@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:25 -06:00
Chris von Recklinghausen 58f917b302 mm: do not increment pgfault stats when page fault handler retries
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 53156443a30368c0759c22e54a8d5cacc1b543cc
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Wed Apr 19 10:58:36 2023 -0700

    mm: do not increment pgfault stats when page fault handler retries

    If the page fault handler requests a retry, we will count the fault
    multiple times.  This is a relatively harmless problem as the retry paths
    are not often requested, and the only user-visible problem is that the
    fault counter will be slightly higher than it should be.  Nevertheless,
    userspace only took one fault, and should not see the fact that the kernel
    had to retry the fault multiple times.

    Move page fault accounting into mm_account_fault() and skip incomplete
    faults which will be accounted upon completion.

    Link: https://lkml.kernel.org/r/20230419175836.3857458-1-surenb@google.com
    Fixes: d065bd810b ("mm: retry page fault when blocking on disk transfer")
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Josef Bacik <josef@toxicpanda.com>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:05 -04:00
Chris von Recklinghausen c4677d95e9 mm: hwpoison: support recovery from HugePage copy-on-write faults
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 1cb9dc4b475c7418f925ab0c97b6750007d9f52e
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Thu Apr 13 21:13:49 2023 +0800

    mm: hwpoison: support recovery from HugePage copy-on-write faults

    copy-on-write of hugetlb user pages with uncorrectable errors will result
    in a kernel crash.  This is because the copy is performed in kernel mode
    and in general we can not handle accessing memory with such errors while
    in kernel mode.  Commit a873dfe1032a ("mm, hwpoison: try to recover from
    copy-on write faults") introduced the routine copy_user_highpage_mc() to
    gracefully handle copying of user pages with uncorrectable errors.
    However, the separate hugetlb copy-on-write code paths were not modified
    as part of commit a873dfe1032a.

    Modify hugetlb copy-on-write code paths to use copy_mc_user_highpage() so
    that they can also gracefully handle uncorrectable errors in user pages.
    This involves changing the hugetlb specific routine
    copy_user_large_folio() from type void to int so that it can return an
    error.  Modify the hugetlb userfaultfd code in the same way so that it can
    return -EHWPOISON if it encounters an uncorrectable error.

    Link: https://lkml.kernel.org/r/20230413131349.2524210-1-liushixin2@huawei.com
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:58 -04:00
Chris von Recklinghausen a3e721c8e7 mm: convert copy_user_huge_page() to copy_user_large_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit c0e8150e144b62ae467520d0b51c4707c09e897b
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Mon Apr 10 21:39:31 2023 +0800

    mm: convert copy_user_huge_page() to copy_user_large_folio()

    Replace copy_user_huge_page() with copy_user_large_folio().
    copy_user_large_folio() does the same as copy_user_huge_page(), but takes
    in folios instead of pages.  Remove pages_per_huge_page from
    copy_user_large_folio(), because we can get that from folio_nr_pages(dst).

    Convert copy_user_gigantic_page() to take in folios.

    Link: https://lkml.kernel.org/r/20230410133932.32288-6-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:57 -04:00
Chris von Recklinghausen 4b83c78b5d userfaultfd: convert copy_huge_page_from_user() to copy_folio_from_user()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit e87340ca5c9cecc8a11daf1a2dcabf23f06a4e10
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Mon Apr 10 21:39:29 2023 +0800

    userfaultfd: convert copy_huge_page_from_user() to copy_folio_from_user()

    Replace copy_huge_page_from_user() with copy_folio_from_user().
    copy_folio_from_user() does the same as copy_huge_page_from_user(), but
    takes in a folio instead of a page.

    Convert page_kaddr to kaddr in copy_folio_from_user() to do indenting
    cleanup.

    Link: https://lkml.kernel.org/r/20230410133932.32288-4-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:56 -04:00
Chris von Recklinghausen ae159bb496 userfaultfd: use kmap_local_page() in copy_huge_page_from_user()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 0d508c1f0e2c7cec76c141e9d2ebc3020d9e4be4
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Mon Apr 10 21:39:28 2023 +0800

    userfaultfd: use kmap_local_page() in copy_huge_page_from_user()

    kmap() and kmap_atomic() are being deprecated in favor of
    kmap_local_page() which is appropriate for any thread local context.[1]

    Let's replace the kmap() and kmap_atomic() with kmap_local_page() in
    copy_huge_page_from_user().  When allow_pagefault is false, disable page
    faults to prevent potential deadlock.[2]

    [1] https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com/
    [2] https://lkml.kernel.org/r/20221025220136.2366143-1-ira.weiny@intel.com

    Link: https://lkml.kernel.org/r/20230410133932.32288-3-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:56 -04:00
Chris von Recklinghausen 75a57ff9f2 sched/numa: enhance vma scanning logic
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit fc137c0ddab29b591db6a091dc6d7ce20ccb73f2
Author: Raghavendra K T <raghavendra.kt@amd.com>
Date:   Wed Mar 1 17:49:01 2023 +0530

    sched/numa: enhance vma scanning logic

    During Numa scanning make sure only relevant vmas of the tasks are
    scanned.

    Before:
     All the tasks of a process participate in scanning the vma even if they
     do not access vma in it's lifespan.

    Now:
     Except cases of first few unconditional scans, if a process do
     not touch vma (exluding false positive cases of PID collisions)
     tasks no longer scan all vma

    Logic used:

    1) 6 bits of PID used to mark active bit in vma numab status during
       fault to remember PIDs accessing vma.  (Thanks Mel)

    2) Subsequently in scan path, vma scanning is skipped if current PID
       had not accessed vma.

    3) First two times we do allow unconditional scan to preserve earlier
       behaviour of scanning.

    Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch to
    store pid information and Peter Zijlstra <peterz@infradead.org> (Usage of
    test and set bit)

    Link: https://lkml.kernel.org/r/092f03105c7c1d3450f4636b1ea350407f07640e.1677672277.git.raghavendra.kt@amd.com
    Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
    Suggested-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Disha Talreja <dishaa.talreja@amd.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:46 -04:00
Chris von Recklinghausen c6121c7de2 mm: introduce per-VMA lock statistics
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 52f238653e452e0fda61e880f263a173d219acd1
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:27 2023 -0800

    mm: introduce per-VMA lock statistics

    Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra statistics
    about handling page fault under VMA lock.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-29-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:44 -04:00
Chris von Recklinghausen a0bb8da374 mm: prevent userfaults to be handled under per-vma lock
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 444eeb17437a0ef526c606e9141a415d3b7dfddd
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:26 2023 -0800

    mm: prevent userfaults to be handled under per-vma lock

    Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
    handling under VMA lock and retry holding mmap_lock.  This can be handled
    more gracefully in the future.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-28-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:44 -04:00
Chris von Recklinghausen 9d76c43a1a mm: prevent do_swap_page from handling page faults under VMA lock
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 17c05f18e54158a3eed0c22c85b7a756b63dcc01
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:25 2023 -0800

    mm: prevent do_swap_page from handling page faults under VMA lock

    Due to the possibility of do_swap_page dropping mmap_lock, abort fault
    handling under VMA lock and retry holding mmap_lock.  This can be handled
    more gracefully in the future.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-27-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:44 -04:00
Chris von Recklinghausen 64ebc59d46 mm: fall back to mmap_lock if vma->anon_vma is not yet set
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 2ac0af1b66e3b66307f53b1cc446514308ec466d
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:23 2023 -0800

    mm: fall back to mmap_lock if vma->anon_vma is not yet set

    When vma->anon_vma is not set, page fault handler will set it by either
    reusing anon_vma of an adjacent VMA if VMAs are compatible or by
    allocating a new one.  find_mergeable_anon_vma() walks VMA tree to find a
    compatible adjacent VMA and that requires not only the faulting VMA to be
    stable but also the tree structure and other VMAs inside that tree.
    Therefore locking just the faulting VMA is not enough for this search.
    Fall back to taking mmap_lock when vma->anon_vma is not set.  This
    situation happens only on the first page fault and should not affect
    overall performance.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-25-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:43 -04:00
Chris von Recklinghausen 469ada2b6f mm: introduce lock_vma_under_rcu to be used from arch-specific code
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 50ee32537206140e4cf6e47024be29a84d458d49
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:22 2023 -0800

    mm: introduce lock_vma_under_rcu to be used from arch-specific code

    Introduce lock_vma_under_rcu function to lookup and lock a VMA during page
    fault handling.  When VMA is not found, can't be locked or changes after
    being locked, the function returns NULL.  The lookup is performed under
    RCU protection to prevent the found VMA from being destroyed before the
    VMA lock is acquired.  VMA lock statistics are updated according to the
    results.  For now only anonymous VMAs can be searched this way.  In other
    cases the function returns NULL.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-24-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:43 -04:00
Chris von Recklinghausen 8828368eaf mm: conditionally write-lock VMA in free_pgtables
Conflicts: mm/mmap.c - fuzz

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 98e51a2239d9d419d819cd61a2e720ebf19a8b0a
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:18 2023 -0800

    mm: conditionally write-lock VMA in free_pgtables

    Normally free_pgtables needs to lock affected VMAs except for the case
    when VMAs were isolated under VMA write-lock.  munmap() does just that,
    isolating while holding appropriate locks and then downgrading mmap_lock
    and dropping per-VMA locks before freeing page tables.  Add a parameter to
    free_pgtables for such scenario.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-20-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:41 -04:00
Chris von Recklinghausen 2adb48015b mm: hold the RCU read lock over calls to ->map_pages
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 58ef47ef7db9dfc2730dc039498cc76130ea3c3d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Mar 27 18:45:15 2023 +0100

    mm: hold the RCU read lock over calls to ->map_pages

    Prevent filesystems from doing things which sleep in their map_pages
    method.  This is in preparation for a pagefault path protected only by
    RCU.

    Link: https://lkml.kernel.org/r/20230327174515.1811532-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: David Howells <dhowells@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:39 -04:00
Chris von Recklinghausen abbf77811d mm: prefer fault_around_pages to fault_around_bytes
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 53d36a56d8c494554e816300ebc0f7c23274b3ae
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Fri Mar 17 21:58:26 2023 +0000

    mm: prefer fault_around_pages to fault_around_bytes

    All use of this value is now at page granularity, so specify the variable
    as such too.  This simplifies the logic.

    We maintain the debugfs entry to ensure that there are no user-visible
    changes.

    Link: https://lkml.kernel.org/r/4995bad07fe9baa51c786fa0d81819dddfb57654.1679089214.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:30 -04:00
Chris von Recklinghausen 98ae253390 mm: refactor do_fault_around()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 9042599e81c295f0b12d940248d6608e87e7b6b6
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Fri Mar 17 21:58:25 2023 +0000

    mm: refactor do_fault_around()

    Patch series "Refactor do_fault_around()"

    Refactor do_fault_around() to avoid bitwise tricks and rather difficult to
    follow logic.  Additionally, prefer fault_around_pages to
    fault_around_bytes as the operations are performed at a base page
    granularity.

    This patch (of 2):

    The existing logic is confusing and fails to abstract a number of bitwise
    tricks.

    Use ALIGN_DOWN() to perform alignment, pte_index() to obtain a PTE index
    and represent the address range using PTE offsets, which naturally make it
    clear that the operation is intended to occur within only a single PTE and
    prevent spanning of more than one page table.

    We rely on the fact that fault_around_bytes will always be page-aligned,
    at least one page in size, a power of two and that it will not exceed
    PAGE_SIZE * PTRS_PER_PTE in size (i.e.  the address space mapped by a
    PTE).  These are all guaranteed by fault_around_bytes_set().

    Link: https://lkml.kernel.org/r/cover.1679089214.git.lstoakes@gmail.com
    Link: https://lkml.kernel.org/r/d125db1c3665a63b80cea29d56407825482e2262.1679089214.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:29 -04:00
Chris von Recklinghausen a500967bd6 mm: memory: use folio_throttle_swaprate() in do_cow_fault()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 68fa572b503ce8bfd0d0c2e5bb185134086d7d7d
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 2 19:58:34 2023 +0800

    mm: memory: use folio_throttle_swaprate() in do_cow_fault()

    Directly use folio_throttle_swaprate() instead of
    cgroup_throttle_swaprate().

    Link: https://lkml.kernel.org/r/20230302115835.105364-7-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:07 -04:00
Chris von Recklinghausen 6fce747219 mm: memory: use folio_throttle_swaprate() in do_anonymous_page()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit e2bf3e2caa62f72d6a67048df440d83a12ae1a2a
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 2 19:58:33 2023 +0800

    mm: memory: use folio_throttle_swaprate() in do_anonymous_page()

    Directly use folio_throttle_swaprate() instead of
    cgroup_throttle_swaprate().

    Link: https://lkml.kernel.org/r/20230302115835.105364-6-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:07 -04:00
Chris von Recklinghausen d37b08b2a2 mm: memory: use folio_throttle_swaprate() in wp_page_copy()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 4d4f75bf3293f35ae1eb1ecf8b70bffdde58ffbe
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 2 19:58:32 2023 +0800

    mm: memory: use folio_throttle_swaprate() in wp_page_copy()

    Directly use folio_throttle_swaprate() instead of
    cgroup_throttle_swaprate().

    Link: https://lkml.kernel.org/r/20230302115835.105364-5-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:07 -04:00
Chris von Recklinghausen b9e719cc5f mm: memory: use folio_throttle_swaprate() in page_copy_prealloc()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit e601ded4247f959702adb5170ca8abac17a0313f
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 2 19:58:31 2023 +0800

    mm: memory: use folio_throttle_swaprate() in page_copy_prealloc()

    Directly use folio_throttle_swaprate() instead of
    cgroup_throttle_swaprate().

    Link: https://lkml.kernel.org/r/20230302115835.105364-4-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:06 -04:00
Chris von Recklinghausen 0358f65269 mm: memory: use folio_throttle_swaprate() in do_swap_page()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 4231f8425833b144f165f01f33887b67f494acf0
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Mar 2 19:58:30 2023 +0800

    mm: memory: use folio_throttle_swaprate() in do_swap_page()

    Directly use folio_throttle_swaprate() instead of
    cgroup_throttle_swaprate().

    Link: https://lkml.kernel.org/r/20230302115835.105364-3-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:06 -04:00
Chris von Recklinghausen eca45431b9 x86/mm/pat: clear VM_PAT if copy_p4d_range failed
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit d155df53f31068c3340733d586eb9b3ddfd70fc5
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Fri Feb 17 10:56:15 2023 +0800

    x86/mm/pat: clear VM_PAT if copy_p4d_range failed

    Syzbot reports a warning in untrack_pfn().  Digging into the root we found
    that this is due to memory allocation failure in pmd_alloc_one.  And this
    failure is produced due to failslab.

    In copy_page_range(), memory alloaction for pmd failed.  During the error
    handling process in copy_page_range(), mmput() is called to remove all
    vmas.  While untrack_pfn this empty pfn, warning happens.

    Here's a simplified flow:

    dup_mm
      dup_mmap
        copy_page_range
          copy_p4d_range
            copy_pud_range
              copy_pmd_range
                pmd_alloc
                  __pmd_alloc
                    pmd_alloc_one
                      page = alloc_pages(gfp, 0);
                        if (!page)
                          return NULL;
        mmput
            exit_mmap
              unmap_vmas
                unmap_single_vma
                  untrack_pfn
                    follow_phys
                      WARN_ON_ONCE(1);

    Since this vma is not generate successfully, we can clear flag VM_PAT.  In
    this case, untrack_pfn() will not be called while cleaning this vma.

    Function untrack_pfn_moved() has also been renamed to fit the new logic.

    Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com
    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Borislav Petkov <bp@suse.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Suresh Siddha <suresh.b.siddha@intel.com>
    Cc: Toshi Kani <toshi.kani@hp.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:04 -04:00
Aristeu Rozanski b36ffff80e mm/uffd: fix comment in handling pte markers
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7a079ba20090ab50d2f4203ceccd1e0f4becd1a6
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Feb 15 15:58:00 2023 -0500

    mm/uffd: fix comment in handling pte markers

    The comment is obsolete after f369b07c8614 ("mm/uffd: reset write
    protection when unregister with wp-mode", 2022-08-20).  Remove it.

    Link: https://lkml.kernel.org/r/20230215205800.223549-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:24 -04:00
Aristeu Rozanski b6aad98b56 mm: introduce __vm_flags_mod and use it in untrack_pfn
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 68f48381d7fdd1cbb9d88c37a4dfbb98ac78226d
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:51 2023 -0800

    mm: introduce __vm_flags_mod and use it in untrack_pfn

    There are scenarios when vm_flags can be modified without exclusive
    mmap_lock, such as:
    - after VMA was isolated and mmap_lock was downgraded or dropped
    - in exit_mmap when there are no other mm users and locking is unnecessary
    Introduce __vm_flags_mod to avoid assertions when the caller takes
    responsibility for the required locking.
    Pass a hint to untrack_pfn to conditionally use __vm_flags_mod for
    flags modification to avoid assertion.

    Link: https://lkml.kernel.org/r/20230126193752.297968-7-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski e214620cfb mm: replace vma->vm_flags direct modifications with modifier calls
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped stuff we don't support when not applying cleanly, left the rest for sake of saving work

commit 1c71222e5f2393b5ea1a41795c67589eea7e3490
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:49 2023 -0800

    mm: replace vma->vm_flags direct modifications with modifier calls

    Replace direct modifications to vma->vm_flags with calls to modifier
    functions to be able to track flag changes and to keep vma locking
    correctness.

    [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
    Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski 274713ebbe mm: use a folio in copy_present_pte()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 14ddee4126fecff5c5c0a84940ba34f0bfe3e708
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:18:13 2023 +0000

    mm: use a folio in copy_present_pte()

    We still have to keep the page around because we need to know which page
    in the folio we're copying, but we can replace five implict calls to
    compound_head() with one.

    Link: https://lkml.kernel.org/r/20230116191813.2145215-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski 532bb0bf59 mm: use a folio in copy_pte_range()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit edf5047058395c89a912783ea29ec8f9e53be414
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:18:12 2023 +0000

    mm: use a folio in copy_pte_range()

    Allocate an order-0 folio instead of a page and pass it all the way down
    the call chain.  Removes dozens of calls to compound_head().

    Link: https://lkml.kernel.org/r/20230116191813.2145215-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski a9ebe2a98c mm: convert do_anonymous_page() to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: d4f9565ae598 is already backported

commit cb3184deef10fdc7658fb366189864c89ad118c9
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:18:10 2023 +0000

    mm: convert do_anonymous_page() to use a folio

    Removes six calls to compound_head(); some inline and some external.

    Link: https://lkml.kernel.org/r/20230116191813.2145215-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski e7030d52b7 mm: remove __HAVE_ARCH_PTE_SWP_EXCLUSIVE
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped arches we don't support

commit 950fe885a89770619e315f9b46301eebf0aab7b3
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Jan 13 18:10:26 2023 +0100

    mm: remove __HAVE_ARCH_PTE_SWP_EXCLUSIVE

    __HAVE_ARCH_PTE_SWP_EXCLUSIVE is now supported by all architectures that
    support swp PTEs, so let's drop it.

    Link: https://lkml.kernel.org/r/20230113171026.582290-27-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:08 -04:00
Aristeu Rozanski 5455c3da6d mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7d4a8be0c4b2b7ffb367929d2b352651f083806b
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jan 10 13:57:22 2023 +1100

    mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export

    mmu_notifier_range_update_to_read_only() was originally introduced in
    commit c6d23413f8 ("mm/mmu_notifier:
    mmu_notifier_range_update_to_read_only() helper") as an optimisation for
    device drivers that know a range has only been mapped read-only.  However
    there are no users of this feature so remove it.  As it is the only user
    of the struct mmu_notifier_range.vma field remove that also.

    Link: https://lkml.kernel.org/r/20230110025722.600912-1-apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:05 -04:00
Aristeu Rozanski d908e3177a mm: add vma_has_recency()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 8788f6781486769d9598dcaedc3fe0eb12fc3e59
Author: Yu Zhao <yuzhao@google.com>
Date:   Fri Dec 30 14:52:51 2022 -0700

    mm: add vma_has_recency()

    Add vma_has_recency() to indicate whether a VMA may exhibit temporal
    locality that the LRU algorithm relies on.

    This function returns false for VMAs marked by VM_SEQ_READ or
    VM_RAND_READ.  While the former flag indicates linear access, i.e., a
    special case of spatial locality, both flags indicate a lack of temporal
    locality, i.e., the reuse of an area within a relatively small duration.

    "Recency" is chosen over "locality" to avoid confusion between temporal
    and spatial localities.

    Before this patch, the active/inactive LRU only ignored the accessed bit
    from VMAs marked by VM_SEQ_READ.  After this patch, the active/inactive
    LRU and MGLRU share the same logic: they both ignore the accessed bit if
    vma_has_recency() returns false.

    For the active/inactive LRU, the following fio test showed a [6, 8]%
    increase in IOPS when randomly accessing mapped files under memory
    pressure.

      kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo)
      kb=$((kb - 8*1024*1024))

      modprobe brd rd_nr=1 rd_size=$kb
      dd if=/dev/zero of=/dev/ram0 bs=1M

      mkfs.ext4 /dev/ram0
      mount /dev/ram0 /mnt/
      swapoff -a

      fio --name=test --directory=/mnt/ --ioengine=mmap --numjobs=8 \
          --size=8G --rw=randrw --time_based --runtime=10m \
          --group_reporting

    The discussion that led to this patch is here [1].  Additional test
    results are available in that thread.

    [1] https://lore.kernel.org/r/Y31s%2FK8T85jh05wH@google.com/

    Link: https://lkml.kernel.org/r/20221230215252.2628425-1-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Andrea Righi <andrea.righi@canonical.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:03 -04:00
Aristeu Rozanski 20dd56698e mm: remove zap_page_range and create zap_vma_pages
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped RISCV changes, and due missing b59c9dc4d9d47b

commit e9adcfecf572fcfaa9f8525904cf49c709974f73
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Jan 3 16:27:32 2023 -0800

    mm: remove zap_page_range and create zap_vma_pages

    zap_page_range was originally designed to unmap pages within an address
    range that could span multiple vmas.  While working on [1], it was
    discovered that all callers of zap_page_range pass a range entirely within
    a single vma.  In addition, the mmu notification call within zap_page
    range does not correctly handle ranges that span multiple vmas.  When
    crossing a vma boundary, a new mmu_notifier_range_init/end call pair with
    the new vma should be made.

    Instead of fixing zap_page_range, do the following:
    - Create a new routine zap_vma_pages() that will remove all pages within
      the passed vma.  Most users of zap_page_range pass the entire vma and
      can use this new routine.
    - For callers of zap_page_range not passing the entire vma, instead call
      zap_page_range_single().
    - Remove zap_page_range.

    [1] https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.kravetz@oracle.com/
    Link: https://lkml.kernel.org/r/20230104002732.232573-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Peter Xu <peterx@redhat.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Acked-by: Heiko Carstens <hca@linux.ibm.com>    [s390]
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:03 -04:00
Rafael Aquini da2c6c408a mm/swap: fix race when skipping swapcache
JIRA: https://issues.redhat.com/browse/RHEL-31646
CVE: CVE-2024-26759

This patch is a backport of the following upstream commit:
commit 13ddaf26be324a7f951891ecd9ccd04466d27458
Author: Kairui Song <kasong@tencent.com>
Date:   Wed Feb 7 02:25:59 2024 +0800

    mm/swap: fix race when skipping swapcache

    When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
    swapin the same entry at the same time, they get different pages (A, B).
    Before one thread (T0) finishes the swapin and installs page (A) to the
    PTE, another thread (T1) could finish swapin of page (B), swap_free the
    entry, then swap out the possibly modified page reusing the same entry.
    It breaks the pte_same check in (T0) because PTE value is unchanged,
    causing ABA problem.  Thread (T0) will install a stalled page (A) into the
    PTE and cause data corruption.

    One possible callstack is like this:

    CPU0                                 CPU1
    ----                                 ----
    do_swap_page()                       do_swap_page() with same entry
    <direct swapin path>                 <direct swapin path>
    <alloc page A>                       <alloc page B>
    swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
    <slow on later locks or interrupt>   <finished swapin first>
    ...                                  set_pte_at()
                                         swap_free() <- entry is free
                                         <write to page B, now page A stalled>
                                         <swap out page B to same swap entry>
    pte_same() <- Check pass, PTE seems
                  unchanged, but page A
                  is stalled!
    swap_free() <- page B content lost!
    set_pte_at() <- staled page A installed!

    And besides, for ZRAM, swap_free() allows the swap device to discard the
    entry content, so even if page (B) is not modified, if swap_read_folio()
    on CPU0 happens later than swap_free() on CPU1, it may also cause data
    loss.

    To fix this, reuse swapcache_prepare which will pin the swap entry using
    the cache flag, and allow only one thread to swap it in, also prevent any
    parallel code from putting the entry in the cache.  Release the pin after
    PT unlocked.

    Racers just loop and wait since it's a rare and very short event.  A
    schedule_timeout_uninterruptible(1) call is added to avoid repeated page
    faults wasting too much CPU, causing livelock or adding too much noise to
    perf statistics.  A similar livelock issue was described in commit
    029c4628b2eb ("mm: swap: get rid of livelock in swapin readahead")

    Reproducer:

    This race issue can be triggered easily using a well constructed
    reproducer and patched brd (with a delay in read path) [1]:

    With latest 6.8 mainline, race caused data loss can be observed easily:
    $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
      Polulating 32MB of memory region...
      Keep swapping out...
      Starting round 0...
      Spawning 65536 workers...
      32746 workers spawned, wait for done...
      Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
      Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
      Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
      Round 0 Failed, 15 data loss!

    This reproducer spawns multiple threads sharing the same memory region
    using a small swap device.  Every two threads updates mapped pages one by
    one in opposite direction trying to create a race, with one dedicated
    thread keep swapping out the data out using madvise.

    The reproducer created a reproduce rate of about once every 5 minutes, so
    the race should be totally possible in production.

    After this patch, I ran the reproducer for over a few hundred rounds and
    no data loss observed.

    Performance overhead is minimal, microbenchmark swapin 10G from 32G
    zram:

    Before:     10934698 us
    After:      11157121 us
    Cached:     13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)

    [kasong@tencent.com: v4]
      Link: https://lkml.kernel.org/r/20240219082040.7495-1-ryncsn@gmail.com
    Link: https://lkml.kernel.org/r/20240206182559.32264-1-ryncsn@gmail.com
    Fixes: 0bcac06f27 ("mm, swap: skip swapcache for swapin of synchronous device")
    Reported-by: "Huang, Ying" <ying.huang@intel.com>
    Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
    Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
    Signed-off-by: Kairui Song <kasong@tencent.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Chris Li <chrisl@kernel.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Barry Song <21cnbao@gmail.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-04-11 16:08:24 -04:00
Audra Mitchell 4efa595c94 mm: hwpoison: support recovery from ksm_might_need_to_copy()
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    minor context conflict due to out of order backport related to the v6.1 update
    e6131c89a5 ("mm/swapoff: allow pte_offset_map[_lock]() to fail")

This patch is a backport of the following upstream commit:
commit 6b970599e807ea95c653926d41b095a92fd381e2
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Dec 9 15:28:01 2022 +0800

    mm: hwpoison: support recovery from ksm_might_need_to_copy()

    When the kernel copies a page from ksm_might_need_to_copy(), but runs into
    an uncorrectable error, it will crash since poisoned page is consumed by
    kernel, this is similar to the issue recently fixed by Copy-on-write
    poison recovery.

    When an error is detected during the page copy, return VM_FAULT_HWPOISON
    in do_swap_page(), and install a hwpoison entry in unuse_pte() when
    swapoff, which help us to avoid system crash.  Note, memory failure on a
    KSM page will be skipped, but still call memory_failure_queue() to be
    consistent with general memory failure process, and we could support KSM
    page recovery in the feature.

    [wangkefeng.wang@huawei.com: enhance unuse_pte(), fix issue found by lkp]
      Link: https://lkml.kernel.org/r/20221213120523.141588-1-wangkefeng.wang@huawei.com
    [wangkefeng.wang@huawei.com: update changelog, alter ksm_might_need_to_copy(), restore unlikely() in unuse_pte()]
      Link: https://lkml.kernel.org/r/20230201074433.96641-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20221209072801.193221-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:03 -04:00
Audra Mitchell e2dcaac9a6 mm: remove VM_FAULT_WRITE
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Minor context differences due to out of order backport
    9e9103fead ("mm: convert wp_page_copy() to use folios")

This patch is a backport of the following upstream commit:
commit cb8d863313436339fb60f7dd5131af2e5854621e
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Oct 21 12:11:35 2022 +0200

    mm: remove VM_FAULT_WRITE

    All users -- GUP and KSM -- are gone, let's just remove it.

    Link: https://lkml.kernel.org/r/20221021101141.84170-4-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:01 -04:00
Audra Mitchell c234788260 mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit aea06577a9005ca81c35196d6171cac346d3b251
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:46 2022 +0100

    mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings

    If we already have a PMD/PUD mapped write-protected in a private mapping
    and we want to break COW either due to FAULT_FLAG_WRITE or
    FAULT_FLAG_UNSHARE, there is no need to inform the file system just like on
    the PTE path.

    Let's just split (->zap) + fallback in that case.

    This is a preparation for more generic FAULT_FLAG_UNSHARE support in
    COW mappings.

    Link: https://lkml.kernel.org/r/20221116102659.70287-8-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:57 -04:00
Audra Mitchell f9d1d9a9f0 mm: add early FAULT_FLAG_WRITE consistency checks
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 79881fed6052a9ce00cfb63297832b9faacf8cf3
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:44 2022 +0100

    mm: add early FAULT_FLAG_WRITE consistency checks

    Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to
    care in all other handlers and might get "surprises" if we forget to do
    so.

    Write faults without VM_MAYWRITE don't make any sense, and our
    maybe_mkwrite() logic could have hidden such abuse for now.

    Write faults without VM_WRITE on something that is not a COW mapping is
    similarly broken, and e.g., do_wp_page() could end up placing an
    anonymous page into a shared mapping, which would be bad.

    This is a preparation for reliable R/O long-term pinning of pages in
    private mappings, whereby we want to make sure that we will never break
    COW in a read-only private mapping.

    Link: https://lkml.kernel.org/r/20221116102659.70287-6-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:57 -04:00
Audra Mitchell d115504bd5 mm: add early FAULT_FLAG_UNSHARE consistency checks
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Minor context difference due to out of order backports:
    c007e2df2e ("mm/hugetlb: fix uffd wr-protection for CoW optimization path")
    92a1aa89946b ("mm: rework handling in do_wp_page() based on private vs. shared mappings")
    887f390a3d60 ("mm: ptep_get() conversion")

This patch is a backport of the following upstream commit:
commit cdc5021cda194112bc0962d6a0e90b379968c504
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:43 2022 +0100

    mm: add early FAULT_FLAG_UNSHARE consistency checks

    For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which
    implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not
    dealing with a COW mapping, such that we treat it like a read fault as
    documented and don't have to worry about the flag throughout all fault
    handlers.

    While at it, centralize the check for mutual exclusion of
    FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that
    either flag is set in the WP handler.

    Link: https://lkml.kernel.org/r/20221116102659.70287-5-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:57 -04:00
Audra Mitchell 48f4b25890 mm: mmu_gather: do not expose delayed_rmap flag
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit f036c8184f8b6750fa642485fb01eb6ff036a86b
Author: Alexander Gordeev <agordeev@linux.ibm.com>
Date:   Wed Nov 16 08:49:30 2022 +0100

    mm: mmu_gather: do not expose delayed_rmap flag

    Flag delayed_rmap of 'struct mmu_gather' is rather a private member, but
    it is still accessed directly.  Instead, let the TLB gather code access
    the flag.

    Link: https://lkml.kernel.org/r/Y3SWCu6NRaMQ5dbD@li-4a3a4a4c-28e5-11b2-a85c-a8d192c6f089.ibm.com
    Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:56 -04:00
Audra Mitchell d2722a500f mm: delay page_remove_rmap() until after the TLB has been flushed
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 5df397dec7c4c08c23bd14f162f1228836faa4ce
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Nov 9 12:30:51 2022 -0800

    mm: delay page_remove_rmap() until after the TLB has been flushed

    When we remove a page table entry, we are very careful to only free the
    page after we have flushed the TLB, because other CPUs could still be
    using the page through stale TLB entries until after the flush.

    However, we have removed the rmap entry for that page early, which means
    that functions like folio_mkclean() would end up not serializing with the
    page table lock because the page had already been made invisible to rmap.

    And that is a problem, because while the TLB entry exists, we could end up
    with the following situation:

     (a) one CPU could come in and clean it, never seeing our mapping of the
         page

     (b) another CPU could continue to use the stale and dirty TLB entry and
         continue to write to said page

    resulting in a page that has been dirtied, but then marked clean again,
    all while another CPU might have dirtied it some more.

    End result: possibly lost dirty data.

    This extends our current TLB gather infrastructure to optionally track a
    "should I do a delayed page_remove_rmap() for this page after flushing the
    TLB".  It uses the newly introduced 'encoded page pointer' to do that
    without having to keep separate data around.

    Note, this is complicated by a couple of issues:

     - we want to delay the rmap removal, but not past the page table lock,
       because that simplifies the memcg accounting

     - only SMP configurations want to delay TLB flushing, since on UP
       there are obviously no remote TLBs to worry about, and the page
       table lock means there are no preemption issues either

     - s390 has its own mmu_gather model that doesn't delay TLB flushing,
       and as a result also does not want the delayed rmap. As such, we can
       treat S390 like the UP case and use a common fallback for the "no
       delays" case.

     - we can track an enormous number of pages in our mmu_gather structure,
       with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
       all set up to be approximately 10k pending pages.

       We do not want to have a huge number of batched pages that we then
       need to check for delayed rmap handling inside the page table lock.

    Particularly that last point results in a noteworthy detail, where the
    normal page batch gathering is limited once we have delayed rmaps pending,
    in such a way that only the last batch (the so-called "active batch") in
    the mmu_gather structure can have any delayed entries.

    NOTE!  While the "possibly lost dirty data" sounds catastrophic, for this
    all to happen you need to have a user thread doing either madvise() with
    MADV_DONTNEED or a full re-mmap() of the area concurrently with another
    thread continuing to use said mapping.

    So arguably this is about user space doing crazy things, but from a VM
    consistency standpoint it's better if we track the dirty bit properly even
    when user space goes off the rails.

    [akpm@linux-foundation.org: fix UP build, per Linus]
    Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
    Link: https://lkml.kernel.org/r/20221109203051.1835763-4-torvalds@linux-foundation.org
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Hugh Dickins <hughd@google.com>
    Reported-by: Nadav Amit <nadav.amit@gmail.com>
    Tested-by: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:56 -04:00
Audra Mitchell c2df118635 mm: always compile in pte markers
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Minor context conflicts due to out of order backport:
    804153f7df ("mm: use pte markers for swap errors")

This patch is a backport of the following upstream commit:
commit ca92ea3dc5a2b01f98e9f02b7a6bc03be06fe124
Author: Peter Xu <peterx@redhat.com>
Date:   Sun Oct 30 17:41:50 2022 -0400

    mm: always compile in pte markers

    Patch series "mm: Use pte marker for swapin errors".

    This series uses the pte marker to replace the swapin error swap entry,
    then we save one more swap entry slot for swap devices.  A new pte marker
    bit is defined.

    This patch (of 2):

    The PTE markers code is tiny and now it's enabled for most of the
    distributions.  It's fine to keep it as-is, but to make a broader use of
    it (e.g.  replacing read error swap entry) it needs to be there always
    otherwise we need special code path to take care of !PTE_MARKER case.

    It'll be easier just make pte marker always exist.  Use this chance to
    extend its usage to anonymous too by simply touching up some of the old
    comments, because it'll be used for anonymous pages in the follow up
    patches.

    Link: https://lkml.kernel.org/r/20221030214151.402274-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20221030214151.402274-2-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Huang Ying <ying.huang@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:56 -04:00
Chris von Recklinghausen 7096ad3b1e mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 8d6a0ac09a16c026e1e2a03a61e12e95c48a25a6
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:47 2022 +0100

    mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping

    Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a
    COW (i.e., private writable) mapping and adjust the documentation
    accordingly.

    FAULT_FLAG_UNSHARE will now also break COW when encountering the shared
    zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by
    properly replacing the mapped page/pfn by a private copy (an exclusive
    anonymous page).

    Note that only do_wp_page() needs care: hugetlb_wp() already handles
    FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it
    correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE
    such that we can handle FAULT_FLAG_UNSHARE on the PTE level.

    This change is a requirement for reliable long-term R/O pinning in
    COW mappings.

    Link: https://lkml.kernel.org/r/20221116102659.70287-9-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:06 -04:00
Chris von Recklinghausen d43c5e6f48 mm: rework handling in do_wp_page() based on private vs. shared mappings
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit b9086fde6d44e8a95dc95b822bd87386129b832d
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:45 2022 +0100

    mm: rework handling in do_wp_page() based on private vs. shared mappings

    We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a
    COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages.
    Let's prepare for that by handling shared mappings first such that we can
    handle private mappings last.

    While at it, use folio-based functions instead of page-based functions
    where we touch the code either way.

    Link: https://lkml.kernel.org/r/20221116102659.70287-7-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:06 -04:00
Chris von Recklinghausen 26437a89ef mm: remove the vma linked list
Conflicts:
	include/linux/mm.h - We already have
		21b85b09527c ("madvise: use zap_page_range_single for madvise dontneed")
		so keep declaration for zap_page_range_single
	kernel/fork.c - We already have
		f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
		so keep declaration of i
	mm/mmap.c - We already have
		a1e8cb93bf ("mm: drop oom code from exit_mmap")
		and
		db3644c677 ("mm: delete unused MMF_OOM_VICTIM flag")
		so keep setting MMF_OOM_SKIP in mm->flags

JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 763ecb035029f500d7e6dc99acd1ad299b7726a1
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:06 2022 +0000

    mm: remove the vma linked list

    Replace any vm_next use with vma_find().

    Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the
    maple tree.

    Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap().  At
    the same time, alter the loop to be more compact.

    Now that free_pgtables() and unmap_vmas() take a maple tree as an
    argument, rearrange do_mas_align_munmap() to use the new tree to hold the
    vmas to remove.

    Remove __vma_link_list() and __vma_unlink_list() as they are exclusively
    used to update the linked list.

    Drop linked list update from __insert_vm_struct().

    Rework validation of tree as it was depending on the linked list.

    [yang.lee@linux.alibaba.com: fix one kernel-doc comment]
      Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1949
      Link: https://lkml.kernel.org/r/20220824021918.94116-1-yang.lee@linux.alib
aba.comLink: https://lkml.kernel.org/r/20220906194824.2110408-69-Liam.Howlett@or
acle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:57 -04:00
Prarit Bhargava 25cf7e4e50 mm: Make pte_mkwrite() take a VMA
JIRA: https://issues.redhat.com/browse/RHEL-25415

Conflicts: This is a rip and replace of pt_mkwrite() with one arg for
pte_mkwrite() with two args.  There are uses upstream that are not yet
in RHEL9.

commit 161e393c0f63592a3b95bdd8b55752653763fc6d
Author: Rick Edgecombe <rick.p.edgecombe@intel.com>
Date:   Mon Jun 12 17:10:29 2023 -0700

    mm: Make pte_mkwrite() take a VMA

    The x86 Shadow stack feature includes a new type of memory called shadow
    stack. This shadow stack memory has some unusual properties, which requires
    some core mm changes to function properly.

    One of these unusual properties is that shadow stack memory is writable,
    but only in limited ways. These limits are applied via a specific PTE
    bit combination. Nevertheless, the memory is writable, and core mm code
    will need to apply the writable permissions in the typical paths that
    call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
    that the x86 implementation of it can know whether to create regular
    writable or shadow stack mappings.

    But there are a couple of challenges to this. Modifying the signatures of
    each arch pte_mkwrite() implementation would be error prone because some
    are generated with macros and would need to be re-implemented. Also, some
    pte_mkwrite() callers operate on kernel memory without a VMA.

    So this can be done in a three step process. First pte_mkwrite() can be
    renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
    added that just calls pte_mkwrite_novma(). Next callers without a VMA can
    be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
    can be changed to take/pass a VMA.

    Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted
    callers that don't have a VMA were to use pte_mkwrite_novma(). So now
    change pte_mkwrite() to take a VMA and change the remaining callers to
    pass a VMA. Apply the same changes for pmd_mkwrite().

    No functional change.

    Suggested-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: David Hildenbrand <david@redhat.com>
    Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com

Omitted-fix: f441ff73f1ec powerpc: Fix pud_mkwrite() definition after pte_mkwrite() API changes
	pud_mkwrite() not in RHEL9 code for powerpc (removed previously)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:43:13 -04:00
Jerry Snitselaar efb6748971 mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
JIRA: https://issues.redhat.com/browse/RHEL-26541
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Conflicts: Context diff due to some commits not being backported yet such as c33c794828f2 ("mm: ptep_get() conversion"),
           and 959a78b6dd45 ("mm/hugetlb: use a folio in hugetlb_wp()").

commit ec8832d007cb7b50229ad5745eec35b847cc9120
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jul 25 23:42:06 2023 +1000

    mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()

    Secondary TLBs are now invalidated from the architecture specific TLB
    invalidation functions.  Therefore there is no need to explicitly notify
    or invalidate as part of the range end functions.  This means we can
    remove mmu_notifier_invalidate_range_end_only() and some of the
    ptep_*_notify() functions.

    Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Andrew Donnellan <ajd@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
    Cc: Frederic Barrat <fbarrat@linux.ibm.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kevin Tian <kevin.tian@intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Nicolin Chen <nicolinc@nvidia.com>
    Cc: Robin Murphy <robin.murphy@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zhi Wang <zhi.wang.linux@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit ec8832d007cb7b50229ad5745eec35b847cc9120)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2024-02-26 15:49:51 -07:00
Mika Penttilä 7ef8f6ec98 mm: fix a few rare cases of using swapin error pte marker
JIRA: https://issues.redhat.com/browse/RHEL-1349
Upstream Status: v6.2-rc7

commit 7e3ce3f8d2d235f916baad1582f6cf12e0319013
Author:     Peter Xu <peterx@redhat.com>
AuthorDate: Wed Dec 14 15:04:53 2022 -0500
Commit:     Andrew Morton <akpm@linux-foundation.org>
CommitDate: Wed Jan 18 17:02:19 2023 -0800

    This patch should harden commit 15520a3f0469 ("mm: use pte markers for
    swap errors") on using pte markers for swapin errors on a few corner
    cases.

    1. Propagate swapin errors across fork()s: if there're swapin errors in
       the parent mm, after fork()s the child should sigbus too when an error
       page is accessed.

    2. Fix a rare condition race in pte_marker_clear() where a uffd-wp pte
       marker can be quickly switched to a swapin error.

    3. Explicitly ignore swapin error pte markers in change_protection().

    I mostly don't worry on (2) or (3) at all, but we should still have them. 
    Case (1) is special because it can potentially cause silent data corrupt
    on child when parent has swapin error triggered with swapoff, but since
    swapin error is rare itself already it's probably not easy to trigger
    either.

    Currently there is a priority difference between the uffd-wp bit and the
    swapin error entry, in which the swapin error always has higher priority
    (e.g.  we don't need to wr-protect a swapin error pte marker).

    If there will be a 3rd bit introduced, we'll probably need to consider a
    more involved approach so we may need to start operate on the bits.  Let's
    leave that for later.

    This patch is tested with case (1) explicitly where we'll get corrupted
    data before in the child if there's existing swapin error pte markers, and
    after patch applied the child can be rightfully killed.

    We don't need to copy stable for this one since 15520a3f0469 just landed
    as part of v6.2-rc1, only "Fixes" applied.

    Link: https://lkml.kernel.org/r/20221214200453.1772655-3-peterx@redhat.com
    Fixes: 15520a3f0469 ("mm: use pte markers for swap errors")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Pengfei Xu <pengfei.xu@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
2023-10-30 07:03:06 +02:00
Mika Penttilä 6b269e16a3 mm/uffd: fix pte marker when fork() without fork event
JIRA: https://issues.redhat.com/browse/RHEL-1349
Upstream Status: v6.2-rc7

commit 49d6d7fb631345b0f2957a7c4be24ad63903150f
Author:     Peter Xu <peterx@redhat.com>
AuthorDate: Wed Dec 14 15:04:52 2022 -0500
Commit:     Andrew Morton <akpm@linux-foundation.org>
CommitDate: Wed Jan 18 17:02:19 2023 -0800

    Patch series "mm: Fixes on pte markers".

    Patch 1 resolves the syzkiller report from Pengfei.

    Patch 2 further harden pte markers when used with the recent swapin error
    markers.  The major case is we should persist a swapin error marker after
    fork(), so child shouldn't read a corrupted page.


    This patch (of 2):

    When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may
    have it and has pte marker installed.  The warning is improper along with
    the comment.  The right thing is to inherit the pte marker when needed, or
    keep the dst pte empty.

    A vague guess is this happened by an accident when there's the prior patch
    to introduce src/dst vma into this helper during the uffd-wp feature got
    developed and I probably messed up in the rebase, since if we replace
    dst_vma with src_vma the warning & comment it all makes sense too.

    Hugetlb did exactly the right here (copy_hugetlb_page_range()).  Fix the
    general path.

    Reproducer:

    https://github.com/xupengfe/syzkaller_logs/blob/main/221208_115556_copy_page_range/repro.c

    Bugzilla report: https://bugzilla.kernel.org/show_bug.cgi?id=216808

    Link: https://lkml.kernel.org/r/20221214200453.1772655-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20221214200453.1772655-2-peterx@redhat.com
    Fixes: c56d1b62cce8 ("mm/shmem: handle uffd-wp during fork()")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: Pengfei Xu <pengfei.xu@intel.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: <stable@vger.kernel.org> # 5.19+
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
2023-10-30 07:03:06 +02:00
Mika Penttilä 804153f7df mm: use pte markers for swap errors
JIRA: https://issues.redhat.com/browse/RHEL-1349
Upstream Status: v6.2-rc1

commit 15520a3f046998e3f57e695743e99b0875e2dae7
Author:     Peter Xu <peterx@redhat.com>
AuthorDate: Sun Oct 30 17:41:51 2022 -0400
Commit:     Andrew Morton <akpm@linux-foundation.org>
CommitDate: Wed Nov 30 15:58:46 2022 -0800

    PTE markers are ideal mechanism for things like SWP_SWAPIN_ERROR.  Using a
    whole swap entry type for this purpose can be an overkill, especially if
    we already have PTE markers.  Define a new bit for swapin error and
    replace it with pte markers.  Then we can safely drop SWP_SWAPIN_ERROR and
    give one device slot back to swap.

    We used to have SWP_SWAPIN_ERROR taking the page pfn as part of the swap
    entry, but it's never used.  Neither do I see how it can be useful because
    normally the swapin failure should not be caused by a bad page but bad
    swap device.  Drop it alongside.

    Link: https://lkml.kernel.org/r/20221030214151.402274-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Huang Ying <ying.huang@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
2023-10-26 06:54:59 +03:00
Chris von Recklinghausen 41172cafb6 mm/memory: handle_pte_fault() use pte_offset_map_nolock()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit c7ad08804fae5baa7f71c0790038e8259e1066a5
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:45:05 2023 -0700

    mm/memory: handle_pte_fault() use pte_offset_map_nolock()

    handle_pte_fault() use pte_offset_map_nolock() to get the vmf.ptl which
    corresponds to vmf.pte, instead of pte_lockptr() being used later, when
    there's a chance that the pmd entry might have changed, perhaps to none,
    or to a huge pmd, with no split ptlock in its struct page.

    Remove its pmd_devmap_trans_unstable() call: pte_offset_map_nolock() will
    handle that case by failing.  Update the "morph" comment above, looking
    forward to when shmem or file collapse to THP may not take mmap_lock for
    write (or not at all).

    do_numa_page() use the vmf->ptl from handle_pte_fault() at first, but
    refresh it when refreshing vmf->pte.

    do_swap_page()'s pte_unmap_same() (the thing that takes ptl to verify a
    two-part PAE orig_pte) use the vmf->ptl from handle_pte_fault() too; but
    do_swap_page() is also used by anon THP's __collapse_huge_page_swapin(),
    so adjust that to set vmf->ptl by pte_offset_map_nolock().

    Link: https://lkml.kernel.org/r/c1107654-3929-60ac-223e-6877cbb86065@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:20 -04:00
Chris von Recklinghausen efe1a9d970 mm/memory: allow pte_offset_map[_lock]() to fail
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3db82b9374ca921b8b820a75e83809d5c4133d8f
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:43:38 2023 -0700

    mm/memory: allow pte_offset_map[_lock]() to fail

    copy_pte_range(): use pte_offset_map_nolock(), and allow for it to fail;
    but with a comment on some further assumptions that are being made there.

    zap_pte_range() and zap_pmd_range(): adjust their interaction so that a
    pte_offset_map_lock() failure in zap_pte_range() leads to a retry in
    zap_pmd_range(); remove call to pmd_none_or_trans_huge_or_clear_bad().

    Allow pte_offset_map_lock() to fail in many functions.  Update comment on
    calling pte_alloc() in do_anonymous_page().  Remove redundant calls to
    pmd_trans_unstable(), pmd_devmap_trans_unstable(), pmd_none() and
    pmd_bad(); but leave pmd_none_or_clear_bad() calls in free_pmd_range() and
    copy_pmd_range(), those do simplify the next level down.

    Link: https://lkml.kernel.org/r/bb548d50-e99a-f29e-eab1-a43bef2a1287@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:20 -04:00
Chris von Recklinghausen 176bb35f89 mm: use pmdp_get_lockless() without surplus barrier()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 26e1a0c3277d7f43856ec424902423be212cc178
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:06:53 2023 -0700

    mm: use pmdp_get_lockless() without surplus barrier()

    Patch series "mm: allow pte_offset_map[_lock]() to fail", v2.

    What is it all about?  Some mmap_lock avoidance i.e.  latency reduction.
    Initially just for the case of collapsing shmem or file pages to THPs; but
    likely to be relied upon later in other contexts e.g.  freeing of empty
    page tables (but that's not work I'm doing).  mmap_write_lock avoidance
    when collapsing to anon THPs?  Perhaps, but again that's not work I've
    done: a quick attempt was not as easy as the shmem/file case.

    I would much prefer not to have to make these small but wide-ranging
    changes for such a niche case; but failed to find another way, and have
    heard that shmem MADV_COLLAPSE's usefulness is being limited by that
    mmap_write_lock it currently requires.

    These changes (though of course not these exact patches) have been in
    Google's data centre kernel for three years now: we do rely upon them.

    What is this preparatory series about?

    The current mmap locking will not be enough to guard against that tricky
    transition between pmd entry pointing to page table, and empty pmd entry,
    and pmd entry pointing to huge page: pte_offset_map() will have to
    validate the pmd entry for itself, returning NULL if no page table is
    there.  What to do about that varies: sometimes nearby error handling
    indicates just to skip it; but in many cases an ACTION_AGAIN or "goto
    again" is appropriate (and if that risks an infinite loop, then there must
    have been an oops, or pfn 0 mistaken for page table, before).

    Given the likely extension to freeing empty page tables, I have not
    limited this set of changes to a THP config; and it has been easier, and
    sets a better example, if each site is given appropriate handling: even
    where deeper study might prove that failure could only happen if the pmd
    table were corrupted.

    Several of the patches are, or include, cleanup on the way; and by the
    end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
    pte_offset_map_lock() then handle those original races and more.  Most
    uses of pte_lockptr() are deprecated, with pte_offset_map_nolock() taking
    its place.

    This patch (of 32):

    Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
    reliable result with PAE (or READ_ONCE as before without PAE); and remove
    the unnecessary extra barrier()s which got left behind in its callers.

    HOWEVER: Note the small print in linux/pgtable.h, where it was designed
    specifically for fast GUP, and depends on interrupts being disabled for
    its full guarantee: most callers which have been added (here and before)
    do NOT have interrupts disabled, so there is still some need for caution.

    Link: https://lkml.kernel.org/r/f35279a9-9ac0-de22-d245-591afbfb4dc@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:11 -04:00
Chris von Recklinghausen a68e16dd11 mm: fix failure to unmap pte on highmem systems
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3b65f437d9e8dd696a2b88e7afcd51385532ab35
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Fri Jun 2 10:29:49 2023 +0100

    mm: fix failure to unmap pte on highmem systems

    The loser of a race to service a pte for a device private entry in the
    swap path previously unlocked the ptl, but failed to unmap the pte.  This
    only affects highmem systems since unmapping a pte is a noop on
    non-highmem systems.

    Link: https://lkml.kernel.org/r/20230602092949.545577-5-ryan.roberts@arm.com
    Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page")
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:10 -04:00
Chris von Recklinghausen c2aa4ee6d2 mm/uffd: UFFD_FEATURE_WP_UNPOPULATED
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 2bad466cc9d9b4c3b4b16eb9c03c919b59561316
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Mar 9 17:37:10 2023 -0500

    mm/uffd: UFFD_FEATURE_WP_UNPOPULATED

    Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4.

    The new feature bit makes anonymous memory acts the same as file memory on
    userfaultfd-wp in that it'll also wr-protect none ptes.

    It can be useful in two cases:

    (1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot,
        so pre-fault can be replaced by enabling this flag and speed up
        protections

    (2) It helps to implement async uffd-wp mode that Muhammad is working on [1]

    It's debatable whether this is the most ideal solution because with the
    new feature bit set, wr-protect none pte needs to pre-populate the
    pgtables to the last level (PAGE_SIZE).  But it seems fine so far to
    service either purpose above, so we can leave optimizations for later.

    The series brings pte markers to anonymous memory too.  There's some
    change in the common mm code path in the 1st patch, great to have some eye
    looking at it, but hopefully they're still relatively straightforward.

    This patch (of 2):

    This is a new feature that controls how uffd-wp handles none ptes.  When
    it's set, the kernel will handle anonymous memory the same way as file
    memory, by allowing the user to wr-protect unpopulated ptes.

    File memories handles none ptes consistently by allowing wr-protecting of
    none ptes because of the unawareness of page cache being exist or not.
    For anonymous it was not as persistent because we used to assume that we
    don't need protections on none ptes or known zero pages.

    One use case of such a feature bit was VM live snapshot, where if without
    wr-protecting empty ptes the snapshot can contain random rubbish in the
    holes of the anonymous memory, which can cause misbehave of the guest when
    the guest OS assumes the pages should be all zeros.

    QEMU worked it around by pre-populate the section with reads to fill in
    zero page entries before starting the whole snapshot process [1].

    Recently there's another need raised on using userfaultfd wr-protect for
    detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
    case if without being able to wr-protect none ptes by default, the dirty
    info can get lost, since we cannot treat every none pte to be dirty (the
    current design is identify a page dirty based on uffd-wp bit being
    cleared).

    In general, we want to be able to wr-protect empty ptes too even for
    anonymous.

    This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
    uffd-wp handling on none ptes being consistent no matter what the memory
    type is underneath.  It doesn't have any impact on file memories so far
    because we already have pte markers taking care of that.  So it only
    affects anonymous.

    The feature bit is by default off, so the old behavior will be maintained.
    Sometimes it may be wanted because the wr-protect of none ptes will
    contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte
    markers to anonymous), but also on creating the pgtables to store the pte
    markers.  So there's potentially less chance of using thp on the first
    fault for a none pmd or larger than a pmd.

    The major implementation part is teaching the whole kernel to understand
    pte markers even for anonymously mapped ranges, meanwhile allowing the
    UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
    new feature bit is set.

    Note that even if the patch subject starts with mm/uffd, there're a few
    small refactors to major mm path of handling anonymous page faults.  But
    they should be straightforward.

    With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all
    the memory before wr-protect during taking a live snapshot.  Quotting from
    Muhammad's test result here [3] based on a simple program [4]:

      (1) With huge page disabled
      echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
      ./uffd_wp_perf
      Test DEFAULT: 4
      Test PRE-READ: 1111453 (pre-fault 1101011)
      Test MADVISE: 278276 (pre-fault 266378)
      Test WP-UNPOPULATE: 11712

      (2) With Huge page enabled
      echo always > /sys/kernel/mm/transparent_hugepage/enabled
      ./uffd_wp_perf
      Test DEFAULT: 4
      Test PRE-READ: 22521 (pre-fault 22348)
      Test MADVISE: 4909 (pre-fault 4743)
      Test WP-UNPOPULATE: 14448

    There'll be a great perf boost for no-thp case, while for thp enabled with
    extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE,
    but that's low possibility in reality, also the overhead was not reduced
    but postponed until a follow up write on any huge zero thp, so potentially
    it is faster by making the follow up writes slower.

    [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
    [2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
    [3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/
    [4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c

    [peterx@redhat.com: comment changes, oneliner fix to khugepaged]
      Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n
    Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Paul Gofman <pgofman@codeweavers.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:02 -04:00
Chris von Recklinghausen 9e9103fead mm: convert wp_page_copy() to use folios
Conflicts: mm/memory.c - We don't have
	7d4a8be0c4b2 ("mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export")
	so call mmu_notifier_range_init with both mm and vma (context)

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 28d41a4863316321bb5aa616bd82d65c84fc0f8b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:18:11 2023 +0000

    mm: convert wp_page_copy() to use folios

    Use new_folio instead of new_page throughout, because we allocated it
    and know it's an order-0 folio.  Most old_page uses become old_folio,
    but use vmf->page where we need the precise page.

    Link: https://lkml.kernel.org/r/20230116191813.2145215-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:54 -04:00
Chris von Recklinghausen ad7b024ba7 mm: add vma_alloc_zeroed_movable_folio()
Conflicts: drop changes to arch/alpha/include/asm/page.h
	arch/ia64/include/asm/page.h arch/m68k/include/asm/page_no.h -
		unsupported arches

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 6bc56a4d855303705802c5ede4625973637484c7
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:18:09 2023 +0000

    mm: add vma_alloc_zeroed_movable_folio()

    Replace alloc_zeroed_user_highpage_movable().  The main difference is
    returning a folio containing a single page instead of returning the page,
    but take the opportunity to rename the function to match other allocation
    functions a little better and rewrite the documentation to place more
    emphasis on the zeroing rather than the highmem aspect.

    Link: https://lkml.kernel.org/r/20230116191813.2145215-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:53 -04:00
Chris von Recklinghausen c28dba63db mm/memory: add vm_normal_folio()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 318e9342fbbb6888d903d86e83865609901a1c65
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Wed Dec 21 10:08:45 2022 -0800

    mm/memory: add vm_normal_folio()

    Patch series "Convert deactivate_page() to folio_deactivate()", v4.

    Deactivate_page() has already been converted to use folios.  This patch
    series modifies the callers of deactivate_page() to use folios.  It also
    introduces vm_normal_folio() to assist with folio conversions, and
    converts deactivate_page() to folio_deactivate() which takes in a folio.

    This patch (of 4):

    Introduce a wrapper function called vm_normal_folio().  This function
    calls vm_normal_page() and returns the folio of the page found, or null if
    no page is found.

    This function allows callers to get a folio from a pte, which will
    eventually allow them to completely replace their struct page variables
    with struct folio instead.

    Link: https://lkml.kernel.org/r/20221221180848.20774-1-vishal.moola@gmail.com
    Link: https://lkml.kernel.org/r/20221221180848.20774-2-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:47 -04:00
Chris von Recklinghausen 653ae76632 mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()
Conflicts: mm/userfaultfd.c - RHEL-only patch
	8e95bedaa1a ("mm: Fix CVE-2022-2590 by reverting "mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"")
	causes a merge conflict with this patch. Since upstream commit
	5535be309971 ("mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW")
	actually fixes the CVE we can safely remove the conflicted lines
	and replace them with the lines the upstream version of thes
	patch adds

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f1eb1bacfba9019823b2fce42383f010cd561fa6
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Dec 14 15:15:33 2022 -0500

    mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()

    This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths.

    The reasons I still think this patch is worthwhile, are:

      (1) It is a cleanup already; diffstat tells.

      (2) It just feels natural after I thought about this, if the pte is uffd
          protected, let's remove the write bit no matter what it was.

      (2) Since x86 is the only arch that supports uffd-wp, it also redefines
          pte|pmd_mkuffd_wp() in that it should always contain removals of
          write bits.  It means any future arch that want to implement uffd-wp
          should naturally follow this rule too.  It's good to make it a
          default, even if with vm_page_prot changes on VM_UFFD_WP.

      (3) It covers more than vm_page_prot.  So no chance of any potential
          future "accident" (like pte_mkdirty() sparc64 or loongarch, even
          though it just got its pte_mkdirty fixed <1 month ago).  It'll be
          fairly clear when reading the code too that we don't worry anything
          before a pte_mkuffd_wp() on uncertainty of the write bit.

    We may call pte_wrprotect() one more time in some paths (e.g.  thp split),
    but that should be fully local bitop instruction so the overhead should be
    negligible.

    Although this patch should logically also fix all the known issues on
    uffd-wp too recently on page migration (not for numa hint recovery - that
    may need another explcit pte_wrprotect), but this is not the plan for that
    fix.  So no fixes, and stable doesn't need this.

    Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ives van Hoorne <ives@codesandbox.io>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:43 -04:00
Chris von Recklinghausen 4808276894 mm, hwpoison: when copy-on-write hits poison, take page offline
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d302c2398ba269e788a4f37ae57c07a7fcabaa42
Author: Tony Luck <tony.luck@intel.com>
Date:   Fri Oct 21 13:01:20 2022 -0700

    mm, hwpoison: when copy-on-write hits poison, take page offline

    Cannot call memory_failure() directly from the fault handler because
    mmap_lock (and others) are held.

    It is important, but not urgent, to mark the source page as h/w poisoned
    and unmap it from other tasks.

    Use memory_failure_queue() to request a call to memory_failure() for the
    page with the error.

    Also provide a stub version for CONFIG_MEMORY_FAILURE=n

    Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.com
    Signed-off-by: Tony Luck <tony.luck@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Shuai Xue <xueshuai@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:22 -04:00
Chris von Recklinghausen 360555fbb4 mm, hwpoison: try to recover from copy-on write faults
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a873dfe1032a132bf89f9e19a6ac44f5a0b78754
Author: Tony Luck <tony.luck@intel.com>
Date:   Fri Oct 21 13:01:19 2022 -0700

    mm, hwpoison: try to recover from copy-on write faults

    Patch series "Copy-on-write poison recovery", v3.

    Part 1 deals with the process that triggered the copy on write fault with
    a store to a shared read-only page.  That process is send a SIGBUS with
    the usual machine check decoration to specify the virtual address of the
    lost page, together with the scope.

    Part 2 sets up to asynchronously take the page with the uncorrected error
    offline to prevent additional machine check faults.  H/t to Miaohe Lin
    <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com> for
    pointing me to the existing function to queue a call to memory_failure().

    On x86 there is some duplicate reporting (because the error is also
    signalled by the memory controller as well as by the core that triggered
    the machine check).  Console logs look like this:

    This patch (of 2):

    If the kernel is copying a page as the result of a copy-on-write
    fault and runs into an uncorrectable error, Linux will crash because
    it does not have recovery code for this case where poison is consumed
    by the kernel.

    It is easy to set up a test case. Just inject an error into a private
    page, fork(2), and have the child process write to the page.

    I wrapped that neatly into a test at:

      git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

    just enable ACPI error injection and run:

      # ./einj_mem-uc -f copy-on-write

    Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
    on architectures where that is available (currently x86 and powerpc).
    When an error is detected during the page copy, return VM_FAULT_HWPOISON
    to caller of wp_page_copy(). This propagates up the call stack. Both x86
    and powerpc have code in their fault handler to deal with this code by
    sending a SIGBUS to the application.

    Note that this patch avoids a system crash and signals the process that
    triggered the copy-on-write action. It does not take any action for the
    memory error that is still in the shared page. To handle that a call to
    memory_failure() is needed. But this cannot be done from wp_page_copy()
    because it holds mmap_lock(). Perhaps the architecture fault handlers
    can deal with this loose end in a subsequent patch?

    On Intel/x86 this loose end will often be handled automatically because
    the memory controller provides an additional notification of the h/w
    poison in memory, the handler for this will call memory_failure(). This
    isn't a 100% solution. If there are multiple errors, not all may be
    logged in this way.

    [tony.luck@intel.com: add call to kmsan_unpoison_memory(), per Miaohe Lin]
      Link: https://lkml.kernel.org/r/20221031201029.102123-2-tony.luck@intel.com
    Link: https://lkml.kernel.org/r/20221021200120.175753-1-tony.luck@intel.com
    Link: https://lkml.kernel.org/r/20221021200120.175753-2-tony.luck@intel.com
    Signed-off-by: Tony Luck <tony.luck@intel.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:22 -04:00
Chris von Recklinghausen 9cec47342a mm: convert mm's rss stats into percpu_counter
Conflicts:
	include/linux/sched.h - We don't have
		7964cf8caa4d ("mm: remove vmacache")
		so don't remove the declaration for vmacache
	kernel/fork.c - We don't have
		d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
		so don't add calls to mt_init_flags or mt_set_external_lock

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25
Author: Shakeel Butt <shakeelb@google.com>
Date:   Mon Oct 24 05:28:41 2022 +0000

    mm: convert mm's rss stats into percpu_counter

    Currently mm_struct maintains rss_stats which are updated on page fault
    and the unmapping codepaths.  For page fault codepath the updates are
    cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
    The reason for caching is performance for multithreaded applications
    otherwise the rss_stats updates may become hotspot for such applications.

    However this optimization comes with the cost of error margin in the rss
    stats.  The rss_stats for applications with large number of threads can be
    very skewed.  At worst the error margin is (nr_threads * 64) and we have a
    lot of applications with 100s of threads, so the error margin can be very
    high.  Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.

    Recently we started seeing the unbounded errors for rss_stats for specific
    applications which use TCP rx0cp.  It seems like vm_insert_pages()
    codepath does not sync rss_stats at all.

    This patch converts the rss_stats into percpu_counter to convert the error
    margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).  However
    this conversion enable us to get the accurate stats for situations where
    accuracy is more important than the cpu cost.

    This patch does not make such tradeoffs - we can just use
    percpu_counter_add_local() for the updates and percpu_counter_sum() (or
    percpu_counter_sync() + percpu_counter_read) for the readers.  At the
    moment the readers are either procfs interface, oom_killer and memory
    reclaim which I think are not performance critical and should be ok with
    slow read.  However I think we can make that change in a separate patch.

    Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:21 -04:00
Chris von Recklinghausen ac4694cf43 Revert "mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in"
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b12fdbf15f92b6cf5fecdd8a1855afe8809e5c58
Author: Peter Xu <peterx@redhat.com>
Date:   Mon Oct 24 15:33:36 2022 -0400

    Revert "mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in"

    With " mm/uffd: Fix vma check on userfault for wp" to fix the
    registration, we'll be safe to remove the macro hacks now.

    Link: https://lkml.kernel.org/r/20221024193336.1233616-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:18 -04:00
Chris von Recklinghausen 2e4f279847 hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 04ada095dcfc4ae359418053c0be94453bdf1e84
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Mon Nov 14 15:55:06 2022 -0800

    hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing

    madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
    tables associated with the address range.  For hugetlb vmas,
    zap_page_range will call __unmap_hugepage_range_final.  However,
    __unmap_hugepage_range_final assumes the passed vma is about to be removed
    and deletes the vma_lock to prevent pmd sharing as the vma is on the way
    out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
    missing vma_lock prevents pmd sharing and could potentially lead to issues
    with truncation/fault races.

    This issue was originally reported here [1] as a BUG triggered in
    page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
    vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
    prevent pmd sharing.  Subsequent faults on this vma were confused as
    VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
    not set in new pages added to the page table.  This resulted in pages that
    appeared anonymous in a VM_SHARED vma and triggered the BUG.

    Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
    call from unmap_vmas().  This is used to indicate the 'final' unmapping of
    a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
    the vm_lock is not deleted.

    [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

    Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
    Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reported-by: Wei Chen <harperchen1110@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:13 -04:00
Chris von Recklinghausen 68b5e6cc07 madvise: use zap_page_range_single for madvise dontneed
Conflicts: include/linux/mm.h - We don't have
	763ecb035029 ("mm: remove the vma linked list")
	so keep the old definition of unmap_vmas

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 21b85b09527c28e242db55c1b751f7f7549b830c
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Mon Nov 14 15:55:05 2022 -0800

    madvise: use zap_page_range_single for madvise dontneed

    This series addresses the issue first reported in [1], and fully described
    in patch 2.  Patches 1 and 2 address the user visible issue and are tagged
    for stable backports.

    While exploring solutions to this issue, related problems with mmu
    notification calls were discovered.  This is addressed in the patch
    "hugetlb: remove duplicate mmu notifications:".  Since there are no user
    visible effects, this third is not tagged for stable backports.

    Previous discussions suggested further cleanup by removing the
    routine zap_page_range.  This is possible because zap_page_range_single
    is now exported, and all callers of zap_page_range pass ranges entirely
    within a single vma.  This work will be done in a later patch so as not
    to distract from this bug fix.

    [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjU
b6JudD0g@mail.gmail.com/

    This patch (of 2):

    Expose the routine zap_page_range_single to zap a range within a single
    vma.  The madvise routine madvise_dontneed_single_vma can use this routine
    as it explicitly operates on a single vma.  Also, update the mmu
    notification range in zap_page_range_single to take hugetlb pmd sharing
    into account.  This is required as MADV_DONTNEED supports hugetlb vmas.

    Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
    Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reported-by: Wei Chen <harperchen1110@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:12 -04:00
Chris von Recklinghausen 02174dae48 hugetlb: fix vma lock handling during split vma and range unmapping
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 131a79b474e973f023c5c75e2323a940332103be
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Oct 4 18:17:05 2022 -0700

    hugetlb: fix vma lock handling during split vma and range unmapping

    Patch series "hugetlb: fixes for new vma lock series".

    In review of the series "hugetlb: Use new vma lock for huge pmd sharing
    synchronization", Miaohe Lin pointed out two key issues:

    1) There is a race in the routine hugetlb_unmap_file_folio when locks
       are dropped and reacquired in the correct order [1].

    2) With the switch to using vma lock for fault/truncate synchronization,
       we need to make sure lock exists for all VM_MAYSHARE vmas, not just
       vmas capable of pmd sharing.

    These two issues are addressed here.  In addition, having a vma lock
    present in all VM_MAYSHARE vmas, uncovered some issues around vma
    splitting.  Those are also addressed.

    [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/

    This patch (of 3):

    The hugetlb vma lock hangs off the vm_private_data field and is specific
    to the vma.  When vm_area_dup() is called as part of vma splitting, the
    vma lock pointer is copied to the new vma.  This will result in issues
    such as double freeing of the structure.  Update the hugetlb open vm_ops
    to allocate a new vma lock for the new vma.

    The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
    to prevent subsequent pmd sharing.  hugetlb_vma_lock_free attempted to
    anticipate this by checking both VM_MAYSHARE and VM_SHARED.  However, if
    only VM_MAYSHARE was set we would miss the free.  With the introduction of
    the vma lock, a vma can not participate in pmd sharing if vm_private_data
    is NULL.  Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
    free the vma lock to prevent sharing.  Also, update the sharing code to
    make sure vma lock is indeed a condition for pmd sharing.
    hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.

    Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
    Fixes: "hugetlb: add vma based lock for pmd sharing"
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:55 -04:00