Commit Graph

101 Commits

Author SHA1 Message Date
Thomas Huth c2693493a1 mm/userfaultfd: Do not place zeropages when zeropages are disallowed
JIRA: https://issues.redhat.com/browse/RHEL-65229

commit 90a7592da14951bd21f74a53246ba30955a648aa
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Apr 11 18:14:40 2024 +0200

    mm/userfaultfd: Do not place zeropages when zeropages are disallowed

    s390x must disable shared zeropages for processes running VMs, because
    the VMs could end up making use of "storage keys" or protected
    virtualization, which are incompatible with shared zeropages.

    Yet, with userfaultfd it is possible to insert shared zeropages into
    such processes. Let's fallback to simply allocating a fresh zeroed
    anonymous folio and insert that instead.

    mm_forbids_zeropage() was introduced in commit 593befa6ab ("mm: introduce
    mm_forbids_zeropage function"), briefly before userfaultfd went
    upstream.

    Note that we don't want to fail the UFFDIO_ZEROPAGE request like we do
    for hugetlb, it would be rather unexpected. Further, we also
    cannot really indicated "not supported" to user space ahead of time: it
    could be that the MM disallows zeropages after userfaultfd was already
    registered.

    [ agordeev: Fixed checkpatch complaints ]

    Fixes: c1a4de99fa ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
    Reviewed-by: Peter Xu <peterx@redhat.com>
    Link: https://lore.kernel.org/r/20240411161441.910170-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

Signed-off-by: Thomas Huth <thuth@redhat.com>
2024-10-30 11:41:14 +01:00
Rafael Aquini 5da3b36fc4 userfaultfd: don't BUG_ON() if khugepaged yanks our page table
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 4828d207dc5161dc7ddf9a4f6dcfd80c7dd7d20a
Author: Jann Horn <jannh@google.com>
Date:   Tue Aug 13 22:25:22 2024 +0200

    userfaultfd: don't BUG_ON() if khugepaged yanks our page table

    Since khugepaged was changed to allow retracting page tables in file
    mappings without holding the mmap lock, these BUG_ON()s are wrong - get
    rid of them.

    We could also remove the preceding "if (unlikely(...))" block, but then we
    could reach pte_offset_map_lock() with transhuge pages not just for file
    mappings but also for anonymous mappings - which would probably be fine
    but I think is not necessarily expected.

    Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-2-5efa61078a41@google.com
    Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
    Signed-off-by: Jann Horn <jannh@google.com>
    Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Pavel Emelyanov <xemul@virtuozzo.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:48 -04:00
Rafael Aquini 31105cfe09 userfaultfd: fix checks for huge PMDs
JIRA: https://issues.redhat.com/browse/RHEL-27743
JIRA: https://issues.redhat.com/browse/RHEL-59459
CVE: CVE-2024-46787

This patch is a backport of the following upstream commit:
commit 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8
Author: Jann Horn <jannh@google.com>
Date:   Tue Aug 13 22:25:21 2024 +0200

    userfaultfd: fix checks for huge PMDs

    Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.

    The pmd_trans_huge() code in mfill_atomic() is wrong in three different
    ways depending on kernel version:

    1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
       the right two race windows) - I've tested this in a kernel build with
       some extra mdelay() calls. See the commit message for a description
       of the race scenario.
       On older kernels (before 6.5), I think the same bug can even
       theoretically lead to accessing transhuge page contents as a page table
       if you hit the right 5 narrow race windows (I haven't tested this case).
    2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
       detecting PMDs that don't point to page tables.
       On older kernels (before 6.5), you'd just have to win a single fairly
       wide race to hit this.
       I've tested this on 6.1 stable by racing migration (with a mdelay()
       patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
       VM, that causes a kernel oops in ptlock_ptr().
    3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
       to yank page tables out from under us (though I haven't tested that),
       so I think the BUG_ON() checks in mfill_atomic() are just wrong.

    I decided to write two separate fixes for these (one fix for bugs 1+2, one
    fix for bug 3), so that the first fix can be backported to kernels
    affected by bugs 1+2.

    This patch (of 2):

    This fixes two issues.

    I discovered that the following race can occur:

      mfill_atomic                other thread
      ============                ============
                                  <zap PMD>
      pmdp_get_lockless() [reads none pmd]
      <bail if trans_huge>
      <if none:>
                                  <pagefault creates transhuge zeropage>
        __pte_alloc [no-op]
                                  <zap PMD>
      <bail if pmd_trans_huge(*dst_pmd)>
      BUG_ON(pmd_none(*dst_pmd))

    I have experimentally verified this in a kernel with extra mdelay() calls;
    the BUG_ON(pmd_none(*dst_pmd)) triggers.

    On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
    pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
    a BUG_ON(), since the page table access helpers are actually designed to
    deal with page tables concurrently disappearing; but on older kernels
    (<=6.4), I think we could probably theoretically race past the two
    BUG_ON() checks and end up treating a hugepage as a page table.

    The second issue is that, as Qi Zheng pointed out, there are other types
    of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
    (in particular, migration PMDs).

    On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
    PMD that contains a migration entry (which just requires winning a single,
    fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
    assumes that the PMD points to a page table.

    Breakage follows: First, the kernel tries to take the PTE lock (which will
    crash or maybe worse if there is no "struct page" for the address bits in
    the migration entry PMD - I think at least on X86 there usually is no
    corresponding "struct page" thanks to the PTE inversion mitigation, amd64
    looks different).

    If that didn't crash, the kernel would next try to write a PTE into what
    it wrongly thinks is a page table.

    As part of fixing these issues, get rid of the check for pmd_trans_huge()
    before __pte_alloc() - that's redundant, we're going to have to check for
    that after the __pte_alloc() anyway.

    Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.

    Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@google.com
    Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@google.com
    Fixes: c1a4de99fa ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
    Signed-off-by: Jann Horn <jannh@google.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Pavel Emelyanov <xemul@virtuozzo.com>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:47 -04:00
Rafael Aquini f5185129ab mm: userfaultfd: add new UFFDIO_POISON ioctl: fix
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 597425df4fecd272ca48f73feca7833433c16e12
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Jul 11 18:27:17 2023 -0700

    mm: userfaultfd: add new UFFDIO_POISON ioctl: fix

    Smatch has observed that pte_offset_map_lock() is now allowed to fail, and
    then ptl should not be unlocked.  Use -EAGAIN here like elsewhere.

    Link: https://lkml.kernel.org/r/bc7bba61-d34f-ad3a-ccf1-c191585ef851@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Dan Carpenter <dan.carpenter@linaro.org>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:29 -04:00
Rafael Aquini 2097286f41 mm: userfaultfd: support UFFDIO_POISON for hugetlbfs
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 8a13897fb0daa8f56821f263f0c63661e1c6acae
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Fri Jul 7 14:55:37 2023 -0700

    mm: userfaultfd: support UFFDIO_POISON for hugetlbfs

    The behavior here is the same as it is for anon/shmem.  This is done
    separately because hugetlb pte marker handling is a bit different.

    Link: https://lkml.kernel.org/r/20230707215540.2324998-6-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:06 -04:00
Rafael Aquini 6ff3dec6e4 mm: userfaultfd: add new UFFDIO_POISON ioctl
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit fc71884a5f599a603fcc3c2b28b3872c09d19c18
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Fri Jul 7 14:55:36 2023 -0700

    mm: userfaultfd: add new UFFDIO_POISON ioctl

    The basic idea here is to "simulate" memory poisoning for VMs.  A VM
    running on some host might encounter a memory error, after which some
    page(s) are poisoned (i.e., future accesses SIGBUS).  They expect that
    once poisoned, pages can never become "un-poisoned".  So, when we live
    migrate the VM, we need to preserve the poisoned status of these pages.

    When live migrating, we try to get the guest running on its new host as
    quickly as possible.  So, we start it running before all memory has been
    copied, and before we're certain which pages should be poisoned or not.

    So the basic way to use this new feature is:

    - On the new host, the guest's memory is registered with userfaultfd, in
      either MISSING or MINOR mode (doesn't really matter for this purpose).
    - On any first access, we get a userfaultfd event. At this point we can
      communicate with the old host to find out if the page was poisoned.
    - If so, we can respond with a UFFDIO_POISON - this places a swap marker
      so any future accesses will SIGBUS. Because the pte is now "present",
      future accesses won't generate more userfaultfd events, they'll just
      SIGBUS directly.

    UFFDIO_POISON does not handle unmapping previously-present PTEs.  This
    isn't needed, because during live migration we want to intercept all
    accesses with userfaultfd (not just writes, so WP mode isn't useful for
    this).  So whether minor or missing mode is being used (or both), the PTE
    won't be present in any case, so handling that case isn't needed.

    Similarly, UFFDIO_POISON won't replace existing PTE markers.  This might
    be okay to do, but it seems to be safer to just refuse to overwrite any
    existing entry (like a UFFD_WP PTE marker).

    Link: https://lkml.kernel.org/r/20230707215540.2324998-5-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:05 -04:00
Rafael Aquini 51c042396c mm: userfaultfd: extract file size check out into a helper
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * minor context difference on the 2nd hunk due to RHEL-only commit
    38e95bedaa ("mm: Fix CVE-2022-2590 by reverting "mm/shmem:
    unconditionally set pte dirty in mfill_atomic_install_pte"")

This patch is a backport of the following upstream commit:
commit 435cdb41a76fcfa5d6af7e0e39bb8ab5ef4b7a64
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Fri Jul 7 14:55:35 2023 -0700

    mm: userfaultfd: extract file size check out into a helper

    This code is already duplicated twice, and UFFDIO_POISON will do the same
    check a third time.  So, it's worth extracting into a helper to save
    repetitive lines of code.

    Link: https://lkml.kernel.org/r/20230707215540.2324998-4-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Reviewed-by: Peter Xu <peterx@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:04 -04:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Lucas Zampieri 921df80dee Merge: Update MM Selftests for 9.5
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4786

Update kselftests mm to include new tests and fixes. This update is larger than usual, due to having to backport a lot of build related changes in `tools/testing/selftests/kselftest*`

Omitted-fix: f1227dc7d0411ee9a9faaa1e80cfd9d6e5d6d63e

Omitted-fix: a52540522c9541bfa3e499d2edba7bc0ca73a4ca

Omitted-fix: 2bfed7d2ffa5d86c462d3e2067f2832eaf8c04c7

JIRA: https://issues.redhat.com/browse/RHEL-39306

Signed-off-by: Nico Pache <npache@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-21 12:51:18 +00:00
Nico Pache ee31333513 mm: userfaultfd: add UFFDIO_CONTINUE_MODE_WP to install WP PTEs
commit 0289184476c845968ad6ac9083c96cc0f75ca505
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Tue Mar 14 15:12:50 2023 -0700

    mm: userfaultfd: add UFFDIO_CONTINUE_MODE_WP to install WP PTEs

    UFFDIO_COPY already has UFFDIO_COPY_MODE_WP, so when installing a new PTE
    to resolve a missing fault, one can install a write-protected one.  This
    is useful when using UFFDIO_REGISTER_MODE_{MISSING,WP} in combination.

    This was motivated by testing HugeTLB HGM [1], and in particular its
    interaction with userfaultfd features.  Existing userfaultfd code supports
    using WP and MINOR modes together (i.e.  you can register an area with
    both enabled), but without this CONTINUE flag the combination is in
    practice unusable.

    So, add an analogous UFFDIO_CONTINUE_MODE_WP, which does the same thing as
    UFFDIO_COPY_MODE_WP, but for *minor* faults.

    Update the selftest to do some very basic exercising of the new flag.

    Update Documentation/ to describe how these flags are used (neither the
    COPY nor the new CONTINUE versions of this mode flag were described there
    before).

    [1]: https://patchwork.kernel.org/project/linux-mm/cover/20230218002819.1486479-1-jthoughton@google.com/

    Link: https://lkml.kernel.org/r/20230314221250.682452-5-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-39306
Signed-off-by: Nico Pache <npache@redhat.com>
2024-08-02 10:13:23 -06:00
Rafael Aquini 85effa4c2f userfaultfd: convert mfill_atomic() to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-48221

This patch is a backport of the following upstream commit:
commit d7be6d7eee1bbf98671d7a2c95654322241e2ae4
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Mon Apr 10 21:39:32 2023 +0800

    userfaultfd: convert mfill_atomic() to use a folio

    Convert mfill_atomic_pte_copy(), shmem_mfill_atomic_pte() and
    mfill_atomic_pte() to take in a folio pointer.

    Convert mfill_atomic() to use a folio.  Convert page_kaddr to kaddr in
    mfill_atomic().

    Link: https://lkml.kernel.org/r/20230410133932.32288-7-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-07-16 09:30:09 -04:00
Nico Pache a9c57c86df userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb
commit 67695f18d55924b2013534ef3bdc363bc9e14605
Author: Lokesh Gidra <lokeshgidra@google.com>
Date:   Wed Jan 17 14:37:29 2024 -0800

    userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb

    In mfill_atomic_hugetlb(), mmap_changing isn't being checked
    again if we drop mmap_lock and reacquire it. When the lock is not held,
    mmap_changing could have been incremented. This is also inconsistent
    with the behavior in mfill_atomic().

    Link: https://lkml.kernel.org/r/20240117223729.1444522-1-lokeshgidra@google.com
    Fixes: df2cc96e77 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
    Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Kalesh Singh <kaleshsingh@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nicolas Geoffray <ngeoffray@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:29 -06:00
Chris von Recklinghausen b165253363 userfaultfd: use helper function range_in_vma()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 686ea6e61da61e46ae7068f73167ca26e0c67efb
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Mon Apr 17 08:39:19 2023 +0800

    userfaultfd: use helper function range_in_vma()

    We can use range_in_vma() to check if dst_start, dst_start + len are
    within the dst_vma range.  Minor readability improvement.

    Link: https://lkml.kernel.org/r/20230417003919.930515-1-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:03 -04:00
Chris von Recklinghausen 7fe6fb66af userfaultfd: convert mfill_atomic_hugetlb() to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 0169fd518a8934d8d723659752b07589ecc9f692
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Mon Apr 10 21:39:30 2023 +0800

    userfaultfd: convert mfill_atomic_hugetlb() to use a folio

    Convert hugetlb_mfill_atomic_pte() to take in a folio pointer instead of
    a page pointer.

    Convert mfill_atomic_hugetlb() to use a folio.

    Link: https://lkml.kernel.org/r/20230410133932.32288-5-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:56 -04:00
Chris von Recklinghausen 4b83c78b5d userfaultfd: convert copy_huge_page_from_user() to copy_folio_from_user()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit e87340ca5c9cecc8a11daf1a2dcabf23f06a4e10
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Mon Apr 10 21:39:29 2023 +0800

    userfaultfd: convert copy_huge_page_from_user() to copy_folio_from_user()

    Replace copy_huge_page_from_user() with copy_folio_from_user().
    copy_folio_from_user() does the same as copy_huge_page_from_user(), but
    takes in a folio instead of a page.

    Convert page_kaddr to kaddr in copy_folio_from_user() to do indenting
    cleanup.

    Link: https://lkml.kernel.org/r/20230410133932.32288-4-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:56 -04:00
Chris von Recklinghausen 443caf29b6 userfaultfd: convert mfill_atomic_pte_copy() to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 07e6d4095c75bcf0bf511b36eecaceb3fbb91ad9
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Mon Apr 10 21:39:27 2023 +0800

    userfaultfd: convert mfill_atomic_pte_copy() to use a folio

    Patch series "userfaultfd: convert userfaultfd functions to use folios",
    v6.

    This patch series converts several userfaultfd functions to use folios.

    This patch (of 6):

    Call vma_alloc_folio() directly instead of alloc_page_vma() and convert
    page_kaddr to kaddr in mfill_atomic_pte_copy().  Removes several calls to
    compound_head().

    Link: https://lkml.kernel.org/r/20230410133932.32288-1-zhangpeng362@huawei.com
    Link: https://lkml.kernel.org/r/20230410133932.32288-2-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:55 -04:00
Chris von Recklinghausen ab982ab697 mm: userfaultfd: combine 'mode' and 'wp_copy' arguments
Conflicts: mm/userfaultfd.c - We already have
	161e393c0f63 ("mm: Make pte_mkwrite() take a VMA")
	so pte_mkwrite takes 2  arguments

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit d9712937037e0ce887920f321429826e9dbfd960
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Tue Mar 14 15:12:49 2023 -0700

    mm: userfaultfd: combine 'mode' and 'wp_copy' arguments

    Many userfaultfd ioctl functions take both a 'mode' and a 'wp_copy'
    argument.  In future commits we plan to plumb the flags through to more
    places, so we'd be proliferating the very long argument list even further.

    Let's take the time to simplify the argument list.  Combine the two
    arguments into one - and generalize, so when we add more flags in the
    future, it doesn't imply more function arguments.

    Since the modes (copy, zeropage, continue) are mutually exclusive, store
    them as an integer value (0, 1, 2) in the low bits.  Place combine-able
    flag bits in the high bits.

    This is quite similar to an earlier patch proposed by Nadav Amit
    ("userfaultfd: introduce uffd_flags" [1]).  The main difference is that
    patch only handled flags, whereas this patch *also* combines the "mode"
    argument into the same type to shorten the argument list.

    [1]: https://lore.kernel.org/all/20220619233449.181323-2-namit@vmware.com/

    Link: https://lkml.kernel.org/r/20230314221250.682452-4-axelrasmussen@google
.com
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Acked-by: James Houghton <jthoughton@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:25 -04:00
Chris von Recklinghausen 75317fb06a mm: userfaultfd: don't pass around both mm and vma
Conflicts: mm/userfaultfd.c - We already have
	153132571f ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
	and
	73f37dbcfe17 ("mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages")
	so keep the setting of ret and possible jump to out.

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 61c5004022f56c443b86800e8985d8803f3a22aa
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Tue Mar 14 15:12:48 2023 -0700

    mm: userfaultfd: don't pass around both mm and vma

    Quite a few userfaultfd functions took both mm and vma pointers as
    arguments.  Since the mm is trivially accessible via vma->vm_mm, there's
    no reason to pass both; it just needlessly extends the already long
    argument list.

    Get rid of the mm pointer, where possible, to shorten the argument list.

    Link: https://lkml.kernel.org/r/20230314221250.682452-3-axelrasmussen@google
.com
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:25 -04:00
Chris von Recklinghausen 81108c01c8 mm: userfaultfd: rename functions for clarity + consistency
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit a734991ccaec1985fff42fb26bb6d789d35defb4
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Tue Mar 14 15:12:47 2023 -0700

    mm: userfaultfd: rename functions for clarity + consistency

    Patch series "mm: userfaultfd: refactor and add UFFDIO_CONTINUE_MODE_WP",
    v5.

    - Commits 1-3 refactor userfaultfd ioctl code without behavior changes, with the
      main goal of improving consistency and reducing the number of function args.

    - Commit 4 adds UFFDIO_CONTINUE_MODE_WP.

    This patch (of 4):

    The basic problem is, over time we've added new userfaultfd ioctls, and
    we've refactored the code so functions which used to handle only one case
    are now re-used to deal with several cases.  While this happened, we
    didn't bother to rename the functions.

    Similarly, as we added new functions, we cargo-culted pieces of the
    now-inconsistent naming scheme, so those functions too ended up with names
    that don't make a lot of sense.

    A key point here is, "copy" in most userfaultfd code refers specifically
    to UFFDIO_COPY, where we allocate a new page and copy its contents from
    userspace.  There are many functions with "copy" in the name that don't
    actually do this (at least in some cases).

    So, rename things into a consistent scheme.  The high level idea is that
    the call stack for userfaultfd ioctls becomes:

    userfaultfd_ioctl
      -> userfaultfd_(particular ioctl)
        -> mfill_atomic_(particular kind of fill operation)
          -> mfill_atomic    /* loops over pages in range */
            -> mfill_atomic_pte    /* deals with single pages */
              -> mfill_atomic_pte_(particular kind of fill operation)
                -> mfill_atomic_install_pte

    There are of course some special cases (shmem, hugetlb), but this is the
    general structure which all function names now adhere to.

    Link: https://lkml.kernel.org/r/20230314221250.682452-1-axelrasmussen@google.com
    Link: https://lkml.kernel.org/r/20230314221250.682452-2-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:24 -04:00
Chris von Recklinghausen b276f97846 mm/userfaultfd: support WP on multiple VMAs
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit a1b92a3f14984c96ace381f204b5d72c0805296e
Author: Muhammad Usama Anjum <usama.anjum@collabora.com>
Date:   Fri Feb 17 15:55:58 2023 +0500

    mm/userfaultfd: support WP on multiple VMAs

    mwriteprotect_range() errors out if [start, end) doesn't fall in one VMA.
    We are facing a use case where multiple VMAs are present in one range of
    interest.  For example, the following pseudocode reproduces the error
    which we are trying to fix:

    - Allocate memory of size 16 pages with PROT_NONE with mmap
    - Register userfaultfd
    - Change protection of the first half (1 to 8 pages) of memory to
      PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
    - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
      out.

    This is a simple use case where user may or may not know if the memory
    area has been divided into multiple VMAs.

    We need an implementation which doesn't disrupt the already present users.
    So keeping things simple, stop going over all the VMAs if any one of the
    VMA hasn't been registered in WP mode.  While at it, remove the un-needed
    error check as well.

    [akpm@linux-foundation.org: s/VM_WARN_ON_ONCE/VM_WARN_ONCE/ to fix build]
    Link: https://lkml.kernel.org/r/20230217105558.832710-1-usama.anjum@collabora.com
    Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reported-by: Paul Gofman <pgofman@codeweavers.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:04 -04:00
Prarit Bhargava 25cf7e4e50 mm: Make pte_mkwrite() take a VMA
JIRA: https://issues.redhat.com/browse/RHEL-25415

Conflicts: This is a rip and replace of pt_mkwrite() with one arg for
pte_mkwrite() with two args.  There are uses upstream that are not yet
in RHEL9.

commit 161e393c0f63592a3b95bdd8b55752653763fc6d
Author: Rick Edgecombe <rick.p.edgecombe@intel.com>
Date:   Mon Jun 12 17:10:29 2023 -0700

    mm: Make pte_mkwrite() take a VMA

    The x86 Shadow stack feature includes a new type of memory called shadow
    stack. This shadow stack memory has some unusual properties, which requires
    some core mm changes to function properly.

    One of these unusual properties is that shadow stack memory is writable,
    but only in limited ways. These limits are applied via a specific PTE
    bit combination. Nevertheless, the memory is writable, and core mm code
    will need to apply the writable permissions in the typical paths that
    call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
    that the x86 implementation of it can know whether to create regular
    writable or shadow stack mappings.

    But there are a couple of challenges to this. Modifying the signatures of
    each arch pte_mkwrite() implementation would be error prone because some
    are generated with macros and would need to be re-implemented. Also, some
    pte_mkwrite() callers operate on kernel memory without a VMA.

    So this can be done in a three step process. First pte_mkwrite() can be
    renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
    added that just calls pte_mkwrite_novma(). Next callers without a VMA can
    be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
    can be changed to take/pass a VMA.

    Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted
    callers that don't have a VMA were to use pte_mkwrite_novma(). So now
    change pte_mkwrite() to take a VMA and change the remaining callers to
    pass a VMA. Apply the same changes for pmd_mkwrite().

    No functional change.

    Suggested-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: David Hildenbrand <david@redhat.com>
    Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com

Omitted-fix: f441ff73f1ec powerpc: Fix pud_mkwrite() definition after pte_mkwrite() API changes
	pud_mkwrite() not in RHEL9 code for powerpc (removed previously)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:43:13 -04:00
Chris von Recklinghausen 45dfdb79ea mm/userfaultfd: allow pte_offset_map_lock() to fail
Conflicts: mm/userfaultfd.c - We don't have
	61c5004022f5 ("mm: userfaultfd: don't pass around both mm and vma")
	since it needs
	a1b92a3f1498 (" mm/userfaultfd: support WP on multiple VMAs")
	as a prerequisite, and a1b92a3f1498 uses the Maple Tree VMA
	Iterator, which is a specific non-goal of this patch set. Continue
	to call pte_offset_map_lock with dst_mm

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3622d3cde30898c1b6eafde281c122b994718c58
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:26:04 2023 -0700

    mm/userfaultfd: allow pte_offset_map_lock() to fail

    mfill_atomic_install_pte() and mfill_atomic_pte_zeropage() treat failed
    pte_offset_map_lock() as -EAGAIN, which mfill_atomic() already returns to
    user for a similar race.

    Link: https://lkml.kernel.org/r/50cf3930-1bfa-4de9-a079-3da47b7ce17b@google.
com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:15 -04:00
Chris von Recklinghausen 653ae76632 mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()
Conflicts: mm/userfaultfd.c - RHEL-only patch
	8e95bedaa1a ("mm: Fix CVE-2022-2590 by reverting "mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"")
	causes a merge conflict with this patch. Since upstream commit
	5535be309971 ("mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW")
	actually fixes the CVE we can safely remove the conflicted lines
	and replace them with the lines the upstream version of thes
	patch adds

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f1eb1bacfba9019823b2fce42383f010cd561fa6
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Dec 14 15:15:33 2022 -0500

    mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()

    This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths.

    The reasons I still think this patch is worthwhile, are:

      (1) It is a cleanup already; diffstat tells.

      (2) It just feels natural after I thought about this, if the pte is uffd
          protected, let's remove the write bit no matter what it was.

      (2) Since x86 is the only arch that supports uffd-wp, it also redefines
          pte|pmd_mkuffd_wp() in that it should always contain removals of
          write bits.  It means any future arch that want to implement uffd-wp
          should naturally follow this rule too.  It's good to make it a
          default, even if with vm_page_prot changes on VM_UFFD_WP.

      (3) It covers more than vm_page_prot.  So no chance of any potential
          future "accident" (like pte_mkdirty() sparc64 or loongarch, even
          though it just got its pte_mkdirty fixed <1 month ago).  It'll be
          fairly clear when reading the code too that we don't worry anything
          before a pte_mkuffd_wp() on uncertainty of the write bit.

    We may call pte_wrprotect() one more time in some paths (e.g.  thp split),
    but that should be fully local bitop instruction so the overhead should be
    negligible.

    Although this patch should logically also fix all the known issues on
    uffd-wp too recently on page migration (not for numa hint recovery - that
    may need another explcit pte_wrprotect), but this is not the plan for that
    fix.  So no fixes, and stable doesn't need this.

    Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ives van Hoorne <ives@codesandbox.io>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:43 -04:00
Chris von Recklinghausen 56f21bc526 mm: Rename pmd_read_atomic()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit dab6e717429e5ec795d558a0e9a5337a1ed33a3d
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Nov 26 17:20:28 2020 +0100

    mm: Rename pmd_read_atomic()

    There's no point in having the identical routines for PTE/PMD have
    different names.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20221022114424.841277397%40infradead.org

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:37 -04:00
Chris von Recklinghausen 956fcd9df6 userfaultfd: replace lru_cache functions with folio_add functions
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 28965f0f8be62e1ed8296fe0240b5d5dc064b681
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Tue Nov 1 10:53:24 2022 -0700

    userfaultfd: replace lru_cache functions with folio_add functions

    Replaces lru_cache_add() and lru_cache_add_inactive_or_unevictable() with
    folio_add_lru() and folio_add_lru_vma().  This is in preparation for the
    removal of lru_cache_add().

    Link: https://lkml.kernel.org/r/20221101175326.13265-4-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Miklos Szeredi <mszeredi@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:30 -04:00
Chris von Recklinghausen 85c20d8773 mm/userfaultfd: replace kmap/kmap_atomic() with kmap_local_page()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 5521de7dddd211e3a9403d7bde0b614fd0936ac6
Author: Ira Weiny <ira.weiny@intel.com>
Date:   Sun Oct 23 21:34:52 2022 -0700

    mm/userfaultfd: replace kmap/kmap_atomic() with kmap_local_page()

    kmap() and kmap_atomic() are being deprecated in favor of
    kmap_local_page() which is appropriate for any thread local context.[1]

    A recent locking bug report with userfaultfd showed that the conversion of
    the kmap_atomic()'s in those code flows requires care with regard to the
    prevention of deadlock.[2]

    git archaeology implied that the recursion may not be an actual bug.[3]
    However, depending on the implementation of the mmap_lock and the
    condition of the call there may still be a deadlock.[4] So this is not
    purely a lockdep issue.  Considering a single threaded call stack there
    are 3 options.

            1) Different mm's are in play (no issue)
            2) Readlock implementation is recursive and same mm is in play
               (no issue)
            3) Readlock implementation is _not_ recursive (issue)

    The mmap_lock is recursive so with a single thread there is no issue.

    However, Matthew pointed out a deadlock scenario when you consider
    additional process' and threads thusly.

    "The readlock implementation is only recursive if nobody else has taken a
    write lock.  If you have a multithreaded process, one of the other threads
    can call mmap() and that will prevent recursion (due to fairness).  Even
    if it's a different process that you're trying to acquire the mmap read
    lock on, you can still get into a deadly embrace.  eg:

    process A thread 1 takes read lock on own mmap_lock
    process A thread 2 calls mmap, blocks taking write lock
    process B thread 1 takes page fault, read lock on own mmap lock
    process B thread 2 calls mmap, blocks taking write lock
    process A thread 1 blocks taking read lock on process B
    process B thread 1 blocks taking read lock on process A

    Now all four threads are blocked waiting for each other."

    Regardless using pagefault_disable() ensures that no matter what locking
    implementation is used a deadlock will not occur.

    Complete kmap conversion in userfaultfd by replacing the kmap() and
    kmap_atomic() calls with kmap_local_page().  When replacing the
    kmap_atomic() call ensure page faults continue to be disabled to support
    the correct fall back behavior and add a comment to inform future souls of
    the requirement.

    [1] https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com/
    [2] https://lore.kernel.org/all/Y1Mh2S7fUGQ%2FiKFR@iweiny-desk3/
    [3] https://lore.kernel.org/all/Y1MymJ%2FINb45AdaY@iweiny-desk3/
    [4] https://lore.kernel.org/lkml/Y1bXBtGTCym77%2FoD@casper.infradead.org/

    [ira.weiny@intel.com: v2]
      Link: https://lkml.kernel.org/r/20221025220136.2366143-1-ira.weiny@intel.com
    Link: https://lkml.kernel.org/r/20221024043452.1491677-1-ira.weiny@intel.com
    Signed-off-by: Ira Weiny <ira.weiny@intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:06 -04:00
Chris von Recklinghausen 3c51ae41b8 userfaultfd: convert mcontinue_atomic_pte() to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 12acf4fbc4f78b24822317888b9406d56dc9ad2a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:28 2022 +0100

    userfaultfd: convert mcontinue_atomic_pte() to use a folio

    shmem_getpage() is being replaced by shmem_get_folio() so use a folio
    throughout this function.  Saves several calls to compound_head().

    Link: https://lkml.kernel.org/r/20220902194653.1739778-33-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:58 -04:00
Nico Pache 7d166461ff mm/shmem: use page_mapping() to detect page cache for uffd continue
commit 93b0d9178743a68723babe8448981f658aebc58e
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Nov 2 14:41:52 2022 -0400

    mm/shmem: use page_mapping() to detect page cache for uffd continue

    mfill_atomic_install_pte() checks page->mapping to detect whether one page
    is used in the page cache.  However as pointed out by Matthew, the page
    can logically be a tail page rather than always the head in the case of
    uffd minor mode with UFFDIO_CONTINUE.  It means we could wrongly install
    one pte with shmem thp tail page assuming it's an anonymous page.

    It's not that clear even for anonymous page, since normally anonymous
    pages also have page->mapping being setup with the anon vma.  It's safe
    here only because the only such caller to mfill_atomic_install_pte() is
    always passing in a newly allocated page (mcopy_atomic_pte()), whose
    page->mapping is not yet setup.  However that's not extremely obvious
    either.

    For either of above, use page_mapping() instead.

    Link: https://lkml.kernel.org/r/Y2K+y7wnhC4vbnP2@x1n
    Fixes: 153132571f ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: Matthew Wilcox <willy@infradead.org>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:02 -06:00
Nico Pache 897a73840c hugetlb: use new vma_lock for pmd sharing synchronization
commit 40549ba8f8e0ed1f8b235979563f619e9aa34fdf
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:09 2022 -0700

    hugetlb: use new vma_lock for pmd sharing synchronization

    The new hugetlb vma lock is used to address this race:

    Faulting thread                                 Unsharing thread
    ...                                                  ...
    ptep = huge_pte_offset()
          or
    ptep = huge_pte_alloc()
    ...
                                                    i_mmap_lock_write
                                                    lock page table
    ptep invalid   <------------------------        huge_pmd_unshare()
    Could be in a previously                        unlock_page_table
    sharing process or worse                        i_mmap_unlock_write
    ...

    The vma_lock is used as follows:
    - During fault processing. The lock is acquired in read mode before
      doing a page table lock and allocation (huge_pte_alloc).  The lock is
      held until code is finished with the page table entry (ptep).
    - The lock must be held in write mode whenever huge_pmd_unshare is
      called.

    Lock ordering issues come into play when unmapping a page from all
    vmas mapping the page.  The i_mmap_rwsem must be held to search for the
    vmas, and the vma lock must be held before calling unmap which will
    call huge_pmd_unshare.  This is done today in:
    - try_to_migrate_one and try_to_unmap_ for page migration and memory
      error handling.  In these routines we 'try' to obtain the vma lock and
      fail to unmap if unsuccessful.  Calling routines already deal with the
      failure of unmapping.
    - hugetlb_vmdelete_list for truncation and hole punch.  This routine
      also tries to acquire the vma lock.  If it fails, it skips the
      unmapping.  However, we can not have file truncation or hole punch
      fail because of contention.  After hugetlb_vmdelete_list, truncation
      and hole punch call remove_inode_hugepages.  remove_inode_hugepages
      checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
      hugetlb_unmap_file_page is designed to drop locks and reacquire in the
      correct order to guarantee unmap success.

    Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache 56a9d706ab hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization
commit 3a47c54f09c4c89128d8f67d49296b1c25b317d0
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:03 2022 -0700

    hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization

    Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") added code to take i_mmap_rwsem in read mode for the
    duration of fault processing.  However, this has been shown to cause
    performance/scaling issues.  Revert the code and go back to only taking
    the semaphore in huge_pmd_share during the fault path.

    Keep the code that takes i_mmap_rwsem in write mode before calling
    try_to_unmap as this is required if huge_pmd_unshare is called.

    NOTE: Reverting this code does expose the following race condition.

    Faulting thread                                 Unsharing thread
    ...                                                  ...
    ptep = huge_pte_offset()
          or
    ptep = huge_pte_alloc()
    ...
                                                    i_mmap_lock_write
                                                    lock page table
    ptep invalid   <------------------------        huge_pmd_unshare()
    Could be in a previously                        unlock_page_table
    sharing process or worse                        i_mmap_unlock_write
    ...
    ptl = huge_pte_lock(ptep)
    get/update pte
    set_pte_at(pte, ptep)

    It is unknown if the above race was ever experienced by a user.  It was
    discovered via code inspection when initially addressed.

    In subsequent patches, a new synchronization mechanism will be added to
    coordinate pmd sharing and eliminate this race.

    Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Chris von Recklinghausen d392132e1e mm/uffd: detect pgtable allocation failures
Bugzilla: https://bugzilla.redhat.com/2160210

commit d1751118c88673fe5a948ad82277898e9e284c55
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Jan 4 17:52:07 2023 -0500

    mm/uffd: detect pgtable allocation failures

    Before this patch, when there's any pgtable allocation issues happened
    during change_protection(), the error will be ignored from the syscall.
    For shmem, there will be an error dumped into the host dmesg.  Two issues
    with that:

      (1) Doing a trace dump when allocation fails is not anything close to
          grace.

      (2) The user should be notified with any kind of such error, so the user
          can trap it and decide what to do next, either by retrying, or stop
          the process properly, or anything else.

    For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
    pgtable allocation failure happened.  It should not normally break anyone,
    though.  If it breaks, then in good ways.

    One man-page update will be on the way to introduce the new -ENOMEM for
    UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on
    the 5.19-till-now kernels.

    [akpm@linux-foundation.org: coding-style cleanups]
    Link: https://lkml.kernel.org/r/20230104225207.1066932-4-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: James Houghton <jthoughton@google.com>
    Acked-by: James Houghton <jthoughton@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:35 -04:00
Chris von Recklinghausen d0b4183cf0 mm/mprotect: drop pgprot_t parameter from change_protection()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 1ef488edd6c4d447784710974f049628c2890481
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Dec 23 16:56:16 2022 +0100

    mm/mprotect: drop pgprot_t parameter from change_protection()

    Being able to provide a custom protection opens the door for
    inconsistencies and BUGs: for example, accidentally allowing for more
    permissions than desired by other mechanisms (e.g., softdirty tracking).
    vma->vm_page_prot should be the single source of truth.

    Only PROT_NUMA is special: there is no way we can erroneously allow
    for more permissions when removing all permissions. Special-case using
    the MM_CP_PROT_NUMA flag.

    [david@redhat.com: PAGE_NONE might not be defined without CONFIG_NUMA_BALANCING]
      Link: https://lkml.kernel.org/r/5084ff1c-ebb3-f918-6a60-bacabf550a88@redhat.com
    Link: https://lkml.kernel.org/r/20221223155616.297723-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:35 -04:00
Chris von Recklinghausen bd3f789b69 mm/userfaultfd: rely on vma->vm_page_prot in uffd_wp_range()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 931298e103c228c4ce6d13e7b5781aeaaff37ac7
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Dec 23 16:56:15 2022 +0100

    mm/userfaultfd: rely on vma->vm_page_prot in uffd_wp_range()

    Patch series "mm: uffd-wp + change_protection() cleanups".

    Cleanup page protection handling in uffd-wp when calling
    change_protection() and improve unprotecting uffd=wp in private mappings,
    trying to set PTEs writable again if possible just like we do during
    mprotect() when upgrading write permissions.  Make the change_protection()
    interface harder to get wrong :)

    I consider both pages primarily cleanups, although patch #1 fixes a corner
    case with uffd-wp and softdirty tracking for shmem.  @Peter, please let me
    know if we should flag patch #1 as pure cleanup -- I have no idea how
    important softdirty tracking on shmem is.

    This patch (of 2):

    uffd_wp_range() currently calculates page protection manually using
    vm_get_page_prot().  This will ignore any other reason for active
    writenotify: one mechanism applicable to shmem is softdirty tracking.

    For example, the following sequence

    1) Write to mapped shmem page
    2) Clear softdirty
    3) Register uffd-wp covering the mapped page
    4) Unregister uffd-wp covering the mapped page
    5) Write to page again

    will not set the modified page softdirty, because uffd_wp_range() will
    ignore that writenotify is required for softdirty tracking and simply map
    the page writable again using change_protection().  Similarly, instead of
    unregistering, protecting followed by un-protecting the page using uffd-wp
    would result in the same situation.

    Now that we enable writenotify whenever enabling uffd-wp on a VMA,
    vma->vm_page_prot will already properly reflect our requirements: the
    default is to write-protect all PTEs.  However, for shared mappings we
    would now not remap the PTEs writable if possible when unprotecting, just
    like for private mappings (COW).  To compensate, set
    MM_CP_TRY_CHANGE_WRITABLE just like mprotect() does to try mapping
    individual PTEs writable.

    For private mappings, this change implies that we will now always try
    setting PTEs writable when un-protecting, just like when upgrading write
    permissions using mprotect(), which is an improvement.

    For shared mappings, we will only set PTEs writable if
    can_change_pte_writable()/can_change_pmd_writable() indicates that it's
    ok.  For ordinary shmem, this will be the case when PTEs are dirty, which
    should usually be the case -- otherwise we could special-case shmem in
    can_change_pte_writable()/can_change_pmd_writable() easily, because shmem
    itself doesn't require writenotify.

    Note that hugetlb does not yet implement MM_CP_TRY_CHANGE_WRITABLE, so we
    won't try setting PTEs writable when unprotecting or when unregistering
    uffd-wp.  This can be added later on top by implementing
    MM_CP_TRY_CHANGE_WRITABLE.

    While commit ffd0579396 ("userfaultfd: wp: support write protection for
    userfault vma range") introduced that code, it should only be applicable
    to uffd-wp on shared mappings -- shmem (hugetlb does not support softdirty
    tracking).  I don't think this corner cases justifies to cc stable.  Let's
    just handle it correctly and prepare for change_protection() cleanups.

    [david@redhat.com: o need for additional harmless checks if we're wr-protecting either way]
      Link: https://lkml.kernel.org/r/71412742-a71f-9c74-865f-773ad83db7a5@redhat.com
    Link: https://lkml.kernel.org/r/20221223155616.297723-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20221223155616.297723-2-david@redhat.com
    Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:35 -04:00
Chris von Recklinghausen dbd4a843ae mm/uffd: reset write protection when unregister with wp-mode
Bugzilla: https://bugzilla.redhat.com/2160210

commit f369b07c861435bd812a9d14493f71b34132ed6f
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Aug 11 16:13:40 2022 -0400

    mm/uffd: reset write protection when unregister with wp-mode

    The motivation of this patch comes from a recent report and patchfix from
    David Hildenbrand on hugetlb shared handling of wr-protected page [1].

    With the reproducer provided in commit message of [1], one can leverage
    the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect
    not only the attacker process, but also the whole system.

    The lazy-reset mechanism of uffd-wp was used to make unregister faster,
    meanwhile it has an assumption that any leftover pgtable entries should
    only affect the process on its own, so not only the user should be aware
    of anything it does, but also it should not affect outside of the process.

    But it seems that this is not true, and it can also be utilized to make
    some exploit easier.

    So far there's no clue showing that the lazy-reset is important to any
    userfaultfd users because normally the unregister will only happen once
    for a specific range of memory of the lifecycle of the process.

    Considering all above, what this patch proposes is to do explicit pte
    resets when unregister an uffd region with wr-protect mode enabled.

    It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false)
    right before ioctl(UFFDIO_UNREGISTER) for the user.  So potentially it'll
    make the unregister slower.  From that pov it's a very slight abi change,
    but hopefully nothing should break with this change either.

    Regarding to the change itself - core of uffd write [un]protect operation
    is moved into a separate function (uffd_wp_range()) and it is reused in
    the unregister code path.

    Note that the new function will not check for anything, e.g.  ranges or
    memory types, because they should have been checked during the previous
    UFFDIO_REGISTER or it should have failed already.  It also doesn't check
    mmap_changing because we're with mmap write lock held anyway.

    I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's
    the only issue reported so far and that's the commit David's reproducer
    will start working (v5.19+).  But the whole idea actually applies to not
    only file memories but also anonymous.  It's just that we don't need to
    fix anonymous prior to v5.19- because there's no known way to exploit.

    IOW, this patch can also fix the issue reported in [1] as the patch 2 does.

    [1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/

    Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com
    Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 73816739f6 mm/uffd: enable write protection for shmem & hugetlbfs
Bugzilla: https://bugzilla.redhat.com/2160210

commit b1f9e876862d8f7176299ec4fb2108bc1045cbc8
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:56 2022 -0700

    mm/uffd: enable write protection for shmem & hugetlbfs

    We've had all the necessary changes ready for both shmem and hugetlbfs.
    Turn on all the shmem/hugetlbfs switches for userfaultfd-wp.

    We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too
    because all existing types now support write protection mode.

    Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.

    Link: https://lkml.kernel.org/r/20220405014926.15101-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen cbd1fe7b79 mm/hugetlb: handle UFFDIO_WRITEPROTECT
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5a90d5a103c2badfcf12d48e2fec350969e3f486
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:54 2022 -0700

    mm/hugetlb: handle UFFDIO_WRITEPROTECT

    This starts from passing cp_flags into hugetlb_change_protection() so
    hugetlb will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.

    huge_pte_clear_uffd_wp() is introduced to handle the case where the
    UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.

    Link: https://lkml.kernel.org/r/20220405014906.14708-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen 5989780018 mm/hugetlb: take care of UFFDIO_COPY_MODE_WP
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6041c69179034278ac6d57f90a55b09e588f4b90
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:54 2022 -0700

    mm/hugetlb: take care of UFFDIO_COPY_MODE_WP

    Pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout the
    stack.  Apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with UFFDIO_COPY.

    Hugetlb pages are only managed by hugetlbfs, so we're safe even without
    setting dirty bit in the huge pte if the page is installed as read-only.
    However we'd better still keep the dirty bit set for a read-only
    UFFDIO_COPY pte (when UFFDIO_COPY_MODE_WP bit is set), not only to match
    what we do with shmem, but also because the page does contain dirty data
    that the kernel just copied from the userspace.

    Link: https://lkml.kernel.org/r/20220405014904.14643-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Chris von Recklinghausen 48d17d6ca0 mm/shmem: take care of UFFDIO_COPY_MODE_WP
Bugzilla: https://bugzilla.redhat.com/2160210

commit 8ee79edff6d3b43b2d0c1e41f92b32e128988b22
Author: Peter Xu <peterx@redhat.com>
Date:   Thu May 12 20:22:52 2022 -0700

    mm/shmem: take care of UFFDIO_COPY_MODE_WP

    Pass wp_copy into shmem_mfill_atomic_pte() through the stack, then apply
    the UFFD_WP bit properly when the UFFDIO_COPY on shmem is with
    UFFDIO_COPY_MODE_WP.  wp_copy lands mfill_atomic_install_pte() finally.

    Note: we must do pte_wrprotect() if !writable in
    mfill_atomic_install_pte(), as mk_pte() could return a writable pte (e.g.,
    when VM_SHARED on a shmem file).

    Link: https://lkml.kernel.org/r/20220405014841.14185-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:04 -04:00
Nico Pache 08db962fd5 mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
commit 73f37dbcfe1763ee2294c7717a1f571e27d17fd8
Author: Axel Rasmussen <axelrasmussen@google.com>
Date:   Fri Jun 10 10:38:12 2022 -0700

    mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages

    When fallocate() is used on a shmem file, the pages we allocate can end up
    with !PageUptodate.

    Since UFFDIO_CONTINUE tries to find the existing page the user wants to
    map with SGP_READ, we would fail to find such a page, since
    shmem_getpage_gfp returns with a "NULL" pagep for SGP_READ if it discovers
    !PageUptodate.  As a result, UFFDIO_CONTINUE returns -EFAULT, as it would
    do if the page wasn't found in the page cache at all.

    This isn't the intended behavior.  UFFDIO_CONTINUE is just trying to find
    if a page exists, and doesn't care whether it still needs to be cleared or
    not.  So, instead of SGP_READ, pass in SGP_NOALLOC.  This is the same,
    except for one critical difference: in the !PageUptodate case, SGP_NOALLOC
    will clear the page and then return it.  With this change, UFFDIO_CONTINUE
    works properly (succeeds) on a shmem file which has been fallocated, but
    otherwise not modified.

    Link: https://lkml.kernel.org/r/20220610173812.1768919-1-axelrasmussen@google.com
    Fixes: 153132571f ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
    Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:39 -07:00
Nico Pache 93abe418b0 userfaultfd: mark uffd_wp regardless of VM_WRITE flag
Conflicts:
       mm/userfaultfd.c: was not expecting downstream RHEL-only commit
        38e95bedaa ("mm: Fix CVE-2022-2590 by reverting "mm/shmem:
        unconditionally set pte dirty in mfill_atomic_install_pte"")

commit 0e88904cb700a9654c9f0d9ca4967e761e7c9ee8
Author: Nadav Amit <namit@vmware.com>
Date:   Thu Apr 21 16:35:43 2022 -0700

    userfaultfd: mark uffd_wp regardless of VM_WRITE flag

    When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is
    currently only marked as write-protected if the VMA has VM_WRITE flag
    set.  This seems incorrect or at least would be unexpected by the users.

    Consider the following sequence of operations that are being performed
    on a certain page:

            mprotect(PROT_READ)
            UFFDIO_COPY(UFFDIO_COPY_MODE_WP)
            mprotect(PROT_READ|PROT_WRITE)

    At this point the user would expect to still get UFFD notification when
    the page is accessed for write, but the user would not get one, since
    the PTE was not marked as UFFD_WP during UFFDIO_COPY.

    Fix it by always marking PTEs as UFFD_WP regardless on the
    write-permission in the VMA flags.

    Link: https://lkml.kernel.org/r/20220217211602.2769-1-namit@vmware.com
    Fixes: 292924b260 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:35 -07:00
Chris von Recklinghausen f0a431e143 mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 40f2bbf71161fa9195c7869004290003af152375
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()

    New anonymous pages are always mapped natively: only THP/khugepaged code
    maps a new compound anonymous page and passes "true".  Otherwise, we're
    just dealing with simple, non-compound pages.

    Let's give the interface clearer semantics and document these.  Remove the
    PageTransCompound() sanity check from page_add_new_anon_rmap().

    Link: https://lkml.kernel.org/r/20220428083441.37290-9-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 579cdcc5b5 mm/mprotect: use mmu_gather
Bugzilla: https://bugzilla.redhat.com/2120352

commit 4a18419f71cdf9155d2d2a6c79546f720978b990
Author: Nadav Amit <namit@vmware.com>
Date:   Mon May 9 18:20:50 2022 -0700

    mm/mprotect: use mmu_gather

    Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.

    This patchset is intended to remove unnecessary TLB flushes during
    mprotect() syscalls.  Once this patch-set make it through, similar and
    further optimizations for MADV_COLD and userfaultfd would be possible.

    Basically, there are 3 optimizations in this patch-set:

    1. Use TLB batching infrastructure to batch flushes across VMAs and do
       better/fewer flushes.  This would also be handy for later userfaultfd
       enhancements.

    2. Avoid unnecessary TLB flushes.  This optimization is the one that
       provides most of the performance benefits.  Unlike previous versions,
       we now only avoid flushes that would not result in spurious
       page-faults.

    3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
       prevent the A/D bits from changing.

    Andrew asked for some benchmark numbers.  I do not have an easy
    determinate macrobenchmark in which it is easy to show benefit.  I
    therefore ran a microbenchmark: a loop that does the following on
    anonymous memory, just as a sanity check to see that time is saved by
    avoiding TLB flushes.  The loop goes:

            mprotect(p, PAGE_SIZE, PROT_READ)
            mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
            *p = 0; // make the page writable

    The test was run in KVM guest with 1 or 2 threads (the second thread was
    busy-looping).  I measured the time (cycles) of each operation:

                    1 thread                2 threads
                    mmots   +patch          mmots   +patch
    PROT_READ       3494    2725 (-22%)     8630    7788 (-10%)
    PROT_READ|WRITE 3952    2724 (-31%)     9075    2865 (-68%)

    [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]

    The exact numbers are really meaningless, but the benefit is clear.  There
    are 2 interesting results though.

    (1) PROT_READ is cheaper, while one can expect it not to be affected.
    This is presumably due to TLB miss that is saved

    (2) Without memory access (*p = 0), the speedup of the patch is even
    greater.  In that scenario mprotect(PROT_READ) also avoids the TLB flush.
    As a result both operations on the patched kernel take roughly ~1500
    cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
    high as presented in the table.

    This patch (of 3):

    change_pXX_range() currently does not use mmu_gather, but instead
    implements its own deferred TLB flushes scheme.  This both complicates the
    code, as developers need to be aware of different invalidation schemes,
    and prevents opportunities to avoid TLB flushes or perform them in finer
    granularity.

    The use of mmu_gather for modified PTEs has benefits in various scenarios
    even if pages are not released.  For instance, if only a single page needs
    to be flushed out of a range of many pages, only that page would be
    flushed.  If a THP page is flushed, on x86 a single TLB invlpg instruction
    can be used instead of 512 instructions (or a full TLB flush, which would
    Linux would actually use by default).  mprotect() over multiple VMAs
    requires a single flush.

    Use mmu_gather in change_pXX_range().  As the pages are not released, only
    record the flushed range using tlb_flush_pXX_range().

    Handle THP similarly and get rid of flush_cache_range() which becomes
    redundant since tlb_start_vma() calls it when needed.

    Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
    Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Andrew Cooper <andrew.cooper3@citrix.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Nick Piggin <npiggin@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:08 -04:00
Chris von Recklinghausen 5d4d6bace0 mm: shmem: don't truncate page if memory failure happens
Bugzilla: https://bugzilla.redhat.com/2120352

commit a7605426666196c5a460dd3de6f8dac1d3c21f00
Author: Yang Shi <shy828301@gmail.com>
Date:   Fri Jan 14 14:05:19 2022 -0800

    mm: shmem: don't truncate page if memory failure happens

    The current behavior of memory failure is to truncate the page cache
    regardless of dirty or clean.  If the page is dirty the later access
    will get the obsolete data from disk without any notification to the
    users.  This may cause silent data loss.  It is even worse for shmem
    since shmem is in-memory filesystem, truncating page cache means
    discarding data blocks.  The later read would return all zero.

    The right approach is to keep the corrupted page in page cache, any
    later access would return error for syscalls or SIGBUS for page fault,
    until the file is truncated, hole punched or removed.  The regular
    storage backed filesystems would be more complicated so this patch is
    focused on shmem.  This also unblock the support for soft offlining
    shmem THP.

    [akpm@linux-foundation.org: coding style fixes]
    [arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()]
      Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org
    [Fix invalid pointer dereference in shmem_read_mapping_page_gfp() with a
     slight different implementation from what Ajay Garg <ajaygargnsit@gmail.com>
     and Muchun Song <songmuchun@bytedance.com> proposed and reworked the
     error handling of shmem_write_begin() suggested by Linus]
      Link: https://lore.kernel.org/linux-mm/20211111084617.6746-1-ajaygargnsit@gmail.com/

    Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com
    Link: https://lkml.kernel.org/r/20211116193247.21102-1-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Ajay Garg <ajaygargnsit@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Andy Lavr <andy.lavr@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
David Hildenbrand 38e95bedaa mm: Fix CVE-2022-2590 by reverting "mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2116301
CVE: CVE-2022-2590
Upstream Status: RHEL only
Tested: My reproducer no longer triggers with this patch.

CVE-2022-2590 allows for modifying shmem/tmpfs files without write
permissions on x86_64 and aarch64 with CONFIG_USERFAULTFD=y. For now,
it's sufficient to revert the problematic commit. If we ever need it
again (e.g., for extended uffd-wp support), we might want to re-apply it
along with an upstream fix that's still pending.

This reverts commit 61fedfa86f.

Signed-off-by: David Hildenbrand <david@redhat.com>
2022-08-08 10:59:03 +02:00
Aristeu Rozanski d4d75b89d8 mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 7c25a0b89a487878b0691e6524fb5a8827322194
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue Mar 22 14:42:08 2022 -0700

    mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic()

    userfaultfd calls mcopy_atomic_pte() and __mcopy_atomic() which do not
    do any cache flushing for the target page.  Then the target page will be
    mapped to the user space with a different address (user address), which
    might have an alias issue with the kernel address used to copy the data
    from the user to.  Fix this by insert flush_dcache_page() after
    copy_from_user() succeeds.

    Link: https://lkml.kernel.org/r/20220210123058.79206-7-songmuchun@bytedance.com
    Fixes: b6ebaedb4c ("userfaultfd: avoid mmap_sem read recursion in mcopy_atomic")
    Fixes: c1a4de99fa ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Fam Zheng <fam.zheng@bytedance.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lars Persson <lars.persson@axis.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:20 -04:00
Aristeu Rozanski 4b2aa38f6e mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context difference due lack of f4c4a3f484 and differences due RHEL-only 44740bc20b

commit cea86fe246b694a191804b47378eb9d77aefabec
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:26:39 2022 -0800

    mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

    Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
    inline functions which check (vma->vm_flags & VM_LOCKED) before calling
    mlock_page() and munlock_page() in mm/mlock.c.

    Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
    because we have understandable difficulty in accounting pte maps of THPs,
    and if passed a PageHead page, mlock_page() and munlock_page() cannot
    tell whether it's a pmd map to be counted or a pte map to be ignored.

    Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
    others, and use that to call mlock_vma_page() at the end of the page
    adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
    beginning? unimportant, but end was easier for assertions in testing).

    No page lock is required (although almost all adds happen to hold it):
    delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
    Certainly page lock did serialize with page migration, but I'm having
    difficulty explaining why that was ever important.

    Mlock accounting on THPs has been hard to define, differed between anon
    and file, involved PageDoubleMap in some places and not others, required
    clear_page_mlock() at some points.  Keep it simple now: just count the
    pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

    page_add_new_anon_rmap() callers unchanged: they have long been calling
    lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
    handling (it also checks for not VM_SPECIAL: I think that's overcautious,
    and inconsistent with other checks, that mmap_region() already prevents
    VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 61fedfa86f mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 9ae0f87d009ca6c4aab2882641ddfc319727e3db
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Nov 5 13:38:24 2021 -0700

    mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte

    Patch series "mm: A few cleanup patches around zap, shmem and uffd", v4.

    IMHO all of them are very nice cleanups to existing code already,
    they're all small and self-contained.  They'll be needed by uffd-wp
    coming series.

    This patch (of 4):

    It was conditionally done previously, as there's one shmem special case
    that we use SetPageDirty() instead.  However that's not necessary and it
    should be easier and cleaner to do it unconditionally in
    mfill_atomic_install_pte().

    The most recent discussion about this is here, where Hugh explained the
    history of SetPageDirty() and why it's possible that it's not required
    at all:

    https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/

    Currently mfill_atomic_install_pte() has three callers:

            1. shmem_mfill_atomic_pte
            2. mcopy_atomic_pte
            3. mcontinue_atomic_pte

    After the change: case (1) should have its SetPageDirty replaced by the
    dirty bit on pte (so we unify them together, finally), case (2) should
    have no functional change at all as it has page_in_cache==false, case
    (3) may add a dirty bit to the pte.  However since case (3) is
    UFFDIO_CONTINUE for shmem, it's merely 100% sure the page is dirty after
    all because UFFDIO_CONTINUE normally requires another process to modify
    the page cache and kick the faulted thread, so should not make a real
    difference either.

    This should make it much easier to follow on which case will set dirty
    for uffd, as we'll simply set it all now for all uffd related ioctls.
    Meanwhile, no special handling of SetPageDirty() if there's no need.

    Link: https://lkml.kernel.org/r/20210915181456.10739-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20210915181456.10739-2-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Liam Howlett <liam.howlett@oracle.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:06 -04:00
Aristeu Rozanski bddf5b2fad mm/memcg: Convert mem_cgroup_charge() to take a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 8f425e4ed0eb3ef0b2d85a9efccf947ca6aa9b1c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 25 09:27:04 2021 -0400

    mm/memcg: Convert mem_cgroup_charge() to take a folio

    Convert all callers of mem_cgroup_charge() to call page_folio() on the
    page they're currently passing in.  Many of them will be converted to
    use folios themselves soon.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:28 -04:00
Rafael Aquini 93219de029 userfaultfd: change mmap_changing to atomic
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit a759a909d42d727e918bd5248d6cff7562fa8109
Author: Nadav Amit <namit@vmware.com>
Date:   Thu Sep 2 14:58:56 2021 -0700

    userfaultfd: change mmap_changing to atomic

    Patch series "userfaultfd: minor bug fixes".

    Three unrelated bug fixes. The first two addresses possible issues (not
    too theoretical ones), but I did not encounter them in practice.

    The third patch addresses a test bug that causes the test to fail on my
    system. It has been sent before as part of a bigger RFC.

    This patch (of 3):

    mmap_changing is currently a boolean variable, which is set and cleared
    without any lock that protects against concurrent modifications.

    mmap_changing is supposed to mark whether userfaultfd page-faults handling
    should be retried since mappings are undergoing a change.  However,
    concurrent calls, for instance to madvise(MADV_DONTNEED), might cause
    mmap_changing to be false, although the remove event was still not read
    (hence acknowledged) by the user.

    Change mmap_changing to atomic_t and increase/decrease appropriately.  Add
    a debug assertion to see whether mmap_changing is negative.

    Link: https://lkml.kernel.org/r/20210808020724.1022515-1-namit@vmware.com
    Link: https://lkml.kernel.org/r/20210808020724.1022515-2-namit@vmware.com
    Fixes: df2cc96e77 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:04 -05:00
Axel Rasmussen 7d64ae3ab6 userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte()
In a previous commit, we added the mfill_atomic_install_pte() helper.
This helper does the job of setting up PTEs for an existing page, to map
it into a given VMA.  It deals with both the anon and shmem cases, as well
as the shared and private cases.

In other words, shmem_mfill_atomic_pte() duplicates a case it already
handles.  So, expose it, and let shmem_mfill_atomic_pte() use it directly,
to reduce code duplication.

This requires that we refactor shmem_mfill_atomic_pte() a bit:

Instead of doing accounting (shmem_recalc_inode() et al) part-way through
the PTE setup, do it afterward.  This frees up mfill_atomic_install_pte()
from having to care about this accounting, and means we don't need to e.g.
shmem_uncharge() in the error path.

A side effect is this switches shmem_mfill_atomic_pte() to use
lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add().
This wrapper does some extra accounting in an exceptional case, if
appropriate, so it's actually the more correct thing to use.

Link: https://lkml.kernel.org/r/20210503180737.2487560-7-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:27 -07:00