Commit Graph

31 Commits

Author SHA1 Message Date
Rafael Aquini e373f575b7 mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize()
JIRA: https://issues.redhat.com/browse/RHEL-84184
JIRA: https://issues.redhat.com/browse/RHEL-83249
CVE: CVE-2025-21861

This patch is a backport of the following upstream commit:
commit 41cddf83d8b00f29fd105e7a0777366edc69a5cf
Author: David Hildenbrand <david@redhat.com>
Date:   Mon Feb 10 17:13:17 2025 +0100

    mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize()

    If migration succeeded, we called
    folio_migrate_flags()->mem_cgroup_migrate() to migrate the memcg from the
    old to the new folio.  This will set memcg_data of the old folio to 0.

    Similarly, if migration failed, memcg_data of the dst folio is left unset.

    If we call folio_putback_lru() on such folios (memcg_data == 0), we will
    add the folio to be freed to the LRU, making memcg code unhappy.  Running
    the hmm selftests:

      # ./hmm-tests
      ...
      #  RUN           hmm.hmm_device_private.migrate ...
      [  102.078007][T14893] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff27d200 pfn:0x13cc00
      [  102.079974][T14893] anon flags: 0x17ff00000020018(uptodate|dirty|swapbacked|node=0|zone=2|lastcpupid=0x7ff)
      [  102.082037][T14893] raw: 017ff00000020018 dead000000000100 dead000000000122 ffff8881353896c9
      [  102.083687][T14893] raw: 00000007ff27d200 0000000000000000 00000001ffffffff 0000000000000000
      [  102.085331][T14893] page dumped because: VM_WARN_ON_ONCE_FOLIO(!memcg && !mem_cgroup_disabled())
      [  102.087230][T14893] ------------[ cut here ]------------
      [  102.088279][T14893] WARNING: CPU: 0 PID: 14893 at ./include/linux/memcontrol.h:726 folio_lruvec_lock_irqsave+0x10e/0x170
      [  102.090478][T14893] Modules linked in:
      [  102.091244][T14893] CPU: 0 UID: 0 PID: 14893 Comm: hmm-tests Not tainted 6.13.0-09623-g6c216bc522fd #151
      [  102.093089][T14893] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
      [  102.094848][T14893] RIP: 0010:folio_lruvec_lock_irqsave+0x10e/0x170
      [  102.096104][T14893] Code: ...
      [  102.099908][T14893] RSP: 0018:ffffc900236c37b0 EFLAGS: 00010293
      [  102.101152][T14893] RAX: 0000000000000000 RBX: ffffea0004f30000 RCX: ffffffff8183f426
      [  102.102684][T14893] RDX: ffff8881063cb880 RSI: ffffffff81b8117f RDI: ffff8881063cb880
      [  102.104227][T14893] RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
      [  102.105757][T14893] R10: 0000000000000001 R11: 0000000000000002 R12: ffffc900236c37d8
      [  102.107296][T14893] R13: ffff888277a2bcb0 R14: 000000000000001f R15: 0000000000000000
      [  102.108830][T14893] FS:  00007ff27dbdd740(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000
      [  102.110643][T14893] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  102.111924][T14893] CR2: 00007ff27d400000 CR3: 000000010866e000 CR4: 0000000000750ef0
      [  102.113478][T14893] PKRU: 55555554
      [  102.114172][T14893] Call Trace:
      [  102.114805][T14893]  <TASK>
      [  102.115397][T14893]  ? folio_lruvec_lock_irqsave+0x10e/0x170
      [  102.116547][T14893]  ? __warn.cold+0x110/0x210
      [  102.117461][T14893]  ? folio_lruvec_lock_irqsave+0x10e/0x170
      [  102.118667][T14893]  ? report_bug+0x1b9/0x320
      [  102.119571][T14893]  ? handle_bug+0x54/0x90
      [  102.120494][T14893]  ? exc_invalid_op+0x17/0x50
      [  102.121433][T14893]  ? asm_exc_invalid_op+0x1a/0x20
      [  102.122435][T14893]  ? __wake_up_klogd.part.0+0x76/0xd0
      [  102.123506][T14893]  ? dump_page+0x4f/0x60
      [  102.124352][T14893]  ? folio_lruvec_lock_irqsave+0x10e/0x170
      [  102.125500][T14893]  folio_batch_move_lru+0xd4/0x200
      [  102.126577][T14893]  ? __pfx_lru_add+0x10/0x10
      [  102.127505][T14893]  __folio_batch_add_and_move+0x391/0x720
      [  102.128633][T14893]  ? __pfx_lru_add+0x10/0x10
      [  102.129550][T14893]  folio_putback_lru+0x16/0x80
      [  102.130564][T14893]  migrate_device_finalize+0x9b/0x530
      [  102.131640][T14893]  dmirror_migrate_to_device.constprop.0+0x7c5/0xad0
      [  102.133047][T14893]  dmirror_fops_unlocked_ioctl+0x89b/0xc80

    Likely, nothing else goes wrong: putting the last folio reference will
    remove the folio from the LRU again.  So besides memcg complaining, adding
    the folio to be freed to the LRU is just an unnecessary step.

    The new flow resembles what we have in migrate_folio_move(): add the dst
    to the lru, remove migration ptes, unlock and unref dst.

    Link: https://lkml.kernel.org/r/20250210161317.717936-1-david@redhat.com
    Fixes: 8763cb45ab ("mm/migrate: new memory migration helper for use with device memory")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:40:06 -04:00
Rafael Aquini a10c4a570e mm: migrate_device: use more folio in migrate_device_finalize()
JIRA: https://issues.redhat.com/browse/RHEL-84184
JIRA: https://issues.redhat.com/browse/RHEL-83249

This patch is a backport of the following upstream commit:
commit 58bf8c2bf47550bc94fea9cafd2bc7304d97102c
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Mon Aug 26 14:58:12 2024 +0800

    mm: migrate_device: use more folio in migrate_device_finalize()

    Saves a couple of calls to compound_head() and remove last two callers of
    putback_lru_page().

    Link: https://lkml.kernel.org/r/20240826065814.1336616-5-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:40:05 -04:00
Rafael Aquini b014eca865 mm: migrate_device: use more folio in migrate_device_unmap()
JIRA: https://issues.redhat.com/browse/RHEL-84184
JIRA: https://issues.redhat.com/browse/RHEL-83249

This patch is a backport of the following upstream commit:
commit 39e618d986e46b02d0f0efcdeb7693f65a51a917
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Mon Aug 26 14:58:11 2024 +0800

    mm: migrate_device: use more folio in migrate_device_unmap()

    The page for migrate_device_unmap() already has a reference, so it is safe
    to convert the page to folio to save a few calls to compound_head(), which
    removes the last isolate_lru_page() call.

    Link: https://lkml.kernel.org/r/20240826065814.1336616-4-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:40:05 -04:00
Rafael Aquini eb64fc69c6 mm: migrate_device: convert try_to_migrate() to folios
JIRA: https://issues.redhat.com/browse/RHEL-84184
JIRA: https://issues.redhat.com/browse/RHEL-83249
Upstream Status: RHEL only

These are hunks from upstream's commit 4b8554c527f3 ("mm/rmap: Convert
try_to_migrate() to folios") that ended up lost in RHEL-9 due to
out-of-order backports of commits 4b8554c527f3 and 76cbbead253d
("mm: move the migrate_vma_* device migration code into its own file").

There is nothing functionally wrong with the current RHEL-9 backported
code, but it is better to bring these bits back to their original
upstream form, so it becomes easier and less error-prone to perform
further backports to this area in the future.

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:40:05 -04:00
Rafael Aquini 9345f8c811 mm: migrate_device: use a folio in migrate_device_range()
JIRA: https://issues.redhat.com/browse/RHEL-84184
JIRA: https://issues.redhat.com/browse/RHEL-83249

This patch is a backport of the following upstream commit:
commit 53456b7b3f4c3427ff04ae5c92e6dba1b9bfbb23
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Mon Aug 26 14:58:10 2024 +0800

    mm: migrate_device: use a folio in migrate_device_range()

    Save two calls to compound_head() and use folio throughout.

    Link: https://lkml.kernel.org/r/20240826065814.1336616-3-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:40:04 -04:00
Rafael Aquini 026885dd86 mm: migrate_device: convert to migrate_device_coherent_folio()
JIRA: https://issues.redhat.com/browse/RHEL-84184
JIRA: https://issues.redhat.com/browse/RHEL-83249

This patch is a backport of the following upstream commit:
commit 5c8525a37b78ddfee84ff6927b8838013ff2521e
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Mon Aug 26 14:58:09 2024 +0800

    mm: migrate_device: convert to migrate_device_coherent_folio()

    Patch series "mm: finish isolate/putback_lru_page()".

    Convert to use more folios in migrate_device.c, then we could remove
    isolate_lru_page() and putback_lru_page().

    This patch (of 6):

    Save a few calls to compound_head() and use folio throughout.

    Link: https://lkml.kernel.org/r/20240826065814.1336616-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20240826065814.1336616-2-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:40:04 -04:00
Rafael Aquini 799c574090 mm/migrate_device: try to handle swapcache pages
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit df263d9a7dffee94ca5391120ee3b0587efa07f1
Author: Mika Penttilä <mpenttil@redhat.com>
Date:   Wed Jun 7 20:29:44 2023 +0300

    mm/migrate_device: try to handle swapcache pages

    Migrating file pages and swapcache pages into device memory is not
    supported.  Try to get rid of the swap cache, and if successful, go ahead
    as with other anonymous pages.

    Link: https://lkml.kernel.org/r/20230607172944.11713-1-mpenttil@redhat.com
    Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:16:59 -04:00
Rafael Aquini 25e4aa840e mm: remove references to pagevec
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 1fec6890bf2247ecc93f5491c2d3f33c333d5c6e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jun 21 17:45:56 2023 +0100

    mm: remove references to pagevec

    Most of these should just refer to the LRU cache rather than the data
    structure used to implement the LRU cache.

    Link: https://lkml.kernel.org/r/20230621164557.3510324-13-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:32 -04:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Nico Pache 0b91dbac20 mm: enable page walking API to lock vmas during the walk
commit 49b0638502da097c15d46cd4e871dbaa022caf7c
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Aug 4 08:27:19 2023 -0700

    mm: enable page walking API to lock vmas during the walk

    walk_page_range() and friends often operate under write-locked mmap_lock.
    With introduction of vma locks, the vmas have to be locked as well during
    such walks to prevent concurrent page faults in these areas.  Add an
    additional member to mm_walk_ops to indicate locking requirements for the
    walk.

    The change ensures that page walks which prevent concurrent page faults
    by write-locking mmap_lock, operate correctly after introduction of
    per-vma locks.  With per-vma locks page faults can be handled under vma
    lock without taking mmap_lock at all, so write locking mmap_lock would
    not stop them.  The change ensures vmas are properly locked during such
    walks.

    A sample issue this solves is do_mbind() performing queue_pages_range()
    to queue pages for migration.  Without this change a concurrent page
    can be faulted into the area and be left out of migration.

    Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
    Suggested-by: Jann Horn <jannh@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam Howlett <liam.howlett@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:27 -06:00
Aristeu Rozanski 4c96f5154f mm: change to return bool for isolate_lru_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f7f9c00dfafffd7a5a1a5685e2d874c64913e2ed
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:35 2023 +0800

    mm: change to return bool for isolate_lru_page()

    The isolate_lru_page() can only return 0 or -EBUSY, and most users did not
    care about the negative error of isolate_lru_page(), except one user in
    add_page_for_migration().  So we can convert the isolate_lru_page() to
    return a boolean value, which can help to make the code more clear when
    checking the return value of isolate_lru_page().

    Also convert all users' logic of checking the isolation state.

    No functional changes intended.

    Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski 5455c3da6d mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7d4a8be0c4b2b7ffb367929d2b352651f083806b
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jan 10 13:57:22 2023 +1100

    mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export

    mmu_notifier_range_update_to_read_only() was originally introduced in
    commit c6d23413f8 ("mm/mmu_notifier:
    mmu_notifier_range_update_to_read_only() helper") as an optimisation for
    device drivers that know a range has only been mapped read-only.  However
    there are no users of this feature so remove it.  As it is the only user
    of the struct mmu_notifier_range.vma field remove that also.

    Link: https://lkml.kernel.org/r/20230110025722.600912-1-apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:05 -04:00
Prarit Bhargava 25cf7e4e50 mm: Make pte_mkwrite() take a VMA
JIRA: https://issues.redhat.com/browse/RHEL-25415

Conflicts: This is a rip and replace of pt_mkwrite() with one arg for
pte_mkwrite() with two args.  There are uses upstream that are not yet
in RHEL9.

commit 161e393c0f63592a3b95bdd8b55752653763fc6d
Author: Rick Edgecombe <rick.p.edgecombe@intel.com>
Date:   Mon Jun 12 17:10:29 2023 -0700

    mm: Make pte_mkwrite() take a VMA

    The x86 Shadow stack feature includes a new type of memory called shadow
    stack. This shadow stack memory has some unusual properties, which requires
    some core mm changes to function properly.

    One of these unusual properties is that shadow stack memory is writable,
    but only in limited ways. These limits are applied via a specific PTE
    bit combination. Nevertheless, the memory is writable, and core mm code
    will need to apply the writable permissions in the typical paths that
    call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
    that the x86 implementation of it can know whether to create regular
    writable or shadow stack mappings.

    But there are a couple of challenges to this. Modifying the signatures of
    each arch pte_mkwrite() implementation would be error prone because some
    are generated with macros and would need to be re-implemented. Also, some
    pte_mkwrite() callers operate on kernel memory without a VMA.

    So this can be done in a three step process. First pte_mkwrite() can be
    renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
    added that just calls pte_mkwrite_novma(). Next callers without a VMA can
    be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
    can be changed to take/pass a VMA.

    Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted
    callers that don't have a VMA were to use pte_mkwrite_novma(). So now
    change pte_mkwrite() to take a VMA and change the remaining callers to
    pass a VMA. Apply the same changes for pmd_mkwrite().

    No functional change.

    Suggested-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: David Hildenbrand <david@redhat.com>
    Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com

Omitted-fix: f441ff73f1ec powerpc: Fix pud_mkwrite() definition after pte_mkwrite() API changes
	pud_mkwrite() not in RHEL9 code for powerpc (removed previously)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:43:13 -04:00
Jerry Snitselaar efb6748971 mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
JIRA: https://issues.redhat.com/browse/RHEL-26541
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Conflicts: Context diff due to some commits not being backported yet such as c33c794828f2 ("mm: ptep_get() conversion"),
           and 959a78b6dd45 ("mm/hugetlb: use a folio in hugetlb_wp()").

commit ec8832d007cb7b50229ad5745eec35b847cc9120
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jul 25 23:42:06 2023 +1000

    mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()

    Secondary TLBs are now invalidated from the architecture specific TLB
    invalidation functions.  Therefore there is no need to explicitly notify
    or invalidate as part of the range end functions.  This means we can
    remove mmu_notifier_invalidate_range_end_only() and some of the
    ptep_*_notify() functions.

    Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Andrew Donnellan <ajd@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
    Cc: Frederic Barrat <fbarrat@linux.ibm.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kevin Tian <kevin.tian@intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Nicolin Chen <nicolinc@nvidia.com>
    Cc: Robin Murphy <robin.murphy@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zhi Wang <zhi.wang.linux@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit ec8832d007cb7b50229ad5745eec35b847cc9120)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2024-02-26 15:49:51 -07:00
Chris von Recklinghausen d289211b36 mm/migrate_device: allow pte_offset_map_lock() to fail
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4b56069c95d69bfce0c0ffb2531d08216268a972
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:38:17 2023 -0700

    mm/migrate_device: allow pte_offset_map_lock() to fail

    migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after
    splitting huge zero pmd, and the pmd_none() handling after successfully
    splitting huge page: those are now managed inside pte_offset_map_lock(),
    and by "goto again" when it fails.

    But the skip after unsuccessful split_huge_page() must stay: it avoids an
    endless loop.  The skip when pmd_bad()?  Remove that: it will be treated
    as a hole rather than a skip once cleared by pte_offset_map_lock(), but
    with different timing that would be so anyway; and it's arguably best to
    leave the pmd_bad() handling centralized there.

    migrate_vma_insert_page(): remove comment on the old pte_offset_map() and
    old locking limitations; remove the pmd_trans_unstable() check and just
    proceed to pte_offset_map_lock(), aborting when it fails (page has been
    charged to memcg, but as in other cases, it's uncharged when freed).

    Link: https://lkml.kernel.org/r/1131be62-2e84-da2f-8f45-807b2cbeeec5@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:18 -04:00
Donald Dutile 7f126a9a3f mm/migrate_device: return number of migrating pages in args->cpages
Bugzilla: http://bugzilla.redhat.com/2159905

commit 44af0b45d58d7b6f09ebb9081aa89b8bdc33a630
Author: Alistair Popple <apopple@nvidia.com>
Date:   Fri Nov 11 11:51:35 2022 +1100

    mm/migrate_device: return number of migrating pages in args->cpages

    migrate_vma->cpages originally contained a count of the number of pages
    migrating including non-present pages which can be populated directly on
    the target.

    Commit 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and
    migrate_device_coherent_page()") inadvertantly changed this to contain
    just the number of pages that were unmapped.  Usage of migrate_vma->cpages
    isn't documented, but most drivers use it to see if all the requested
    addresses can be migrated so restore the original behaviour.

    Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com
    Fixes: 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()")
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reported-by: Ralph Campbell <rcampbell@nvidia.com>
    Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2023-05-02 16:47:27 -04:00
Donald Dutile 39d0194dba mm/migrate_device.c: add migrate_device_range()
Bugzilla: http://bugzilla.redhat.com/2159905

commit e778406b40dbb1342a1888cd751ca9d2982a12e2
Author: Alistair Popple <apopple@nvidia.com>
Date:   Wed Sep 28 22:01:19 2022 +1000

    mm/migrate_device.c: add migrate_device_range()

    Device drivers can use the migrate_vma family of functions to migrate
    existing private anonymous mappings to device private pages.  These pages
    are backed by memory on the device with drivers being responsible for
    copying data to and from device memory.

    Device private pages are freed via the pgmap->page_free() callback when
    they are unmapped and their refcount drops to zero.  Alternatively they
    may be freed indirectly via migration back to CPU memory in response to a
    pgmap->migrate_to_ram() callback called whenever the CPU accesses an
    address mapped to a device private page.

    In other words drivers cannot control the lifetime of data allocated on
    the devices and must wait until these pages are freed from userspace.
    This causes issues when memory needs to reclaimed on the device, either
    because the device is going away due to a ->release() callback or because
    another user needs to use the memory.

    Drivers could use the existing migrate_vma functions to migrate data off
    the device.  However this would require them to track the mappings of each
    page which is both complicated and not always possible.  Instead drivers
    need to be able to migrate device pages directly so they can free up
    device memory.

    To allow that this patch introduces the migrate_device family of functions
    which are functionally similar to migrate_vma but which skips the initial
    lookup based on mapping.

    Link: https://lkml.kernel.org/r/868116aab70b0c8ee467d62498bb2cf0ef907295.1664366292.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2023-05-02 16:47:26 -04:00
Donald Dutile 175d1121e6 mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()
Bugzilla: http://bugzilla.redhat.com/2159905

commit 241f68859656836ae3e85179cc224cc4c5e4e6a7
Author: Alistair Popple <apopple@nvidia.com>
Date:   Wed Sep 28 22:01:18 2022 +1000

    mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()

    migrate_device_coherent_page() reuses the existing migrate_vma family of
    functions to migrate a specific page without providing a valid mapping or
    vma.  This looks a bit odd because it means we are calling migrate_vma_*()
    without setting a valid vma, however it was considered acceptable at the
    time because the details were internal to migrate_device.c and there was
    only a single user.

    One of the reasons the details could be kept internal was that this was
    strictly for migrating device coherent memory.  Such memory can be copied
    directly by the CPU without intervention from a driver.  However this
    isn't true for device private memory, and a future change requires similar
    functionality for device private memory.  So refactor the code into
    something more sensible for migrating device memory without a vma.

    Link: https://lkml.kernel.org/r/c7b2ff84e9b33d022cf4a40f87d051f281a16d8f.1664366292.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2023-05-02 16:47:26 -04:00
Donald Dutile 6217af187c mm/memory.c: fix race when faulting a device private page
Bugzilla: http://bugzilla.redhat.com/2159905

commit 16ce101db85db694a91380aa4c89b25530871d33
Author: Alistair Popple <apopple@nvidia.com>
Date:   Wed Sep 28 22:01:15 2022 +1000

    mm/memory.c: fix race when faulting a device private page

    Patch series "Fix several device private page reference counting issues",
    v2

    This series aims to fix a number of page reference counting issues in
    drivers dealing with device private ZONE_DEVICE pages.  These result in
    use-after-free type bugs, either from accessing a struct page which no
    longer exists because it has been removed or accessing fields within the
    struct page which are no longer valid because the page has been freed.

    During normal usage it is unlikely these will cause any problems.  However
    without these fixes it is possible to crash the kernel from userspace.
    These crashes can be triggered either by unloading the kernel module or
    unbinding the device from the driver prior to a userspace task exiting.
    In modules such as Nouveau it is also possible to trigger some of these
    issues by explicitly closing the device file-descriptor prior to the task
    exiting and then accessing device private memory.

    This involves some minor changes to both PowerPC and AMD GPU code.
    Unfortunately I lack hardware to test either of those so any help there
    would be appreciated.  The changes mimic what is done in for both Nouveau
    and hmm-tests though so I doubt they will cause problems.

    This patch (of 8):

    When the CPU tries to access a device private page the migrate_to_ram()
    callback associated with the pgmap for the page is called.  However no
    reference is taken on the faulting page.  Therefore a concurrent migration
    of the device private page can free the page and possibly the underlying
    pgmap.  This results in a race which can crash the kernel due to the
    migrate_to_ram() function pointer becoming invalid.  It also means drivers
    can't reliably read the zone_device_data field because the page may have
    been freed with memunmap_pages().

    Close the race by getting a reference on the page while holding the ptl to
    ensure it has not been freed.  Unfortunately the elevated reference count
    will cause the migration required to handle the fault to fail.  To avoid
    this failure pass the faulting page into the migrate_vma functions so that
    if an elevated reference count is found it can be checked to see if it's
    expected or not.

    [mpe@ellerman.id.au: fix build]
      Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
    Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
    Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2023-05-02 16:47:26 -04:00
Rafael Aquini 9f8a34b521 mm: remember young/dirty bit for page migrations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168392

This patch is a backport of the following upstream commit:
commit 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Aug 11 12:13:29 2022 -0400

    mm: remember young/dirty bit for page migrations

    When page migration happens, we always ignore the young/dirty bit settings
    in the old pgtable, and marking the page as old in the new page table
    using either pte_mkold() or pmd_mkold(), and keeping the pte clean.

    That's fine from functional-wise, but that's not friendly to page reclaim
    because the moving page can be actively accessed within the procedure.
    Not to mention hardware setting the young bit can bring quite some
    overhead on some systems, e.g.  x86_64 needs a few hundreds nanoseconds to
    set the bit.  The same slowdown problem to dirty bits when the memory is
    first written after page migration happened.

    Actually we can easily remember the A/D bit configuration and recover the
    information after the page is migrated.  To achieve it, define a new set
    of bits in the migration swap offset field to cache the A/D bits for old
    pte.  Then when removing/recovering the migration entry, we can recover
    the A/D bits even if the page changed.

    One thing to mention is that here we used max_swapfile_size() to detect
    how many swp offset bits we have, and we'll only enable this feature if we
    know the swp offset is big enough to store both the PFN value and the A/D
    bits.  Otherwise the A/D bits are dropped like before.

    Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andi Kleen <andi.kleen@intel.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2023-04-03 10:16:25 -04:00
Chris von Recklinghausen 30ef2db262 mm/migrate: Convert migrate_page() to migrate_folio()
Conflicts:
	drivers/gpu/drm/i915/gem/i915_gem_userptr.c - We already have
		7a3deb5bcc ("Merge DRM changes from upstream v5.19..v6.0")
		so it already has the change from this patch.
	drop changes to fs/btrfs/disk-io.c - unsupported config

Bugzilla: https://bugzilla.redhat.com/2160210

commit 541846502f4fe826cd7c16e4784695ac90736585
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 10:27:41 2022 -0400

    mm/migrate: Convert migrate_page() to migrate_folio()

    Convert all callers to pass a folio.  Most have the folio
    already available.  Switch all users from aops->migratepage to
    aops->migrate_folio.  Also turn the documentation into kerneldoc.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen 6573caddd5 mm/gup: migrate device coherent pages when pinning instead of failing
Bugzilla: https://bugzilla.redhat.com/2160210

commit b05a79d4377f6dcc30683008ffd1c531ea965393
Author: Alistair Popple <apopple@nvidia.com>
Date:   Fri Jul 15 10:05:13 2022 -0500

    mm/gup: migrate device coherent pages when pinning instead of failing

    Currently any attempts to pin a device coherent page will fail.  This is
    because device coherent pages need to be managed by a device driver, and
    pinning them would prevent a driver from migrating them off the device.

    However this is no reason to fail pinning of these pages.  These are
    coherent and accessible from the CPU so can be migrated just like pinning
    ZONE_MOVABLE pages.  So instead of failing all attempts to pin them first
    try migrating them out of ZONE_DEVICE.

    [hch@lst.de: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c]
    Link: https://lkml.kernel.org/r/20220715150521.18165-7-alex.sierra@amd.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:23 -04:00
Chris von Recklinghausen ff09b78a0b mm: add device coherent vma selection for memory migration
Bugzilla: https://bugzilla.redhat.com/2160210

commit dd19e6d8ffaa1289d75d7833de97faf1b6b2c8e4
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:12 2022 -0500

    mm: add device coherent vma selection for memory migration

    This case is used to migrate pages from device memory, back to system
    memory.  Device coherent type memory is cache coherent from device and CPU
    point of view.

    Link: https://lkml.kernel.org/r/20220715150521.18165-6-alex.sierra@amd.com
    Signed-off-by: Alex Sierra <alex.sierra@amd.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Reviewed-by: Alistair Poppple <apopple@nvidia.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:23 -04:00
Chris von Recklinghausen 279ba99c6b mm: add zone device coherent type memory support
Bugzilla: https://bugzilla.redhat.com/2160210

commit f25cbb7a95a24ff9a2a3bebd308e303942ae6b2c
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:10 2022 -0500

    mm: add zone device coherent type memory support

    Device memory that is cache coherent from device and CPU point of view.
    This is used on platforms that have an advanced system bus (like CAPI or
    CXL).  Any page of a process can be migrated to such memory.  However, no
    one should be allowed to pin such memory so that it can always be evicted.

    [hch@lst.de: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page]
    Link: https://lkml.kernel.org/r/20220715150521.18165-4-alex.sierra@amd.com
    Signed-off-by: Alex Sierra <alex.sierra@amd.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:23 -04:00
Nico Pache 2a1907b9d5 mm/migrate_device.c: fix a misleading and outdated comment
commit 0742e49026121371c7ce1f640628c68c7da175d6
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Aug 30 12:01:38 2022 +1000

    mm/migrate_device.c: fix a misleading and outdated comment

    Commit ab09243aa95a ("mm/migrate.c: remove MIGRATE_PFN_LOCKED") changed
    the way trylock_page() in migrate_vma_collect_pmd() works without updating
    the comment.  Reword the comment to be less misleading and a better
    reflection of what happens.

    Link: https://lkml.kernel.org/r/20220830020138.497063-1-apopple@nvidia.com
    Fixes: ab09243aa95a ("mm/migrate.c: remove MIGRATE_PFN_LOCKED")
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reported-by: Peter Xu <peterx@redhat.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:42 -07:00
Nico Pache 2acf489940 mm/migrate_device.c: copy pte dirty bit to page
commit fd35ca3d12cc9922d7d9a35f934e72132dbc4853
Author: Alistair Popple <apopple@nvidia.com>
Date:   Fri Sep 2 10:35:53 2022 +1000

    mm/migrate_device.c: copy pte dirty bit to page

    migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
    installs migration entries directly if it can lock the migrating page.
    When removing a dirty pte the dirty bit is supposed to be carried over to
    the underlying page to prevent it being lost.

    Currently migrate_vma_*() can only be used for private anonymous mappings.
    That means loss of the dirty bit usually doesn't result in data loss
    because these pages are typically not file-backed.  However pages may be
    backed by swap storage which can result in data loss if an attempt is made
    to migrate a dirty page that doesn't yet have the PageDirty flag set.

    In this case migration will fail due to unexpected references but the
    dirty pte bit will be lost.  If the page is subsequently reclaimed data
    won't be written back to swap storage as it is considered uptodate,
    resulting in data loss if the page is subsequently accessed.

    Prevent this by copying the dirty bit to the page when removing the pte to
    match what try_to_migrate_one() does.

    Link: https://lkml.kernel.org/r/dd48e4882ce859c295c1a77612f66d198b0403f9.1662078528.git-series.apopple@nvidia.com
    Fixes: 8c3328f1f3 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reported-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: huang ying <huang.ying.caritas@gmail.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Karol Herbst <kherbst@redhat.com>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Paul Mackerras <paulus@ozlabs.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:41 -07:00
Nico Pache a3cdb74655 mm/migrate_device.c: add missing flush_cache_page()
commit a3589e1d5fe39c3d9fdd291b111524b93d08bc32
Author: Alistair Popple <apopple@nvidia.com>
Date:   Fri Sep 2 10:35:52 2022 +1000

    mm/migrate_device.c: add missing flush_cache_page()

    Currently we only call flush_cache_page() for the anon_exclusive case,
    however in both cases we clear the pte so should flush the cache.

    Link: https://lkml.kernel.org/r/5676f30436ab71d1a587ac73f835ed8bd2113ff5.1662078528.git-series.apopple@nvidia.com
    Fixes: 8c3328f1f3 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: huang ying <huang.ying.caritas@gmail.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Karol Herbst <kherbst@redhat.com>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Paul Mackerras <paulus@ozlabs.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:41 -07:00
Nico Pache 2b48e10d05 mm/migrate_device.c: flush TLB while holding PTL
commit 60bae73708963de4a17231077285bd9ff2f41c44
Author: Alistair Popple <apopple@nvidia.com>
Date:   Fri Sep 2 10:35:51 2022 +1000

    mm/migrate_device.c: flush TLB while holding PTL

    When clearing a PTE the TLB should be flushed whilst still holding the PTL
    to avoid a potential race with madvise/munmap/etc.  For example consider
    the following sequence:

      CPU0                          CPU1
      ----                          ----

      migrate_vma_collect_pmd()
      pte_unmap_unlock()
                                    madvise(MADV_DONTNEED)
                                    -> zap_pte_range()
                                    pte_offset_map_lock()
                                    [ PTE not present, TLB not flushed ]
                                    pte_unmap_unlock()
                                    [ page is still accessible via stale TLB ]
      flush_tlb_range()

    In this case the page may still be accessed via the stale TLB entry after
    madvise returns.  Fix this by flushing the TLB while holding the PTL.

    Fixes: 8c3328f1f3 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
    Link: https://lkml.kernel.org/r/9f801e9d8d830408f2ca27821f606e09aa856899.1662078528.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reported-by: Nadav Amit <nadav.amit@gmail.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: huang ying <huang.ying.caritas@gmail.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Karol Herbst <kherbst@redhat.com>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Paul Mackerras <paulus@ozlabs.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:41 -07:00
Chris von Recklinghausen 30e9a2455a mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6c287605fd56466e645693eff3ae7c08fba56e0a
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:44 2022 -0700

    mm: remember exclusively mapped anonymous pages with PG_anon_exclusive

    Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
    exclusive, and use that information to make GUP pins reliable and stay
    consistent with the page mapped into the page table even if the page table
    entry gets write-protected.

    With that information at hand, we can extend our COW logic to always reuse
    anonymous pages that are exclusive.  For anonymous pages that might be
    shared, the existing logic applies.

    As already documented, PG_anon_exclusive is usually only expressive in
    combination with a page table entry.  Especially PTE vs.  PMD-mapped
    anonymous pages require more thought, some examples: due to mremap() we
    can easily have a single compound page PTE-mapped into multiple page
    tables exclusively in a single process -- multiple page table locks apply.
    Further, due to MADV_WIPEONFORK we might not necessarily write-protect
    all PTEs, and only some subpages might be pinned.  Long story short: once
    PTE-mapped, we have to track information about exclusivity per sub-page,
    but until then, we can just track it for the compound page in the head
    page and not having to update a whole bunch of subpages all of the time
    for a simple PMD mapping of a THP.

    For simplicity, this commit mostly talks about "anonymous pages", while
    it's for THP actually "the part of an anonymous folio referenced via a
    page table entry".

    To not spill PG_anon_exclusive code all over the mm code-base, we let the
    anon rmap code to handle all PG_anon_exclusive logic it can easily handle.

    If a writable, present page table entry points at an anonymous (sub)page,
    that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
    pin (FOLL_PIN) on an anonymous page references via a present page table
    entry, it must only pin if PG_anon_exclusive is set for the mapped
    (sub)page.

    This commit doesn't adjust GUP, so this is only implicitly handled for
    FOLL_WRITE, follow-up commits will teach GUP to also respect it for
    FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
    reliable.

    Whenever an anonymous page is to be shared (fork(), KSM), or when
    temporarily unmapping an anonymous page (swap, migration), the relevant
    PG_anon_exclusive bit has to be cleared to mark the anonymous page
    possibly shared.  Clearing will fail if there are GUP pins on the page:

    * For fork(), this means having to copy the page and not being able to
      share it.  fork() protects against concurrent GUP using the PT lock and
      the src_mm->write_protect_seq.

    * For KSM, this means sharing will fail.  For swap this means, unmapping
      will fail, For migration this means, migration will fail early.  All
      three cases protect against concurrent GUP using the PT lock and a
      proper clear/invalidate+flush of the relevant page table entry.

    This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
    pinned page gets mapped R/O and the successive write fault ends up
    replacing the page instead of reusing it.  It improves the situation for
    O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
    fork() is *not* involved, however swapout and fork() are still
    problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
    users will fix the issue for them.

    I. Details about basic handling

    I.1. Fresh anonymous pages

    page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
    given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
    the mechanism fresh anonymous pages come into life (besides migration code
    where we copy the page->mapping), all fresh anonymous pages will start out
    as exclusive.

    I.2. COW reuse handling of anonymous pages

    When a COW handler stumbles over a (sub)page that's marked exclusive, it
    simply reuses it.  Otherwise, the handler tries harder under page lock to
    detect if the (sub)page is exclusive and can be reused.  If exclusive,
    page_move_anon_rmap() will mark the given (sub)page exclusive.

    Note that hugetlb code does not yet check for PageAnonExclusive(), as it
    still uses the old COW logic that is prone to the COW security issue
    because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
    pages are a scarce resource.

    I.3. Migration handling

    try_to_migrate() has to try marking an exclusive anonymous page shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  migrate_vma_collect_pmd() and
    __split_huge_pmd_locked() are handled similarly.

    Writable migration entries implicitly point at shared anonymous pages.
    For readable migration entries that information is stored via a new
    "readable-exclusive" migration entry, specific to anonymous pages.

    When restoring a migration entry in remove_migration_pte(), information
    about exlusivity is detected via the migration entry type, and
    RMAP_EXCLUSIVE is set accordingly for
    page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.

    I.4. Swapout handling

    try_to_unmap() has to try marking the mapped page possibly shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  For now, information about exclusivity is lost.  In
    the future, we might want to remember that information in the swap entry
    in some cases, however, it requires more thought, care, and a way to store
    that information in swap entries.

    I.5. Swapin handling

    do_swap_page() will never stumble over exclusive anonymous pages in the
    swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
    to detect manually if an anonymous page is exclusive and has to set
    RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.

    I.6. THP handling

    __split_huge_pmd_locked() has to move the information about exclusivity
    from the PMD to the PTEs.

    a) In case we have a readable-exclusive PMD migration entry, simply
       insert readable-exclusive PTE migration entries.

    b) In case we have a present PMD entry and we don't want to freeze
       ("convert to migration entries"), simply forward PG_anon_exclusive to
       all sub-pages, no need to temporarily clear the bit.

    c) In case we have a present PMD entry and want to freeze, handle it
       similar to try_to_migrate(): try marking the page shared first.  In
       case we fail, we ignore the "freeze" instruction and simply split
       ordinarily.  try_to_migrate() will properly fail because the THP is
       still mapped via PTEs.

    When splitting a compound anonymous folio (THP), the information about
    exclusivity is implicitly handled via the migration entries: no need to
    replicate PG_anon_exclusive manually.

    I.7.  fork() handling fork() handling is relatively easy, because
    PG_anon_exclusive is only expressive for some page table entry types.

    a) Present anonymous pages

    page_try_dup_anon_rmap() will mark the given subpage shared -- which will
    fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
    PMD to handle it on the PTE level).

    Note that device exclusive entries are just a pointer at a PageAnon()
    page.  fork() will first convert a device exclusive entry to a present
    page table and handle it just like present anonymous pages.

    b) Device private entry

    Device private entries point at PageAnon() pages that cannot be mapped
    directly and, therefore, cannot get pinned.

    page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
    fail because they cannot get pinned.

    c) HW poison entries

    PG_anon_exclusive will remain untouched and is stale -- the page table
    entry is just a placeholder after all.

    d) Migration entries

    Writable and readable-exclusive entries are converted to readable entries:
    possibly shared.

    I.8. mprotect() handling

    mprotect() only has to properly handle the new readable-exclusive
    migration entry:

    When write-protecting a migration entry that points at an anonymous page,
    remember the information about exclusivity via the "readable-exclusive"
    migration entry type.

    II. Migration and GUP-fast

    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a migration entry, we have to mark the page possibly
    shared and synchronize against GUP-fast by a proper clear/invalidate+flush
    to make the following scenario impossible:

    1. try_to_migrate() places a migration entry after checking for GUP pins
       and marks the page possibly shared.

    2. GUP-fast pins the page due to lack of synchronization

    3. fork() converts the "writable/readable-exclusive" migration entry into a
       readable migration entry

    4. Migration fails due to the GUP pin (failing to freeze the refcount)

    5. Migration entries are restored. PG_anon_exclusive is lost

    -> We have a pinned page that is not marked exclusive anymore.

    Note that we move information about exclusivity from the page to the
    migration entry as it otherwise highly overcomplicates fork() and
    PTE-mapping a THP.

    III. Swapout and GUP-fast

    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a swap entry, we have to mark the page possibly shared
    and synchronize against GUP-fast by a proper clear/invalidate+flush to
    make the following scenario impossible:

    1. try_to_unmap() places a swap entry after checking for GUP pins and
       clears exclusivity information on the page.

    2. GUP-fast pins the page due to lack of synchronization.

    -> We have a pinned page that is not marked exclusive anymore.

    If we'd ever store information about exclusivity in the swap entry,
    similar to migration handling, the same considerations as in II would
    apply.  This is future work.

    Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen f0a431e143 mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 40f2bbf71161fa9195c7869004290003af152375
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()

    New anonymous pages are always mapped natively: only THP/khugepaged code
    maps a new compound anonymous page and passes "true".  Otherwise, we're
    just dealing with simple, non-compound pages.

    Let's give the interface clearer semantics and document these.  Remove the
    PageTransCompound() sanity check from page_add_new_anon_rmap().

    Link: https://lkml.kernel.org/r/20220428083441.37290-9-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen cf5ab9070a mm: move the migrate_vma_* device migration code into its own file
Conflicts:
        mm/migrate.c - The backport of
                ab09243aa95a ("mm/migrate.c: remove MIGRATE_PFN_LOCKED")
                had a conflict because of the backports of
                413248faac ("mm/rmap: Convert try_to_migrate() to folios")
                and
                4eecb8b9163d ("mm/migrate: Convert remove_migration_ptes() to folios")
                which leads to a difference in deleted code.
        mm/migrate_device.c - because of 413248faac and 4eecb8b9163d add
                code to use folios for calls to try_to_migrate and
                remove_migration_ptes

Bugzilla: https://bugzilla.redhat.com/2120352

commit 76cbbead253ddcae9878be0d702208bb1e4fac6f
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Feb 16 15:31:38 2022 +1100

    mm: move the migrate_vma_* device migration code into its own file

    Split the code used to migrate to and from ZONE_DEVICE memory from
    migrate.c into a new file.

    Link: https://lkml.kernel.org/r/20220210072828.2930359-14-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Chaitanya Kulkarni <kch@nvidia.com>
    Cc: Christian Knig <christian.koenig@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Karol Herbst <kherbst@redhat.com>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:46 -04:00