Commit Graph

687 Commits

Author SHA1 Message Date
Rafael Aquini 683fe6564d mm: fix crashes from deferred split racing folio migration
JIRA: https://issues.redhat.com/browse/RHEL-84184
CVE: CVE-2024-42234

This patch is a backport of the following upstream commit:
commit be9581ea8c058d81154251cb0695987098996cad
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Jul 2 00:40:55 2024 -0700

    mm: fix crashes from deferred split racing folio migration

    Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
    flags when freeing, yet the flags shown are not bad: PG_locked had been
    set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
    deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
    symptoms implying double free by deferred split and large folio migration.

    6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
    folio migration") was right to fix the memcg-dependent locking broken in
    85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
    but missed a subtlety of deferred_split_scan(): it moves folios to its own
    local list to work on them without split_queue_lock, during which time
    folio->_deferred_list is not empty, but even the "right" lock does nothing
    to secure the folio and the list it is on.

    Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
    folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
    while the old folio's reference count is temporarily frozen to 0 - adding
    such a freeze in the !mapping case too (originally, folio lock and
    unmapping and no swap cache left an anon folio unreachable, so no freezing
    was needed there: but the deferred split queue offers a way to reach it).

    Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com
    Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration")
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:40:00 -04:00
Rafael Aquini 5f4fc24a99 mm/migrate: fix deadlock in migrate_pages_batch() on large folios
JIRA: https://issues.redhat.com/browse/RHEL-84184
Conflicts:
  * This backport drops the 2nd hunk from the upstream commit because RHEL-9
    misses commit 7262f208ca68 ("mm/migrate: split source folio if it is on
    deferred split list"), which itself is a follow-up fix for v6.10 code
    that was never backported into RHEL-9

This patch is a backport of the following upstream commit:
commit 2e6506e1c4eed2676a8412231046f31e10e240da
Author: Gao Xiang <hsiangkao@linux.alibaba.com>
Date:   Mon Jul 29 10:13:06 2024 +0800

    mm/migrate: fix deadlock in migrate_pages_batch() on large folios

    Currently, migrate_pages_batch() can lock multiple locked folios with an
    arbitrary order.  Although folio_trylock() is used to avoid deadlock as
    commit 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously
    firstly") mentioned, it seems try_split_folio() is still missing.

    It was found by compaction stress test when I explicitly enable EROFS
    compressed files to use large folios, which case I cannot reproduce with
    the same workload if large folio support is off (current mainline).
    Typically, filesystem reads (with locked file-backed folios) could use
    another bdev/meta inode to load some other I/Os (e.g.  inode extent
    metadata or caching compressed data), so the locking order will be:

      file-backed folios  (A)
         bdev/meta folios (B)

    The following calltrace shows the deadlock:
       Thread 1 takes (B) lock and tries to take folio (A) lock
       Thread 2 takes (A) lock and tries to take folio (B) lock

    [Thread 1]
    INFO: task stress:1824 blocked for more than 30 seconds.
          Tainted: G           OE      6.10.0-rc7+ #6
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:stress          state:D stack:0     pid:1824  tgid:1824  ppid:1822   flags:0x0000000c
    Call trace:
     __switch_to+0xec/0x138
     __schedule+0x43c/0xcb0
     schedule+0x54/0x198
     io_schedule+0x44/0x70
     folio_wait_bit_common+0x184/0x3f8
                            <-- folio mapping ffff00036d69cb18 index 996  (**)
     __folio_lock+0x24/0x38
     migrate_pages_batch+0x77c/0xea0        // try_split_folio (mm/migrate.c:1486:2)
                                            // migrate_pages_batch (mm/migrate.c:1734:16)
                    <--- LIST_HEAD(unmap_folios) has
                            ..
                            folio mapping 0xffff0000d184f1d8 index 1711;   (*)
                            folio mapping 0xffff0000d184f1d8 index 1712;
                            ..
     migrate_pages+0xb28/0xe90
     compact_zone+0xa08/0x10f0
     compact_node+0x9c/0x180
     sysctl_compaction_handler+0x8c/0x118
     proc_sys_call_handler+0x1a8/0x280
     proc_sys_write+0x1c/0x30
     vfs_write+0x240/0x380
     ksys_write+0x78/0x118
     __arm64_sys_write+0x24/0x38
     invoke_syscall+0x78/0x108
     el0_svc_common.constprop.0+0x48/0xf0
     do_el0_svc+0x24/0x38
     el0_svc+0x3c/0x148
     el0t_64_sync_handler+0x100/0x130
     el0t_64_sync+0x190/0x198

    [Thread 2]
    INFO: task stress:1825 blocked for more than 30 seconds.
          Tainted: G           OE      6.10.0-rc7+ #6
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:stress          state:D stack:0     pid:1825  tgid:1825  ppid:1822   flags:0x0000000c
    Call trace:
     __switch_to+0xec/0x138
     __schedule+0x43c/0xcb0
     schedule+0x54/0x198
     io_schedule+0x44/0x70
     folio_wait_bit_common+0x184/0x3f8
                            <-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*)
     __folio_lock+0x24/0x38
     z_erofs_runqueue+0x384/0x9c0 [erofs]
     z_erofs_readahead+0x21c/0x350 [erofs]       <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**)
     read_pages+0x74/0x328
     page_cache_ra_order+0x26c/0x348
     ondemand_readahead+0x1c0/0x3a0
     page_cache_sync_ra+0x9c/0xc0
     filemap_get_pages+0xc4/0x708
     filemap_read+0x104/0x3a8
     generic_file_read_iter+0x4c/0x150
     vfs_read+0x27c/0x330
     ksys_pread64+0x84/0xd0
     __arm64_sys_pread64+0x28/0x40
     invoke_syscall+0x78/0x108
     el0_svc_common.constprop.0+0x48/0xf0
     do_el0_svc+0x24/0x38
     el0_svc+0x3c/0x148
     el0t_64_sync_handler+0x100/0x130
     el0t_64_sync+0x190/0x198

    Link: https://lkml.kernel.org/r/20240729021306.398286-1-hsiangkao@linux.alibaba.com
    Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
    Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:39:58 -04:00
Rafael Aquini e3c54fdd26 mm/migrate: fix shmem xarray update during migration
JIRA: https://issues.redhat.com/browse/RHEL-84184
Conflicts:
  * notable context difference due to RHEL-9 missing support
    for mTHP and its follow-up commits

This patch is a backport of the following upstream commit:
commit 60cf233b585cdf1f3c5e52d1225606b86acd08b0
Author: Zi Yan <ziy@nvidia.com>
Date:   Wed Mar 5 15:04:03 2025 -0500

    mm/migrate: fix shmem xarray update during migration

    A shmem folio can be either in page cache or in swap cache, but not at the
    same time.  Namely, once it is in swap cache, folio->mapping should be
    NULL, and the folio is no longer in a shmem mapping.

    In __folio_migrate_mapping(), to determine the number of xarray entries to
    update, folio_test_swapbacked() is used, but that conflates shmem in page
    cache case and shmem in swap cache case.  It leads to xarray multi-index
    entry corruption, since it turns a sibling entry to a normal entry during
    xas_store() (see [1] for a userspace reproduction).  Fix it by only using
    folio_test_swapcache() to determine whether xarray is storing swap cache
    entries or not to choose the right number of xarray entries to update.

    [1] https://lore.kernel.org/linux-mm/Z8idPCkaJW1IChjT@casper.infradead.org/

    Note:
    In __split_huge_page(), folio_test_anon() && folio_test_swapcache() is
    used to get swap_cache address space, but that ignores the shmem folio in
    swap cache case.  It could lead to NULL pointer dereferencing when a
    in-swap-cache shmem folio is split at __xa_store(), since
    !folio_test_anon() is true and folio->mapping is NULL.  But fortunately,
    its caller split_huge_page_to_list_to_order() bails out early with EBUSY
    when folio->mapping is NULL.  So no need to take care of it here.

    Link: https://lkml.kernel.org/r/20250305200403.2822855-1-ziy@nvidia.com
    Fixes: fc346d0a70a1 ("mm: migrate high-order folios in swap cache correctly")
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reported-by: Liu Shixin <liushixin2@huawei.com>
    Closes: https://lore.kernel.org/all/28546fb4-5210-bf75-16d6-43e1f8646080@huawei.com/
    Suggested-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Charan Teja Kalla <quic_charante@quicinc.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Lance Yang <ioworker0@gmail.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:39:55 -04:00
Rafael Aquini 737fb311e8 memory tiering: count PGPROMOTE_SUCCESS when mem tiering is enabled.
JIRA: https://issues.redhat.com/browse/RHEL-84184
Conflicts:
  * context differences from upstream due to RHEL-9 missing commit
    7262f208ca68 ("mm/migrate: split source folio if it is on
    deferred split list") and its follow-up fix commit 6e49019db5f7
    ("mm/migrate: putback split folios when numa hint migration fails"),
    with none of them being relevant to this backport

This patch is a backport of the following upstream commit:
commit ac59a1f0146f46bad7d5f8d1b20756ece43122ec
Author: Zi Yan <ziy@nvidia.com>
Date:   Wed Jul 24 09:01:15 2024 -0400

    memory tiering: count PGPROMOTE_SUCCESS when mem tiering is enabled.

    memory tiering can be enabled/disabled at runtime and
    sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING is used to
    check it.  In migrate_misplaced_folio(), the check is missing when
    PGPROMOTE_SUCCESS is incremented.  Add the missing check.

    Link: https://lkml.kernel.org/r/20240724130115.793641-4-ziy@nvidia.com
    Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency")
    Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Closes: https://lore.kernel.org/linux-mm/f4ae2c9c-fe40-4807-bdb2-64cf2d716c1a@huawei.com/
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:39:55 -04:00
Rafael Aquini 761a707e88 vmscan,migrate: fix page count imbalance on node stats when demoting pages
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 35e41024c4c2b02ef8207f61b9004f6956cf037b
Author: Gregory Price <gourry@gourry.net>
Date:   Fri Oct 25 10:17:24 2024 -0400

    vmscan,migrate: fix page count imbalance on node stats when demoting pages

    When numa balancing is enabled with demotion, vmscan will call
    migrate_pages when shrinking LRUs.  migrate_pages will decrement the
    the node's isolated page count, leading to an imbalanced count when
    invoked from (MG)LRU code.

    The result is dmesg output like such:

    $ cat /proc/sys/vm/stat_refresh

    [77383.088417] vmstat_refresh: nr_isolated_anon -103212
    [77383.088417] vmstat_refresh: nr_isolated_file -899642

    This negative value may impact compaction and reclaim throttling.

    The following path produces the decrement:

    shrink_folio_list
      demote_folio_list
        migrate_pages
          migrate_pages_batch
            migrate_folio_move
              migrate_folio_done
                mod_node_page_state(-ve) <- decrement

    This path happens for SUCCESSFUL migrations, not failures.  Typically
    callers to migrate_pages are required to handle putback/accounting for
    failures, but this is already handled in the shrink code.

    When accounting for migrations, instead do not decrement the count when
    the migration reason is MR_DEMOTION.  As of v6.11, this demotion logic
    is the only source of MR_DEMOTION.

    Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net
    Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim")
    Signed-off-by: Gregory Price <gourry@gourry.net>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:56 -05:00
Rafael Aquini 92f5738ef9 mm: migrate: annotate data-race in migrate_folio_unmap()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 8001070cfbec5cd4ea00b8b48ea51df91122f265
Author: Jeongjun Park <aha310510@gmail.com>
Date:   Tue Sep 24 22:00:53 2024 +0900

    mm: migrate: annotate data-race in migrate_folio_unmap()

    I found a report from syzbot [1]

    This report shows that the value can be changed, but in reality, the
    value of __folio_set_movable() cannot be changed because it holds the
    folio refcount.

    Therefore, it is appropriate to add an annotate to make KCSAN
    ignore that data-race.

    [1]

    ==================================================================
    BUG: KCSAN: data-race in __filemap_remove_folio / migrate_pages_batch

    write to 0xffffea0004b81dd8 of 8 bytes by task 6348 on cpu 0:
     page_cache_delete mm/filemap.c:153 [inline]
     __filemap_remove_folio+0x1ac/0x2c0 mm/filemap.c:233
     filemap_remove_folio+0x6b/0x1f0 mm/filemap.c:265
     truncate_inode_folio+0x42/0x50 mm/truncate.c:178
     shmem_undo_range+0x25b/0xa70 mm/shmem.c:1028
     shmem_truncate_range mm/shmem.c:1144 [inline]
     shmem_evict_inode+0x14d/0x530 mm/shmem.c:1272
     evict+0x2f0/0x580 fs/inode.c:731
     iput_final fs/inode.c:1883 [inline]
     iput+0x42a/0x5b0 fs/inode.c:1909
     dentry_unlink_inode+0x24f/0x260 fs/dcache.c:412
     __dentry_kill+0x18b/0x4c0 fs/dcache.c:615
     dput+0x5c/0xd0 fs/dcache.c:857
     __fput+0x3fb/0x6d0 fs/file_table.c:439
     ____fput+0x1c/0x30 fs/file_table.c:459
     task_work_run+0x13a/0x1a0 kernel/task_work.c:228
     resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
     exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
     exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
     __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
     syscall_exit_to_user_mode+0xbe/0x130 kernel/entry/common.c:218
     do_syscall_64+0xd6/0x1c0 arch/x86/entry/common.c:89
     entry_SYSCALL_64_after_hwframe+0x77/0x7f

    read to 0xffffea0004b81dd8 of 8 bytes by task 6342 on cpu 1:
     __folio_test_movable include/linux/page-flags.h:699 [inline]
     migrate_folio_unmap mm/migrate.c:1199 [inline]
     migrate_pages_batch+0x24c/0x1940 mm/migrate.c:1797
     migrate_pages_sync mm/migrate.c:1963 [inline]
     migrate_pages+0xff1/0x1820 mm/migrate.c:2072
     do_mbind mm/mempolicy.c:1390 [inline]
     kernel_mbind mm/mempolicy.c:1533 [inline]
     __do_sys_mbind mm/mempolicy.c:1607 [inline]
     __se_sys_mbind+0xf76/0x1160 mm/mempolicy.c:1603
     __x64_sys_mbind+0x78/0x90 mm/mempolicy.c:1603
     x64_sys_call+0x2b4d/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:238
     do_syscall_x64 arch/x86/entry/common.c:52 [inline]
     do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x77/0x7f

    value changed: 0xffff888127601078 -> 0x0000000000000000

    Link: https://lkml.kernel.org/r/20240924130053.107490-1-aha310510@gmail.com
    Fixes: 7e2a5e5ab217 ("mm: migrate: use __folio_test_movable()")
    Signed-off-by: Jeongjun Park <aha310510@gmail.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:42 -05:00
Rafael Aquini 288fab6492 mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * virt/kvm/guest_memfd.c: difference in the hunk due to RHEL missing upstream
    commit 1d23040caa8b ("KVM: guest_memfd: Use AS_INACCESSIBLE when creating
    guest_memfd inode") which would end up being reverted with this follow-up fix.

This patch is a backport of the following upstream commit:
commit 27e6a24a4cf3d25421c0f6ebb7c39f45fc14d20f
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Thu Jul 11 13:56:54 2024 -0400

    mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE

    The flags AS_UNMOVABLE and AS_INACCESSIBLE were both added just for guest_memfd;
    AS_UNMOVABLE is already in existing versions of Linux, while AS_INACCESSIBLE was
    acked for inclusion in 6.11.

    But really, they are the same thing: only guest_memfd uses them, at least for
    now, and guest_memfd pages are unmovable because they should not be
    accessed by the CPU.

    So merge them into one; use the AS_INACCESSIBLE name which is more comprehensive.
    At the same time, this fixes an embarrassing bug where AS_INACCESSIBLE was used
    as a bit mask, despite it being just a bit index.

    The bug was mostly benign, because AS_INACCESSIBLE's bit representation (1010)
    corresponded to setting AS_UNEVICTABLE (which is already set) and AS_ENOSPC
    (except no async writes can happen on the guest_memfd).  So the AS_INACCESSIBLE
    flag simply had no effect.

    Fixes: 1d23040caa8b ("KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode")
    Fixes: c72ceafbd12c ("mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory")
    Cc: linux-mm@kvack.org
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Hildenbrand <david@redhat.com>
    Tested-by: Michael Roth <michael.roth@amd.com>
    Reviewed-by: Michael Roth <michael.roth@amd.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:25 -05:00
Rafael Aquini 4c7a04bd61 mm: migrate: fix getting incorrect page mapping during page migration
JIRA: https://issues.redhat.com/browse/RHEL-27745
JIRA: https://issues.redhat.com/browse/RHEL-28873
CVE: CVE-2023-52490

This patch is a backport of the following upstream commit:
commit d1adb25df7111de83b64655a80b5a135adbded61
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri Dec 15 20:07:52 2023 +0800

    mm: migrate: fix getting incorrect page mapping during page migration

    When running stress-ng testing, we found below kernel crash after a few hours:

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
    pc : dentry_name+0xd8/0x224
    lr : pointer+0x22c/0x370
    sp : ffff800025f134c0
    ......
    Call trace:
      dentry_name+0xd8/0x224
      pointer+0x22c/0x370
      vsnprintf+0x1ec/0x730
      vscnprintf+0x2c/0x60
      vprintk_store+0x70/0x234
      vprintk_emit+0xe0/0x24c
      vprintk_default+0x3c/0x44
      vprintk_func+0x84/0x2d0
      printk+0x64/0x88
      __dump_page+0x52c/0x530
      dump_page+0x14/0x20
      set_migratetype_isolate+0x110/0x224
      start_isolate_page_range+0xc4/0x20c
      offline_pages+0x124/0x474
      memory_block_offline+0x44/0xf4
      memory_subsys_offline+0x3c/0x70
      device_offline+0xf0/0x120
      ......

    After analyzing the vmcore, I found this issue is caused by page migration.
    The scenario is that, one thread is doing page migration, and we will use the
    target page's ->mapping field to save 'anon_vma' pointer between page unmap and
    page move, and now the target page is locked and refcount is 1.

    Currently, there is another stress-ng thread performing memory hotplug,
    attempting to offline the target page that is being migrated. It discovers that
    the refcount of this target page is 1, preventing the offline operation, thus
    proceeding to dump the page. However, page_mapping() of the target page may
    return an incorrect file mapping to crash the system in dump_mapping(), since
    the target page->mapping only saves 'anon_vma' pointer without setting
    PAGE_MAPPING_ANON flag.

    There are seveval ways to fix this issue:
    (1) Setting the PAGE_MAPPING_ANON flag for target page's ->mapping when saving
    'anon_vma', but this can confuse PageAnon() for PFN walkers, since the target
    page has not built mappings yet.
    (2) Getting the page lock to call page_mapping() in __dump_page() to avoid crashing
    the system, however, there are still some PFN walkers that call page_mapping()
    without holding the page lock, such as compaction.
    (3) Using target page->private field to save the 'anon_vma' pointer and 2 bits
    page state, just as page->mapping records an anonymous page, which can remove
    the page_mapping() impact for PFN walkers and also seems a simple way.

    So I choose option 3 to fix this issue, and this can also fix other potential
    issues for PFN walkers, such as compaction.

    Link: https://lkml.kernel.org/r/e60b17a88afc38cb32f84c3e30837ec70b343d2b.1702641709.git.baolin.wang@linux.alibaba.com
    Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()")
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Xu Yu <xuyu@linux.alibaba.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:09 -05:00
Rafael Aquini 889ae878e7 mm: migrate: record the mlocked page status to remove unnecessary lru drain
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit eebb3dabbb5cc590afe32880b5d3726d0fbf88db
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Sat Oct 21 12:33:22 2023 +0800

    mm: migrate: record the mlocked page status to remove unnecessary lru drain

    When doing compaction, I found the lru_add_drain() is an obvious hotspot
    when migrating pages. The distribution of this hotspot is as follows:
       - 18.75% compact_zone
          - 17.39% migrate_pages
             - 13.79% migrate_pages_batch
                - 11.66% migrate_folio_move
                   - 7.02% lru_add_drain
                      + 7.02% lru_add_drain_cpu
                   + 3.00% move_to_new_folio
                     1.23% rmap_walk
                + 1.92% migrate_folio_unmap
             + 3.20% migrate_pages_sync
          + 0.90% isolate_migratepages

    The lru_add_drain() was added by commit c3096e6782b7 ("mm/migrate:
    __unmap_and_move() push good newpage to LRU") to drain the newpage to LRU
    immediately, to help to build up the correct newpage->mlock_count in
    remove_migration_ptes() for mlocked pages.  However, if there are no
    mlocked pages are migrating, then we can avoid this lru drain operation,
    especailly for the heavy concurrent scenarios.

    So we can record the source pages' mlocked status in
    migrate_folio_unmap(), and only drain the lru list when the mlocked status
    is set in migrate_folio_move().

    In addition, the page was already isolated from lru when migrating, so
    checking the mlocked status is stable by folio_test_mlocked() in
    migrate_folio_unmap().

    After this patch, I can see the hotpot of the lru_add_drain() is gone:
       - 9.41% migrate_pages_batch
          - 6.15% migrate_folio_move
             - 3.64% move_to_new_folio
                + 1.80% migrate_folio_extra
                + 1.70% buffer_migrate_folio
             + 1.41% rmap_walk
             + 0.62% folio_add_lru
          + 3.07% migrate_folio_unmap

    Meanwhile, the compaction latency shows some improvements when running
    thpscale:
                                base                   patched
    Amean     fault-both-1      1131.22 (   0.00%)     1112.55 *   1.65%*
    Amean     fault-both-3      2489.75 (   0.00%)     2324.15 *   6.65%*
    Amean     fault-both-5      3257.37 (   0.00%)     3183.18 *   2.28%*
    Amean     fault-both-7      4257.99 (   0.00%)     4079.04 *   4.20%*
    Amean     fault-both-12     6614.02 (   0.00%)     6075.60 *   8.14%*
    Amean     fault-both-18    10607.78 (   0.00%)     8978.86 *  15.36%*
    Amean     fault-both-24    14911.65 (   0.00%)    11619.55 *  22.08%*
    Amean     fault-both-30    14954.67 (   0.00%)    14925.66 *   0.19%*
    Amean     fault-both-32    16654.87 (   0.00%)    15580.31 *   6.45%*

    Link: https://lkml.kernel.org/r/06e9153a7a4850352ec36602df3a3a844de45698.1697859741.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yin Fengwei <fengwei.yin@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:07 -05:00
Rafael Aquini a3cdfbdc13 mm/migrate: add nr_split to trace_mm_migrate_pages stats.
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 49cac03a8f0a56cafa5329911564c97c130ced43
Author: Zi Yan <ziy@nvidia.com>
Date:   Tue Oct 17 12:31:29 2023 -0400

    mm/migrate: add nr_split to trace_mm_migrate_pages stats.

    Add nr_split to trace_mm_migrate_pages for large folio (including THP)
    split events.

    [akpm@linux-foundation.org: cleanup per Huang, Ying]
    Link: https://lkml.kernel.org/r/20231017163129.2025214-2-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:02 -05:00
Rafael Aquini ef82979096 mm/migrate: correct nr_failed in migrate_pages_sync()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit a259945efe6ada94087ef666e9b38f8e34ea34ba
Author: Zi Yan <ziy@nvidia.com>
Date:   Tue Oct 17 12:31:28 2023 -0400

    mm/migrate: correct nr_failed in migrate_pages_sync()

    nr_failed was missing the large folio splits from migrate_pages_batch()
    and can cause a mismatch between migrate_pages() return value and the
    number of not migrated pages, i.e., when the return value of
    migrate_pages() is 0, there are still pages left in the from page list.
    It will happen when a non-PMD THP large folio fails to migrate due to
    -ENOMEM and is split successfully but not all the split pages are not
    migrated, migrate_pages_batch() would return non-zero, but
    astats.nr_thp_split = 0.  nr_failed would be 0 and returned to the caller
    of migrate_pages(), but the not migrated pages are left in the from page
    list without being added back to LRU lists.

    Fix it by adding a new nr_split counter for large folio splits and adding
    it to nr_failed in migrate_page_sync() after migrate_pages_batch() is
    done.

    Link: https://lkml.kernel.org/r/20231017163129.2025214-1-zi.yan@sent.com
    Fixes: 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly")
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Acked-by: Huang Ying <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:01 -05:00
Rafael Aquini 32b576f804 mm/filemap: remove hugetlb special casing in filemap.c
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Tue Sep 26 12:20:17 2023 -0700

    mm/filemap: remove hugetlb special casing in filemap.c

    Remove special cased hugetlb handling code within the page cache by
    changing the granularity of ->index to the base page size rather than the
    huge page size.  The motivation of this patch is to reduce complexity
    within the filemap code while also increasing performance by removing
    branches that are evaluated on every page cache lookup.

    To support the change in index, new wrappers for hugetlb page cache
    interactions are added.  These wrappers perform the conversion to a linear
    index which is now expected by the page cache for huge pages.

    ========================= PERFORMANCE ======================================

    Perf was used to check the performance differences after the patch.
    Overall the performance is similar to mainline with a very small larger
    overhead that occurs in __filemap_add_folio() and
    hugetlb_add_to_page_cache().  This is because of the larger overhead that
    occurs in xa_load() and xa_store() as the xarray is now using more entries
    to store hugetlb folios in the page cache.

    Timing

    aarch64
        2MB Page Size
            6.5-rc3 + this patch:
                [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
                real    1m49.568s
                user    0m0.000s
                sys     1m49.461s

            6.5-rc3:
                [root]# time fallocate -l 700GB test.txt
                real    1m47.495s
                user    0m0.000s
                sys     1m47.370s
        1GB Page Size
            6.5-rc3 + this patch:
                [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                real    1m47.024s
                user    0m0.000s
                sys     1m46.921s

            6.5-rc3:
                [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                real    1m44.551s
                user    0m0.000s
                sys     1m44.438s

    x86
        2MB Page Size
            6.5-rc3 + this patch:
                [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
                real    0m22.383s
                user    0m0.000s
                sys     0m22.255s

            6.5-rc3:
                [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
                real    0m22.735s
                user    0m0.038s
                sys     0m22.567s

        1GB Page Size
            6.5-rc3 + this patch:
                [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
                real    0m25.786s
                user    0m0.001s
                sys     0m25.589s

            6.5-rc3:
                [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
                real    0m33.454s
                user    0m0.001s
                sys     0m33.193s

    aarch64:
        workload - fallocate a 700GB file backed by huge pages

        6.5-rc3 + this patch:
            2MB Page Size:
                --100.00%--__arm64_sys_fallocate
                              ksys_fallocate
                              vfs_fallocate
                              hugetlbfs_fallocate
                              |
                              |--95.04%--__pi_clear_page
                              |
                              |--3.57%--clear_huge_page
                              |          |
                              |          |--2.63%--rcu_all_qs
                              |          |
                              |           --0.91%--__cond_resched
                              |
                               --0.67%--__cond_resched
                0.17%     0.00%             0  fallocate  [kernel.vmlinux]       [k] hugetlb_add_to_page_cache
                0.14%     0.10%            11  fallocate  [kernel.vmlinux]       [k] __filemap_add_folio

        6.5-rc3
            2MB Page Size:
                    --100.00%--__arm64_sys_fallocate
                              ksys_fallocate
                              vfs_fallocate
                              hugetlbfs_fallocate
                              |
                              |--94.91%--__pi_clear_page
                              |
                              |--4.11%--clear_huge_page
                              |          |
                              |          |--3.00%--rcu_all_qs
                              |          |
                              |           --1.10%--__cond_resched
                              |
                               --0.59%--__cond_resched
                0.08%     0.01%             1  fallocate  [kernel.kallsyms]  [k] hugetlb_add_to_page_cache
                0.05%     0.03%             3  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio

    x86
        workload - fallocate a 100GB file backed by huge pages

        6.5-rc3 + this patch:
            2MB Page Size:
                hugetlbfs_fallocate
                |
                --99.57%--clear_huge_page
                    |
                    --98.47%--clear_page_erms
                        |
                        --0.53%--asm_sysvec_apic_timer_interrupt

                0.04%     0.04%             1  fallocate  [kernel.kallsyms]     [k] xa_load
                0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] hugetlb_add_to_page_cache
                0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] __filemap_add_folio
                0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] xas_store

        6.5-rc3
            2MB Page Size:
                    --99.93%--__x64_sys_fallocate
                              vfs_fallocate
                              hugetlbfs_fallocate
                              |
                               --99.38%--clear_huge_page
                                         |
                                         |--98.40%--clear_page_erms
                                         |
                                          --0.59%--__cond_resched
                0.03%     0.03%             1  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio

    ========================= TESTING ======================================

    This patch passes libhugetlbfs tests and LTP hugetlb tests

    ********** TEST SUMMARY
    *                      2M
    *                      32-bit 64-bit
    *     Total testcases:   110    113
    *             Skipped:     0      0
    *                PASS:   107    113
    *                FAIL:     0      0
    *    Killed by signal:     3      0
    *   Bad configuration:     0      0
    *       Expected FAIL:     0      0
    *     Unexpected PASS:     0      0
    *    Test not present:     0      0
    * Strange test result:     0      0
    **********

        Done executing testcases.
        LTP Version:  20220527-178-g2761a81c4

    page migration was also tested using Mike Kravetz's test program.[8]

    [dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
      Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
    Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
    Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
    Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:37 -05:00
Rafael Aquini 1ddcdf9da9 mm: migrate: remove isolated variable in add_page_for_migration()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit fa1df3f6287e1e1fd8b5309828238e2c728e985f
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 13 17:51:31 2023 +0800

    mm: migrate: remove isolated variable in add_page_for_migration()

    Directly check the return of isolate_hugetlb() and folio_isolate_lru() to
    remove isolated variable, also setup err = -EBUSY in advance before
    isolation, and update err only when successfully queued for migration,
    which could help us to unify and simplify code a bit.

    Link: https://lkml.kernel.org/r/20230913095131.2426871-9-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:28 -05:00
Rafael Aquini f160331780 mm: migrate: remove PageHead() check for HugeTLB in add_page_for_migration()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit b426ed7889be80359cb4edef142e5c5fa697b068
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 13 17:51:30 2023 +0800

    mm: migrate: remove PageHead() check for HugeTLB in add_page_for_migration()

    There is some different between hugeTLB and THP behave when passed the
    address of a tail page, for THP, it will migrate the entire THP page, but
    for HugeTLB, it will return -EACCES, or -ENOENT before commit e66f17ff71
    ("mm/hugetlb: take page table lock in follow_huge_pmd()"),

      -EACCES The page is mapped by multiple processes and can be moved
              only if MPOL_MF_MOVE_ALL is specified.
      -ENOENT The page is not present.

    But when check manual[1], both of the two errnos are not suitable, it is
    better to keep the same behave between hugetlb and THP when passed the
    address of a tail page, so let's just remove the PageHead() check for
    HugeTLB.

    [1] https://man7.org/linux/man-pages/man2/move_pages.2.html

    Link: https://lkml.kernel.org/r/20230913095131.2426871-8-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Zi Yan <ziy@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:27 -05:00
Rafael Aquini 335a9babfb mm: migrate: use a folio in add_page_for_migration()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit d64cfccbc805663a2c5691f638cf9198b9676a9f
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 13 17:51:29 2023 +0800

    mm: migrate: use a folio in add_page_for_migration()

    Use a folio in add_page_for_migration() to save compound_head() calls.

    Link: https://lkml.kernel.org/r/20230913095131.2426871-7-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:26 -05:00
Rafael Aquini 7f689eb1e5 mm: migrate: use __folio_test_movable()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 7e2a5e5ab217d5e4166cdbdf4af8c5e34b6200bb
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 13 17:51:28 2023 +0800

    mm: migrate: use __folio_test_movable()

    Use __folio_test_movable(), no need to convert from folio to page again.

    Link: https://lkml.kernel.org/r/20230913095131.2426871-6-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:26 -05:00
Rafael Aquini 27ca54790a mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 73eab3ca481e5be0f1fd8140365d604482f84ee1
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 13 17:51:27 2023 +0800

    mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio()

    At present, numa balance only support base page and PMD-mapped THP, but we
    will expand to support to migrate large folio/pte-mapped THP in the
    future, it is better to make migrate_misplaced_page() to take a folio
    instead of a page, and rename it to migrate_misplaced_folio(), it is a
    preparation, also this remove several compound_head() calls.

    Link: https://lkml.kernel.org/r/20230913095131.2426871-5-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:25 -05:00
Rafael Aquini 829524ec07 mm: migrate: convert numamigrate_isolate_page() to numamigrate_isolate_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * Minor context conflict on the 2nd hunk due to out-of-order backport
    of commit 774f256e7c0 ("mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index")

This patch is a backport of the following upstream commit:
commit 2ac9e99f3b21b2864305fbfba4bae5913274c409
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 13 17:51:26 2023 +0800

    mm: migrate: convert numamigrate_isolate_page() to numamigrate_isolate_folio()

    Rename numamigrate_isolate_page() to numamigrate_isolate_folio(), then
    make it takes a folio and use folio API to save compound_head() calls.

    Link: https://lkml.kernel.org/r/20230913095131.2426871-4-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:24 -05:00
Rafael Aquini 0385689f3a mm: migrate: remove THP mapcount check in numamigrate_isolate_page()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 728be28fae8c838d52c91dce4867133798146357
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 13 17:51:25 2023 +0800

    mm: migrate: remove THP mapcount check in numamigrate_isolate_page()

    The check of THP mapped by multiple processes was introduced by commit
    04fa5d6a65 ("mm: migrate: check page_count of THP before migrating") and
    refactor by commit 340ef3902c ("mm: numa: cleanup flow of transhuge page
    migration"), which is out of date, since migrate_misplaced_page() is now
    using the standard migrate_pages() for small pages and THPs, the reference
    count checking is in folio_migrate_mapping(), so let's remove the special
    check for THP.

    Link: https://lkml.kernel.org/r/20230913095131.2426871-3-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:23 -05:00
Rafael Aquini 4a800d052d mm: migrate: remove PageTransHuge check in numamigrate_isolate_page()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit a8ac4a767dcd9d87d8229045904d9fe15ea5e0e8
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 13 17:51:24 2023 +0800

    mm: migrate: remove PageTransHuge check in numamigrate_isolate_page()

    Patch series "mm: migrate: more folio conversion and unification", v3.

    Convert more migrate functions to use a folio, it is also a preparation
    for large folio migration support when balancing numa.

    This patch (of 8):

    The assert VM_BUG_ON_PAGE(order && !PageTransHuge(page), page) is not very
    useful,

       1) for a tail/base page, order = 0, for a head page, the order > 0 &&
          PageTransHuge() is true
       2) there is a PageCompound() check and only base page is handled in
          do_numa_page(), and do_huge_pmd_numa_page() only handle PMD-mapped
          THP
       3) even though the page is a tail page, isolate_lru_page() will post
          a warning, and fail to isolate the page
       4) if large folio/pte-mapped THP migration supported in the future,
          we could migrate the entire folio if numa fault on a tail page

    so just remove the check.

    Link: https://lkml.kernel.org/r/20230913095131.2426871-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20230913095131.2426871-2-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:22 -05:00
Rafael Aquini 2a9317ff87 mm/rmap: pass folio to hugepage_add_anon_rmap()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 09c550508a4b8f7844b197cc16877dd0f7c42d8f
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Sep 13 14:51:13 2023 +0200

    mm/rmap: pass folio to hugepage_add_anon_rmap()

    Let's pass a folio; we are always mapping the entire thing.

    Link: https://lkml.kernel.org/r/20230913125113.313322-7-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:22 -05:00
Rado Vrbovsky 570a71d7db Merge: mm: update core code to v6.6 upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252

JIRA: https://issues.redhat.com/browse/RHEL-27743  
JIRA: https://issues.redhat.com/browse/RHEL-59459    
CVE: CVE-2024-46787    
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961  
  
This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level.    
This work follows up on the previous v6.5 update (RHEL-27742) and as such,    
the bulk of this changeset is comprised of refactoring and clean-ups of     
the internal implementation of several APIs as it further advances the     
conversion to FOLIOS, and follow up on the per-VMA locking changes.

Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow    
Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds,    
and we add a potential extra level of protection (assessment pending) to help    
on mitigating kernel heap exploits dubbed as "SlubStick".     
    
Follow-up fixes are omitted from this series either because they are irrelevant to     
the bits we support on RHEL or because they depend on bigger changesets introduced     
upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately.    

Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot")    
Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources")   
Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()")    
Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros")    
Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages")    
Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType")    
Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()")    
Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio")    
Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling")    
    
Signed-off-by: Rafael Aquini <raquini@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-30 07:22:28 +00:00
Rafael Aquini e66e65400a mm: hugetlb: add huge page size param to set_huge_pte_at()
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * arch/parisc/include/asm/hugetlb.h: hunks dropped (unsupported arch)
  * arch/parisc/mm/hugetlbpage.c:  hunks dropped (unsupported arch)
  * arch/riscv/include/asm/hugetlb.h: hunks dropped (unsupported arch)
  * arch/riscv/mm/hugetlbpage.c: hunks dropped (unsupported arch)
  * arch/sparc/mm/hugetlbpage.c: hunks dropped (unsupported arch)
  * mm/rmap.c: minor context conflict on the 7th hunk due to backport of
      upstream commit 322842ea3c72 ("mm/rmap: fix missing swap_free() in
      try_to_unmap() after arch_unmap_one() failed")

This patch is a backport of the following upstream commit:
commit 935d4f0c6dc8b3533e6e39346de7389a84490178
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Fri Sep 22 12:58:03 2023 +0100

    mm: hugetlb: add huge page size param to set_huge_pte_at()

    Patch series "Fix set_huge_pte_at() panic on arm64", v2.

    This series fixes a bug in arm64's implementation of set_huge_pte_at(),
    which can result in an unprivileged user causing a kernel panic.  The
    problem was triggered when running the new uffd poison mm selftest for
    HUGETLB memory.  This test (and the uffd poison feature) was merged for
    v6.5-rc7.

    Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
    (correctly this time) to get it backported to v6.5, where the issue first
    showed up.

    Description of Bug
    ==================

    arm64's huge pte implementation supports multiple huge page sizes, some of
    which are implemented in the page table with multiple contiguous entries.
    So set_huge_pte_at() needs to work out how big the logical pte is, so that
    it can also work out how many physical ptes (or pmds) need to be written.
    It previously did this by grabbing the folio out of the pte and querying
    its size.

    However, there are cases when the pte being set is actually a swap entry.
    But this also used to work fine, because for huge ptes, we only ever saw
    migration entries and hwpoison entries.  And both of these types of swap
    entries have a PFN embedded, so the code would grab that and everything
    still worked out.

    But over time, more calls to set_huge_pte_at() have been added that set
    swap entry types that do not embed a PFN.  And this causes the code to go
    bang.  The triggering case is for the uffd poison test, commit
    99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
    causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
    8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
    added in v6.5-rc7.  Although review shows that there are other call sites
    that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
    on arm64 because arm64 doesn't support UFFD WP.

    If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
    it will dereference a bad pointer in page_folio():

        static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
        {
            VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));

            return page_folio(pfn_to_page(swp_offset_pfn(entry)));
        }

    Fix
    ===

    The simplest fix would have been to revert the dodgy cleanup commit
    18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
    things have moved on, this would have required an audit of all the new
    set_huge_pte_at() call sites to see if they should be converted to
    set_huge_swap_pte_at().  As per the original intent of the change, it
    would also leave us open to future bugs when people invariably get it
    wrong and call the wrong helper.

    So instead, I've added a huge page size parameter to set_huge_pte_at().
    This means that the arm64 code has the size in all cases.  It's a bigger
    change, due to needing to touch the arches that implement the function,
    but it is entirely mechanical, so in my view, low risk.

    I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
    s390, sparc (and additionally x86_64).  I've additionally booted and run
    mm selftests against arm64, where I observe the uffd poison test is fixed,
    and there are no other regressions.

    This patch (of 2):

    In order to fix a bug, arm64 needs to be told the size of the huge page
    for which the pte is being set in set_huge_pte_at().  Provide for this by
    adding an `unsigned long sz` parameter to the function.  This follows the
    same pattern as huge_pte_clear().

    This commit makes the required interface modifications to the core mm as
    well as all arches that implement this function (arm64, parisc, powerpc,
    riscv, s390, sparc).  The actual arm64 bug will be fixed in a separate
    commit.

    No behavioral changes intended.

    Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
    Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
    Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>     [powerpc 8xx]
    Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>       [vmalloc change]
    Cc: Alexandre Ghiti <alex@ghiti.fr>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: <stable@vger.kernel.org>    [6.5+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:19 -04:00
Rafael Aquini 2751ffd905 migrate: use folio_set_bh() instead of set_bh_page()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit d5db4f9df9397d398256a2e33ad63c39c213b990
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Jul 13 04:55:09 2023 +0100

    migrate: use folio_set_bh() instead of set_bh_page()

    This function was converted before folio_set_bh() existed.  Catch up to
    the new API.

    Link: https://lkml.kernel.org/r/20230713035512.4139457-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: David Sterba <dsterba@suse.com>
    Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
    Cc: Nathan Chancellor <nathan@kernel.org>
    Cc: Nick Desaulniers <ndesaulniers@google.com>
    Cc: Pankaj Raghav <p.raghav@samsung.com>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Tom Rix <trix@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:48 -04:00
Waiman Long 18cf897167 hugetlb: memcg: account hugetlb-backed memory in memory controller
JIRA: https://issues.redhat.com/browse/RHEL-56023
Conflicts: A context diff in alloc_hugetlb_folio() hunk of mm/hugetlb.c
	   due to the presence of a later upstream commit b76b46902c2d
	   ("mm/hugetlb: fix missing hugetlb_lock for resv uncharge").

commit 8cba9576df601c384abd334a503c3f6e1e29eefb
Author: Nhat Pham <nphamcs@gmail.com>
Date:   Fri, 6 Oct 2023 11:46:28 -0700

    hugetlb: memcg: account hugetlb-backed memory in memory controller

    Currently, hugetlb memory usage is not acounted for in the memory
    controller, which could lead to memory overprotection for cgroups with
    hugetlb-backed memory.  This has been observed in our production system.

    For instance, here is one of our usecases: suppose there are two 32G
    containers.  The machine is booted with hugetlb_cma=6G, and each container
    may or may not use up to 3 gigantic page, depending on the workload within
    it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
    limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
    difficult to configure memory.max to keep overall consumption, including
    anon, cache, slab etc.  fair.

    What we have had to resort to is to constantly poll hugetlb usage and
    readjust memory.max.  Similar procedure is done to other memory limits
    (memory.low for e.g).  However, this is rather cumbersome and buggy.
    Furthermore, when there is a delay in memory limits correction, (for e.g
    when hugetlb usage changes within consecutive runs of the userspace
    agent), the system could be in an over/underprotected state.

    This patch rectifies this issue by charging the memcg when the hugetlb
    folio is utilized, and uncharging when the folio is freed (analogous to
    the hugetlb controller).  Note that we do not charge when the folio is
    allocated to the hugetlb pool, because at this point it is not owned by
    any memcg.

    Some caveats to consider:
      * This feature is only available on cgroup v2.
      * There is no hugetlb pool management involved in the memory
        controller. As stated above, hugetlb folios are only charged towards
        the memory controller when it is used. Host overcommit management
        has to consider it when configuring hard limits.
      * Failure to charge towards the memcg results in SIGBUS. This could
        happen even if the hugetlb pool still has pages (but the cgroup
        limit is hit and reclaim attempt fails).
      * When this feature is enabled, hugetlb pages contribute to memory
        reclaim protection. low, min limits tuning must take into account
        hugetlb memory.
      * Hugetlb pages utilized while this option is not selected will not
        be tracked by the memory controller (even if cgroup v2 is remounted
        later on).

    Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
    Signed-off-by: Nhat Pham <nphamcs@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Frank van der Linden <fvdl@google.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-09-30 09:46:59 -04:00
Rafael Aquini a726366716 mm: remove unnecessary pagevec includes
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 994ec4e29b3de188d11fe60d17403285fcc8917a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jun 21 17:45:57 2023 +0100

    mm: remove unnecessary pagevec includes

    These files no longer need pagevec.h, mostly due to function declarations
    being moved out of it.

    Link: https://lkml.kernel.org/r/20230621164557.3510324-14-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:33 -04:00
Rafael Aquini a8c6b788e8 mm: fix shmem THP counters on migration
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 0b52c420350e8f9873ba62768cd8246827184408
Author: Jan Glauber <jglauber@digitalocean.com>
Date:   Mon Jun 19 12:33:51 2023 +0200

    mm: fix shmem THP counters on migration

    The per node numa_stat values for shmem don't change on page migration for
    THP:

      grep shmem /sys/fs/cgroup/machine.slice/.../memory.numa_stat:

        shmem N0=1092616192 N1=10485760
        shmem_thp N0=1092616192 N1=10485760

      migratepages 9181 0 1:

        shmem N0=0 N1=1103101952
        shmem_thp N0=1092616192 N1=10485760

    Fix that by updating shmem_thp counters likewise to shmem counters on page
    migration.

    [jglauber@digitalocean.com: use folio_test_pmd_mappable instead of folio_test_transhuge]
      Link: https://lkml.kernel.org/r/20230622094720.510540-1-jglauber@digitalocean.com
    Link: https://lkml.kernel.org/r/20230619103351.234837-1-jglauber@digitalocean.com
    Signed-off-by: Jan Glauber <jglauber@digitalocean.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:22 -04:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Rafael Aquini 553573f4b1 mm: convert migrate_pages() to work on folios
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * dropped hunk for Documentation/translations/zh_CN/mm/page_migration.rst.
    This doc file was introduced upstream via pre-v6.0 (v6.0-rc1) merge
    commit 6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of
    git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm") which was never
    picked by previous backport attempts.

This patch is a backport of the following upstream commit:
commit 4e096ae1801e24b338e02715c65c3ffa8883ba5d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat May 13 01:11:01 2023 +0100

    mm: convert migrate_pages() to work on folios

    Almost all of the callers & implementors of migrate_pages() were already
    converted to use folios.  compaction_alloc() & compaction_free() are
    trivial to convert a part of this patch and not worth splitting out.

    Link: https://lkml.kernel.org/r/20230513001101.276972-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:25 -04:00
Rafael Aquini 2f06e66606 migrate_pages_batch: simplify retrying and failure counting of large folios
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 124abced647306aa3badb5d472c3616de23f180a
Author: Huang Ying <ying.huang@intel.com>
Date:   Wed May 10 11:18:29 2023 +0800

    migrate_pages_batch: simplify retrying and failure counting of large folios

    After recent changes to the retrying and failure counting in
    migrate_pages_batch(), it was found that it's unnecessary to count
    retrying and failure for normal, large, and THP folios separately.
    Because we don't use retrying and failure number of large folios directly.
    So, in this patch, we simplified retrying and failure counting of large
    folios via counting retrying and failure of normal and large folios
    together.  This results in the reduced line number.

    Previously, in migrate_pages_batch we need to track whether the source
    folio is large/THP before splitting.  So is_large is used to cache
    folio_test_large() result.  Now, we don't need that variable any more
    because we don't count retrying and failure of large folios (only counting
    that of THP folios).  So, in this patch, is_large is removed to simplify
    the code.

    This is just code cleanup, no functionality changes are expected.

    Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:22 -04:00
Rafael Aquini 66ba90cfdc migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHT
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 4bb6dc79d987b243d65c70c5029e51e719cfb94b
Author: Douglas Anderson <dianders@chromium.org>
Date:   Fri Apr 28 13:54:38 2023 -0700

    migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHT

    The MIGRATE_SYNC_LIGHT mode is intended to block for things that will
    finish quickly but not for things that will take a long time.  Exactly how
    long is too long is not well defined, but waits of tens of milliseconds is
    likely non-ideal.

    When putting a Chromebook under memory pressure (opening over 90 tabs on a
    4GB machine) it was fairly easy to see delays waiting for some locks in
    the kcompactd code path of > 100 ms.  While the laptop wasn't amazingly
    usable in this state, it was still limping along and this state isn't
    something artificial.  Sometimes we simply end up with a lot of memory
    pressure.

    Putting the same Chromebook under memory pressure while it was running
    Android apps (though not stressing them) showed a much worse result (NOTE:
    this was on a older kernel but the codepaths here are similar).  Android
    apps on ChromeOS currently run from a 128K-block, zlib-compressed,
    loopback-mounted squashfs disk.  If we get a page fault from something
    backed by the squashfs filesystem we could end up holding a folio lock
    while reading enough from disk to decompress 128K (and then decompressing
    it using the somewhat slow zlib algorithms).  That reading goes through
    the ext4 subsystem (because it's a loopback mount) before eventually
    ending up in the block subsystem.  This extra jaunt adds extra overhead.
    Without much work I could see cases where we ended up blocked on a folio
    lock for over a second.  With more extreme memory pressure I could see up
    to 25 seconds.

    We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the
    two locks that were seen to be slow [1] and that generated much
    discussion.  After discussion, it was decided that we should avoid waiting
    for the two locks during MIGRATE_SYNC_LIGHT if they were being held for
    IO.  We'll continue with the unbounded wait for the more full SYNC modes.

    With this change, I couldn't see any slow waits on these locks with my
    previous testcases.

    NOTE: The reason I stated digging into this originally isn't because some
    benchmark had gone awry, but because we've received in-the-field crash
    reports where we have a hung task waiting on the page lock (which is the
    equivalent code path on old kernels).  While the root cause of those
    crashes is likely unrelated and won't be fixed by this patch, analyzing
    those crash reports did point out these very long waits seemed like
    something good to fix.  With this patch we should no longer hang waiting
    on these locks, but presumably the system will still be in a bad shape and
    hang somewhere else.

    [1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid

    Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid
    Signed-off-by: Douglas Anderson <dianders@chromium.org>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:18 -04:00
Rafael Aquini dd80b061ca mm/migrate: revert "mm/migrate: fix wrongly apply write bit after mkdirty on sparc64"
JIRA: https://issues.redhat.com/browse/RHEL-48221

This patch is a backport of the following upstream commit:
commit 3c811f7883c4ee5a34ba4354381bde062888dd31
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Apr 11 16:25:10 2023 +0200

    mm/migrate: revert "mm/migrate: fix wrongly apply write bit after mkdirty on sparc64"

    This reverts commit 96a9c287e25d ("mm/migrate: fix wrongly apply write bit
    after mkdirty on sparc64").

    Now that sparc64 mkdirty handling is fixed and no longer sets a PTE/PMD
    writable that shouldn't be writable, let's revert the temporary fix.

    The mkdirty mm selftest still passes with this change on sparc64.

    Note that loongarch handling was fixed in commit bf2f34a506e6 ("LoongArch:
    Set _PAGE_DIRTY only if _PAGE_WRITE is set in {pmd,pte}_mkdirty()").

    Link: https://lkml.kernel.org/r/20230411142512.438404-5-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sam Ravnborg <sam@ravnborg.org>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-07-16 09:30:11 -04:00
Lucas Zampieri 9b8174fe29 Merge: mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4026

JIRA: https://issues.redhat.com/browse/RHEL-31840  
CVE: CVE-2024-26783  
  
Signed-off-by: Rafael Aquini <aquini@redhat.com>

Approved-by: Steve Best <sbest@redhat.com>
Approved-by: Nico Pache <npache@redhat.com>
Approved-by: Aristeu Rozanski <arozansk@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-16 13:33:33 +00:00
Nico Pache 562be233ef mm/migrate: fix do_pages_move for compat pointers
commit 229e2253766c7cdfe024f1fe280020cc4711087c
Author: Gregory Price <gourry.memverge@gmail.com>
Date:   Tue Oct 3 10:48:56 2023 -0400

    mm/migrate: fix do_pages_move for compat pointers

    do_pages_move does not handle compat pointers for the page list.
    correctly.  Add in_compat_syscall check and appropriate get_user fetch
    when iterating the page list.

    It makes the syscall in compat mode (32-bit userspace, 64-bit kernel)
    work the same way as the native 32-bit syscall again, restoring the
    behavior before my broken commit 5b1b561ba73c ("mm: simplify
    compat_sys_move_pages").

    More specifically, my patch moved the parsing of the 'pages' array from
    the main entry point into do_pages_stat(), which left the syscall
    working correctly for the 'stat' operation (nodes = NULL), while the
    'move' operation (nodes != NULL) is now missing the conversion and
    interprets 'pages' as an array of 64-bit pointers instead of the
    intended 32-bit userspace pointers.

    It is possible that nobody noticed this bug because the few
    applications that actually call move_pages are unlikely to run in
    compat mode because of their large memory requirements, but this
    clearly fixes a user-visible regression and should have been caught by
    ltp.

    Link: https://lkml.kernel.org/r/20231003144857.752952-1-gregory.price@memverge.com
    Fixes: 5b1b561ba73c ("mm: simplify compat_sys_move_pages")
    Signed-off-by: Gregory Price <gregory.price@memverge.com>
    Reported-by: Arnd Bergmann <arnd@arndb.de>
    Co-developed-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:28 -06:00
Chris von Recklinghausen d88eb62a7f mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
Conflicts: mm/huge_memory.c - We already have
	96a9c287e25d ("mm/migrate: fix wrongly apply write bit after mkdirty on sparc64")
	so don't add check for pmd_swp_uffd_wp or call pmd_wrprotect
	We also have
	161e393c0f63 ("mm: Make pte_mkwrite() take a VMA")
	so call pte_mkwrite with a vma

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit f3ebdf042df4e08bab1d5f8bf1c4b959d8741c10
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Apr 18 16:21:13 2023 +0200

    mm: don't check VMA write permissions if the PTE/PMD indicates write permiss
ions

    Staring at the comment "Recheck VMA as permissions can change since
    migration started" in remove_migration_pte() can result in confusion,
    because if the source PTE/PMD indicates write permissions, then there
    should be no need to check VMA write permissions when restoring migration
    entries or PTE-mapping a PMD.

    Commit d3cb8bf608 ("mm: migrate: Close race between migration completion
    and mprotect") introduced the maybe_mkwrite() handling in
    remove_migration_pte() in 2014, stating that a race between mprotect() and
    migration finishing would be possible, and that we could end up with a
    writable PTE that should be readable.

    However, mprotect() code first updates vma->vm_flags / vma->vm_page_prot
    and then walks the page tables to (a) set all present writable PTEs to
    read-only and (b) convert all writable migration entries to readable
    migration entries.  While walking the page tables and modifying the
    entries, migration code has to grab the PT locks to synchronize against
    concurrent page table modifications.

    Assuming migration would find a writable migration entry (while holding
    the PT lock) and replace it with a writable present PTE, surely mprotect()
    code didn't stumble over the writable migration entry yet (converting it
    into a readable migration entry) and would instead wait for the PT lock to
    convert the now present writable PTE into a read-only PTE.  As mprotect()
    didn't finish yet, the behavior is just like migration didn't happen: a
    writable PTE will be converted to a read-only PTE.

    So it's fine to rely on the writability information in the source PTE/PMD
    and not recheck against the VMA as long as we're holding the PT lock to
    synchronize with anyone who concurrently wants to downgrade write
    permissions (like mprotect()) by first adjusting vma->vm_flags /
    vma->vm_page_prot to then walk over the page tables to adjust the page
    table entries.

    Running test cases that should reveal such races -- mprotect(PROT_READ)
    racing with page migration or THP splitting -- for multiple hours did not
    reveal an issue with this cleanup.

    Link: https://lkml.kernel.org/r/20230418142113.439494-1-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:03 -04:00
Chris von Recklinghausen bc39c71195 migrate_pages_batch: fix statistics for longterm pin retry
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 851ae6424697d1c4f085cb878c88168923ebcad1
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Apr 17 07:59:29 2023 +0800

    migrate_pages_batch: fix statistics for longterm pin retry

    In commit fd4a7ac32918 ("mm: migrate: try again if THP split is failed due
    to page refcnt"), if the THP splitting fails due to page reference count,
    we will retry to improve migration successful rate.  But the failed
    splitting is counted as migration failure and migration retry, which will
    cause duplicated failure counting.  So, in this patch, this is fixed via
    undoing the failure counting if we decide to retry.  The patch is tested
    via failure injection.

    Link: https://lkml.kernel.org/r/20230416235929.1040194-1-ying.huang@intel.com
    Fixes: fd4a7ac32918 ("mm: migrate: try again if THP split is failed due to page refcnt")
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:03 -04:00
Aristeu Rozanski 33cb4bb71d migrate_pages: try migrate in batch asynchronously firstly
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 2ef7dbb269902bde34c82f027806992195d1d1ee
Author: Huang Ying <ying.huang@intel.com>
Date:   Fri Mar 3 11:01:55 2023 +0800

    migrate_pages: try migrate in batch asynchronously firstly

    When we have locked more than one folios, we cannot wait the lock or bit
    (e.g., page lock, buffer head lock, writeback bit) synchronously.
    Otherwise deadlock may be triggered.  This make it hard to batch the
    synchronous migration directly.

    This patch re-enables batching synchronous migration via trying to migrate
    in batch asynchronously firstly.  And any folios that are failed to be
    migrated asynchronously will be migrated synchronously one by one.

    Test shows that this can restore the TLB flushing batching performance for
    synchronous migration effectively.

    Link: https://lkml.kernel.org/r/20230303030155.160983-4-ying.huang@intel.com
    Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Tested-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Stefan Roesch <shr@devkernel.io>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:25 -04:00
Aristeu Rozanski d7a2c854eb migrate_pages: move split folios processing out of migrate_pages_batch()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit a21d2133215b58fbf254ea2bb77eb3143ffedf60
Author: Huang Ying <ying.huang@intel.com>
Date:   Fri Mar 3 11:01:54 2023 +0800

    migrate_pages: move split folios processing out of migrate_pages_batch()

    To simplify the code logic and reduce the line number.

    Link: https://lkml.kernel.org/r/20230303030155.160983-3-ying.huang@intel.com
    Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Stefan Roesch <shr@devkernel.io>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:25 -04:00
Aristeu Rozanski f3579f9e12 migrate_pages: fix deadlock in batched migration
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit fb3592c41a4427601f9643b2a84e55bb99f5cd7c
Author: Huang Ying <ying.huang@intel.com>
Date:   Fri Mar 3 11:01:53 2023 +0800

    migrate_pages: fix deadlock in batched migration

    Patch series "migrate_pages: fix deadlock in batched synchronous
    migration", v2.

    Two deadlock bugs were reported for the migrate_pages() batching series.
    Thanks Hugh and Pengfei.  Analysis shows that if we have locked some other
    folios except the one we are migrating, it's not safe in general to wait
    synchronously, for example, to wait the writeback to complete or wait to
    lock the buffer head.

    So 1/3 fixes the deadlock in a simple way, where the batching support for
    the synchronous migration is disabled.  The change is straightforward and
    easy to be understood.  While 3/3 re-introduce the batching for
    synchronous migration via trying to migrate asynchronously in batch
    optimistically, then fall back to migrate synchronously one by one for
    fail-to-migrate folios.  Test shows that this can restore the TLB flushing
    batching performance for synchronous migration effectively.

    This patch (of 3):

    Two deadlock bugs were reported for the migrate_pages() batching series.
    Thanks Hugh and Pengfei!  For example, in the following deadlock trace
    snippet,

     INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
           Not tainted 6.2.0-rc4-kvm+ #1314
     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
     task:kworker/u4:0    state:D stack:0     pid:9     ppid:2      flags:0x00004000
     Workqueue: loop4 loop_rootcg_workfn
     Call Trace:
      <TASK>
      __schedule+0x43b/0xd00
      schedule+0x6a/0xf0
      io_schedule+0x4a/0x80
      folio_wait_bit_common+0x1b5/0x4e0
      ? __pfx_wake_page_function+0x10/0x10
      __filemap_get_folio+0x73d/0x770
      shmem_get_folio_gfp+0x1fd/0xc80
      shmem_write_begin+0x91/0x220
      generic_perform_write+0x10e/0x2e0
      __generic_file_write_iter+0x17e/0x290
      ? generic_write_checks+0x12b/0x1a0
      generic_file_write_iter+0x97/0x180
      ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
      do_iter_readv_writev+0x13c/0x210
      ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
      do_iter_write+0xf6/0x330
      vfs_iter_write+0x46/0x70
      loop_process_work+0x723/0xfe0
      loop_rootcg_workfn+0x28/0x40
      process_one_work+0x3cc/0x8d0
      worker_thread+0x66/0x630
      ? __pfx_worker_thread+0x10/0x10
      kthread+0x153/0x190
      ? __pfx_kthread+0x10/0x10
      ret_from_fork+0x29/0x50
      </TASK>

     INFO: task repro:1023 blocked for more than 147 seconds.
           Not tainted 6.2.0-rc4-kvm+ #1314
     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
     task:repro           state:D stack:0     pid:1023  ppid:360    flags:0x00004004
     Call Trace:
      <TASK>
      __schedule+0x43b/0xd00
      schedule+0x6a/0xf0
      io_schedule+0x4a/0x80
      folio_wait_bit_common+0x1b5/0x4e0
      ? compaction_alloc+0x77/0x1150
      ? __pfx_wake_page_function+0x10/0x10
      folio_wait_bit+0x30/0x40
      folio_wait_writeback+0x2e/0x1e0
      migrate_pages_batch+0x555/0x1ac0
      ? __pfx_compaction_alloc+0x10/0x10
      ? __pfx_compaction_free+0x10/0x10
      ? __this_cpu_preempt_check+0x17/0x20
      ? lock_is_held_type+0xe6/0x140
      migrate_pages+0x100e/0x1180
      ? __pfx_compaction_free+0x10/0x10
      ? __pfx_compaction_alloc+0x10/0x10
      compact_zone+0xe10/0x1b50
      ? lock_is_held_type+0xe6/0x140
      ? check_preemption_disabled+0x80/0xf0
      compact_node+0xa3/0x100
      ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
      ? _find_first_bit+0x7b/0x90
      sysctl_compaction_handler+0x5d/0xb0
      proc_sys_call_handler+0x29d/0x420
      proc_sys_write+0x2b/0x40
      vfs_write+0x3a3/0x780
      ksys_write+0xb7/0x180
      __x64_sys_write+0x26/0x30
      do_syscall_64+0x3b/0x90
      entry_SYSCALL_64_after_hwframe+0x72/0xdc
     RIP: 0033:0x7f3a2471f59d
     RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
     RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
     RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
     RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
     R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
     R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
      </TASK>

    The page migration task has held the lock of the shmem folio A, and is
    waiting the writeback of the folio B of the file system on the loop block
    device to complete.  While the loop worker task which writes back the
    folio B is waiting to lock the shmem folio A, because the folio A backs
    the folio B in the loop device.  Thus deadlock is triggered.

    In general, if we have locked some other folios except the one we are
    migrating, it's not safe to wait synchronously, for example, to wait the
    writeback to complete or wait to lock the buffer head.

    To fix the deadlock, in this patch, we avoid to batch the page migration
    except for MIGRATE_ASYNC mode.  In MIGRATE_ASYNC mode, synchronous waiting
    is avoided.

    The fix can be improved further.  We will do that as soon as possible.

    Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com
    Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/
    Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/
    Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/
    Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com
    Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reported-by: Hugh Dickins <hughd@google.com>
    Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Stefan Roesch <shr@devkernel.io>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:25 -04:00
Aristeu Rozanski 4077e13de2 mm: avoid gcc complaint about pointer casting
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit e77d587a2c04e82c6a0dffa4a32c874a4029385d
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sat Mar 4 14:03:27 2023 -0800

    mm: avoid gcc complaint about pointer casting

    The migration code ends up temporarily stashing information of the wrong
    type in unused fields of the newly allocated destination folio.  That
    all works fine, but gcc does complain about the pointer type mis-use:

        mm/migrate.c: In function ‘__migrate_folio_extract’:
        mm/migrate.c:1050:20: note: randstruct: casting between randomized structure pointer types (ssa): ‘struct anon_vma’ and ‘struct address_space’

         1050 |         *anon_vmap = (void *)dst->mapping;
              |         ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~

    and gcc is actually right to complain since it really doesn't understand
    that this is a very temporary special case where this is ok.

    This could be fixed in different ways by just obfuscating the assignment
    sufficiently that gcc doesn't see what is going on, but the truly
    "proper C" way to do this is by explicitly using a union.

    Using unions for type conversions like this is normally hugely ugly and
    syntactically nasty, but this really is one of the few cases where we
    want to make it clear that we're not doing type conversion, we're really
    re-using the value bit-for-bit just using another type.

    IOW, this should not become a common pattern, but in this one case using
    that odd union is probably the best way to document to the compiler what
    is conceptually going on here.

    [ Side note: there are valid cases where we convert pointers to other
      pointer types, notably the whole "folio vs page" situation, where the
      types actually have fundamental commonalities.

      The fact that the gcc note is limited to just randomized structures
      means that we don't see equivalent warnings for those cases, but it
      migth also mean that we miss other cases where we do play these kinds
      of dodgy games, and this kind of explicit conversion might be a good
      idea. ]

    I verified that at least for an allmodconfig build on x86-64, this
    generates the exact same code, apart from line numbers and assembler
    comment changes.

    Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()")
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:25 -04:00
Aristeu Rozanski 837cf9f325 mm: change to return bool for isolate_movable_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit cd7755800eb54e8522f5e51f4e71e6494c1f1572
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:37 2023 +0800

    mm: change to return bool for isolate_movable_page()

    Now the isolate_movable_page() can only return 0 or -EBUSY, and no users
    will care about the negative return value, thus we can convert the
    isolate_movable_page() to return a boolean value to make the code more
    clear when checking the movable page isolation state.

    No functional changes intended.

    [akpm@linux-foundation.org: remove unneeded comment, per Matthew]
    Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:24 -04:00
Aristeu Rozanski 39263f3448 mm: hugetlb: change to return bool for isolate_hugetlb()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 9747b9e92418b61c2281561e0651803f1fad0159
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:36 2023 +0800

    mm: hugetlb: change to return bool for isolate_hugetlb()

    Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not
    care about the negative value, thus we can convert the isolate_hugetlb()
    to return a boolean value to make code more clear when checking the
    hugetlb isolation state.  Moreover converts 2 users which will consider
    the negative value returned by isolate_hugetlb().

    No functional changes intended.

    [akpm@linux-foundation.org: shorten locked section, per SeongJae Park]
    Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:24 -04:00
Aristeu Rozanski 4c96f5154f mm: change to return bool for isolate_lru_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f7f9c00dfafffd7a5a1a5685e2d874c64913e2ed
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:35 2023 +0800

    mm: change to return bool for isolate_lru_page()

    The isolate_lru_page() can only return 0 or -EBUSY, and most users did not
    care about the negative error of isolate_lru_page(), except one user in
    add_page_for_migration().  So we can convert the isolate_lru_page() to
    return a boolean value, which can help to make the code more clear when
    checking the return value of isolate_lru_page().

    Also convert all users' logic of checking the isolation state.

    No functional changes intended.

    Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski 663359b4c3 migrate_pages: move THP/hugetlb migration support check to simplify code
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 6f7d760e86fa84862d749e36ebd29abf31f4f883
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 13 20:34:44 2023 +0800

    migrate_pages: move THP/hugetlb migration support check to simplify code

    This is a code cleanup patch, no functionality change is expected.  After
    the change, the line number reduces especially in the long
    migrate_pages_batch().

    Link: https://lkml.kernel.org/r/20230213123444.155149-10-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Suggested-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski 5784c8749c migrate_pages: batch flushing TLB
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7e12beb8ca2ac98b2ec42e0ea4b76cdc93b58654
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 13 20:34:43 2023 +0800

    migrate_pages: batch flushing TLB

    The TLB flushing will cost quite some CPU cycles during the folio
    migration in some situations.  For example, when migrate a folio of a
    process with multiple active threads that run on multiple CPUs.  After
    batching the _unmap and _move in migrate_pages(), the TLB flushing can be
    batched easily with the existing TLB flush batching mechanism.  This patch
    implements that.

    We use the following test case to test the patch.

    On a 2-socket Intel server,

    - Run pmbench memory accessing benchmark

    - Run `migratepages` to migrate pages of pmbench between node 0 and
      node 1 back and forth.

    With the patch, the TLB flushing IPI reduces 99.1% during the test and the
    number of pages migrated successfully per second increases 291.7%.

    Haoxin helped to test the patchset on an ARM64 server with 128 cores, 2
    NUMA nodes.  Test results show that the page migration performance
    increases up to 78%.

    NOTE: TLB flushing is batched only for normal folios, not for THP folios.
    Because the overhead of TLB flushing for THP folios is much lower than
    that for normal folios (about 1/512 on x86 platform).

    Link: https://lkml.kernel.org/r/20230213123444.155149-9-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Tested-by: Xin Hao <xhao@linux.alibaba.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski 40d2a594af migrate_pages: share more code between _unmap and _move
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit ebe75e4751063dce6f61b579b43de86dcf7b7462
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 13 20:34:42 2023 +0800

    migrate_pages: share more code between _unmap and _move

    This is a code cleanup patch to reduce the duplicated code between the
    _unmap and _move stages of migrate_pages().  No functionality change is
    expected.

    Link: https://lkml.kernel.org/r/20230213123444.155149-8-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski b54af45b6c migrate_pages: move migrate_folio_unmap()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 80562ba0d8378e89fe5836c28ea56c2aab3014e8
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 13 20:34:41 2023 +0800

    migrate_pages: move migrate_folio_unmap()

    Just move the position of the functions.  There's no any functionality
    change.  This is to make it easier to review the next patch via putting
    code near its position in the next patch.

    Link: https://lkml.kernel.org/r/20230213123444.155149-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski b9eacb8530 migrate_pages: batch _unmap and _move
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 5dfab109d5193e6c224d96cabf90e9cc2c039884
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 13 20:34:40 2023 +0800

    migrate_pages: batch _unmap and _move

    In this patch the _unmap and _move stage of the folio migration is
    batched.  That for, previously, it is,

      for each folio
        _unmap()
        _move()

    Now, it is,

      for each folio
        _unmap()
      for each folio
        _move()

    Based on this, we can batch the TLB flushing and use some hardware
    accelerator to copy folios between batched _unmap and batched _move
    stages.

    Link: https://lkml.kernel.org/r/20230213123444.155149-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski 83c5bdf4e3 migrate_pages: split unmap_and_move() to _unmap() and _move()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 64c8902ed4418317cd416c566f896bd4a92b2efc
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 13 20:34:39 2023 +0800

    migrate_pages: split unmap_and_move() to _unmap() and _move()

    This is a preparation patch to batch the folio unmapping and moving.

    In this patch, unmap_and_move() is split to migrate_folio_unmap() and
    migrate_folio_move().  So, we can batch _unmap() and _move() in different
    loops later.  To pass some information between unmap and move, the
    original unused dst->mapping and dst->private are used.

    Link: https://lkml.kernel.org/r/20230213123444.155149-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski 7540e12101 migrate_pages: restrict number of pages to migrate in batch
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 42012e0436d44aeb2e68f11a28ddd0ad3f38b61f
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 13 20:34:38 2023 +0800

    migrate_pages: restrict number of pages to migrate in batch

    This is a preparation patch to batch the folio unmapping and moving for
    non-hugetlb folios.

    If we had batched the folio unmapping, all folios to be migrated would be
    unmapped before copying the contents and flags of the folios.  If the
    folios that were passed to migrate_pages() were too many in unit of pages,
    the execution of the processes would be stopped for too long time, thus
    too long latency.  For example, migrate_pages() syscall will call
    migrate_pages() with all folios of a process.  To avoid this possible
    issue, in this patch, we restrict the number of pages to be migrated to be
    no more than HPAGE_PMD_NR.  That is, the influence is at the same level of
    THP migration.

    Link: https://lkml.kernel.org/r/20230213123444.155149-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:22 -04:00