JIRA: https://issues.redhat.com/browse/RHEL-84184
CVE: CVE-2024-42234
This patch is a backport of the following upstream commit:
commit be9581ea8c058d81154251cb0695987098996cad
Author: Hugh Dickins <hughd@google.com>
Date: Tue Jul 2 00:40:55 2024 -0700
mm: fix crashes from deferred split racing folio migration
Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
flags when freeing, yet the flags shown are not bad: PG_locked had been
set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
symptoms implying double free by deferred split and large folio migration.
6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
folio migration") was right to fix the memcg-dependent locking broken in
85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
but missed a subtlety of deferred_split_scan(): it moves folios to its own
local list to work on them without split_queue_lock, during which time
folio->_deferred_list is not empty, but even the "right" lock does nothing
to secure the folio and the list it is on.
Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
while the old folio's reference count is temporarily frozen to 0 - adding
such a freeze in the !mapping case too (originally, folio lock and
unmapping and no swap cache left an anon folio unreachable, so no freezing
was needed there: but the deferred split queue offers a way to reach it).
Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com
Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-84184
Conflicts:
* This backport drops the 2nd hunk from the upstream commit because RHEL-9
misses commit 7262f208ca68 ("mm/migrate: split source folio if it is on
deferred split list"), which itself is a follow-up fix for v6.10 code
that was never backported into RHEL-9
This patch is a backport of the following upstream commit:
commit 2e6506e1c4eed2676a8412231046f31e10e240da
Author: Gao Xiang <hsiangkao@linux.alibaba.com>
Date: Mon Jul 29 10:13:06 2024 +0800
mm/migrate: fix deadlock in migrate_pages_batch() on large folios
Currently, migrate_pages_batch() can lock multiple locked folios with an
arbitrary order. Although folio_trylock() is used to avoid deadlock as
commit 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously
firstly") mentioned, it seems try_split_folio() is still missing.
It was found by compaction stress test when I explicitly enable EROFS
compressed files to use large folios, which case I cannot reproduce with
the same workload if large folio support is off (current mainline).
Typically, filesystem reads (with locked file-backed folios) could use
another bdev/meta inode to load some other I/Os (e.g. inode extent
metadata or caching compressed data), so the locking order will be:
file-backed folios (A)
bdev/meta folios (B)
The following calltrace shows the deadlock:
Thread 1 takes (B) lock and tries to take folio (A) lock
Thread 2 takes (A) lock and tries to take folio (B) lock
[Thread 1]
INFO: task stress:1824 blocked for more than 30 seconds.
Tainted: G OE 6.10.0-rc7+ #6
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:stress state:D stack:0 pid:1824 tgid:1824 ppid:1822 flags:0x0000000c
Call trace:
__switch_to+0xec/0x138
__schedule+0x43c/0xcb0
schedule+0x54/0x198
io_schedule+0x44/0x70
folio_wait_bit_common+0x184/0x3f8
<-- folio mapping ffff00036d69cb18 index 996 (**)
__folio_lock+0x24/0x38
migrate_pages_batch+0x77c/0xea0 // try_split_folio (mm/migrate.c:1486:2)
// migrate_pages_batch (mm/migrate.c:1734:16)
<--- LIST_HEAD(unmap_folios) has
..
folio mapping 0xffff0000d184f1d8 index 1711; (*)
folio mapping 0xffff0000d184f1d8 index 1712;
..
migrate_pages+0xb28/0xe90
compact_zone+0xa08/0x10f0
compact_node+0x9c/0x180
sysctl_compaction_handler+0x8c/0x118
proc_sys_call_handler+0x1a8/0x280
proc_sys_write+0x1c/0x30
vfs_write+0x240/0x380
ksys_write+0x78/0x118
__arm64_sys_write+0x24/0x38
invoke_syscall+0x78/0x108
el0_svc_common.constprop.0+0x48/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x3c/0x148
el0t_64_sync_handler+0x100/0x130
el0t_64_sync+0x190/0x198
[Thread 2]
INFO: task stress:1825 blocked for more than 30 seconds.
Tainted: G OE 6.10.0-rc7+ #6
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:stress state:D stack:0 pid:1825 tgid:1825 ppid:1822 flags:0x0000000c
Call trace:
__switch_to+0xec/0x138
__schedule+0x43c/0xcb0
schedule+0x54/0x198
io_schedule+0x44/0x70
folio_wait_bit_common+0x184/0x3f8
<-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*)
__folio_lock+0x24/0x38
z_erofs_runqueue+0x384/0x9c0 [erofs]
z_erofs_readahead+0x21c/0x350 [erofs] <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**)
read_pages+0x74/0x328
page_cache_ra_order+0x26c/0x348
ondemand_readahead+0x1c0/0x3a0
page_cache_sync_ra+0x9c/0xc0
filemap_get_pages+0xc4/0x708
filemap_read+0x104/0x3a8
generic_file_read_iter+0x4c/0x150
vfs_read+0x27c/0x330
ksys_pread64+0x84/0xd0
__arm64_sys_pread64+0x28/0x40
invoke_syscall+0x78/0x108
el0_svc_common.constprop.0+0x48/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x3c/0x148
el0t_64_sync_handler+0x100/0x130
el0t_64_sync+0x190/0x198
Link: https://lkml.kernel.org/r/20240729021306.398286-1-hsiangkao@linux.alibaba.com
Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-84184
Conflicts:
* notable context difference due to RHEL-9 missing support
for mTHP and its follow-up commits
This patch is a backport of the following upstream commit:
commit 60cf233b585cdf1f3c5e52d1225606b86acd08b0
Author: Zi Yan <ziy@nvidia.com>
Date: Wed Mar 5 15:04:03 2025 -0500
mm/migrate: fix shmem xarray update during migration
A shmem folio can be either in page cache or in swap cache, but not at the
same time. Namely, once it is in swap cache, folio->mapping should be
NULL, and the folio is no longer in a shmem mapping.
In __folio_migrate_mapping(), to determine the number of xarray entries to
update, folio_test_swapbacked() is used, but that conflates shmem in page
cache case and shmem in swap cache case. It leads to xarray multi-index
entry corruption, since it turns a sibling entry to a normal entry during
xas_store() (see [1] for a userspace reproduction). Fix it by only using
folio_test_swapcache() to determine whether xarray is storing swap cache
entries or not to choose the right number of xarray entries to update.
[1] https://lore.kernel.org/linux-mm/Z8idPCkaJW1IChjT@casper.infradead.org/
Note:
In __split_huge_page(), folio_test_anon() && folio_test_swapcache() is
used to get swap_cache address space, but that ignores the shmem folio in
swap cache case. It could lead to NULL pointer dereferencing when a
in-swap-cache shmem folio is split at __xa_store(), since
!folio_test_anon() is true and folio->mapping is NULL. But fortunately,
its caller split_huge_page_to_list_to_order() bails out early with EBUSY
when folio->mapping is NULL. So no need to take care of it here.
Link: https://lkml.kernel.org/r/20250305200403.2822855-1-ziy@nvidia.com
Fixes: fc346d0a70a1 ("mm: migrate high-order folios in swap cache correctly")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: Liu Shixin <liushixin2@huawei.com>
Closes: https://lore.kernel.org/all/28546fb4-5210-bf75-16d6-43e1f8646080@huawei.com/
Suggested-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-84184
Conflicts:
* context differences from upstream due to RHEL-9 missing commit
7262f208ca68 ("mm/migrate: split source folio if it is on
deferred split list") and its follow-up fix commit 6e49019db5f7
("mm/migrate: putback split folios when numa hint migration fails"),
with none of them being relevant to this backport
This patch is a backport of the following upstream commit:
commit ac59a1f0146f46bad7d5f8d1b20756ece43122ec
Author: Zi Yan <ziy@nvidia.com>
Date: Wed Jul 24 09:01:15 2024 -0400
memory tiering: count PGPROMOTE_SUCCESS when mem tiering is enabled.
memory tiering can be enabled/disabled at runtime and
sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING is used to
check it. In migrate_misplaced_folio(), the check is missing when
PGPROMOTE_SUCCESS is incremented. Add the missing check.
Link: https://lkml.kernel.org/r/20240724130115.793641-4-ziy@nvidia.com
Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency")
Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Closes: https://lore.kernel.org/linux-mm/f4ae2c9c-fe40-4807-bdb2-64cf2d716c1a@huawei.com/
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 35e41024c4c2b02ef8207f61b9004f6956cf037b
Author: Gregory Price <gourry@gourry.net>
Date: Fri Oct 25 10:17:24 2024 -0400
vmscan,migrate: fix page count imbalance on node stats when demoting pages
When numa balancing is enabled with demotion, vmscan will call
migrate_pages when shrinking LRUs. migrate_pages will decrement the
the node's isolated page count, leading to an imbalanced count when
invoked from (MG)LRU code.
The result is dmesg output like such:
$ cat /proc/sys/vm/stat_refresh
[77383.088417] vmstat_refresh: nr_isolated_anon -103212
[77383.088417] vmstat_refresh: nr_isolated_file -899642
This negative value may impact compaction and reclaim throttling.
The following path produces the decrement:
shrink_folio_list
demote_folio_list
migrate_pages
migrate_pages_batch
migrate_folio_move
migrate_folio_done
mod_node_page_state(-ve) <- decrement
This path happens for SUCCESSFUL migrations, not failures. Typically
callers to migrate_pages are required to handle putback/accounting for
failures, but this is already handled in the shrink code.
When accounting for migrations, instead do not decrement the count when
the migration reason is MR_DEMOTION. As of v6.11, this demotion logic
is the only source of MR_DEMOTION.
Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net
Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim")
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Wei Xu <weixugc@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 8001070cfbec5cd4ea00b8b48ea51df91122f265
Author: Jeongjun Park <aha310510@gmail.com>
Date: Tue Sep 24 22:00:53 2024 +0900
mm: migrate: annotate data-race in migrate_folio_unmap()
I found a report from syzbot [1]
This report shows that the value can be changed, but in reality, the
value of __folio_set_movable() cannot be changed because it holds the
folio refcount.
Therefore, it is appropriate to add an annotate to make KCSAN
ignore that data-race.
[1]
==================================================================
BUG: KCSAN: data-race in __filemap_remove_folio / migrate_pages_batch
write to 0xffffea0004b81dd8 of 8 bytes by task 6348 on cpu 0:
page_cache_delete mm/filemap.c:153 [inline]
__filemap_remove_folio+0x1ac/0x2c0 mm/filemap.c:233
filemap_remove_folio+0x6b/0x1f0 mm/filemap.c:265
truncate_inode_folio+0x42/0x50 mm/truncate.c:178
shmem_undo_range+0x25b/0xa70 mm/shmem.c:1028
shmem_truncate_range mm/shmem.c:1144 [inline]
shmem_evict_inode+0x14d/0x530 mm/shmem.c:1272
evict+0x2f0/0x580 fs/inode.c:731
iput_final fs/inode.c:1883 [inline]
iput+0x42a/0x5b0 fs/inode.c:1909
dentry_unlink_inode+0x24f/0x260 fs/dcache.c:412
__dentry_kill+0x18b/0x4c0 fs/dcache.c:615
dput+0x5c/0xd0 fs/dcache.c:857
__fput+0x3fb/0x6d0 fs/file_table.c:439
____fput+0x1c/0x30 fs/file_table.c:459
task_work_run+0x13a/0x1a0 kernel/task_work.c:228
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0xbe/0x130 kernel/entry/common.c:218
do_syscall_64+0xd6/0x1c0 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x77/0x7f
read to 0xffffea0004b81dd8 of 8 bytes by task 6342 on cpu 1:
__folio_test_movable include/linux/page-flags.h:699 [inline]
migrate_folio_unmap mm/migrate.c:1199 [inline]
migrate_pages_batch+0x24c/0x1940 mm/migrate.c:1797
migrate_pages_sync mm/migrate.c:1963 [inline]
migrate_pages+0xff1/0x1820 mm/migrate.c:2072
do_mbind mm/mempolicy.c:1390 [inline]
kernel_mbind mm/mempolicy.c:1533 [inline]
__do_sys_mbind mm/mempolicy.c:1607 [inline]
__se_sys_mbind+0xf76/0x1160 mm/mempolicy.c:1603
__x64_sys_mbind+0x78/0x90 mm/mempolicy.c:1603
x64_sys_call+0x2b4d/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:238
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
value changed: 0xffff888127601078 -> 0x0000000000000000
Link: https://lkml.kernel.org/r/20240924130053.107490-1-aha310510@gmail.com
Fixes: 7e2a5e5ab217 ("mm: migrate: use __folio_test_movable()")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
* virt/kvm/guest_memfd.c: difference in the hunk due to RHEL missing upstream
commit 1d23040caa8b ("KVM: guest_memfd: Use AS_INACCESSIBLE when creating
guest_memfd inode") which would end up being reverted with this follow-up fix.
This patch is a backport of the following upstream commit:
commit 27e6a24a4cf3d25421c0f6ebb7c39f45fc14d20f
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Thu Jul 11 13:56:54 2024 -0400
mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
The flags AS_UNMOVABLE and AS_INACCESSIBLE were both added just for guest_memfd;
AS_UNMOVABLE is already in existing versions of Linux, while AS_INACCESSIBLE was
acked for inclusion in 6.11.
But really, they are the same thing: only guest_memfd uses them, at least for
now, and guest_memfd pages are unmovable because they should not be
accessed by the CPU.
So merge them into one; use the AS_INACCESSIBLE name which is more comprehensive.
At the same time, this fixes an embarrassing bug where AS_INACCESSIBLE was used
as a bit mask, despite it being just a bit index.
The bug was mostly benign, because AS_INACCESSIBLE's bit representation (1010)
corresponded to setting AS_UNEVICTABLE (which is already set) and AS_ENOSPC
(except no async writes can happen on the guest_memfd). So the AS_INACCESSIBLE
flag simply had no effect.
Fixes: 1d23040caa8b ("KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode")
Fixes: c72ceafbd12c ("mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory")
Cc: linux-mm@kvack.org
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Michael Roth <michael.roth@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
JIRA: https://issues.redhat.com/browse/RHEL-28873
CVE: CVE-2023-52490
This patch is a backport of the following upstream commit:
commit d1adb25df7111de83b64655a80b5a135adbded61
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date: Fri Dec 15 20:07:52 2023 +0800
mm: migrate: fix getting incorrect page mapping during page migration
When running stress-ng testing, we found below kernel crash after a few hours:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
pc : dentry_name+0xd8/0x224
lr : pointer+0x22c/0x370
sp : ffff800025f134c0
......
Call trace:
dentry_name+0xd8/0x224
pointer+0x22c/0x370
vsnprintf+0x1ec/0x730
vscnprintf+0x2c/0x60
vprintk_store+0x70/0x234
vprintk_emit+0xe0/0x24c
vprintk_default+0x3c/0x44
vprintk_func+0x84/0x2d0
printk+0x64/0x88
__dump_page+0x52c/0x530
dump_page+0x14/0x20
set_migratetype_isolate+0x110/0x224
start_isolate_page_range+0xc4/0x20c
offline_pages+0x124/0x474
memory_block_offline+0x44/0xf4
memory_subsys_offline+0x3c/0x70
device_offline+0xf0/0x120
......
After analyzing the vmcore, I found this issue is caused by page migration.
The scenario is that, one thread is doing page migration, and we will use the
target page's ->mapping field to save 'anon_vma' pointer between page unmap and
page move, and now the target page is locked and refcount is 1.
Currently, there is another stress-ng thread performing memory hotplug,
attempting to offline the target page that is being migrated. It discovers that
the refcount of this target page is 1, preventing the offline operation, thus
proceeding to dump the page. However, page_mapping() of the target page may
return an incorrect file mapping to crash the system in dump_mapping(), since
the target page->mapping only saves 'anon_vma' pointer without setting
PAGE_MAPPING_ANON flag.
There are seveval ways to fix this issue:
(1) Setting the PAGE_MAPPING_ANON flag for target page's ->mapping when saving
'anon_vma', but this can confuse PageAnon() for PFN walkers, since the target
page has not built mappings yet.
(2) Getting the page lock to call page_mapping() in __dump_page() to avoid crashing
the system, however, there are still some PFN walkers that call page_mapping()
without holding the page lock, such as compaction.
(3) Using target page->private field to save the 'anon_vma' pointer and 2 bits
page state, just as page->mapping records an anonymous page, which can remove
the page_mapping() impact for PFN walkers and also seems a simple way.
So I choose option 3 to fix this issue, and this can also fix other potential
issues for PFN walkers, such as compaction.
Link: https://lkml.kernel.org/r/e60b17a88afc38cb32f84c3e30837ec70b343d2b.1702641709.git.baolin.wang@linux.alibaba.com
Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Xu Yu <xuyu@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit eebb3dabbb5cc590afe32880b5d3726d0fbf88db
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date: Sat Oct 21 12:33:22 2023 +0800
mm: migrate: record the mlocked page status to remove unnecessary lru drain
When doing compaction, I found the lru_add_drain() is an obvious hotspot
when migrating pages. The distribution of this hotspot is as follows:
- 18.75% compact_zone
- 17.39% migrate_pages
- 13.79% migrate_pages_batch
- 11.66% migrate_folio_move
- 7.02% lru_add_drain
+ 7.02% lru_add_drain_cpu
+ 3.00% move_to_new_folio
1.23% rmap_walk
+ 1.92% migrate_folio_unmap
+ 3.20% migrate_pages_sync
+ 0.90% isolate_migratepages
The lru_add_drain() was added by commit c3096e6782b7 ("mm/migrate:
__unmap_and_move() push good newpage to LRU") to drain the newpage to LRU
immediately, to help to build up the correct newpage->mlock_count in
remove_migration_ptes() for mlocked pages. However, if there are no
mlocked pages are migrating, then we can avoid this lru drain operation,
especailly for the heavy concurrent scenarios.
So we can record the source pages' mlocked status in
migrate_folio_unmap(), and only drain the lru list when the mlocked status
is set in migrate_folio_move().
In addition, the page was already isolated from lru when migrating, so
checking the mlocked status is stable by folio_test_mlocked() in
migrate_folio_unmap().
After this patch, I can see the hotpot of the lru_add_drain() is gone:
- 9.41% migrate_pages_batch
- 6.15% migrate_folio_move
- 3.64% move_to_new_folio
+ 1.80% migrate_folio_extra
+ 1.70% buffer_migrate_folio
+ 1.41% rmap_walk
+ 0.62% folio_add_lru
+ 3.07% migrate_folio_unmap
Meanwhile, the compaction latency shows some improvements when running
thpscale:
base patched
Amean fault-both-1 1131.22 ( 0.00%) 1112.55 * 1.65%*
Amean fault-both-3 2489.75 ( 0.00%) 2324.15 * 6.65%*
Amean fault-both-5 3257.37 ( 0.00%) 3183.18 * 2.28%*
Amean fault-both-7 4257.99 ( 0.00%) 4079.04 * 4.20%*
Amean fault-both-12 6614.02 ( 0.00%) 6075.60 * 8.14%*
Amean fault-both-18 10607.78 ( 0.00%) 8978.86 * 15.36%*
Amean fault-both-24 14911.65 ( 0.00%) 11619.55 * 22.08%*
Amean fault-both-30 14954.67 ( 0.00%) 14925.66 * 0.19%*
Amean fault-both-32 16654.87 ( 0.00%) 15580.31 * 6.45%*
Link: https://lkml.kernel.org/r/06e9153a7a4850352ec36602df3a3a844de45698.1697859741.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 49cac03a8f0a56cafa5329911564c97c130ced43
Author: Zi Yan <ziy@nvidia.com>
Date: Tue Oct 17 12:31:29 2023 -0400
mm/migrate: add nr_split to trace_mm_migrate_pages stats.
Add nr_split to trace_mm_migrate_pages for large folio (including THP)
split events.
[akpm@linux-foundation.org: cleanup per Huang, Ying]
Link: https://lkml.kernel.org/r/20231017163129.2025214-2-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit a259945efe6ada94087ef666e9b38f8e34ea34ba
Author: Zi Yan <ziy@nvidia.com>
Date: Tue Oct 17 12:31:28 2023 -0400
mm/migrate: correct nr_failed in migrate_pages_sync()
nr_failed was missing the large folio splits from migrate_pages_batch()
and can cause a mismatch between migrate_pages() return value and the
number of not migrated pages, i.e., when the return value of
migrate_pages() is 0, there are still pages left in the from page list.
It will happen when a non-PMD THP large folio fails to migrate due to
-ENOMEM and is split successfully but not all the split pages are not
migrated, migrate_pages_batch() would return non-zero, but
astats.nr_thp_split = 0. nr_failed would be 0 and returned to the caller
of migrate_pages(), but the not migrated pages are left in the from page
list without being added back to LRU lists.
Fix it by adding a new nr_split counter for large folio splits and adding
it to nr_failed in migrate_page_sync() after migrate_pages_batch() is
done.
Link: https://lkml.kernel.org/r/20231017163129.2025214-1-zi.yan@sent.com
Fixes: 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: Huang Ying <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date: Tue Sep 26 12:20:17 2023 -0700
mm/filemap: remove hugetlb special casing in filemap.c
Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.
To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.
========================= PERFORMANCE ======================================
Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.
Timing
aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s
6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s
6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s
x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s
6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s
6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s
aarch64:
workload - fallocate a 700GB file backed by huge pages
6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio
6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio
x86
workload - fallocate a 100GB file backed by huge pages
6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt
0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store
6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio
========================= TESTING ======================================
This patch passes libhugetlbfs tests and LTP hugetlb tests
********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********
Done executing testcases.
LTP Version: 20220527-178-g2761a81c4
page migration was also tested using Mike Kravetz's test program.[8]
[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit fa1df3f6287e1e1fd8b5309828238e2c728e985f
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Wed Sep 13 17:51:31 2023 +0800
mm: migrate: remove isolated variable in add_page_for_migration()
Directly check the return of isolate_hugetlb() and folio_isolate_lru() to
remove isolated variable, also setup err = -EBUSY in advance before
isolation, and update err only when successfully queued for migration,
which could help us to unify and simplify code a bit.
Link: https://lkml.kernel.org/r/20230913095131.2426871-9-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit b426ed7889be80359cb4edef142e5c5fa697b068
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Wed Sep 13 17:51:30 2023 +0800
mm: migrate: remove PageHead() check for HugeTLB in add_page_for_migration()
There is some different between hugeTLB and THP behave when passed the
address of a tail page, for THP, it will migrate the entire THP page, but
for HugeTLB, it will return -EACCES, or -ENOENT before commit e66f17ff71
("mm/hugetlb: take page table lock in follow_huge_pmd()"),
-EACCES The page is mapped by multiple processes and can be moved
only if MPOL_MF_MOVE_ALL is specified.
-ENOENT The page is not present.
But when check manual[1], both of the two errnos are not suitable, it is
better to keep the same behave between hugetlb and THP when passed the
address of a tail page, so let's just remove the PageHead() check for
HugeTLB.
[1] https://man7.org/linux/man-pages/man2/move_pages.2.html
Link: https://lkml.kernel.org/r/20230913095131.2426871-8-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit d64cfccbc805663a2c5691f638cf9198b9676a9f
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Wed Sep 13 17:51:29 2023 +0800
mm: migrate: use a folio in add_page_for_migration()
Use a folio in add_page_for_migration() to save compound_head() calls.
Link: https://lkml.kernel.org/r/20230913095131.2426871-7-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 7e2a5e5ab217d5e4166cdbdf4af8c5e34b6200bb
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Wed Sep 13 17:51:28 2023 +0800
mm: migrate: use __folio_test_movable()
Use __folio_test_movable(), no need to convert from folio to page again.
Link: https://lkml.kernel.org/r/20230913095131.2426871-6-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 73eab3ca481e5be0f1fd8140365d604482f84ee1
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Wed Sep 13 17:51:27 2023 +0800
mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio()
At present, numa balance only support base page and PMD-mapped THP, but we
will expand to support to migrate large folio/pte-mapped THP in the
future, it is better to make migrate_misplaced_page() to take a folio
instead of a page, and rename it to migrate_misplaced_folio(), it is a
preparation, also this remove several compound_head() calls.
Link: https://lkml.kernel.org/r/20230913095131.2426871-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
* Minor context conflict on the 2nd hunk due to out-of-order backport
of commit 774f256e7c0 ("mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index")
This patch is a backport of the following upstream commit:
commit 2ac9e99f3b21b2864305fbfba4bae5913274c409
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Wed Sep 13 17:51:26 2023 +0800
mm: migrate: convert numamigrate_isolate_page() to numamigrate_isolate_folio()
Rename numamigrate_isolate_page() to numamigrate_isolate_folio(), then
make it takes a folio and use folio API to save compound_head() calls.
Link: https://lkml.kernel.org/r/20230913095131.2426871-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 728be28fae8c838d52c91dce4867133798146357
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Wed Sep 13 17:51:25 2023 +0800
mm: migrate: remove THP mapcount check in numamigrate_isolate_page()
The check of THP mapped by multiple processes was introduced by commit
04fa5d6a65 ("mm: migrate: check page_count of THP before migrating") and
refactor by commit 340ef3902c ("mm: numa: cleanup flow of transhuge page
migration"), which is out of date, since migrate_misplaced_page() is now
using the standard migrate_pages() for small pages and THPs, the reference
count checking is in folio_migrate_mapping(), so let's remove the special
check for THP.
Link: https://lkml.kernel.org/r/20230913095131.2426871-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit a8ac4a767dcd9d87d8229045904d9fe15ea5e0e8
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Wed Sep 13 17:51:24 2023 +0800
mm: migrate: remove PageTransHuge check in numamigrate_isolate_page()
Patch series "mm: migrate: more folio conversion and unification", v3.
Convert more migrate functions to use a folio, it is also a preparation
for large folio migration support when balancing numa.
This patch (of 8):
The assert VM_BUG_ON_PAGE(order && !PageTransHuge(page), page) is not very
useful,
1) for a tail/base page, order = 0, for a head page, the order > 0 &&
PageTransHuge() is true
2) there is a PageCompound() check and only base page is handled in
do_numa_page(), and do_huge_pmd_numa_page() only handle PMD-mapped
THP
3) even though the page is a tail page, isolate_lru_page() will post
a warning, and fail to isolate the page
4) if large folio/pte-mapped THP migration supported in the future,
we could migrate the entire folio if numa fault on a tail page
so just remove the check.
Link: https://lkml.kernel.org/r/20230913095131.2426871-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20230913095131.2426871-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 09c550508a4b8f7844b197cc16877dd0f7c42d8f
Author: David Hildenbrand <david@redhat.com>
Date: Wed Sep 13 14:51:13 2023 +0200
mm/rmap: pass folio to hugepage_add_anon_rmap()
Let's pass a folio; we are always mapping the entire thing.
Link: https://lkml.kernel.org/r/20230913125113.313322-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252
JIRA: https://issues.redhat.com/browse/RHEL-27743
JIRA: https://issues.redhat.com/browse/RHEL-59459
CVE: CVE-2024-46787
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961
This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level.
This work follows up on the previous v6.5 update (RHEL-27742) and as such,
the bulk of this changeset is comprised of refactoring and clean-ups of
the internal implementation of several APIs as it further advances the
conversion to FOLIOS, and follow up on the per-VMA locking changes.
Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow
Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds,
and we add a potential extra level of protection (assessment pending) to help
on mitigating kernel heap exploits dubbed as "SlubStick".
Follow-up fixes are omitted from this series either because they are irrelevant to
the bits we support on RHEL or because they depend on bigger changesets introduced
upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately.
Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot")
Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources")
Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()")
Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros")
Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages")
Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType")
Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()")
Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio")
Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling")
Signed-off-by: Rafael Aquini <raquini@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
* arch/parisc/include/asm/hugetlb.h: hunks dropped (unsupported arch)
* arch/parisc/mm/hugetlbpage.c: hunks dropped (unsupported arch)
* arch/riscv/include/asm/hugetlb.h: hunks dropped (unsupported arch)
* arch/riscv/mm/hugetlbpage.c: hunks dropped (unsupported arch)
* arch/sparc/mm/hugetlbpage.c: hunks dropped (unsupported arch)
* mm/rmap.c: minor context conflict on the 7th hunk due to backport of
upstream commit 322842ea3c72 ("mm/rmap: fix missing swap_free() in
try_to_unmap() after arch_unmap_one() failed")
This patch is a backport of the following upstream commit:
commit 935d4f0c6dc8b3533e6e39346de7389a84490178
Author: Ryan Roberts <ryan.roberts@arm.com>
Date: Fri Sep 22 12:58:03 2023 +0100
mm: hugetlb: add huge page size param to set_huge_pte_at()
Patch series "Fix set_huge_pte_at() panic on arm64", v2.
This series fixes a bug in arm64's implementation of set_huge_pte_at(),
which can result in an unprivileged user causing a kernel panic. The
problem was triggered when running the new uffd poison mm selftest for
HUGETLB memory. This test (and the uffd poison feature) was merged for
v6.5-rc7.
Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
(correctly this time) to get it backported to v6.5, where the issue first
showed up.
Description of Bug
==================
arm64's huge pte implementation supports multiple huge page sizes, some of
which are implemented in the page table with multiple contiguous entries.
So set_huge_pte_at() needs to work out how big the logical pte is, so that
it can also work out how many physical ptes (or pmds) need to be written.
It previously did this by grabbing the folio out of the pte and querying
its size.
However, there are cases when the pte being set is actually a swap entry.
But this also used to work fine, because for huge ptes, we only ever saw
migration entries and hwpoison entries. And both of these types of swap
entries have a PFN embedded, so the code would grab that and everything
still worked out.
But over time, more calls to set_huge_pte_at() have been added that set
swap entry types that do not embed a PFN. And this causes the code to go
bang. The triggering case is for the uffd poison test, commit
99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
added in v6.5-rc7. Although review shows that there are other call sites
that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
on arm64 because arm64 doesn't support UFFD WP.
If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
it will dereference a bad pointer in page_folio():
static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
{
VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
return page_folio(pfn_to_page(swp_offset_pfn(entry)));
}
Fix
===
The simplest fix would have been to revert the dodgy cleanup commit
18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
things have moved on, this would have required an audit of all the new
set_huge_pte_at() call sites to see if they should be converted to
set_huge_swap_pte_at(). As per the original intent of the change, it
would also leave us open to future bugs when people invariably get it
wrong and call the wrong helper.
So instead, I've added a huge page size parameter to set_huge_pte_at().
This means that the arm64 code has the size in all cases. It's a bigger
change, due to needing to touch the arches that implement the function,
but it is entirely mechanical, so in my view, low risk.
I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
s390, sparc (and additionally x86_64). I've additionally booted and run
mm selftests against arm64, where I observe the uffd poison test is fixed,
and there are no other regressions.
This patch (of 2):
In order to fix a bug, arm64 needs to be told the size of the huge page
for which the pte is being set in set_huge_pte_at(). Provide for this by
adding an `unsigned long sz` parameter to the function. This follows the
same pattern as huge_pte_clear().
This commit makes the required interface modifications to the core mm as
well as all arches that implement this function (arm64, parisc, powerpc,
riscv, s390, sparc). The actual arm64 bug will be fixed in a separate
commit.
No behavioral changes intended.
Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx]
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> [vmalloc change]
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org> [6.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit d5db4f9df9397d398256a2e33ad63c39c213b990
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Thu Jul 13 04:55:09 2023 +0100
migrate: use folio_set_bh() instead of set_bh_page()
This function was converted before folio_set_bh() existed. Catch up to
the new API.
Link: https://lkml.kernel.org/r/20230713035512.4139457-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Sterba <dsterba@suse.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Tom Rix <trix@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56023
Conflicts: A context diff in alloc_hugetlb_folio() hunk of mm/hugetlb.c
due to the presence of a later upstream commit b76b46902c2d
("mm/hugetlb: fix missing hugetlb_lock for resv uncharge").
commit 8cba9576df601c384abd334a503c3f6e1e29eefb
Author: Nhat Pham <nphamcs@gmail.com>
Date: Fri, 6 Oct 2023 11:46:28 -0700
hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit 994ec4e29b3de188d11fe60d17403285fcc8917a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Jun 21 17:45:57 2023 +0100
mm: remove unnecessary pagevec includes
These files no longer need pagevec.h, mostly due to function declarations
being moved out of it.
Link: https://lkml.kernel.org/r/20230621164557.3510324-14-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit 0b52c420350e8f9873ba62768cd8246827184408
Author: Jan Glauber <jglauber@digitalocean.com>
Date: Mon Jun 19 12:33:51 2023 +0200
mm: fix shmem THP counters on migration
The per node numa_stat values for shmem don't change on page migration for
THP:
grep shmem /sys/fs/cgroup/machine.slice/.../memory.numa_stat:
shmem N0=1092616192 N1=10485760
shmem_thp N0=1092616192 N1=10485760
migratepages 9181 0 1:
shmem N0=0 N1=1103101952
shmem_thp N0=1092616192 N1=10485760
Fix that by updating shmem_thp counters likewise to shmem counters on page
migration.
[jglauber@digitalocean.com: use folio_test_pmd_mappable instead of folio_test_transhuge]
Link: https://lkml.kernel.org/r/20230622094720.510540-1-jglauber@digitalocean.com
Link: https://lkml.kernel.org/r/20230619103351.234837-1-jglauber@digitalocean.com
Signed-off-by: Jan Glauber <jglauber@digitalocean.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
* drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
these are already applied via RHEL commit 26418f1a34 ("Merge DRM
changes from upstream v6.4..v6.5")
* kernel/events/uprobes.c: minor context difference due to backport of upstream
commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
as part of mmu_notifier_invalidate_range_end()")
* mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
* mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
on the backport of upstream commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
within a BUG(): inconsistent pte comparison")
* mm/ksm.c: context conflicts and differences on the 1st hunk are due to
out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
give up if pte_offset_map[_lock]() fails") being compensated for only now.
* mm/memory.c: minor context difference on the 35th hunk due to backport of
upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
* mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
* mm/migrate.c: minor context difference on the 2nd hunk due to backport of
upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
indicates write permissions")
* mm/migrate_device.c: minor context difference on the 5th hunk due to backport
of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
TLBs as part of mmu_notifier_invalidate_range_end()")
* mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
entry type for hwpoisoned swapcache page")
* mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
lru_gen_look_around()")
This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date: Mon Jun 12 16:15:45 2023 +0100
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
* dropped hunk for Documentation/translations/zh_CN/mm/page_migration.rst.
This doc file was introduced upstream via pre-v6.0 (v6.0-rc1) merge
commit 6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm") which was never
picked by previous backport attempts.
This patch is a backport of the following upstream commit:
commit 4e096ae1801e24b338e02715c65c3ffa8883ba5d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Sat May 13 01:11:01 2023 +0100
mm: convert migrate_pages() to work on folios
Almost all of the callers & implementors of migrate_pages() were already
converted to use folios. compaction_alloc() & compaction_free() are
trivial to convert a part of this patch and not worth splitting out.
Link: https://lkml.kernel.org/r/20230513001101.276972-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit 124abced647306aa3badb5d472c3616de23f180a
Author: Huang Ying <ying.huang@intel.com>
Date: Wed May 10 11:18:29 2023 +0800
migrate_pages_batch: simplify retrying and failure counting of large folios
After recent changes to the retrying and failure counting in
migrate_pages_batch(), it was found that it's unnecessary to count
retrying and failure for normal, large, and THP folios separately.
Because we don't use retrying and failure number of large folios directly.
So, in this patch, we simplified retrying and failure counting of large
folios via counting retrying and failure of normal and large folios
together. This results in the reduced line number.
Previously, in migrate_pages_batch we need to track whether the source
folio is large/THP before splitting. So is_large is used to cache
folio_test_large() result. Now, we don't need that variable any more
because we don't count retrying and failure of large folios (only counting
that of THP folios). So, in this patch, is_large is removed to simplify
the code.
This is just code cleanup, no functionality changes are expected.
Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit 4bb6dc79d987b243d65c70c5029e51e719cfb94b
Author: Douglas Anderson <dianders@chromium.org>
Date: Fri Apr 28 13:54:38 2023 -0700
migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHT
The MIGRATE_SYNC_LIGHT mode is intended to block for things that will
finish quickly but not for things that will take a long time. Exactly how
long is too long is not well defined, but waits of tens of milliseconds is
likely non-ideal.
When putting a Chromebook under memory pressure (opening over 90 tabs on a
4GB machine) it was fairly easy to see delays waiting for some locks in
the kcompactd code path of > 100 ms. While the laptop wasn't amazingly
usable in this state, it was still limping along and this state isn't
something artificial. Sometimes we simply end up with a lot of memory
pressure.
Putting the same Chromebook under memory pressure while it was running
Android apps (though not stressing them) showed a much worse result (NOTE:
this was on a older kernel but the codepaths here are similar). Android
apps on ChromeOS currently run from a 128K-block, zlib-compressed,
loopback-mounted squashfs disk. If we get a page fault from something
backed by the squashfs filesystem we could end up holding a folio lock
while reading enough from disk to decompress 128K (and then decompressing
it using the somewhat slow zlib algorithms). That reading goes through
the ext4 subsystem (because it's a loopback mount) before eventually
ending up in the block subsystem. This extra jaunt adds extra overhead.
Without much work I could see cases where we ended up blocked on a folio
lock for over a second. With more extreme memory pressure I could see up
to 25 seconds.
We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the
two locks that were seen to be slow [1] and that generated much
discussion. After discussion, it was decided that we should avoid waiting
for the two locks during MIGRATE_SYNC_LIGHT if they were being held for
IO. We'll continue with the unbounded wait for the more full SYNC modes.
With this change, I couldn't see any slow waits on these locks with my
previous testcases.
NOTE: The reason I stated digging into this originally isn't because some
benchmark had gone awry, but because we've received in-the-field crash
reports where we have a hung task waiting on the page lock (which is the
equivalent code path on old kernels). While the root cause of those
crashes is likely unrelated and won't be fixed by this patch, analyzing
those crash reports did point out these very long waits seemed like
something good to fix. With this patch we should no longer hang waiting
on these locks, but presumably the system will still be in a bad shape and
hang somewhere else.
[1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid
Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-48221
This patch is a backport of the following upstream commit:
commit 3c811f7883c4ee5a34ba4354381bde062888dd31
Author: David Hildenbrand <david@redhat.com>
Date: Tue Apr 11 16:25:10 2023 +0200
mm/migrate: revert "mm/migrate: fix wrongly apply write bit after mkdirty on sparc64"
This reverts commit 96a9c287e25d ("mm/migrate: fix wrongly apply write bit
after mkdirty on sparc64").
Now that sparc64 mkdirty handling is fixed and no longer sets a PTE/PMD
writable that shouldn't be writable, let's revert the temporary fix.
The mkdirty mm selftest still passes with this change on sparc64.
Note that loongarch handling was fixed in commit bf2f34a506e6 ("LoongArch:
Set _PAGE_DIRTY only if _PAGE_WRITE is set in {pmd,pte}_mkdirty()").
Link: https://lkml.kernel.org/r/20230411142512.438404-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <aquini@redhat.com>
commit 229e2253766c7cdfe024f1fe280020cc4711087c
Author: Gregory Price <gourry.memverge@gmail.com>
Date: Tue Oct 3 10:48:56 2023 -0400
mm/migrate: fix do_pages_move for compat pointers
do_pages_move does not handle compat pointers for the page list.
correctly. Add in_compat_syscall check and appropriate get_user fetch
when iterating the page list.
It makes the syscall in compat mode (32-bit userspace, 64-bit kernel)
work the same way as the native 32-bit syscall again, restoring the
behavior before my broken commit 5b1b561ba73c ("mm: simplify
compat_sys_move_pages").
More specifically, my patch moved the parsing of the 'pages' array from
the main entry point into do_pages_stat(), which left the syscall
working correctly for the 'stat' operation (nodes = NULL), while the
'move' operation (nodes != NULL) is now missing the conversion and
interprets 'pages' as an array of 64-bit pointers instead of the
intended 32-bit userspace pointers.
It is possible that nobody noticed this bug because the few
applications that actually call move_pages are unlikely to run in
compat mode because of their large memory requirements, but this
clearly fixes a user-visible regression and should have been caught by
ltp.
Link: https://lkml.kernel.org/r/20231003144857.752952-1-gregory.price@memverge.com
Fixes: 5b1b561ba73c ("mm: simplify compat_sys_move_pages")
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
Conflicts: mm/huge_memory.c - We already have
96a9c287e25d ("mm/migrate: fix wrongly apply write bit after mkdirty on sparc64")
so don't add check for pmd_swp_uffd_wp or call pmd_wrprotect
We also have
161e393c0f63 ("mm: Make pte_mkwrite() take a VMA")
so call pte_mkwrite with a vma
JIRA: https://issues.redhat.com/browse/RHEL-27741
commit f3ebdf042df4e08bab1d5f8bf1c4b959d8741c10
Author: David Hildenbrand <david@redhat.com>
Date: Tue Apr 18 16:21:13 2023 +0200
mm: don't check VMA write permissions if the PTE/PMD indicates write permiss
ions
Staring at the comment "Recheck VMA as permissions can change since
migration started" in remove_migration_pte() can result in confusion,
because if the source PTE/PMD indicates write permissions, then there
should be no need to check VMA write permissions when restoring migration
entries or PTE-mapping a PMD.
Commit d3cb8bf608 ("mm: migrate: Close race between migration completion
and mprotect") introduced the maybe_mkwrite() handling in
remove_migration_pte() in 2014, stating that a race between mprotect() and
migration finishing would be possible, and that we could end up with a
writable PTE that should be readable.
However, mprotect() code first updates vma->vm_flags / vma->vm_page_prot
and then walks the page tables to (a) set all present writable PTEs to
read-only and (b) convert all writable migration entries to readable
migration entries. While walking the page tables and modifying the
entries, migration code has to grab the PT locks to synchronize against
concurrent page table modifications.
Assuming migration would find a writable migration entry (while holding
the PT lock) and replace it with a writable present PTE, surely mprotect()
code didn't stumble over the writable migration entry yet (converting it
into a readable migration entry) and would instead wait for the PT lock to
convert the now present writable PTE into a read-only PTE. As mprotect()
didn't finish yet, the behavior is just like migration didn't happen: a
writable PTE will be converted to a read-only PTE.
So it's fine to rely on the writability information in the source PTE/PMD
and not recheck against the VMA as long as we're holding the PT lock to
synchronize with anyone who concurrently wants to downgrade write
permissions (like mprotect()) by first adjusting vma->vm_flags /
vma->vm_page_prot to then walk over the page tables to adjust the page
table entries.
Running test cases that should reveal such races -- mprotect(PROT_READ)
racing with page migration or THP splitting -- for multiple hours did not
reveal an issue with this cleanup.
Link: https://lkml.kernel.org/r/20230418142113.439494-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27741
commit 851ae6424697d1c4f085cb878c88168923ebcad1
Author: Huang Ying <ying.huang@intel.com>
Date: Mon Apr 17 07:59:29 2023 +0800
migrate_pages_batch: fix statistics for longterm pin retry
In commit fd4a7ac32918 ("mm: migrate: try again if THP split is failed due
to page refcnt"), if the THP splitting fails due to page reference count,
we will retry to improve migration successful rate. But the failed
splitting is counted as migration failure and migration retry, which will
cause duplicated failure counting. So, in this patch, this is fixed via
undoing the failure counting if we decide to retry. The patch is tested
via failure injection.
Link: https://lkml.kernel.org/r/20230416235929.1040194-1-ying.huang@intel.com
Fixes: fd4a7ac32918 ("mm: migrate: try again if THP split is failed due to page refcnt")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 2ef7dbb269902bde34c82f027806992195d1d1ee
Author: Huang Ying <ying.huang@intel.com>
Date: Fri Mar 3 11:01:55 2023 +0800
migrate_pages: try migrate in batch asynchronously firstly
When we have locked more than one folios, we cannot wait the lock or bit
(e.g., page lock, buffer head lock, writeback bit) synchronously.
Otherwise deadlock may be triggered. This make it hard to batch the
synchronous migration directly.
This patch re-enables batching synchronous migration via trying to migrate
in batch asynchronously firstly. And any folios that are failed to be
migrated asynchronously will be migrated synchronously one by one.
Test shows that this can restore the TLB flushing batching performance for
synchronous migration effectively.
Link: https://lkml.kernel.org/r/20230303030155.160983-4-ying.huang@intel.com
Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit a21d2133215b58fbf254ea2bb77eb3143ffedf60
Author: Huang Ying <ying.huang@intel.com>
Date: Fri Mar 3 11:01:54 2023 +0800
migrate_pages: move split folios processing out of migrate_pages_batch()
To simplify the code logic and reduce the line number.
Link: https://lkml.kernel.org/r/20230303030155.160983-3-ying.huang@intel.com
Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit fb3592c41a4427601f9643b2a84e55bb99f5cd7c
Author: Huang Ying <ying.huang@intel.com>
Date: Fri Mar 3 11:01:53 2023 +0800
migrate_pages: fix deadlock in batched migration
Patch series "migrate_pages: fix deadlock in batched synchronous
migration", v2.
Two deadlock bugs were reported for the migrate_pages() batching series.
Thanks Hugh and Pengfei. Analysis shows that if we have locked some other
folios except the one we are migrating, it's not safe in general to wait
synchronously, for example, to wait the writeback to complete or wait to
lock the buffer head.
So 1/3 fixes the deadlock in a simple way, where the batching support for
the synchronous migration is disabled. The change is straightforward and
easy to be understood. While 3/3 re-introduce the batching for
synchronous migration via trying to migrate asynchronously in batch
optimistically, then fall back to migrate synchronously one by one for
fail-to-migrate folios. Test shows that this can restore the TLB flushing
batching performance for synchronous migration effectively.
This patch (of 3):
Two deadlock bugs were reported for the migrate_pages() batching series.
Thanks Hugh and Pengfei! For example, in the following deadlock trace
snippet,
INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
Not tainted 6.2.0-rc4-kvm+ #1314
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000
Workqueue: loop4 loop_rootcg_workfn
Call Trace:
<TASK>
__schedule+0x43b/0xd00
schedule+0x6a/0xf0
io_schedule+0x4a/0x80
folio_wait_bit_common+0x1b5/0x4e0
? __pfx_wake_page_function+0x10/0x10
__filemap_get_folio+0x73d/0x770
shmem_get_folio_gfp+0x1fd/0xc80
shmem_write_begin+0x91/0x220
generic_perform_write+0x10e/0x2e0
__generic_file_write_iter+0x17e/0x290
? generic_write_checks+0x12b/0x1a0
generic_file_write_iter+0x97/0x180
? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
do_iter_readv_writev+0x13c/0x210
? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
do_iter_write+0xf6/0x330
vfs_iter_write+0x46/0x70
loop_process_work+0x723/0xfe0
loop_rootcg_workfn+0x28/0x40
process_one_work+0x3cc/0x8d0
worker_thread+0x66/0x630
? __pfx_worker_thread+0x10/0x10
kthread+0x153/0x190
? __pfx_kthread+0x10/0x10
ret_from_fork+0x29/0x50
</TASK>
INFO: task repro:1023 blocked for more than 147 seconds.
Not tainted 6.2.0-rc4-kvm+ #1314
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004
Call Trace:
<TASK>
__schedule+0x43b/0xd00
schedule+0x6a/0xf0
io_schedule+0x4a/0x80
folio_wait_bit_common+0x1b5/0x4e0
? compaction_alloc+0x77/0x1150
? __pfx_wake_page_function+0x10/0x10
folio_wait_bit+0x30/0x40
folio_wait_writeback+0x2e/0x1e0
migrate_pages_batch+0x555/0x1ac0
? __pfx_compaction_alloc+0x10/0x10
? __pfx_compaction_free+0x10/0x10
? __this_cpu_preempt_check+0x17/0x20
? lock_is_held_type+0xe6/0x140
migrate_pages+0x100e/0x1180
? __pfx_compaction_free+0x10/0x10
? __pfx_compaction_alloc+0x10/0x10
compact_zone+0xe10/0x1b50
? lock_is_held_type+0xe6/0x140
? check_preemption_disabled+0x80/0xf0
compact_node+0xa3/0x100
? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
? _find_first_bit+0x7b/0x90
sysctl_compaction_handler+0x5d/0xb0
proc_sys_call_handler+0x29d/0x420
proc_sys_write+0x2b/0x40
vfs_write+0x3a3/0x780
ksys_write+0xb7/0x180
__x64_sys_write+0x26/0x30
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7f3a2471f59d
RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
</TASK>
The page migration task has held the lock of the shmem folio A, and is
waiting the writeback of the folio B of the file system on the loop block
device to complete. While the loop worker task which writes back the
folio B is waiting to lock the shmem folio A, because the folio A backs
the folio B in the loop device. Thus deadlock is triggered.
In general, if we have locked some other folios except the one we are
migrating, it's not safe to wait synchronously, for example, to wait the
writeback to complete or wait to lock the buffer head.
To fix the deadlock, in this patch, we avoid to batch the page migration
except for MIGRATE_ASYNC mode. In MIGRATE_ASYNC mode, synchronous waiting
is avoided.
The fix can be improved further. We will do that as soon as possible.
Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com
Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/
Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/
Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/
Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com
Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit e77d587a2c04e82c6a0dffa4a32c874a4029385d
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat Mar 4 14:03:27 2023 -0800
mm: avoid gcc complaint about pointer casting
The migration code ends up temporarily stashing information of the wrong
type in unused fields of the newly allocated destination folio. That
all works fine, but gcc does complain about the pointer type mis-use:
mm/migrate.c: In function ‘__migrate_folio_extract’:
mm/migrate.c:1050:20: note: randstruct: casting between randomized structure pointer types (ssa): ‘struct anon_vma’ and ‘struct address_space’
1050 | *anon_vmap = (void *)dst->mapping;
| ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
and gcc is actually right to complain since it really doesn't understand
that this is a very temporary special case where this is ok.
This could be fixed in different ways by just obfuscating the assignment
sufficiently that gcc doesn't see what is going on, but the truly
"proper C" way to do this is by explicitly using a union.
Using unions for type conversions like this is normally hugely ugly and
syntactically nasty, but this really is one of the few cases where we
want to make it clear that we're not doing type conversion, we're really
re-using the value bit-for-bit just using another type.
IOW, this should not become a common pattern, but in this one case using
that odd union is probably the best way to document to the compiler what
is conceptually going on here.
[ Side note: there are valid cases where we convert pointers to other
pointer types, notably the whole "folio vs page" situation, where the
types actually have fundamental commonalities.
The fact that the gcc note is limited to just randomized structures
means that we don't see equivalent warnings for those cases, but it
migth also mean that we miss other cases where we do play these kinds
of dodgy games, and this kind of explicit conversion might be a good
idea. ]
I verified that at least for an allmodconfig build on x86-64, this
generates the exact same code, apart from line numbers and assembler
comment changes.
Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()")
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit cd7755800eb54e8522f5e51f4e71e6494c1f1572
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date: Wed Feb 15 18:39:37 2023 +0800
mm: change to return bool for isolate_movable_page()
Now the isolate_movable_page() can only return 0 or -EBUSY, and no users
will care about the negative return value, thus we can convert the
isolate_movable_page() to return a boolean value to make the code more
clear when checking the movable page isolation state.
No functional changes intended.
[akpm@linux-foundation.org: remove unneeded comment, per Matthew]
Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 9747b9e92418b61c2281561e0651803f1fad0159
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date: Wed Feb 15 18:39:36 2023 +0800
mm: hugetlb: change to return bool for isolate_hugetlb()
Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not
care about the negative value, thus we can convert the isolate_hugetlb()
to return a boolean value to make code more clear when checking the
hugetlb isolation state. Moreover converts 2 users which will consider
the negative value returned by isolate_hugetlb().
No functional changes intended.
[akpm@linux-foundation.org: shorten locked section, per SeongJae Park]
Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit f7f9c00dfafffd7a5a1a5685e2d874c64913e2ed
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date: Wed Feb 15 18:39:35 2023 +0800
mm: change to return bool for isolate_lru_page()
The isolate_lru_page() can only return 0 or -EBUSY, and most users did not
care about the negative error of isolate_lru_page(), except one user in
add_page_for_migration(). So we can convert the isolate_lru_page() to
return a boolean value, which can help to make the code more clear when
checking the return value of isolate_lru_page().
Also convert all users' logic of checking the isolation state.
No functional changes intended.
Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 6f7d760e86fa84862d749e36ebd29abf31f4f883
Author: Huang Ying <ying.huang@intel.com>
Date: Mon Feb 13 20:34:44 2023 +0800
migrate_pages: move THP/hugetlb migration support check to simplify code
This is a code cleanup patch, no functionality change is expected. After
the change, the line number reduces especially in the long
migrate_pages_batch().
Link: https://lkml.kernel.org/r/20230213123444.155149-10-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 7e12beb8ca2ac98b2ec42e0ea4b76cdc93b58654
Author: Huang Ying <ying.huang@intel.com>
Date: Mon Feb 13 20:34:43 2023 +0800
migrate_pages: batch flushing TLB
The TLB flushing will cost quite some CPU cycles during the folio
migration in some situations. For example, when migrate a folio of a
process with multiple active threads that run on multiple CPUs. After
batching the _unmap and _move in migrate_pages(), the TLB flushing can be
batched easily with the existing TLB flush batching mechanism. This patch
implements that.
We use the following test case to test the patch.
On a 2-socket Intel server,
- Run pmbench memory accessing benchmark
- Run `migratepages` to migrate pages of pmbench between node 0 and
node 1 back and forth.
With the patch, the TLB flushing IPI reduces 99.1% during the test and the
number of pages migrated successfully per second increases 291.7%.
Haoxin helped to test the patchset on an ARM64 server with 128 cores, 2
NUMA nodes. Test results show that the page migration performance
increases up to 78%.
NOTE: TLB flushing is batched only for normal folios, not for THP folios.
Because the overhead of TLB flushing for THP folios is much lower than
that for normal folios (about 1/512 on x86 platform).
Link: https://lkml.kernel.org/r/20230213123444.155149-9-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit ebe75e4751063dce6f61b579b43de86dcf7b7462
Author: Huang Ying <ying.huang@intel.com>
Date: Mon Feb 13 20:34:42 2023 +0800
migrate_pages: share more code between _unmap and _move
This is a code cleanup patch to reduce the duplicated code between the
_unmap and _move stages of migrate_pages(). No functionality change is
expected.
Link: https://lkml.kernel.org/r/20230213123444.155149-8-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 80562ba0d8378e89fe5836c28ea56c2aab3014e8
Author: Huang Ying <ying.huang@intel.com>
Date: Mon Feb 13 20:34:41 2023 +0800
migrate_pages: move migrate_folio_unmap()
Just move the position of the functions. There's no any functionality
change. This is to make it easier to review the next patch via putting
code near its position in the next patch.
Link: https://lkml.kernel.org/r/20230213123444.155149-7-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 5dfab109d5193e6c224d96cabf90e9cc2c039884
Author: Huang Ying <ying.huang@intel.com>
Date: Mon Feb 13 20:34:40 2023 +0800
migrate_pages: batch _unmap and _move
In this patch the _unmap and _move stage of the folio migration is
batched. That for, previously, it is,
for each folio
_unmap()
_move()
Now, it is,
for each folio
_unmap()
for each folio
_move()
Based on this, we can batch the TLB flushing and use some hardware
accelerator to copy folios between batched _unmap and batched _move
stages.
Link: https://lkml.kernel.org/r/20230213123444.155149-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 64c8902ed4418317cd416c566f896bd4a92b2efc
Author: Huang Ying <ying.huang@intel.com>
Date: Mon Feb 13 20:34:39 2023 +0800
migrate_pages: split unmap_and_move() to _unmap() and _move()
This is a preparation patch to batch the folio unmapping and moving.
In this patch, unmap_and_move() is split to migrate_folio_unmap() and
migrate_folio_move(). So, we can batch _unmap() and _move() in different
loops later. To pass some information between unmap and move, the
original unused dst->mapping and dst->private are used.
Link: https://lkml.kernel.org/r/20230213123444.155149-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 42012e0436d44aeb2e68f11a28ddd0ad3f38b61f
Author: Huang Ying <ying.huang@intel.com>
Date: Mon Feb 13 20:34:38 2023 +0800
migrate_pages: restrict number of pages to migrate in batch
This is a preparation patch to batch the folio unmapping and moving for
non-hugetlb folios.
If we had batched the folio unmapping, all folios to be migrated would be
unmapped before copying the contents and flags of the folios. If the
folios that were passed to migrate_pages() were too many in unit of pages,
the execution of the processes would be stopped for too long time, thus
too long latency. For example, migrate_pages() syscall will call
migrate_pages() with all folios of a process. To avoid this possible
issue, in this patch, we restrict the number of pages to be migrated to be
no more than HPAGE_PMD_NR. That is, the influence is at the same level of
THP migration.
Link: https://lkml.kernel.org/r/20230213123444.155149-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>