JIRA: https://issues.redhat.com/browse/RHEL-84184
JIRA: https://issues.redhat.com/browse/RHEL-83249
This patch is a backport of the following upstream commit:
commit 5c8525a37b78ddfee84ff6927b8838013ff2521e
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Mon Aug 26 14:58:09 2024 +0800
mm: migrate_device: convert to migrate_device_coherent_folio()
Patch series "mm: finish isolate/putback_lru_page()".
Convert to use more folios in migrate_device.c, then we could remove
isolate_lru_page() and putback_lru_page().
This patch (of 6):
Save a few calls to compound_head() and use folio throughout.
Link: https://lkml.kernel.org/r/20240826065814.1336616-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240826065814.1336616-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-84184
This patch is a backport of the following upstream commit:
commit 62e72d2cf702a5e2fb53d9c46ed900d9384e4a06
Author: Kairui Song <kasong@tencent.com>
Date: Sun Dec 22 20:29:36 2024 +0800
mm, madvise: fix potential workingset node list_lru leaks
Since commit 5abc1e37afa0 ("mm: list_lru: allocate list_lru_one only when
needed"), all list_lru users need to allocate the items using the new
infrastructure that provides list_lru info for slab allocation, ensuring
that the corresponding memcg list_lru is allocated before use.
For workingset shadow nodes (which are xa_node), users are converted to
use the new infrastructure by commit 9bbdc0f32409 ("xarray: use
kmem_cache_alloc_lru to allocate xa_node"). The xas->xa_lru will be set
correctly for filemap users. However, there is a missing case: xa_node
allocations caused by madvise(..., MADV_COLLAPSE).
madvise(..., MADV_COLLAPSE) will also read in the absent parts of file
map, and there will be xa_nodes allocated for the caller's memcg (assuming
it's not rootcg). However, these allocations won't trigger memcg list_lru
allocation because the proper xas info was not set.
If nothing else has allocated other xa_nodes for that memcg to trigger
list_lru creation, and memory pressure starts to evict file pages,
workingset_update_node will try to add these xa_nodes to their
corresponding memcg list_lru, and it does not exist (NULL). So they will
be added to rootcg's list_lru instead.
This shouldn't be a significant issue in practice, but it is indeed
unexpected behavior, and these xa_nodes will not be reclaimed effectively.
And may lead to incorrect counting of the list_lru->nr_items counter.
This problem wasn't exposed until recent commit 28e98022b31ef
("mm/list_lru: simplify reparenting and initial allocation") added a
sanity check: only dying memcg could have a NULL list_lru when
list_lru_{add,del} is called. This problem triggered this WARNING.
So make madvise(..., MADV_COLLAPSE) also call xas_set_lru() to pass the
list_lru which we may want to insert xa_node into later. And move
mapping_set_update to mm/internal.h, and turn into a macro to avoid
including extra headers in mm/internal.h.
Link: https://lkml.kernel.org/r/20241222122936.67501-1-ryncsn@gmail.com
Fixes: 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node")
Reported-by: syzbot+38a0cbd267eff2d286ff@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/675d01e9.050a0220.37aaf.00be.GAE@google.com/
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-84184
This patch is a backport of the following upstream commit:
commit b64e74e95aa6491b31477e9002aab1d8df3995bf
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Feb 19 07:27:09 2024 +0100
mm: move mapping_set_update out of <linux/swap.h>
mapping_set_update is only used inside mm/. Move mapping_set_update to
mm/internal.h and turn it into an inline function instead of a macro.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-68912
Conflicts: this is a straight backport from v6.6 LTS branch, commit a97fe6889b25
at git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
commit 4080ef1579b2413435413988d14ac8c68e4d42c8
Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Fri Nov 15 12:41:55 2024 +0000
mm: unconditionally close VMAs on error
[ Upstream commit 4080ef1579b2413435413988d14ac8c68e4d42c8 ]
Incorrect invocation of VMA callbacks when the VMA is no longer in a
consistent state is bug prone and risky to perform.
With regards to the important vm_ops->close() callback We have gone to
great lengths to try to track whether or not we ought to close VMAs.
Rather than doing so and risking making a mistake somewhere, instead
unconditionally close and reset vma->vm_ops to an empty dummy operations
set with a NULL .close operator.
We introduce a new function to do so - vma_close() - and simplify existing
vms logic which tracked whether we needed to close or not.
This simplifies the logic, avoids incorrect double-calling of the .close()
callback and allows us to update error paths to simply call vma_close()
unconditionally - making VMA closure idempotent.
Link: https://lkml.kernel.org/r/28e89dda96f68c505cb6f8e9fc9b57c3e9f74b42.1730224667.git.lorenzo.stoakes@oracle.com
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-68912
Conflicts: this is a straight backport from v6.6 LTS branch, commit cd3ed99fca8c
at git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
commit 3dd6ed34ce1f2356a77fb88edafb5ec96784e3cf
Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Fri Nov 15 12:41:54 2024 +0000
mm: avoid unsafe VMA hook invocation when error arises on mmap hook
[ Upstream commit 3dd6ed34ce1f2356a77fb88edafb5ec96784e3cf ]
Patch series "fix error handling in mmap_region() and refactor
(hotfixes)", v4.
mmap_region() is somewhat terrifying, with spaghetti-like control flow and
numerous means by which issues can arise and incomplete state, memory
leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
This series goes to great lengths to simplify how mmap_region() works and
to avoid unwinding errors late on in the process of setting up the VMA for
the new mapping, and equally avoids such operations occurring while the
VMA is in an inconsistent state.
The patches in this series comprise the minimal changes required to
resolve existing issues in mmap_region() error handling, in order that
they can be hotfixed and backported. There is additionally a follow up
series which goes further, separated out from the v1 series and sent and
updated separately.
This patch (of 5):
After an attempted mmap() fails, we are no longer in a situation where we
can safely interact with VMA hooks. This is currently not enforced,
meaning that we need complicated handling to ensure we do not incorrectly
call these hooks.
We can avoid the whole issue by treating the VMA as suspect the moment
that the file->f_ops->mmap() function reports an error by replacing
whatever VMA operations were installed with a dummy empty set of VMA
operations.
We do so through a new helper function internal to mm - mmap_file() -
which is both more logically named than the existing call_mmap() function
and correctly isolates handling of the vm_op reassignment to mm.
All the existing invocations of call_mmap() outside of mm are ultimately
nested within the call_mmap() from mm, which we now replace.
It is therefore safe to leave call_mmap() in place as a convenience
function (and to avoid churn). The invokers are:
ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap()
coda_file_operations -> mmap -> coda_file_mmap()
shm_file_operations -> shm_mmap()
shm_file_operations_huge -> shm_mmap()
dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops
-> i915_gem_dmabuf_mmap()
None of these callers interact with vm_ops or mappings in a problematic
way on error, quickly exiting out.
Link: https://lkml.kernel.org/r/cover.1730224667.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d41fd763496fd0048a962f3fd9407dc72dd4fd86.1730224667.git.lorenzo.stoakes@oracle.com
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
* this is a direct port from v6.6 LTS branch backport commit 26273f5f4cf6
("mm: gup: stop abusing try_grab_folio"), due to RHEL9 missing upstream
commit 53e45c4f6d4f ("mm: convert put_devmap_managed_page_refs() to
put_devmap_managed_folio_refs()") and its big accompanying series.
This patch is a backport of the following upstream commit:
commit f442fa6141379a20b48ae3efabee827a3d260787
Author: Yang Shi <yang@os.amperecomputing.com>
Date: Fri Jun 28 12:14:58 2024 -0700
mm: gup: stop abusing try_grab_folio
A kernel warning was reported when pinning folio in CMA memory when
launching SEV virtual machine. The splat looks like:
[ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520
[ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6
[ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520
[ 464.325515] Call Trace:
[ 464.325520] <TASK>
[ 464.325523] ? __get_user_pages+0x423/0x520
[ 464.325528] ? __warn+0x81/0x130
[ 464.325536] ? __get_user_pages+0x423/0x520
[ 464.325541] ? report_bug+0x171/0x1a0
[ 464.325549] ? handle_bug+0x3c/0x70
[ 464.325554] ? exc_invalid_op+0x17/0x70
[ 464.325558] ? asm_exc_invalid_op+0x1a/0x20
[ 464.325567] ? __get_user_pages+0x423/0x520
[ 464.325575] __gup_longterm_locked+0x212/0x7a0
[ 464.325583] internal_get_user_pages_fast+0xfb/0x190
[ 464.325590] pin_user_pages_fast+0x47/0x60
[ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd]
[ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd]
Per the analysis done by yangge, when starting the SEV virtual machine, it
will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory.
But the page is in CMA area, so fast GUP will fail then fallback to the
slow path due to the longterm pinnalbe check in try_grab_folio().
The slow path will try to pin the pages then migrate them out of CMA area.
But the slow path also uses try_grab_folio() to pin the page, it will
also fail due to the same check then the above warning is triggered.
In addition, the try_grab_folio() is supposed to be used in fast path and
it elevates folio refcount by using add ref unless zero. We are guaranteed
to have at least one stable reference in slow path, so the simple atomic add
could be used. The performance difference should be trivial, but the
misuse may be confusing and misleading.
Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page()
to try_grab_folio(), and use them in the proper paths. This solves both
the abuse and the kernel warning.
The proper naming makes their usecase more clear and should prevent from
abusing in the future.
peterx said:
: The user will see the pin fails, for gpu-slow it further triggers the WARN
: right below that failure (as in the original report):
:
: folio = try_grab_folio(page, page_increm - 1,
: foll_flags);
: if (WARN_ON_ONCE(!folio)) { <------------------------ here
: /*
: * Release the 1st page ref if the
: * folio is problematic, fail hard.
: */
: gup_put_folio(page_folio(page), 1,
: foll_flags);
: ret = -EFAULT;
: goto out;
: }
[1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/
[shy828301@gmail.com: fix implicit declaration of function try_grab_folio_fast]
Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com
Link: https://lkml.kernel.org/r/20240628191458.2605553-1-yang@os.amperecomputing.com
Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"")
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Reported-by: yangge <yangge1116@126.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org> [6.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
* this is a direct port from v6.6 LTS branch backport commit eb6b6d3e1f1e
("mm: refactor folio_undo_large_rmappable()"), due to RHEL9 missing
upstream commit 90491d87dd46 ("mm: add free_unref_folios()") and its
big accompanying series.
This patch is a backport of the following upstream commit:
commit 593a10dabe08dcf93259fce2badd8dc2528859a8
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Tue May 21 21:03:15 2024 +0800
mm: refactor folio_undo_large_rmappable()
Folios of order <= 1 are not in deferred list, the check of order is added
into folio_undo_large_rmappable() from commit 8897277acfef ("mm: support
order-1 folios in the page cache"), but there is a repeated check for
small folio (order 0) during each call of the
folio_undo_large_rmappable(), so only keep folio_order() check inside the
function.
In addition, move all the checks into header file to save a function call
for non-large-rmappable or empty deferred_list folio.
Link: https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 631426ba1d45a8672b177ee85ad4cabe760dd131
Author: David Hildenbrand <david@redhat.com>
Date: Thu Mar 14 17:12:59 2024 +0100
mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly
Darrick reports that in some cases where pread() would fail with -EIO and
mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
While the madvise() call can be interrupted by a signal, this is not the
desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave
like page faults in that case: fail and not retry forever.
A reproducer can be found at [1].
The reason is that __get_user_pages(), as called by
faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way:
it will simply return 0 when VM_FAULT_RETRY happened, making
madvise_populate()->faultin_vma_page_range() retry again and again, never
setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages().
__get_user_pages_locked() does what we want, but duplicating that logic in
faultin_vma_page_range() feels wrong.
So let's use __get_user_pages_locked() instead, that will detect
VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler
return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT
from faultin_page() to __get_user_pages(), all the way to
madvise_populate().
But, there is an issue: __get_user_pages_locked() will end up re-taking
the MM lock and then __get_user_pages() will do another VMA lookup. In
the meantime, the VMA layout could have changed and we'd fail with
different error codes than we'd want to.
As __get_user_pages() will currently do a new VMA lookup either way, let
it do the VMA handling in a different way, controlled by a new
FOLL_MADV_POPULATE flag, effectively moving these checks from
madvise_populate() + faultin_page_range() in there.
With this change, Darricks reproducer properly fails with -EFAULT, as
documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE.
[1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/
Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com
Fixes: 4ca9b3859d ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Darrick J. Wong <djwong@kernel.org>
Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 8897277acfef7f70fdecc054073bea2542fc7a1b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Mon Feb 26 15:55:28 2024 -0500
mm: support order-1 folios in the page cache
Folios of order 1 have no space to store the deferred list. This is not a
problem for the page cache as file-backed folios are never placed on the
deferred list. All we need to do is prevent the core MM from touching the
deferred list for order 1 folios and remove the code which prevented us
from allocating order 1 folios.
Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google.com/
Link: https://lkml.kernel.org/r/20240226205534.1603748-3-zi.yan@sent.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
* arch/*/Kconfig: all hunks dropped as there were only text blurbs and comments
being changed with no functional changes whatsoever, and RHEL9 is missing
several (unrelated) commits to these arches that tranform the text blurbs in
the way these non-functional hunks were expecting;
* drivers/accel/qaic/qaic_data.c: hunk dropped due to RHEL-only commit
083c0cdce2 ("Merge DRM changes from upstream v6.8..v6.9");
* drivers/gpu/drm/i915/gem/selftests/huge_pages.c: hunk dropped due to RHEL-only
commit ca8b16c11b ("Merge DRM changes from upstream v6.7..v6.8");
* drivers/gpu/drm/ttm/tests/ttm_pool_test.c: all hunks dropped due to RHEL-only
commit ca8b16c11b ("Merge DRM changes from upstream v6.7..v6.8");
* drivers/video/fbdev/vermilion/vermilion.c: hunk dropped as RHEL9 misses
commit dbe7e429fe ("vmlfb: framebuffer driver for Intel Vermilion Range");
* include/linux/pageblock-flags.h: differences due to out-of-order backport
of upstream commits 72801513b2bf ("mm: set pageblock_order to HPAGE_PMD_ORDER
in case with !CONFIG_HUGETLB_PAGE but THP enabled"), and 3a7e02c040b1
("minmax: avoid overly complicated constant expressions in VM code");
* mm/mm_init.c: differences on the 3rd, and 4th hunks are due to RHEL
backport commit 1845b92dcf ("mm: move most of core MM initialization to
mm/mm_init.c") ignoring the out-of-order backport of commit 3f6dac0fd1b8
("mm/page_alloc: make deferred page init free pages in MAX_ORDER blocks")
thus partially reverting the changes introduced by the latter;
This patch is a backport of the following upstream commit:
commit 5e0a760b44417f7cadd79de2204d6247109558a0
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date: Thu Dec 28 17:47:04 2023 +0300
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.
To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.
Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 067311d33e650adfe7ae23765959ddcc1ba18510
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date: Wed Nov 1 13:16:25 2023 -0400
maple_tree: separate ma_state node from status
The maple tree node is overloaded to keep status as well as the active
node. This, unfortunately, results in a re-walk on underflow or overflow.
Since the maple state has room, the status can be placed in its own enum
in the structure. Once an underflow/overflow is detected, certain modes
can restore the status to active and others may need to re-walk just that
one node to see the entry.
The status being an enum has the benefit of detecting unhandled status in
switch statements.
[Liam.Howlett@oracle.com: fix comments about MAS_*]
Link: https://lkml.kernel.org/r/20231106154124.614247-1-Liam.Howlett@oracle.com
[Liam.Howlett@oracle.com: update forking to separate maple state and node]
Link: https://lkml.kernel.org/r/20231106154551.615042-1-Liam.Howlett@oracle.com
[Liam.Howlett@oracle.com: fix mas_prev() state separation code]
Link: https://lkml.kernel.org/r/20231207193319.4025462-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20231101171629.3612299-9-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit bf857ddd21d0bffc1edafc317e8e2ce0d6d5950c
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date: Wed Nov 1 13:16:20 2023 -0400
maple_tree: move debug check to __mas_set_range()
__mas_set_range() was created to shortcut resetting the maple state and a
debug check was added to the caller (the vma iterator) to ensure the
internal maple state remains safe to use. Move the debug check from the
vma iterator into the maple tree itself so other users do not incorrectly
use the advanced maple state modification.
Fallout from this change include a large amount of debug setup needed to
be moved to earlier in the header, and the maple_tree.h radix-tree test
code needed to move the inclusion of the header to after the atomic
define. None of those changes have functional changes.
Link: https://lkml.kernel.org/r/20231101171629.3612299-4-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 2033c98cce666b0d125ae956613ab5111bb8d202
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Nov 8 18:28:09 2023 +0000
mm: remove invalidate_inode_page()
All callers are now converted to call mapping_evict_folio().
Link: https://lkml.kernel.org/r/20231108182809.602073-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 1e12cbb9f69541181afab6b1ff358b4f1dd3e253
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Nov 8 18:28:04 2023 +0000
mm: make mapping_evict_folio() the preferred way to evict clean folios
Patch series "Fix fault handler's handling of poisoned tail pages".
Since introducing the ability to have large folios in the page cache, it's
been possible to have a hwpoisoned tail page returned from the fault
handler. We handle this situation poorly; failing to remove the affected
page from use.
This isn't a minimal patch to fix it, it's a full conversion of all the
code surrounding it.
This patch (of 6):
invalidate_inode_page() does very little beyond calling
mapping_evict_folio(). Move the check for mapping being NULL into
mapping_evict_folio() and make it available to the rest of the MM for use
in the next few patches.
Link: https://lkml.kernel.org/r/20231108182809.602073-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231108182809.602073-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
* kernel/fork.c: differences on the 3rd and 4th hunks are due to out-of-order
backport of commit 35e351780fa9 ("fork: defer linking file vma until vma is
fully initialized"), and we have one RHEL-only hunk here that reverts commit
2b4f3b4987b5 ("fork: lock VMAs of the parent process when forking") which was
a temporary measure upstream that ended up in a dead branch after v6.6;
This patch is a backport of the following upstream commit:
commit d2406291483775ecddaee929231a39c70c08fda2
Author: Peng Zhang <zhangpeng.00@bytedance.com>
Date: Fri Oct 27 11:38:45 2023 +0800
fork: use __mt_dup() to duplicate maple tree in dup_mmap()
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
directly replacing the entries of VMAs in the new maple tree can result in
better performance. __mt_dup() uses DFS pre-order to duplicate the maple
tree, so it is efficient.
The average time complexity of __mt_dup() is O(n), where n is the number
of VMAs. The proof of the time complexity is provided in the commit log
that introduces __mt_dup(). After duplicating the maple tree, each
element is traversed and replaced (ignoring the cases of deletion, which
are rare). Since it is only a replacement operation for each element,
this process is also O(n).
Analyzing the exact time complexity of the previous algorithm is
challenging because each insertion can involve appending to a node,
pushing data to adjacent nodes, or even splitting nodes. The frequency of
each action is difficult to calculate. The worst-case scenario for a
single insertion is when the tree undergoes splitting at every level. If
we consider each insertion as the worst-case scenario, we can determine
that the upper bound of the time complexity is O(n*log(n)), although this
is a loose upper bound. However, based on the test data, it appears that
the actual time complexity is likely to be O(n).
As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
fails, there will be a portion of VMAs that have not been duplicated in
the maple tree. To handle this, we mark the failure point with
XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop
releasing VMAs that have not been duplicated after this point.
There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.
Below are the test results. The first row shows the number of VMAs. The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.
21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
[1] https://github.com/kdlucas/byte-unixbench/tree/master
Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 23e4883248f0472d806c8b3422ba6257e67bf1a5
Author: Hugh Dickins <hughd@google.com>
Date: Tue Oct 3 02:25:33 2023 -0700
mm: add page_rmappable_folio() wrapper
folio_prep_large_rmappable() is being used repeatedly along with a
conversion from page to folio, a check non-NULL, a check order > 1: wrap
it all up into struct folio *page_rmappable_folio(struct page *).
Link: https://lkml.kernel.org/r/8d92c6cf-eebe-748-e29c-c8ab224c741@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun heo <tj@kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 93bf5d4aa27d4ba528f71483ae51fbd70edb3ce8
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date: Wed Oct 11 18:04:31 2023 +0100
mm: abstract VMA merge and extend into vma_merge_extend() helper
mremap uses vma_merge() in the case where a VMA needs to be extended. This
can be significantly simplified and abstracted.
This makes it far easier to understand what the actual function is doing,
avoids future mistakes in use of the confusing vma_merge() function and
importantly allows us to make future changes to how vma_merge() is
implemented by knowing explicitly which merge cases each invocation uses.
Note that in the mremap() extend case, we perform this merge only when
old_len == vma->vm_end - addr. The extension_start, i.e. the start of the
extended portion of the VMA is equal to addr + old_len, i.e. vma->vm_end.
With this refactoring, vma_merge() is no longer required anywhere except
mm/mmap.c, so mark it static.
Link: https://lkml.kernel.org/r/f16cbdc2e72d37a1a097c39dc7d1fee8919a1c93.1697043508.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit adb20b0c785e9e8d5e4d4a07b9c3340d7de0e5dc
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date: Wed Oct 11 18:04:29 2023 +0100
mm: make vma_merge() and split_vma() internal
Now the common pattern of - attempting a merge via vma_merge() and should
this fail splitting VMAs via split_vma() - has been abstracted, the former
can be placed into mm/internal.h and the latter made static.
In addition, the split_vma() nommu variant also need not be exported.
Link: https://lkml.kernel.org/r/405f2be10e20c4e9fbcc9fe6b2dfea105f6642e0.1697043508.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 0f20bba1688bdf3b32df0162511a67d4eda15790
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date: Tue Oct 3 00:14:52 2023 +0100
mm/gup: explicitly define and check internal GUP flags, disallow FOLL_TOUCH
Rather than open-coding a list of internal GUP flags in
is_valid_gup_args(), define which ones are internal.
In addition, explicitly check to see if the user passed in FOLL_TOUCH
somehow, as this appears to have been accidentally excluded.
Link: https://lkml.kernel.org/r/971e013dfe20915612ea8b704e801d7aef9a66b6.1696288092.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit cda6d93672ac5dd8af778a3f3e6082e12233b65b
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Thu Sep 21 15:44:15 2023 +0800
mm: memory: make numa_migrate_prep() to take a folio
In preparation for large folio numa balancing, make numa_migrate_prep() to
take a folio, no functional change intended.
Link: https://lkml.kernel.org/r/20230921074417.24004-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit f04eba134e598d4a5af2eeecff0562df5042cbfc
Author: Lucy Mielke <lucymielke@icloud.com>
Date: Fri Oct 6 22:30:51 2023 +0200
mm: add printf attribute to shrinker_debugfs_name_alloc
This fixes a compiler warning when compiling an allyesconfig with W=1:
mm/internal.h:1235:9: error: function might be a candidate for `gnu_printf'
format attribute [-Werror=suggest-attribute=format]
[akpm@linux-foundation.org: fix shrinker_alloc() as welll per Qi Zheng]
Link: https://lkml.kernel.org/r/822387b7-4895-4e64-5806-0f56b5d6c447@bytedance.com
Link: https://lkml.kernel.org/r/ZSBue-3kM6gI6jCr@mainframe
Fixes: c42d50aefd17 ("mm: shrinker: add infrastructure for dynamically allocating shrinker")
Signed-off-by: Lucy Mielke <lucymielke@icloud.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit c42d50aefd17a6bad3ed617769edbbb579137545
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date: Mon Sep 11 17:44:00 2023 +0800
mm: shrinker: add infrastructure for dynamically allocating shrinker
Patch series "use refcount+RCU method to implement lockless slab shrink",
v6.
1. Background
=============
We used to implement the lockless slab shrink with SRCU [1], but then kernel
test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test
case [2], so we reverted it [3].
This patch series aims to re-implement the lockless slab shrink using the
refcount+RCU method proposed by Dave Chinner [4].
[1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
[4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
2. Implementation
=================
Currently, the shrinker instances can be divided into the following three types:
a) global shrinker instance statically defined in the kernel, such as
workingset_shadow_shrinker.
b) global shrinker instance statically defined in the kernel modules, such as
mmu_shrinker in x86.
c) shrinker instance embedded in other structures.
For case a, the memory of shrinker instance is never freed. For case b, the
memory of shrinker instance will be freed after synchronize_rcu() when the
module is unloaded. For case c, the memory of shrinker instance will be freed
along with the structure it is embedded in.
In preparation for implementing lockless slab shrink, we need to dynamically
allocate those shrinker instances in case c, then the memory can be dynamically
freed alone by calling kfree_rcu().
This patchset adds the following new APIs for dynamically allocating shrinker,
and add a private_data field to struct shrinker to record and get the original
embedded structure.
1. shrinker_alloc()
2. shrinker_register()
3. shrinker_free()
In order to simplify shrinker-related APIs and make shrinker more independent of
other kernel mechanisms, this patchset uses the above APIs to convert all
shrinkers (including case a and b) to dynamically allocated, and then remove all
existing APIs. This will also have another advantage mentioned by Dave Chinner:
```
The other advantage of this is that it will break all the existing out of tree
code and third party modules using the old API and will no longer work with a
kernel using lockless slab shrinkers. They need to break (both at the source and
binary levels) to stop bad things from happening due to using uncoverted
shrinkers in the new setup.
```
Then we free the shrinker by calling call_rcu(), and use rcu_read_{lock,unlock}()
to ensure that the shrinker instance is valid. And the shrinker::refcount
mechanism ensures that the shrinker instance will not be run again after
unregistration. So the structure that records the pointer of shrinker instance
can be safely freed without waiting for the RCU read-side critical section.
In this way, while we implement the lockless slab shrink, we don't need to be
blocked in unregister_shrinker() to wait RCU read-side critical section.
PATCH 1: introduce new APIs
PATCH 2~38: convert all shrinnkers to use new APIs
PATCH 39: remove old APIs
PATCH 40~41: some cleanups and preparations
PATCH 42-43: implement the lockless slab shrink
PATCH 44~45: convert shrinker_rwsem to mutex
3. Testing
==========
3.1 slab shrink stress test
---------------------------
We can reproduce the down_read_trylock() hotspot through the following script:
```
DIR="/root/shrinker/memcg/mnt"
do_create()
{
mkdir -p /sys/fs/cgroup/memory/test
echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir -p /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
mkdir -p $DIR/$i;
done
}
do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}
do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}
case "$1" in
touch)
do_touch $2 $3
;;
test)
do_create 4000
do_mount 0 4000
do_touch 0 3000
;;
*)
exit 1
;;
esac
```
Save the above script, then run test and touch commands. Then we can use the
following perf command to view hotspots:
perf top -U -F 999
1) Before applying this patchset:
33.15% [kernel] [k] down_read_trylock
25.38% [kernel] [k] shrink_slab
21.75% [kernel] [k] up_read
4.45% [kernel] [k] _find_next_bit
2.27% [kernel] [k] do_shrink_slab
1.80% [kernel] [k] intel_idle_irq
1.79% [kernel] [k] shrink_lruvec
0.67% [kernel] [k] xas_descend
0.41% [kernel] [k] mem_cgroup_iter
0.40% [kernel] [k] shrink_node
0.38% [kernel] [k] list_lru_count_one
2) After applying this patchset:
64.56% [kernel] [k] shrink_slab
12.18% [kernel] [k] do_shrink_slab
3.30% [kernel] [k] __rcu_read_unlock
2.61% [kernel] [k] shrink_lruvec
2.49% [kernel] [k] __rcu_read_lock
1.93% [kernel] [k] intel_idle_irq
0.89% [kernel] [k] shrink_node
0.81% [kernel] [k] mem_cgroup_iter
0.77% [kernel] [k] mem_cgroup_calculate_protection
0.66% [kernel] [k] list_lru_count_one
We can see that the first perf hotspot becomes shrink_slab, which is what we
expect.
3.2 registration and unregistration stress test
-----------------------------------------------
Run the command below to test:
stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
1) Before applying this patchset:
setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
ramfs 473062 60.00 8.00 279.13 7884.12 1647.59
for a 60.01s run time:
1440.34s available CPU time
7.99s user time ( 0.55%)
279.13s system time ( 19.38%)
287.12s total time ( 19.93%)
load average: 7.12 2.99 1.15
successful run completed in 60.01s (1 min, 0.01 secs)
2) After applying this patchset:
setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
ramfs 477165 60.00 8.13 281.34 7952.55 1648.40
for a 60.01s run time:
1440.33s available CPU time
8.12s user time ( 0.56%)
281.34s system time ( 19.53%)
289.46s total time ( 20.10%)
load average: 6.98 3.03 1.19
successful run completed in 60.01s (1 min, 0.01 secs)
We can see that the ops/s has hardly changed.
This patch (of 45):
Currently, the shrinker instances can be divided into the following three
types:
a) global shrinker instance statically defined in the kernel, such as
workingset_shadow_shrinker.
b) global shrinker instance statically defined in the kernel modules, such
as mmu_shrinker in x86.
c) shrinker instance embedded in other structures.
For case a, the memory of shrinker instance is never freed. For case b,
the memory of shrinker instance will be freed after synchronize_rcu() when
the module is unloaded. For case c, the memory of shrinker instance will
be freed along with the structure it is embedded in.
In preparation for implementing lockless slab shrink, we need to
dynamically allocate those shrinker instances in case c, then the memory
can be dynamically freed alone by calling kfree_rcu().
So this commit adds the following new APIs for dynamically allocating
shrinker, and add a private_data field to struct shrinker to record and
get the original embedded structure.
1. shrinker_alloc()
Used to allocate shrinker instance itself and related memory, it will
return a pointer to the shrinker instance on success and NULL on failure.
2. shrinker_register()
Used to register the shrinker instance, which is same as the current
register_shrinker_prepared().
3. shrinker_free()
Used to unregister (if needed) and free the shrinker instance.
In order to simplify shrinker-related APIs and make shrinker more
independent of other kernel mechanisms, subsequent submissions will use
the above API to convert all shrinkers (including case a and b) to
dynamically allocated, and then remove all existing APIs.
This will also have another advantage mentioned by Dave Chinner:
```
The other advantage of this is that it will break all the existing
out of tree code and third party modules using the old API and will
no longer work with a kernel using lockless slab shrinkers. They
need to break (both at the source and binary levels) to stop bad
things from happening due to using unconverted shrinkers in the new
setup.
```
[zhengqi.arch@bytedance.com: mm: shrinker: some cleanup]
Link: https://lkml.kernel.org/r/20230919024607.65463-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/20230911094444.68966-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/20230911094444.68966-2-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <cel@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sean Paul <sean@poorly.run>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 96f7b2b9bbb1d680fab1fa96a3958b2d759b881f
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date: Mon Sep 11 17:25:15 2023 +0800
mm: vmscan: move shrinker-related code into a separate file
The mm/vmscan.c file is too large, so separate the shrinker-related code
from it into a separate file. No functional changes.
Link: https://lkml.kernel.org/r/20230911092517.64141-3-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sean Paul <sean@poorly.run>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 3ee0aa9f06756e959633ffda37856c6741d948ed
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date: Mon Sep 11 17:25:14 2023 +0800
mm: move some shrinker-related function declarations to mm/internal.h
Patch series "cleanups for lockless slab shrink", v4.
This series is some cleanups for lockless slab shrink.
This patch (of 4):
The following functions are only used inside the mm subsystem, so it's
better to move their declarations to the mm/internal.h file.
1. shrinker_debugfs_add()
2. shrinker_debugfs_detach()
3. shrinker_debugfs_remove()
Link: https://lkml.kernel.org/r/20230911092517.64141-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/20230911092517.64141-2-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sean Paul <sean@poorly.run>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit ebc1baf5c9b46c2240c580a2fd992b2e48606dfa
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Aug 16 16:11:58 2023 +0100
mm: free up a word in the first tail page
Store the folio order in the low byte of the flags word in the first tail
page. This frees up the word that was being used to store the order and
dtor bytes previously.
Link: https://lkml.kernel.org/r/20230816151201.3655946-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit de53c05f2ae3d47d30db58e9c4e54e3bbc868377
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Aug 16 16:11:56 2023 +0100
mm: add large_rmappable page flag
Stored in the first tail page's flags, this flag replaces the destructor.
That removes the last of the destructors, so remove all references to
folio_dtor and compound_dtor.
Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 8dc4a8f1e038189cb575f89bcd23364698b88cc1
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Aug 16 16:11:52 2023 +0100
mm: convert free_transhuge_folio() to folio_undo_large_rmappable()
Indirect calls are expensive, thanks to Spectre. Test for
TRANSHUGE_PAGE_DTOR and destroy the folio appropriately. Move the
free_compound_page() call into destroy_large_folio() to simplify later
patches.
Link: https://lkml.kernel.org/r/20230816151201.3655946-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 0790e1e2b1b71ba357e89e779451efe79dff28e6
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Sat Aug 12 01:20:33 2023 +0100
mm: allow fault_dirty_shared_page() to be called under the VMA lock
By making maybe_unlock_mmap_for_io() handle the VMA lock correctly, we
make fault_dirty_shared_page() safe to be called without the mmap lock
held.
Link: https://lkml.kernel.org/r/20230812002033.1002367-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: David Hildenbrand <david@redhat.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
* mm/internal.h: context difference due to a series of conflict resolutions
given a series of out-of-order backports which made "mirrored_kernelcore"
to end up in a slightly different context when comparing with its upstream
placement. We leverage this backport to make it go into the "right" place.
This patch is a backport of the following upstream commit:
commit 0db31d63f27e5b8ca84b9fd5a3cff5b12ac88abf
Author: Ma Wupeng <mawupeng1@huawei.com>
Date: Wed Aug 2 15:23:28 2023 +0800
mm: disable kernelcore=mirror when no mirror memory
For system with kernelcore=mirror enabled while no mirrored memory is
reported by efi. This could lead to kernel OOM during startup since all
memory beside zone DMA are in the movable zone and this prevents the
kernel to use it.
Zone DMA/DMA32 initialization is independent of mirrored memory and their
max pfn is set in zone_sizes_init(). Since kernel can fallback to zone
DMA/DMA32 if there is no memory in zone Normal, these zones are seen as
mirrored memory no mather their memory attributes are.
To solve this problem, disable kernelcore=mirror when there is no real
mirrored memory exists.
Link: https://lkml.kernel.org/r/20230802072328.2107981-1-mawupeng1@huawei.com
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Suggested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Levi Yun <ppbuk5246@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
* context differences on the 1st, 3rd, and 15th hunks due to out-of-order
backport of upstream commits ad9f006351c3 ("mm: always lock new vma
before inserting into vma tree") and c9d6e982c3f8 ("mm: move vma
locking out of vma_prepare and dup_anon_vma");
* context difference on the 4th hunk due to out-of-order backport of upstream
commit 1419430c8abb ("mmap: fix vma_iterator in error path of vma_merge()")
This patch is a backport of the following upstream commit:
commit b5df09226450165c434084d346fcb6d4858b0d52
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date: Mon Jul 24 14:31:52 2023 -0400
mm: set up vma iterator for vma_iter_prealloc() calls
Set the correct limits for vma_iter_prealloc() calls so that the maple
tree can be smarter about how many nodes are needed.
Link: https://lkml.kernel.org/r/20230724183157.3939892-11-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit f72cf24a868675280bc555e95abc3639777f8d82
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date: Mon Jul 24 14:31:51 2023 -0400
mm: use vma_iter_clear_gfp() in nommu
Move the definition of vma_iter_clear_gfp() from mmap.c to internal.h so
it can be used in the nommu code. This will reduce node preallocations
in nommu.
Link: https://lkml.kernel.org/r/20230724183157.3939892-10-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit da0892547b101df6e13255b378380d077975368d
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date: Mon Jul 24 14:31:49 2023 -0400
maple_tree: re-introduce entry to mas_preallocate() arguments
The current preallocation strategy is to preallocate the absolute
worst-case allocation for a tree modification. The entry (or NULL) is
needed to know how many nodes are needed to write to the tree. Start by
adding the argument to the mas_preallocate() definition.
Link: https://lkml.kernel.org/r/20230724183157.3939892-8-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
* mm/internal.h: minor context difference due to the out-of-order backport of
upstream commit 902c2d91582c ("memblock: Disable mirror feature if
kernelcore is not specified")
This patch is a backport of the following upstream commit:
commit 53bee98d004fcd7062f2fe056720704e947e6000
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date: Mon Jul 24 14:31:48 2023 -0400
mm: remove re-walk from mmap_region()
Using vma_iter_set() will reset the tree and cause a re-walk. Use
vmi_iter_config() to set the write to a sub-set of the range. Change
the file case to also use vmi_iter_config() so that the end is correctly
set.
Link: https://lkml.kernel.org/r/20230724183157.3939892-7-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
* mm/memory.c: minor context difference due to the backport of upstream
commit 2820b0f09be9 ("hugetlbfs: close race between MADV_DONTNEED and page fault")
This patch is a backport of the following upstream commit:
commit fd892593d44d8b649caf30a67f0c7696d976d901
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date: Mon Jul 24 14:31:45 2023 -0400
mm: change do_vmi_align_munmap() tracking of VMAs to remove
The majority of the calls to munmap a vm range is within a single vma.
The maple tree is able to store a single entry at 0, with a size of 1 as
a pointer and avoid any allocations. Change do_vmi_align_munmap() to
store the VMAs being munmap()'ed into a tree indexed by the count. This
will leverage the ability to store the first entry without a node
allocation.
Storing the entries into a tree by the count and not the vma start and
end means changing the functions which iterate over the entries. Update
unmap_vmas() and free_pgtables() to take a maple state and a tree end
address to support this functionality.
Passing through the same maple state to unmap_vmas() and free_pgtables()
means the state needs to be reset between calls. This happens in the
static unmap_region() and exit_mmap().
Link: https://lkml.kernel.org/r/20230724183157.3939892-4-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 1279aa0656bbebbeecbd3bec0dd50faf35e5db38
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Fri Jun 30 14:22:53 2023 +0800
mm: make show_free_areas() static
All callers of show_free_areas() pass 0 and NULL, so we can directly use
show_mem() instead of show_free_areas(0, NULL), which could make
show_free_areas() a static function.
Link: https://lkml.kernel.org/r/20230630062253.189440-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 00547ef73ff282eea59cd33b8952f1dcbb4bd4b8
Author: Rick Edgecombe <rick.p.edgecombe@intel.com>
Date: Mon Jun 12 17:10:45 2023 -0700
mm/mmap: Add shadow stack pages to memory accounting
The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-20-rick.p.edgecombe%40intel.com
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit 1a0fc811f5f5addf54499826bd1b6e34e917491c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Jun 21 17:45:55 2023 +0100
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
We don't use pagevecs for the LRU cache any more, and we don't know that
the failed invalidations were due to the folio being in an LRU cache. So
rename it to be more accurate.
Link: https://lkml.kernel.org/r/20230621164557.3510324-12-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit 1e3be4856f49d55c60b6cd500297b06acfe216a9
Author: Tarun Sahu <tsahu@linux.ibm.com>
Date: Mon Jun 12 15:05:14 2023 +0530
mm/folio: replace set_compound_order with folio_set_order
The patch ("mm/folio: Avoid special handling for order value 0 in
folio_set_order") [1] removed the need for special handling of order = 0
in folio_set_order. Now, folio_set_order and set_compound_order becomes
similar function. This patch removes the set_compound_order and uses
folio_set_order instead.
[1] https://lore.kernel.org/all/20230609183032.13E08C433D2@smtp.kernel.org/
Link: https://lkml.kernel.org/r/20230612093514.689846-1-tsahu@linux.ibm.com
Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com>
Reviewed-by Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit b95826c9aa48b2997b3973b42a8716ba132b920e
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date: Mon Jun 12 09:34:05 2023 -0700
mm: remove set_compound_page_dtor()
All users can use the folio equivalent so this function can be safely
removed.
Link: https://lkml.kernel.org/r/20230612163405.99345-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tarun Sahu <tsahu@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit e3b7bf972d632288ccad95b116628e3141be676e
Author: Tarun Sahu <tsahu@linux.ibm.com>
Date: Fri Jun 9 21:59:07 2023 +0530
mm/folio: avoid special handling for order value 0 in folio_set_order
folio_set_order(folio, 0) is used in kernel at two places
__destroy_compound_gigantic_folio and __prep_compound_gigantic_folio.
Currently, It is called to clear out the folio->_folio_nr_pages and
folio->_folio_order.
For __destroy_compound_gigantic_folio:
In past, folio_set_order(folio, 0) was needed because page->mapping used
to overlap with _folio_nr_pages and _folio_order. So if these fields were
left uncleared during freeing gigantic hugepages, they were causing
"BUG: bad page state" due to non-zero page->mapping. Now, After
Commit a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to
CMA") page->mapping has explicitly been cleared out for tail pages. Also,
_folio_order and _folio_nr_pages no longer overlaps with page->mapping.
So, folio_set_order(folio, 0) can be removed from freeing gigantic
folio path (__destroy_compound_gigantic_folio).
Another place, folio_set_order(folio, 0) is called inside
__prep_compound_gigantic_folio during error path. Here,
folio_set_order(folio, 0) can also be removed if we move
folio_set_order(folio, order) after for loop.
The patch also moves _folio_set_head call in __prep_compound_gigantic_folio()
such that we avoid clearing them in the error path.
Also, as Mike pointed out:
"It would actually be better to move the calls _folio_set_head and
folio_set_order in __prep_compound_gigantic_folio() as suggested here. Why?
In the current code, the ref count on the 'head page' is still 1 (or more)
while those calls are made. So, someone could take a speculative ref on the
page BEFORE the tail pages are set up."
This way, folio_set_order(folio, 0) is no more needed. And it will also
helps removing the confusion of folio order being set to 0 (as _folio_order
field is part of first tail page).
Testing: I have run LTP tests, which all passes. and also I have written
the test in LTP which tests the bug caused by compound_nr and page->mapping
overlapping.
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/hugetlb/hugemmap/hugemmap32.c
Running on older kernel ( < 5.10-rc7) with the above bug this fails while
on newer kernel and, also with this patch it passes.
Link: https://lkml.kernel.org/r/20230609162907.111756-1-tsahu@linux.ibm.com
Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit 4d312ac057da57b4a844ec8af14236e74b652efe
Author: Arnd Bergmann <arnd@arndb.de>
Date: Tue May 16 21:35:47 2023 +0200
x86/mm: Add early_memremap_pgprot_adjust() prototype
early_memremap_pgprot_adjust() is a __weak function with a local
prototype, but x86 has a custom implementation that does not
see the prototype, causing a W=1 warning:
arch/x86/mm/ioremap.c:785:17: error: no previous prototype for 'early_memremap_pgprot_adjust' [-Werror=missing-prototypes]
Move the declaration into the global linux/io.h header to avoid this.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-19-arnd%40kernel.org
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-48221
This patch is a backport of the following upstream commit:
commit 4bf4f155bfbc77a12ad2c9e12d9f57718adb9a5c
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Mon Apr 17 19:48:07 2023 +0800
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
Both of them change the arg from page_list to folio_list when convert them
to use a folio, but not the declaration, let's correct it, also move the
reclaim_pages() from swap.h to internal.h as it only used in mm.
Link: https://lkml.kernel.org/r/20230417114807.186786-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviwed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <aquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-20141
commit ecbb490d8ee38fd84a0d682282589ff723dc62c0
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Tue May 16 14:38:21 2023 +0800
mm: page_alloc: move is_check_pages_enabled() into page_alloc.c
The is_check_pages_enabled() only used in page_alloc.c, move it into
page_alloc.c, also use it in free_tail_page_prepare().
Link: https://lkml.kernel.org/r/20230516063821.121844-14-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-20141
commit e95d372c4cd46b6ec4eeacc07adcb7260ab4cfa0
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Tue May 16 14:38:20 2023 +0800
mm: page_alloc: move sysctls into it own fils
This moves all page alloc related sysctls to its own file, as part of the
kernel/sysctl.c spring cleaning, also move some functions declarations
from mm.h into internal.h.
Link: https://lkml.kernel.org/r/20230516063821.121844-13-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-20141
commit 904d58578fce531be07619a2bc2cdc16c9fd49b6
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Tue May 16 14:38:11 2023 +0800
mm: page_alloc: move set_zone_contiguous() into mm_init.c
set_zone_contiguous() is only used in mm init/hotplug, and
clear_zone_contiguous() only used in hotplug, move them from page_alloc.c
to the more appropriate file.
Link: https://lkml.kernel.org/r/20230516063821.121844-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
commit 5805192c7b7257d290474cb1a3897d0567281bbc
Author: David Hildenbrand <david@redhat.com>
Date: Sat Aug 5 12:12:56 2023 +0200
mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast
In contrast to most other GUP code, GUP-fast common page table walking
code like gup_pte_range() also handles hugetlb pages. But in contrast to
other hugetlb page table walking code, it does not look at the hugetlb PTE
abstraction whereby we have only a single logical hugetlb PTE per hugetlb
page, even when using multiple cont-PTEs underneath -- which is for
example what huge_ptep_get() abstracts.
So when we have a hugetlb page that is mapped via cont-PTEs, GUP-fast
might stumble over a PTE that does not map the head page of a hugetlb page
-- not the first "head" PTE of such a cont mapping.
Logically, the whole hugetlb page is mapped (entire_mapcount == 1), but we
might end up calling gup_must_unshare() with a tail page of a hugetlb
page.
We only maintain a single PageAnonExclusive flag per hugetlb page (as
hugetlb pages cannot get partially COW-shared), stored for the head page.
That flag is clear for all tail pages.
So when gup_must_unshare() ends up calling PageAnonExclusive() with a tail
page of a hugetlb page:
1) With CONFIG_DEBUG_VM_PGFLAGS
Stumbles over the:
VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
For example, when executing the COW selftests with 64k hugetlb pages on
arm64:
[ 61.082187] page:00000000829819ff refcount:3 mapcount:1 mapping:0000000000000000 index:0x1 pfn:0x11ee11
[ 61.082842] head:0000000080f79bf7 order:4 entire_mapcount:1 nr_pages_mapped:0 pincount:2
[ 61.083384] anon flags: 0x17ffff80003000e(referenced|uptodate|dirty|head|mappedtodisk|node=0|zone=2|lastcpupid=0xfffff)
[ 61.084101] page_type: 0xffffffff()
[ 61.084332] raw: 017ffff800000000 fffffc00037b8401 0000000000000402 0000000200000000
[ 61.084840] raw: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000
[ 61.085359] head: 017ffff80003000e ffffd9e95b09b788 ffffd9e95b09b788 ffff0007ff63cf71
[ 61.085885] head: 0000000000000000 0000000000000002 00000003ffffffff 0000000000000000
[ 61.086415] page dumped because: VM_BUG_ON_PAGE(PageHuge(page) && !PageHead(page))
[ 61.086914] ------------[ cut here ]------------
[ 61.087220] kernel BUG at include/linux/page-flags.h:990!
[ 61.087591] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 61.087999] Modules linked in: ...
[ 61.089404] CPU: 0 PID: 4612 Comm: cow Kdump: loaded Not tainted 6.5.0-rc4+ #3
[ 61.089917] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 61.090409] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 61.090897] pc : gup_must_unshare.part.0+0x64/0x98
[ 61.091242] lr : gup_must_unshare.part.0+0x64/0x98
[ 61.091592] sp : ffff8000825eb940
[ 61.091826] x29: ffff8000825eb940 x28: 0000000000000000 x27: fffffc00037b8440
[ 61.092329] x26: 0400000000000001 x25: 0000000000080101 x24: 0000000000080000
[ 61.092835] x23: 0000000000080100 x22: ffff0000cffb9588 x21: ffff0000c8ec6b58
[ 61.093341] x20: 0000ffffad6b1000 x19: fffffc00037b8440 x18: ffffffffffffffff
[ 61.093850] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
[ 61.094358] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
[ 61.094858] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffffd9e958b7a1c0
[ 61.095359] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 00000000002bffa8
[ 61.095873] x5 : ffff0008bb19e708 x4 : 0000000000000000 x3 : 0000000000000000
[ 61.096380] x2 : 0000000000000000 x1 : ffff0000cf6636c0 x0 : 0000000000000046
[ 61.096894] Call trace:
[ 61.097080] gup_must_unshare.part.0+0x64/0x98
[ 61.097392] gup_pte_range+0x3a8/0x3f0
[ 61.097662] gup_pgd_range+0x1ec/0x280
[ 61.097942] lockless_pages_from_mm+0x64/0x1a0
[ 61.098258] internal_get_user_pages_fast+0xe4/0x1d0
[ 61.098612] pin_user_pages_fast+0x58/0x78
[ 61.098917] pin_longterm_test_start+0xf4/0x2b8
[ 61.099243] gup_test_ioctl+0x170/0x3b0
[ 61.099528] __arm64_sys_ioctl+0xa8/0xf0
[ 61.099822] invoke_syscall.constprop.0+0x7c/0xd0
[ 61.100160] el0_svc_common.constprop.0+0xe8/0x100
[ 61.100500] do_el0_svc+0x38/0xa0
[ 61.100736] el0_svc+0x3c/0x198
[ 61.100971] el0t_64_sync_handler+0x134/0x150
[ 61.101280] el0t_64_sync+0x17c/0x180
[ 61.101543] Code: aa1303e0 f00074c1 912b0021 97fffeb2 (d4210000)
2) Without CONFIG_DEBUG_VM_PGFLAGS
Always detects "not exclusive" for passed tail pages and refuses to PIN
the tail pages R/O, as gup_must_unshare() == true. GUP-fast will fallback
to ordinary GUP. As ordinary GUP properly considers the logical hugetlb
PTE abstraction in hugetlb_follow_page_mask(), pinning the page will
succeed when looking at the PageAnonExclusive on the head page only.
So the only real effect of this is that with cont-PTE hugetlb pages, we'll
always fallback from GUP-fast to ordinary GUP when not working on the head
page, which ends up checking the head page and do the right thing.
Consequently, the cow selftests pass with cont-PTE hugetlb pages as well
without CONFIG_DEBUG_VM_PGFLAGS.
Note that this only applies to anon hugetlb pages that are mapped using
cont-PTEs: for example 64k hugetlb pages on a 4k arm64 kernel.
... and only when R/O-pinning (FOLL_PIN) such pages that are mapped into
the page table R/O using GUP-fast.
On production kernels (and even most debug kernels, that don't set
CONFIG_DEBUG_VM_PGFLAGS) this patch should theoretically not be required
to be backported. But of course, it does not hurt.
Link: https://lkml.kernel.org/r/20230805101256.87306-1-david@redhat.com
Fixes: a7f226604170 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
commit 8b9c1cc0418a43196477083e7082568e7a4c9418
Author: David Hildenbrand <david@redhat.com>
Date: Thu Aug 3 16:32:03 2023 +0200
smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
We shouldn't be using a GUP-internal helper if it can be avoided.
Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
vm_normal_page_pmd() that similarly refuses to return the huge zeropage.
In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():
(1) Will always return the head page, not a tail page of a THP.
If we'd ever call smaps_account with a tail page while setting "compound
= true", we could be in trouble, because smaps_account() would look at
the memmap of unrelated pages.
If we're unlucky, that memmap does not exist at all. Before we removed
PG_doublemap, we could have triggered something similar as in
commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for
migration entry").
This can theoretically happen ever since commit ff9f47f6f0 ("mm: proc:
smaps_rollup: do not stall write attempts on mmap_lock"):
(a) We're in show_smaps_rollup() and processed a VMA
(b) We release the mmap lock in show_smaps_rollup() because it is
contended
(c) We merged that VMA with another VMA
(d) We collapsed a THP in that merged VMA at that position
If the end address of the original VMA falls into the middle of a THP
area, we would call smap_gather_stats() with a start address that falls
into a PMD-mapped THP. It's probably very rare to trigger when not
really forced.
(2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()
Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
If such pages would be anonymous, we most certainly would want to
account them.
(3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()
As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
entries.
Especially, follow_pmd_mask() never ends up calling
follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
is set.
So skipping pmd_devmap() pages seems to be the right thing to do.
(4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()
We won't be returning a memmap that should be ignored by core-mm, or
worse, a memmap that does not even exist. Note that while
walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.
Most probably this case doesn't currently really happen on the PMD level,
otherwise we'd already be able to trigger kernel crashes when reading
smaps / smaps_rollup.
So most probably only (1) is relevant in practice as of now, but could only
cause trouble in extreme corner cases.
Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future
reuse in wrong context.
Link: https://lkml.kernel.org/r/20230803143208.383663-3-david@redhat.com
Fixes: ff9f47f6f0 ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
commit b0cc5e89caadc21c2d8f7dc4c97947761b358876
Author: Andrew Morton <akpm@linux-foundation.org>
Date: Mon May 22 13:52:10 2023 -0700
mm/mlock: rename mlock_future_check() to mlock_future_ok()
It is felt that the name mlock_future_check() is vague - it doesn't
particularly convey the function's operation. mlock_future_ok() is a
clearer name for a predicate function.
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
commit 3c54a298db4c6b38bbd5f86216ce0f5ad4596ccf
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date: Mon May 22 09:24:12 2023 +0100
mm/mmap: refactor mlock_future_check()
In all but one instance, mlock_future_check() is treated as a boolean
function despite returning an error code. In one instance, this error
code is ignored and replaced with -ENOMEM.
This is confusing, and the inversion of true -> failure, false -> success
is not warranted. Convert the function to a bool, lightly refactor and
return true if the check passes, false if not.
Link: https://lkml.kernel.org/r/20230522082412.56685-1-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
Conflicts:
mm/internal.h: mt_dump had two params
commit 36bd931049bf93a4eaec4a558beca477db152881
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date: Thu May 18 10:55:27 2023 -0400
mm: update vma_iter_store() to use MAS_WARN_ON()
MAS_WARN_ON() will provide more information on the maple state and can be
more useful for debugging. Use this version of WARN_ON() in the debugging
code when storing to the tree.
Update the printk to a pr_warn(), but this will only be printed when maple
tree debug is enabled anyways.
Making all print statements into one will keep them together on a busy
terminal.
Link: https://lkml.kernel.org/r/20230518145544.1722059-19-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>