Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Rafael Aquini	8f8ad3c6e9	mm/memory_hotplug: prevent accessing by index=-1 JIRA: https://issues.redhat.com/browse/RHEL-27745 This patch is a backport of the following upstream commit: commit 5958d35917e1296f46dfc8b8c959732efd6d8d5d Author: Anastasia Belova <abelova@astralinux.ru> Date: Thu Jun 6 11:06:59 2024 +0300 mm/memory_hotplug: prevent accessing by index=-1 nid may be equal to NUMA_NO_NODE=-1. Prevent accessing node_data array by invalid index with check for nid. Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lkml.kernel.org/r/20240606080659.18525-1-abelova@astralinux.ru Fixes: e83a437faa62 ("mm/memory_hotplug: introduce "auto-movable" online policy") Signed-off-by: Anastasia Belova <abelova@astralinux.ru> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-12-09 12:25:20 -05:00
Rafael Aquini	c8c9c0b259	mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER JIRA: https://issues.redhat.com/browse/RHEL-27745 Conflicts: * arch//Kconfig: all hunks dropped as there were only text blurbs and comments being changed with no functional changes whatsoever, and RHEL9 is missing several (unrelated) commits to these arches that tranform the text blurbs in the way these non-functional hunks were expecting; drivers/accel/qaic/qaic_data.c: hunk dropped due to RHEL-only commit `083c0cdce2` ("Merge DRM changes from upstream v6.8..v6.9"); * drivers/gpu/drm/i915/gem/selftests/huge_pages.c: hunk dropped due to RHEL-only commit `ca8b16c11b` ("Merge DRM changes from upstream v6.7..v6.8"); * drivers/gpu/drm/ttm/tests/ttm_pool_test.c: all hunks dropped due to RHEL-only commit `ca8b16c11b` ("Merge DRM changes from upstream v6.7..v6.8"); * drivers/video/fbdev/vermilion/vermilion.c: hunk dropped as RHEL9 misses commit `dbe7e429fe` ("vmlfb: framebuffer driver for Intel Vermilion Range"); * include/linux/pageblock-flags.h: differences due to out-of-order backport of upstream commits 72801513b2bf ("mm: set pageblock_order to HPAGE_PMD_ORDER in case with !CONFIG_HUGETLB_PAGE but THP enabled"), and 3a7e02c040b1 ("minmax: avoid overly complicated constant expressions in VM code"); * mm/mm_init.c: differences on the 3rd, and 4th hunks are due to RHEL backport commit `1845b92dcf` ("mm: move most of core MM initialization to mm/mm_init.c") ignoring the out-of-order backport of commit 3f6dac0fd1b8 ("mm/page_alloc: make deferred page init free pages in MAX_ORDER blocks") thus partially reverting the changes introduced by the latter; This patch is a backport of the following upstream commit: commit 5e0a760b44417f7cadd79de2204d6247109558a0 Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Date: Thu Dec 28 17:47:04 2023 +0300 mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-12-09 12:24:17 -05:00
Rado Vrbovsky	570a71d7db	Merge: mm: update core code to v6.6 upstream MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252 JIRA: https://issues.redhat.com/browse/RHEL-27743 JIRA: https://issues.redhat.com/browse/RHEL-59459 CVE: CVE-2024-46787 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961 This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level. This work follows up on the previous v6.5 update (RHEL-27742) and as such, the bulk of this changeset is comprised of refactoring and clean-ups of the internal implementation of several APIs as it further advances the conversion to FOLIOS, and follow up on the per-VMA locking changes. Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds, and we add a potential extra level of protection (assessment pending) to help on mitigating kernel heap exploits dubbed as "SlubStick". Follow-up fixes are omitted from this series either because they are irrelevant to the bits we support on RHEL or because they depend on bigger changesets introduced upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately. Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot") Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources") Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()") Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros") Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages") Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType") Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()") Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio") Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling") Signed-off-by: Rafael Aquini <raquini@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Approved-by: Mark Salter <msalter@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Approved-by: David Airlie <airlied@redhat.com> Approved-by: Michal Schmidt <mschmidt@redhat.com> Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-10-30 07:22:28 +00:00
Rado Vrbovsky	45cc2e1e72	Merge: x86/kaslr: Expose and use the end of the physical memory address space MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5273 JIRA: https://issues.redhat.com/browse/RHEL-55130 JIRA: https://issues.redhat.com/browse/RHEL-55196 JIRA: https://issues.redhat.com/browse/RHEL-58584 commit ea72ce5da22806d5713f3ffb39a6d5ae73841f93 Author: Thomas Gleixner <tglx@linutronix.de> Date: Wed, 14 Aug 2024 00:29:36 +0200 x86/kaslr: Expose and use the end of the physical memory address space iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: `0483e1fa6e` ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <max8rr8@gmail.com> Reported-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-By: Max Ramanouski <max8rr8@gmail.com> Tested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Rafael Aquini <raquini@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Approved-by: Lenny Szubowicz <lszubowi@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-10-25 16:19:39 +00:00
Waiman Long	5115d29b92	x86/kaslr: Expose and use the end of the physical memory address space JIRA: https://issues.redhat.com/browse/RHEL-55130 JIRA: https://issues.redhat.com/browse/RHEL-55196 JIRA: https://issues.redhat.com/browse/RHEL-58584 commit ea72ce5da22806d5713f3ffb39a6d5ae73841f93 Author: Thomas Gleixner <tglx@linutronix.de> Date: Wed, 14 Aug 2024 00:29:36 +0200 x86/kaslr: Expose and use the end of the physical memory address space iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: `0483e1fa6e` ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <max8rr8@gmail.com> Reported-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-By: Max Ramanouski <max8rr8@gmail.com> Tested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx Signed-off-by: Waiman Long <longman@redhat.com>	2024-10-03 13:24:19 -04:00
Rafael Aquini	e97e099835	mm/memory_hotplug: document the signal_pending() check in offline_pages() JIRA: https://issues.redhat.com/browse/RHEL-27743 This patch is a backport of the following upstream commit: commit de7cb03db05a4b460edefff266bbaead70a11634 Author: David Hildenbrand <david@redhat.com> Date: Tue Jul 11 19:40:50 2023 +0200 mm/memory_hotplug: document the signal_pending() check in offline_pages() Let's update the documentation that any signal is sufficient, and add a comment that not only checking for fatal signals is historical baggage: changing it now could break existing user space. although unlikely. For example, when an app provides a custom SIGALRM handler and triggers memory offlining, the timeout cmd would no longer stop memory offlining, because SIGALRM would no longer be considered a fatal signal. Note that using signal_pending() instead of fatal_signal_pending() is an anti-pattern, but slowly deprecating that behavior to eventually change it in the far future is probably not worth the effort. If this ever becomes relevant for user-space, we might want to rethink. Link: https://lkml.kernel.org/r/20230711174050.603820-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-10-01 11:18:28 -04:00
Rafael Aquini	a726366716	mm: remove unnecessary pagevec includes JIRA: https://issues.redhat.com/browse/RHEL-27742 This patch is a backport of the following upstream commit: commit 994ec4e29b3de188d11fe60d17403285fcc8917a Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Wed Jun 21 17:45:57 2023 +0100 mm: remove unnecessary pagevec includes These files no longer need pagevec.h, mostly due to function declarations being moved out of it. Link: https://lkml.kernel.org/r/20230621164557.3510324-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:37:33 -04:00
Rafael Aquini	9bcffbf7e3	mm/mm_init.c: remove reset_node_present_pages() JIRA: https://issues.redhat.com/browse/RHEL-27742 This patch is a backport of the following upstream commit: commit 32b6a4a1745a46918f748f6fb7641e588fbec6f2 Author: Haifeng Xu <haifeng.xu@shopee.com> Date: Wed Jun 7 02:50:56 2023 +0000 mm/mm_init.c: remove reset_node_present_pages() reset_node_present_pages() only get called in hotadd_init_pgdat(), move the action that clear present pages to free_area_init_core_hotplug(), so the helper can be removed. Link: https://lkml.kernel.org/r/20230607025056.1348-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:36:36 -04:00
Rafael Aquini	ae97c9af04	mm/memory_hotplug: remove reset_node_managed_pages() in hotadd_init_pgdat() JIRA: https://issues.redhat.com/browse/RHEL-27742 This patch is a backport of the following upstream commit: commit a668968f84265e698a122656c433809ab9f023fa Author: Haifeng Xu <haifeng.xu@shopee.com> Date: Wed Jun 7 02:45:48 2023 +0000 mm/memory_hotplug: remove reset_node_managed_pages() in hotadd_init_pgdat() managed pages has already been set to 0 in free_area_init_core_hotplug(), via zone_init_internals() on each zone. It's pointless to reset again. Furthermore, reset_node_managed_pages() no longer needs to be exposed outside of mm/memblock.c. Remove declaration in include/linux/memblock.h and define it as static. In addtion to this, the only caller of reset_node_managed_pages() is reset_all_zones_managed_pages(), which is annotated with __init, so it should be safe to also mark reset_node_managed_pages() as __init. Link: https://lkml.kernel.org/r/20230607024548.1240-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:36:36 -04:00
Rafael Aquini	d33ec3fa04	mm/sparse: remove unused parameters in sparse_remove_section() JIRA: https://issues.redhat.com/browse/RHEL-27742 This patch is a backport of the following upstream commit: commit bd5f79ab39367665f40e10c2486aa15e7a841490 Author: Yajun Deng <yajun.deng@linux.dev> Date: Wed Jun 7 10:39:52 2023 +0800 mm/sparse: remove unused parameters in sparse_remove_section() These parameters ms and map_offset are not used in sparse_remove_section(), so remove them. The __remove_section() is only called by __remove_pages(), remove it. And put the WARN_ON_ONCE() in sparse_remove_section(). Link: https://lkml.kernel.org/r/20230607023952.2247489-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:36:34 -04:00
Rafael Aquini	1fcdb36810	mm: memory_hotplug: fix format string in warnings JIRA: https://issues.redhat.com/browse/RHEL-27742 This patch is a backport of the following upstream commit: commit 501350459b1fe7a0da6d089484fa112ff48f5252 Author: Rick Wertenbroek <rick.wertenbroek@gmail.com> Date: Wed May 10 11:07:57 2023 +0200 mm: memory_hotplug: fix format string in warnings The format string in __add_pages and __remove_pages has a typo and prints e.g., "Misaligned __add_pages start: 0xfc605 end: #fc609" instead of "Misaligned __add_pages start: 0xfc605 end: 0xfc609" Fix "#%lx" => "%#lx" Link: https://lkml.kernel.org/r/20230510090758.3537242-1-rick.wertenbroek@gmail.com Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:35:23 -04:00
Jeff Moyer	6c95b6a7d6	mm/memory_hotplug: fix memmap_on_memory sysfs value retrieval JIRA: https://issues.redhat.com/browse/RHEL-23824 commit 11684134140bb708b6e6de969a060535630b1b53 Author: Sumanth Korikkar <sumanthk@linux.ibm.com> Date: Wed Jan 10 15:01:27 2024 +0100 mm/memory_hotplug: fix memmap_on_memory sysfs value retrieval set_memmap_mode() stores the kernel parameter memmap mode as an integer. However, the get_memmap_mode() function utilizes param_get_bool() to fetch the value as a boolean, leading to potential endianness issue. On Big-endian architectures, the memmap_on_memory is consistently displayed as 'N' regardless of its actual status. To address this endianness problem, the solution involves obtaining the mode as an integer. This adjustment ensures the proper display of the memmap_on_memory parameter, presenting it as one of the following options: Force, Y, or N. Link: https://lkml.kernel.org/r/20240110140127.241451-1-sumanthk@linux.ibm.com Fixes: 2d1f649c7c08 ("mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks") Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Suggested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-07-26 14:57:32 -04:00
Jeff Moyer	64f47c3802	mm/memory_hotplug: export mhp_supports_memmap_on_memory() JIRA: https://issues.redhat.com/browse/RHEL-23824 Conflicts: Contextual difference due to missing commit c5f1e2d18909 ("mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers"). commit 42d9358252e5d055223487d9f653c2a2ac859a2a Author: Vishal Verma <vishal.l.verma@intel.com> Date: Wed Jan 24 12:03:49 2024 -0800 mm/memory_hotplug: export mhp_supports_memmap_on_memory() In preparation for adding sysfs ABI to toggle memmap_on_memory semantics for drivers adding memory, export the mhp_supports_memmap_on_memory() helper. This allows drivers to check if memmap_on_memory support is available before trying to request it, and display an appropriate message if it isn't available. As part of this, remove the size argument to this - with recent updates to allow memmap_on_memory for larger ranges, and the internal splitting of altmaps into respective memory blocks, the size argument is meaningless. [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-4-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-07-26 14:57:31 -04:00
Jeff Moyer	3668a22013	mm/memory_hotplug: split memmap_on_memory requests across memblocks JIRA: https://issues.redhat.com/browse/RHEL-23824 commit 6b8f0798b85aa529011570369db985a788f3003f Author: Vishal Verma <vishal.l.verma@intel.com> Date: Tue Nov 7 00:22:42 2023 -0700 mm/memory_hotplug: split memmap_on_memory requests across memblocks The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to 'memblock_size' chunks of memory being added. Adding a larger span of memory precludes memmap_on_memory semantics. For users of hotplug such as kmem, large amounts of memory might get added from the CXL subsystem. In some cases, this amount may exceed the available 'main memory' to store the memmap for the memory being added. In this case, it is useful to have a way to place the memmap on the memory being added, even if it means splitting the addition into memblock-sized chunks. Change add_memory_resource() to loop over memblock-sized chunks of memory if caller requested memmap_on_memory, and if other conditions for it are met. Teach try_remove_memory() to also expect that a memory range being removed might have been split up into memblock sized chunks, and to loop through those as needed. This does preclude being able to use PUD mappings in the direct map; a proposal to how this could be optimized in the future is laid out here[1]. [1]: https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752b19@redhat.com/ Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-2-1253ec050ed0@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Fan Ni <fan.ni@samsung.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-07-26 14:57:31 -04:00
Jeff Moyer	99cb851f91	mm/memory_hotplug: replace an open-coded kmemdup() in add_memory_resource() JIRA: https://issues.redhat.com/browse/RHEL-23824 commit 82b8a3b49ebde4e7246319884deeb29d6dc1b0cf Author: Vishal Verma <vishal.l.verma@intel.com> Date: Tue Nov 7 00:22:41 2023 -0700 mm/memory_hotplug: replace an open-coded kmemdup() in add_memory_resource() Patch series "mm: use memmap_on_memory semantics for dax/kmem", v10. The dax/kmem driver can potentially hot-add large amounts of memory originating from CXL memory expanders, or NVDIMMs, or other 'device memories'. There is a chance there isn't enough regular system memory available to fit the memmap for this new memory. It's therefore desirable, if all other conditions are met, for the kmem managed memory to place its memmap on the newly added memory itself. The main hurdle for accomplishing this for kmem is that memmap_on_memory can only be done if the memory being added is equal to the size of one memblock. To overcome this, allow the hotplug code to split an add_memory() request into memblock-sized chunks, and try_remove_memory() to also expect and handle such a scenario. Patch 1 replaces an open-coded kmemdup() Patch 2 teaches the memory_hotplug code to allow for splitting add_memory() and remove_memory() requests over memblock sized chunks. Patch 3 allows the dax region drivers to request memmap_on_memory semantics. CXL dax regions default this to 'on', all others default to off to keep existing behavior unchanged. This patch (of 3): A review of the memmap_on_memory modifications to add_memory_resource() revealed an instance of an open-coded kmemdup(). Replace it with kmemdup(). Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-0-1253ec050ed0@intel.com Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-1-1253ec050ed0@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fan Ni <fan.ni@samsung.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-07-26 14:57:31 -04:00
Jeff Moyer	f43d3fce8e	mm/memory_hotplug: embed vmem_altmap details in memory block JIRA: https://issues.redhat.com/browse/RHEL-23824 Conflicts: Context differences due to different patch application order in RHEL as compared to upstream. Specifically, commits f42ce5f087eb ("mm/memory_hotplug: fix error handling in add_memory_resource()") and 001002e73712 ("mm/memory_hotplug: add missing mem_hotplug_lock"). commit 1a8c64e110435e44e71bcd50a75663174b575f22 Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Date: Tue Aug 8 14:45:01 2023 +0530 mm/memory_hotplug: embed vmem_altmap details in memory block With memmap on memory, some architecture needs more details w.r.t altmap such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of computing them again when we remove a memory block, embed vmem_altmap details in struct memory_block if we are using memmap on memory block feature. [yangyingliang@huawei.com: fix error return code in add_memory_resource()] Link: https://lkml.kernel.org/r/20230809081552.1351184-1-yangyingliang@huawei.com Link: https://lkml.kernel.org/r/20230808091501.287660-7-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-07-26 14:57:31 -04:00
Jeff Moyer	fe1605fa56	mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks JIRA: https://issues.redhat.com/browse/RHEL-23824 commit 2d1f649c7c0855751c7ff43f4e34784061bc72f7 Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Date: Tue Aug 8 14:44:59 2023 +0530 mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks Currently, memmap_on_memory feature is only supported with memory block sizes that result in vmemmap pages covering full page blocks. This is because memory onlining/offlining code requires applicable ranges to be pageblock-aligned, for example, to set the migratetypes properly. This patch helps to lift that restriction by reserving more pages than required for vmemmap space. This helps the start address to be page block aligned with different memory block sizes. Using this facility implies the kernel will be reserving some pages for every memoryblock. This allows the memmap on memory feature to be widely useful with different memory block size values. For ex: with 64K page size and 256MiB memory block size, we require 4 pages to map vmemmap pages, To align things correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved. Link: https://lkml.kernel.org/r/20230808091501.287660-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-07-26 14:57:31 -04:00
Jeff Moyer	b12e31cced	mm/memory_hotplug: allow architecture to override memmap on memory support check JIRA: https://issues.redhat.com/browse/RHEL-23824 commit 85a2b4b08f202d67be81e2453064e01572ec19c8 Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Date: Tue Aug 8 14:44:58 2023 +0530 mm/memory_hotplug: allow architecture to override memmap on memory support check Some architectures would want different restrictions. Hence add an architecture-specific override. The PMD_SIZE check is moved there. Link: https://lkml.kernel.org/r/20230808091501.287660-4-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-07-26 14:57:31 -04:00
Mark Langsdorf	96e656309c	mm/memory_hotplug: fix error handling in add_memory_resource() JIRA: https://issues.redhat.com/browse/RHEL-26183 Conflicts: mm/memory_hotplug.c - minor context differences commit f42ce5f087eb69e47294ababd2e7e6f88a82d308 Author: Sumanth Korikkar <sumanthk@linux.ibm.com> Date: Mon Nov 20 15:53:53 2023 +0100 In add_memory_resource(), creation of memory block devices occurs after successful call to arch_add_memory(). However, creation of memory block devices could fail. In that case, arch_remove_memory() is called to perform necessary cleanup. Currently with or without altmap support, arch_remove_memory() is always passed with altmap set to NULL during error handling. This leads to freeing of struct pages using free_pages(), eventhough the allocation might have been performed with altmap support via altmap_alloc_block_buf(). Fix the error handling by passing altmap in arch_remove_memory(). This ensures the following: * When altmap is disabled, deallocation of the struct pages array occurs via free_pages(). * When altmap is enabled, deallocation occurs via vmem_altmap_free(). Link: https://lkml.kernel.org/r/20231120145354.308999-3-sumanthk@linux.ibm.com Fixes: `a08a2ae346` ("mm,memory_hotplug: allocate memmap from the added memory range") Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>	2024-05-31 16:19:45 -04:00
Mark Langsdorf	250213a584	mm/memory_hotplug: add missing mem_hotplug_lock JIRA: https://issues.redhat.com/browse/RHEL-26183 Conflicts: drivers/base/memory.c - minor context differences commit 001002e73712cdf6b8d9a103648cda3040ad7647 Author: Sumanth Korikkar <sumanthk@linux.ibm.com> Date: Wed, 06 Dec 2023 16:12:46 +0000 From Documentation/core-api/memory-hotplug.rst: When adding/removing/onlining/offlining memory or adding/removing heterogeneous/device memory, we should always hold the mem_hotplug_lock in write mode to serialise memory hotplug (e.g. access to global/zone variables). mhp_(de)init_memmap_on_memory() functions can change zone stats and struct page content, but they are currently called w/o the mem_hotplug_lock. When memory block is being offlined and when kmemleak goes through each populated zone, the following theoretical race conditions could occur: CPU 0: \| CPU 1: memory_offline() \| -> offline_pages() \| -> mem_hotplug_begin() \| ... \| -> mem_hotplug_done() \| \| kmemleak_scan() \| -> get_online_mems() \| ... -> mhp_deinit_memmap_on_memory() \| [not protected by mem_hotplug_begin/done()]\| Marks memory section as offline, \| Retrieves zone_start_pfn poisons vmemmap struct pages and updates \| and struct page members. the zone related data \| \| ... \| -> put_online_mems() Fix this by ensuring mem_hotplug_lock is taken before performing mhp_init_memmap_on_memory(). Also ensure that mhp_deinit_memmap_on_memory() holds the lock. online/offline_pages() are currently only called from memory_block_online/offline(), so it is safe to move the locking there. Link: https://lkml.kernel.org/r/20231120145354.308999-2-sumanthk@linux.ibm.com Fixes: `a08a2ae346` ("mm,memory_hotplug: allocate memmap from the added memory range") Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>	2024-05-31 15:37:45 -04:00
Nico Pache	8f27608f4d	mm/memory_hotplug: use pfn math in place of direct struct page manipulation commit 1640a0ef80f6d572725f5b0330038c18e98ea168 Author: Zi Yan <ziy@nvidia.com> Date: Wed Sep 13 16:12:46 2023 -0400 mm/memory_hotplug: use pfn math in place of direct struct page manipulation When dealing with hugetlb pages, manipulating struct page pointers directly can get to wrong struct page, since struct page is not guaranteed to be contiguous on SPARSEMEM without VMEMMAP. Use pfn calculation to handle it properly. Without the fix, a wrong number of page might be skipped. Since skip cannot be negative, scan_movable_page() will end early and might miss a movable page with -ENOENT. This might fail offline_pages(). No bug is reported. The fix comes from code inspection. Link: https://lkml.kernel.org/r/20230913201248.452081-4-zi.yan@sent.com Fixes: `eeb0efd071` ("mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages") Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> JIRA: https://issues.redhat.com/browse/RHEL-5619 Signed-off-by: Nico Pache <npache@redhat.com>	2024-04-30 17:51:28 -06:00
Aristeu Rozanski	d378590287	mm/memory_hotplug: cleanup return value handing in do_migrate_range() JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit 32cf666eab720b597650d9dea6ff07e99cd36b3d Author: SeongJae Park <sj@kernel.org> Date: Thu Feb 16 17:07:03 2023 +0000 mm/memory_hotplug: cleanup return value handing in do_migrate_range() Return value mechanism of do_migrate_range() is not very simple, while no caller of the function checks the return value. Make the function return nothing to be more simple, and cleanup related unnecessary code. Link: https://lkml.kernel.org/r/20230216170703.64574-1-sj@kernel.org Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:24 -04:00
Aristeu Rozanski	837cf9f325	mm: change to return bool for isolate_movable_page() JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit cd7755800eb54e8522f5e51f4e71e6494c1f1572 Author: Baolin Wang <baolin.wang@linux.alibaba.com> Date: Wed Feb 15 18:39:37 2023 +0800 mm: change to return bool for isolate_movable_page() Now the isolate_movable_page() can only return 0 or -EBUSY, and no users will care about the negative return value, thus we can convert the isolate_movable_page() to return a boolean value to make the code more clear when checking the movable page isolation state. No functional changes intended. [akpm@linux-foundation.org: remove unneeded comment, per Matthew] Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:24 -04:00
Aristeu Rozanski	4c96f5154f	mm: change to return bool for isolate_lru_page() JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit f7f9c00dfafffd7a5a1a5685e2d874c64913e2ed Author: Baolin Wang <baolin.wang@linux.alibaba.com> Date: Wed Feb 15 18:39:35 2023 +0800 mm: change to return bool for isolate_lru_page() The isolate_lru_page() can only return 0 or -EBUSY, and most users did not care about the negative error of isolate_lru_page(), except one user in add_page_for_migration(). So we can convert the isolate_lru_page() to return a boolean value, which can help to make the code more clear when checking the return value of isolate_lru_page(). Also convert all users' logic of checking the isolation state. No functional changes intended. Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:23 -04:00
Aristeu Rozanski	edf79d9715	mm/hugetlb: convert isolate_hugetlb to folios JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit 6aa3a920125e9f58891e2b5dc2efd4d0c1ff05a6 Author: Sidhartha Kumar <sidhartha.kumar@oracle.com> Date: Fri Jan 13 16:30:50 2023 -0600 mm/hugetlb: convert isolate_hugetlb to folios Patch series "continue hugetlb folio conversion", v3. This series continues the conversion of core hugetlb functions to use folios. This series converts many helper funtions in the hugetlb fault path. This is in preparation for another series to convert the hugetlb fault code paths to operate on folios. This patch (of 8): Convert isolate_hugetlb() to take in a folio and convert its callers to pass a folio. Use page_folio() to convert the callers to use a folio is safe as isolate_hugetlb() operates on a head page. Link: https://lkml.kernel.org/r/20230113223057.173292-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20230113223057.173292-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:20 -04:00
Mark Langsdorf	59b0c0257c	mm/memory_hotplug: allow memmap on memory hotplug request to fallback JIRA: https://issues.redhat.com/browse/RHEL-26871 commit e3c2bfdd33a30b34674fb8839f5476ab2702c1c1 Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Date: Mon, 21 Aug 2023 13:37:48 +0000 If not supported, fallback to not using memap on memmory. This avoids the need for callers to do the fallback. Link: https://lkml.kernel.org/r/20230808091501.287660-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>	2024-04-05 17:00:40 -04:00
Paolo Bonzini	0a2ad02005	mm: avoid passing 0 to __ffs() JIRA: https://issues.redhat.com/browse/RHEL-10059 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") results in various boot failures (hang) on arm targets Debug messages reveal the reason. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If start==0, __ffs(start) returns 0xfffffff or (as int) -1, which min_t() interprets as such, while min() apparently uses the returned unsigned long value. Obviously a negative order isn't received well by the rest of the code. [akpm@linux-foundation.org: fix comment, per Mike] Link: https://lkml.kernel.org/r/ZDBa7HWZK69dKKzH@kernel.org Link: https://lkml.kernel.org/r/20230406072529.vupqyrzqnhyozeyh@box.shutemov.name Fixes: 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name> Reported-by: Guenter Roeck <linux@roeck-us.net> Link: https://lkml.kernel.org/r/9460377a-38aa-4f39-ad57-fb73725f92db@roeck-us.net Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 59f876fb9d68a4d8c20305d7a7a0daf4ee9478a8) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-10-30 09:12:42 +01:00
Paolo Bonzini	538bf6f332	mm, treewide: redefine MAX_ORDER sanely JIRA: https://issues.redhat.com/browse/RHEL-10059 MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. [kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box [akpm@linux-foundation.org: fix another min_t warning] [kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name [akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 23baf831a32c04f9a968812511540b1b3e648bf5) [RHEL: Fix conflicts by changing MAX_ORDER - 1 to MAX_ORDER, ">= MAX_ORDER" to "> MAX_ORDER", etc.] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-10-30 09:12:37 +01:00
Chris von Recklinghausen	8d6289d4f0	mm: add pageblock_aligned() macro JIRA: https://issues.redhat.com/browse/RHEL-1848 commit ee0913c4719610204315a0d8a35122c6233249e0 Author: Kefeng Wang <wangkefeng.wang@huawei.com> Date: Wed Sep 7 14:08:44 2022 +0800 mm: add pageblock_aligned() macro Add pageblock_aligned() and use it to simplify code. Link: https://lkml.kernel.org/r/20220907060844.126891-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:14:19 -04:00
Chris von Recklinghausen	39091324ad	mm: fix null-ptr-deref in kswapd_is_running() JIRA: https://issues.redhat.com/browse/RHEL-1848 commit b4a0215e11dcfe23a48c65c6d6c82c0c2c551a48 Author: Kefeng Wang <wangkefeng.wang@huawei.com> Date: Sat Aug 27 19:19:59 2022 +0800 mm: fix null-ptr-deref in kswapd_is_running() kswapd_run/stop() will set pgdat->kswapd to NULL, which could race with kswapd_is_running() in kcompactd(), kswapd_run/stop() kcompactd() kswapd_is_running() pgdat->kswapd // error or nomal ptr verify pgdat->kswapd // load non-NULL pgdat->kswapd pgdat->kswapd = NULL task_is_running(pgdat->kswapd) // Null pointer derefence KASAN reports the null-ptr-deref shown below, vmscan: Failed to start kswapd on node 0 ... BUG: KASAN: null-ptr-deref in kcompactd+0x440/0x504 Read of size 8 at addr 0000000000000024 by task kcompactd0/37 CPU: 0 PID: 37 Comm: kcompactd0 Kdump: loaded Tainted: G OE 5.10.60 #1 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: dump_backtrace+0x0/0x394 show_stack+0x34/0x4c dump_stack+0x158/0x1e4 __kasan_report+0x138/0x140 kasan_report+0x44/0xdc __asan_load8+0x94/0xd0 kcompactd+0x440/0x504 kthread+0x1a4/0x1f0 ret_from_fork+0x10/0x18 At present kswapd/kcompactd_run() and kswapd/kcompactd_stop() are protected by mem_hotplug_begin/done(), but without kcompactd(). There is no need to involve memory hotplug lock in kcompactd(), so let's add a new mutex to protect pgdat->kswapd accesses. Also, because the kcompactd task will check the state of kswapd task, it's better to call kcompactd_stop() before kswapd_stop() to reduce lock conflicts. [akpm@linux-foundation.org: add comments] Link: https://lkml.kernel.org/r/20220827111959.186838-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:13:38 -04:00
Mark Langsdorf	951d600205	mm: kill is_memblock_offlined() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178302 commit 639118d1571f70b1157b4bb5ac574b0ab0f38099 Author: Kefeng Wang <wangkefeng.wang@huawei.com> Date: Sat, 27 Aug 2022 19:20:43 +0800 Directly check state of struct memory_block, no need a single function. Link: https://lkml.kernel.org/r/20220827112043.187028-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>	2023-06-08 12:20:27 -04:00
Chris von Recklinghausen	e8d3762b34	mm: use is_zone_movable_page() helper Bugzilla: https://bugzilla.redhat.com/2160210 commit 07252dfea2c7089bca68949710268cbbb0ce509e Author: Kefeng Wang <wangkefeng.wang@huawei.com> Date: Tue Jul 26 21:11:35 2022 +0800 mm: use is_zone_movable_page() helper Use is_zone_movable_page() helper to simplify code. Link: https://lkml.kernel.org/r/20220726131135.146912-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:19:29 -04:00
Chris von Recklinghausen	5b2e8fce9b	mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory Conflicts: - Documentation/admin-guide/kernel-parameters.txt - We don't have 376e3fdecb0d ("m68k: Enable memtest functionality") so there is a difference in surrounding context Bugzilla: https://bugzilla.redhat.com/2160210 commit 66361095129b3b5d065e6c09cf0c085ef4a8c40f Author: Muchun Song <songmuchun@bytedance.com> Date: Fri Jun 17 21:56:50 2022 +0800 mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_ memory For now, the feature of hugetlb_free_vmemmap is not compatible with the feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes precedence over memory_hotplug.memmap_on_memory. However, someone wants to make memory_hotplug.memmap_on_memory takes precedence over hugetlb_free_vmemmap since memmap_on_memory makes it more likely to succeed memory hotplug in close-to-OOM situations. So the decision of making hugetlb_free_vmemmap take precedence is not wise and elegant. The proper approach is to have hugetlb_vmemmap.c do the check whether the section which the HugeTLB pages belong to can be optimized. If the section's vmemmap pages are allocated from the added memory block itself, hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do the optimization. Then both kernel parameters are compatible. So this patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap pages. The hugetlb_vmemmap can use this flag to detect if a vmemmap page can be optimized. [songmuchun@bytedance.com: walk vmemmap page tables to avoid false-positive] Link: https://lkml.kernel.org/r/20220620110616.12056-3-songmuchun@bytedanc e.com Link: https://lkml.kernel.org/r/20220617135650.74901-3-songmuchun@bytedance. com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:19:20 -04:00
Chris von Recklinghausen	8766beb1af	mm: memory_hotplug: enumerate all supported section flags Bugzilla: https://bugzilla.redhat.com/2160210 commit ed7802dd48f7a507213cbb95bb4c6f1fe134eb5d Author: Muchun Song <songmuchun@bytedance.com> Date: Fri Jun 17 21:56:49 2022 +0800 mm: memory_hotplug: enumerate all supported section flags Patch series "make hugetlb_optimize_vmemmap compatible with memmap_on_memory", v3. This series makes hugetlb_optimize_vmemmap compatible with memmap_on_memory. This patch (of 2): We are almost running out of section flags, only one bit is available in the worst case (powerpc with 256k pages). However, there are still some free bits (in ->section_mem_map) on other architectures (e.g. x86_64 has 10 bits available, arm64 has 8 bits available with worst case of 64K pages). We have hard coded those numbers in code, it is inconvenient to use those bits on other architectures except powerpc. So transfer those section flags to enumeration to make it easy to add new section flags in the future. Also, move SECTION_TAINT_ZONE_DEVICE into the scope of CONFIG_ZONE_DEVICE to save a bit on non-zone-device case. [songmuchun@bytedance.com: replace enum with defines per David] Link: https://lkml.kernel.org/r/20220620110616.12056-2-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220617135650.74901-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220617135650.74901-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:19:20 -04:00
Chris von Recklinghausen	a38734dae1	mm/memory_hotplug: drop 'reason' argument from check_pfn_span() Bugzilla: https://bugzilla.redhat.com/2160210 commit 943189db4f3ed1445dd630dc0b96e115357c4330 Author: Anshuman Khandual <anshuman.khandual@arm.com> Date: Tue May 31 14:34:41 2022 +0530 mm/memory_hotplug: drop 'reason' argument from check_pfn_span() In check_pfn_span(), a 'reason' string is being used to recreate the caller function name, while printing the warning message. It is really unnecessary as the warning message could just be printed inside the caller depending on the return code. Currently there are just two callers for check_pfn_span() i.e __add_pages() and __remove_pages(). Let's clean this up. Link: https://lkml.kernel.org/r/20220531090441.170650-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:19:14 -04:00
Chris von Recklinghausen	69bfe65709	mm: make alloc_contig_range work at pageblock granularity Bugzilla: https://bugzilla.redhat.com/2160210 commit b2c9e2fbba32539626522b6aed30d1dde7b7e971 Author: Zi Yan <ziy@nvidia.com> Date: Thu May 12 20:22:58 2022 -0700 mm: make alloc_contig_range work at pageblock granularity alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid merging pageblocks with different migratetypes. It might unnecessarily convert extra pageblocks at the beginning and at the end of the range. Change alloc_contig_range() to work at pageblock granularity. Special handling is needed for free pages and in-use pages across the boundaries of the range specified by alloc_contig_range(). Because these= Partially isolated pages causes free page accounting issues. The free pages will be split and freed into separate migratetype lists; the in-use= Pages will be migrated then the freed pages will be handled in the aforementioned way. [ziy@nvidia.com: fix deadlock/crash] Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Ren <renzhengeek@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:19:05 -04:00
Chris von Recklinghausen	c434979693	mm/memory_hotplug: use pgprot_val to get value of pgprot Bugzilla: https://bugzilla.redhat.com/2160210 commit 6366238b8dfc383723c211c1ffe8c8d7914107e5 Author: liusongtang <liusongtang@huawei.com> Date: Mon May 9 18:20:52 2022 -0700 mm/memory_hotplug: use pgprot_val to get value of pgprot pgprot.pgprot is non-portable code. It should be replaced by portable macro pgprot_val. Link: https://lkml.kernel.org/r/20220426071302.220646-1-liusongtang@huawei.com Signed-off-by: liusongtang <liusongtang@huawei.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:19:02 -04:00
Chris von Recklinghausen	6425f9d3c8	mm/sparse-vmemmap: add a pgmap argument to section activation Bugzilla: https://bugzilla.redhat.com/2160210 commit e3246d8f52173a798710314a42fea83223036fc8 Author: Joao Martins <joao.m.martins@oracle.com> Date: Thu Apr 28 23:16:15 2022 -0700 mm/sparse-vmemmap: add a pgmap argument to section activation Patch series "sparse-vmemmap: memory savings for compound devmaps (device-dax)", v9. This series minimizes 'struct page' overhead by pursuing a similar approach as Muchun Song series "Free some vmemmap pages of hugetlb page" (now merged since v5.14), but applied to devmap with @vmemmap_shift (device-dax). The vmemmap dedpulication original idea (already used in HugeTLB) is to reuse/deduplicate tail page vmemmap areas, particular the area which only describes tail pages. So a vmemmap page describes 64 struct pages, and the first page for a given ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second vmemmap page would contain only tail pages, and that's what gets reused across the rest of the subsection/section. The bigger the page size, the bigger the savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages). This is done for PMEM /specifically only/ on device-dax configured namespaces, not fsdax. In other words, a devmap with a @vmemmap_shift. In terms of savings, per 1Tb of memory, the struct page cost would go down with compound devmap: * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory) * with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of total memory) The series is mostly summed up by patch 4, and to summarize what the series does: Patches 1 - 3: Minor cleanups in preparation for patch 4. Move the very nice docs of hugetlb_vmemmap.c into a Documentation/vm/ entry. Patch 4: Patch 4 is the one that takes care of the struct page savings (also referred to here as tail-page/vmemmap deduplication). Much like Muchun series, we reuse the second PTE tail page vmemmap areas across a given @vmemmap_shift On important difference though, is that contrary to the hugetlbfs series, there's no vmemmap for the area because we are late-populating it as opposed to remapping a system-ram range. IOW no freeing of pages of already initialized vmemmap like the case for hugetlbfs, which greatly simplifies the logic (besides not being arch-specific). altmap case unchanged and still goes via the vmemmap_populate(). Also adjust the newly added docs to the device-dax case. [Note that device-dax is still a little behind HugeTLB in terms of savings. I have an additional simple patch that reuses the head vmemmap page too, as a follow-up. That will double the savings and namespaces initialization.] Patch 5: Initialize fewer struct pages depending on the page size with DRAM backed struct pages -- because fewer pages are unique and most tail pages (with bigger vmemmap_shift). NVDIMM namespace bootstrap improves from ~268-358 ms to ~80-110/<1ms on 128G NVDIMMs with 2M and 1G respectivally. And struct page needed capacity will be 3.8x / 1071x smaller for 2M and 1G respectivelly. Tested on x86 with 1.5Tb of pmem (including pinning, and RDMA registration/deregistration scalability with 2M MRs) This patch (of 5): In support of using compound pages for devmap mappings, plumb the pgmap down to the vmemmap_populate implementation. Note that while altmap is retrievable from pgmap the memory hotplug code passes altmap without pgmap[], so both need to be independently plumbed. So in addition to @altmap, pass @pgmap to sparse section populate functions namely: sparse_add_section section_activate populate_section_memmap __populate_section_memmap Passing @pgmap allows __populate_section_memmap() to both fetch the vmemmap_shift in which memmap metadata is created for and also to let sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick whether to just reuse tail pages from past onlined sections. While at it, fix the kdoc for @altmap for sparse_add_section(). [] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/ Link: https://lkml.kernel.org/r/20220420155310.9712-1-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/20220420155310.9712-2-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:18:55 -04:00
Nico Pache	aa28eb0c17	mm/migration: return errno when isolate_huge_page failed commit 7ce82f4c3f3ead13a9d9498768e3b1a79975c4d8 Author: Miaohe Lin <linmiaohe@huawei.com> Date: Mon May 30 19:30:15 2022 +0800 mm/migration: return errno when isolate_huge_page failed We might fail to isolate huge page due to e.g. the page is under migration which cleared HPageMigratable. We should return errno in this case rather than always return 1 which could confuse the user, i.e. the caller might think all of the memory is migrated while the hugetlb page is left behind. We make the prototype of isolate_huge_page consistent with isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page to isolate_hugetlb as suggested by Muchun to improve the readability. Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com Fixes: `e8db67eb0d` ("mm: migrate: move_pages() supports thp migration") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Huang Ying <ying.huang@intel.com> Reported-by: kernel test robot <lkp@intel.com> (build error) Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498 Signed-off-by: Nico Pache <npache@redhat.com>	2022-11-08 10:11:39 -07:00
Chris von Recklinghausen	0a4dfa42a3	mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl Bugzilla: https://bugzilla.redhat.com/2120352 commit 78f39084b41d287aedb2ea55f2c1895cfa11d61a Author: Muchun Song <songmuchun@bytedance.com> Date: Fri May 13 16:48:56 2022 -0700 mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and reboot the server to enable or disable the feature of optimizing vmemmap pages associated with HugeTLB pages. However, rebooting usually takes a long time. So add a sysctl to enable or disable the feature at runtime without rebooting. Why we need this? There are 3 use cases. 1) The feature of minimizing overhead of struct page associated with each HugeTLB is disabled by default without passing "hugetlb_free_vmemmap=on" to the boot cmdline. When we (ByteDance) deliver the servers to the users who want to enable this feature, they have to configure the grub (change boot cmdline) and reboot the servers, whereas rebooting usually takes a long time (we have thousands of servers). It's a very bad experience for the users. So we need a approach to enable this feature after rebooting. This is a use case in our practical environment. 2) Some use cases are that HugeTLB pages are allocated 'on the fly' instead of being pulled from the HugeTLB pool, those workloads would be affected with this feature enabled. Those workloads could be identified by the characteristics of they never explicitly allocating huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages' and then let the pages be allocated from the buddy allocator at fault time. We can confirm it is a real use case from the commit `099730d674`. For those workloads, the page fault time could be ~2x slower than before. We suspect those users want to disable this feature if the system has enabled this before and they don't think the memory savings benefit is enough to make up for the performance drop. 3) If the workload which wants vmemmap pages to be optimized and the workload which wants to set 'nr_overcommit_hugepages' and does not want the extera overhead at fault time when the overcommitted pages be allocated from the buddy allocator are deployed in the same server. The user could enable this feature and set 'nr_hugepages' and 'nr_overcommit_hugepages', then disable the feature. In this case, the overcommited HugeTLB pages will not encounter the extra overhead at fault time. Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:28:08 -04:00
Chris von Recklinghausen	9c778c3f93	mm: memory_hotplug: override memmap_on_memory when hugetlb_free_vmemmap=on Bugzilla: https://bugzilla.redhat.com/2120352 commit 6e02c46b4d970f24eb51197d451b00f08a8a4186 Author: Muchun Song <songmuchun@bytedance.com> Date: Fri May 13 16:48:56 2022 -0700 mm: memory_hotplug: override memmap_on_memory when hugetlb_free_vmemmap=on Optimizing HugeTLB vmemmap pages is not compatible with allocating memmap on hot added memory. If "hugetlb_free_vmemmap=on" and memory_hotplug.memmap_on_memory" are both passed on the kernel command line, optimizing hugetlb pages takes precedence. However, the global variable memmap_on_memory will still be set to 1, even though we will not try to allocate memmap on hot added memory. Also introduce mhp_memmap_on_memory() helper to move the definition of "memmap_on_memory" to the scope of CONFIG_MHP_MEMMAP_ON_MEMORY. In the next patch, mhp_memmap_on_memory() will also be exported to be used in hugetlb_vmemmap.c. Link: https://lkml.kernel.org/r/20220512041142.39501-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:28:08 -04:00
Chris von Recklinghausen	fcee10fb9d	mm/memory_hotplug: fix misplaced comment in offline_pages Bugzilla: https://bugzilla.redhat.com/2120352 commit 36ba30bc1df252ed65b8cae514b514985a7593c9 Author: Miaohe Lin <linmiaohe@huawei.com> Date: Tue Mar 22 14:47:24 2022 -0700 mm/memory_hotplug: fix misplaced comment in offline_pages It's misplaced since commit `7960509329` ("mm, memory_hotplug: print reason for the offlining failure"). Move it to the right place. Link: https://lkml.kernel.org/r/20220207133643.23427-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:55 -04:00
Chris von Recklinghausen	c39a39d7a0	mm/memory_hotplug: clean up try_offline_node Bugzilla: https://bugzilla.redhat.com/2120352 commit b27340a5bda4e35453d186e25622bacc3cf595c2 Author: Miaohe Lin <linmiaohe@huawei.com> Date: Tue Mar 22 14:47:22 2022 -0700 mm/memory_hotplug: clean up try_offline_node We can use helper macro node_spanned_pages to check whether node spans pages. And we can change the parameter of check_cpu_on_node to nid as that's what it really cares. Thus we can further get rid of the local variable pgdat and improve the readability a bit. Link: https://lkml.kernel.org/r/20220207133643.23427-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:54 -04:00
Chris von Recklinghausen	2b96934bcf	mm/memory_hotplug: avoid calling zone_intersects() for ZONE_NORMAL Bugzilla: https://bugzilla.redhat.com/2120352 commit d6aad2016a3f902153d7b8b7e02da2c7c50c10a4 Author: Miaohe Lin <linmiaohe@huawei.com> Date: Tue Mar 22 14:47:19 2022 -0700 mm/memory_hotplug: avoid calling zone_intersects() for ZONE_NORMAL If zid reaches ZONE_NORMAL, the caller will always get the NORMAL zone no matter what zone_intersects() returns. So we can save some possible cpu cycles by avoid calling zone_intersects() for ZONE_NORMAL. Link: https://lkml.kernel.org/r/20220207133643.23427-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:54 -04:00
Chris von Recklinghausen	ceef7638ac	mm/memory_hotplug: remove obsolete comment of __add_pages Bugzilla: https://bugzilla.redhat.com/2120352 commit 2b6bf15f464620bbee08e8257d11f8f9f7f8725f Author: Miaohe Lin <linmiaohe@huawei.com> Date: Tue Mar 22 14:47:16 2022 -0700 mm/memory_hotplug: remove obsolete comment of __add_pages Patch series "A few cleanup patches around memory_hotplug". This series contains a few patches to fix obsolete and misplaced comments, clean up the try_offline_node function and so on. This patch (of 4): Since commit `f1dd2cd13c` ("mm, memory_hotplug: do not associate hotadded memory to zones until online"), there is no need to pass in the zone. [akpm@linux-foundation.org: remove the comment altogether, per David] Link: https://lkml.kernel.org/r/20220207133643.23427-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220207133643.23427-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:54 -04:00
Chris von Recklinghausen	0bf26d7297	mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED Bugzilla: https://bugzilla.redhat.com/2120352 commit 32befe9e27859b87e70e0aba9b60bfb8000d9a66 Author: David Hildenbrand <david@redhat.com> Date: Fri Nov 5 13:44:56 2021 -0700 mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED Let's communicate driver-managed regions to memblock, to properly teach kexec_file with CONFIG_ARCH_KEEP_MEMBLOCK to not place images on these memory regions. Link: https://lkml.kernel.org/r/20211004093605.5830-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jianyong Wu <Jianyong.Wu@arm.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shahab Vahedi <shahab@synopsys.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:30 -04:00
Chris von Recklinghausen	3cbb272e07	memblock: allow to specify flags with memblock_add_node() Bugzilla: https://bugzilla.redhat.com/2120352 commit 952eea9b01e4bbb7011329f1b7240844e61e5128 Author: David Hildenbrand <david@redhat.com> Date: Fri Nov 5 13:44:49 2021 -0700 memblock: allow to specify flags with memblock_add_node() We want to specify flags when hotplugging memory. Let's prepare to pass flags to memblock_add_node() by adjusting all existing users. Note that when hotplugging memory the system is already up and running and we might have concurrent memblock users: for example, while we're hotplugging memory, kexec_file code might search for suitable memory regions to place kexec images. It's important to add the memory directly to memblock via a single call with the right flags, instead of adding the memory first and apply flags later: otherwise, concurrent memblock users might temporarily stumble over memblocks with wrong flags, which will be important in a follow-up patch that introduces a new flag to properly handle add_memory_driver_managed(). Link: https://lkml.kernel.org/r/20211004093605.5830-4-david@redhat.com Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Shahab Vahedi <shahab@synopsys.com> [arch/arc] Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jianyong Wu <Jianyong.Wu@arm.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:30 -04:00
Chris von Recklinghausen	6e1642b5da	mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() Bugzilla: https://bugzilla.redhat.com/2120352 commit 53d38316ab2017a7c0d733765b521700aa357ec9 Author: David Hildenbrand <david@redhat.com> Date: Fri Nov 5 13:44:42 2021 -0700 mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() Patch series "mm/memory_hotplug: full support for add_memory_driver_managed() with CONFIG_ARCH_KEEP_MEMBLOCK", v2. Architectures that require CONFIG_ARCH_KEEP_MEMBLOCK=y, such as arm64, don't cleanly support add_memory_driver_managed() yet. Most prominently, kexec_file can still end up placing kexec images on such driver-managed memory, resulting in undesired behavior, for example, having kexec images located on memory not part of the firmware-provided memory map. Teaching kexec to not place images on driver-managed memory is especially relevant for virtio-mem. Details can be found in commit `7b7b27214b` ("mm/memory_hotplug: introduce add_memory_driver_managed()"). Extend memblock with a new flag and set it from memory hotplug code when applicable. This is required to fully support virtio-mem on arm64, making also kexec_file behave like on x86-64. This patch (of 2): If memblock_add_node() fails, we're most probably running out of memory. While this is unlikely to happen, it can happen and having memory added without a memblock can be problematic for architectures that use memblock to detect valid memory. Let's fail in a nice way instead of silently ignoring the error. Link: https://lkml.kernel.org/r/20211004093605.5830-1-david@redhat.com Link: https://lkml.kernel.org/r/20211004093605.5830-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Jianyong Wu <Jianyong.Wu@arm.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Shahab Vahedi <shahab@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:30 -04:00
Chris von Recklinghausen	efa841a1c2	mm/memory_hotplug: remove HIGHMEM leftovers Bugzilla: https://bugzilla.redhat.com/2120352 commit 6b740c6c3aa371cd70ac07f8d071f2a8af28c51c Author: David Hildenbrand <david@redhat.com> Date: Fri Nov 5 13:44:31 2021 -0700 mm/memory_hotplug: remove HIGHMEM leftovers We don't support CONFIG_MEMORY_HOTPLUG on 32 bit and consequently not HIGHMEM. Let's remove any leftover code -- including the unused "status_change_nid_high" field part of the memory notifier. Link: https://lkml.kernel.org/r/20210929143600.49379-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alex Shi <alexs@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:30 -04:00
Chris von Recklinghausen	639c8559af	mm/memory_hotplug: add static qualifier for online_policy_to_str() Bugzilla: https://bugzilla.redhat.com/2120352 commit ac62554ba7060c8824d7821c1342673f1e13c31d Author: Tang Yizhou <tangyizhou@huawei.com> Date: Fri Nov 5 13:44:08 2021 -0700 mm/memory_hotplug: add static qualifier for online_policy_to_str() online_policy_to_str is only used in memory_hotplug.c and should be defined as static. Link: https://lkml.kernel.org/r/20210913024534.26161-1-tangyizhou@huawei.com Signed-off-by: Tang Yizhou <tangyizhou@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:30 -04:00

1 2 3 4 5 ...

576 Commits