Commit Graph

313 Commits

Author SHA1 Message Date
Rafael Aquini c8c9c0b259 mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * arch/*/Kconfig: all hunks dropped as there were only text blurbs and comments
     being changed with no functional changes whatsoever, and RHEL9 is missing
     several (unrelated) commits to these arches that tranform the text blurbs in
     the way these non-functional hunks were expecting;
  * drivers/accel/qaic/qaic_data.c: hunk dropped due to RHEL-only commit
     083c0cdce2 ("Merge DRM changes from upstream v6.8..v6.9");
  * drivers/gpu/drm/i915/gem/selftests/huge_pages.c: hunk dropped due to RHEL-only
     commit ca8b16c11b ("Merge DRM changes from upstream v6.7..v6.8");
  * drivers/gpu/drm/ttm/tests/ttm_pool_test.c: all hunks dropped due to RHEL-only
     commit ca8b16c11b ("Merge DRM changes from upstream v6.7..v6.8");
  * drivers/video/fbdev/vermilion/vermilion.c: hunk dropped as RHEL9 misses
     commit dbe7e429fe ("vmlfb: framebuffer driver for Intel Vermilion Range");
  * include/linux/pageblock-flags.h: differences due to out-of-order backport
    of upstream commits 72801513b2bf ("mm: set pageblock_order to HPAGE_PMD_ORDER
    in case with !CONFIG_HUGETLB_PAGE but THP enabled"), and 3a7e02c040b1
    ("minmax: avoid overly complicated constant expressions in VM code");
  * mm/mm_init.c: differences on the 3rd, and 4th hunks are due to RHEL
     backport commit 1845b92dcf ("mm: move most of core MM initialization to
     mm/mm_init.c") ignoring the out-of-order backport of commit 3f6dac0fd1b8
     ("mm/page_alloc: make deferred page init free pages in MAX_ORDER blocks")
     thus partially reverting the changes introduced by the latter;

This patch is a backport of the following upstream commit:
commit 5e0a760b44417f7cadd79de2204d6247109558a0
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Thu Dec 28 17:47:04 2023 +0300

    mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

    commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
    changed the definition of MAX_ORDER to be inclusive.  This has caused
    issues with code that was not yet upstream and depended on the previous
    definition.

    To draw attention to the altered meaning of the define, rename MAX_ORDER
    to MAX_PAGE_ORDER.

    Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:17 -05:00
Rafael Aquini 51ea23f932 mm: disable kernelcore=mirror when no mirror memory
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * mm/internal.h: context difference due to a series of conflict resolutions
      given a series of out-of-order backports which made "mirrored_kernelcore"
      to end up in a slightly different context when comparing with its upstream
      placement. We leverage this backport to make it go into the "right" place.

This patch is a backport of the following upstream commit:
commit 0db31d63f27e5b8ca84b9fd5a3cff5b12ac88abf
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Wed Aug 2 15:23:28 2023 +0800

    mm: disable kernelcore=mirror when no mirror memory

    For system with kernelcore=mirror enabled while no mirrored memory is
    reported by efi.  This could lead to kernel OOM during startup since all
    memory beside zone DMA are in the movable zone and this prevents the
    kernel to use it.

    Zone DMA/DMA32 initialization is independent of mirrored memory and their
    max pfn is set in zone_sizes_init().  Since kernel can fallback to zone
    DMA/DMA32 if there is no memory in zone Normal, these zones are seen as
    mirrored memory no mather their memory attributes are.

    To solve this problem, disable kernelcore=mirror when there is no real
    mirrored memory exists.

    Link: https://lkml.kernel.org/r/20230802072328.2107981-1-mawupeng1@huawei.com
    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Suggested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Suggested-by: Mike Rapoport <rppt@kernel.org>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Levi Yun <ppbuk5246@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:14 -04:00
Rafael Aquini 51d3fa6c4d Revert "mm,memblock: reset memblock.reserved to system init state to prevent UAF"
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit c442a957b2f4e116f28aeb55bf2719cb7bb2ad60
Author: Mike Rapoport (IBM) <rppt@kernel.org>
Date:   Fri Jul 28 13:55:12 2023 +0300

    Revert "mm,memblock: reset memblock.reserved to system init state to prevent UAF"

    This reverts commit 9e46e4dcd9d6cd88342b028dbfa5f4fb7483d39c.

    kbuild reports a warning in memblock_remove_region() because of a false
    positive caused by partial reset of the memblock state.

    Doing the full reset will remove the false positives, but will allow
    late use of memblock_free() to go unnoticed, so it is better to revert
    the offending commit.

       WARNING: CPU: 0 PID: 1 at mm/memblock.c:352 memblock_remove_region (kbuild/src/x86_64/mm/memblock.c:352 (discriminator 1))
       Modules linked in:
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc3-00001-g9e46e4dcd9d6 #2
       RIP: 0010:memblock_remove_region (kbuild/src/x86_64/mm/memblock.c:352 (discriminator 1))
       Call Trace:
         memblock_discard (kbuild/src/x86_64/mm/memblock.c:383)
         page_alloc_init_late (kbuild/src/x86_64/include/linux/find.h:208 kbuild/src/x86_64/include/linux/nodemask.h:266 kbuild/src/x86_64/mm/mm_init.c:2405)
         kernel_init_freeable (kbuild/src/x86_64/init/main.c:1325 kbuild/src/x86_64/init/main.c:1546)
         kernel_init (kbuild/src/x86_64/init/main.c:1439)
         ret_from_fork (kbuild/src/x86_64/arch/x86/kernel/process.c:145)
         ret_from_fork_asm (kbuild/src/x86_64/arch/x86/entry/entry_64.S:298)

    Reported-by: kernel test robot <oliver.sang@intel.com>
    Closes: https://lore.kernel.org/oe-lkp/202307271656.447aa17e-oliver.sang@intel.com
    Signed-off-by: "Mike Rapoport (IBM)" <rppt@kernel.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:45 -04:00
Rafael Aquini c3a75b7a7a mm,memblock: reset memblock.reserved to system init state to prevent UAF
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 9e46e4dcd9d6cd88342b028dbfa5f4fb7483d39c
Author: Rik van Riel <riel@surriel.com>
Date:   Wed Jul 19 15:41:37 2023 -0400

    mm,memblock: reset memblock.reserved to system init state to prevent UAF

    The memblock_discard function frees the memblock.reserved.regions
    array, which is good.

    However, if a subsequent memblock_free (or memblock_phys_free) comes
    in later, from for example ima_free_kexec_buffer, that will result in
    a use after free bug in memblock_isolate_range.

    When running a kernel with CONFIG_KASAN enabled, this will cause a
    kernel panic very early in boot. Without CONFIG_KASAN, there is
    a chance that memblock_isolate_range might scribble on memory
    that is now in use by somebody else.

    Avoid those issues by making sure that memblock_discard points
    memblock.reserved.regions back at the static buffer.

    If memblock_free is called after memblock memory is discarded, that will
    print a warning in memblock_remove_region.

    Signed-off-by: Rik van Riel <riel@surriel.com>
    Link: https://lore.kernel.org/r/20230719154137.732d8525@imladris.surriel.com
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:44 -04:00
Rafael Aquini ae97c9af04 mm/memory_hotplug: remove reset_node_managed_pages() in hotadd_init_pgdat()
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit a668968f84265e698a122656c433809ab9f023fa
Author: Haifeng Xu <haifeng.xu@shopee.com>
Date:   Wed Jun 7 02:45:48 2023 +0000

    mm/memory_hotplug: remove reset_node_managed_pages() in hotadd_init_pgdat()

    managed pages has already been set to 0 in free_area_init_core_hotplug(),
    via zone_init_internals() on each zone.  It's pointless to reset again.

    Furthermore, reset_node_managed_pages() no longer needs to be exposed
    outside of mm/memblock.c.  Remove declaration in include/linux/memblock.h
    and define it as static.

    In addtion to this, the only caller of reset_node_managed_pages() is
    reset_all_zones_managed_pages(), which is annotated with __init, so it
    should be safe to also mark reset_node_managed_pages() as __init.

    Link: https://lkml.kernel.org/r/20230607024548.1240-1-haifeng.xu@shopee.com
    Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
    Suggested-by: David Hildenbrand <david@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:36 -04:00
Rafael Aquini b0059d1858 memblock: Update nid info in memblock debugfs
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit de649e7f5edb2e61dbd3d64deae44cb165e657ad
Author: Yuwei Guan <ssawgyw@gmail.com>
Date:   Thu Jun 1 21:31:49 2023 +0800

    memblock: Update nid info in memblock debugfs

    The node id for memblock reserved regions will be wrong,
    so let's show 'x' for reg->nid == MAX_NUMNODES in debugfs to keep it align.

    Suggested-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Co-developed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Signed-off-by: Yuwei Guan <ssawgyw@gmail.com>
    Link: https://lore.kernel.org/r/20230601133149.37160-1-ssawgyw@gmail.com
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:23 -04:00
Rafael Aquini e7d4f1bda1 memblock: Add flags and nid info in memblock debugfs
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 493f349e38d022057b3b6e13f589f108269c42b0
Author: Yuwei Guan <ssawgyw@gmail.com>
Date:   Fri May 19 18:53:21 2023 +0800

    memblock: Add flags and nid info in memblock debugfs

    Currently, the memblock debugfs can display the count of memblock_type and
    the base and end of the reg. However, when memblock_mark_*() or
    memblock_set_node() is executed on some range, the information in the
    existing debugfs cannot make it clear why the address is not consecutive.

    For example,
    cat /sys/kernel/debug/memblock/memory
       0: 0x0000000080000000..0x00000000901fffff
       1: 0x0000000090200000..0x00000000905fffff
       2: 0x0000000090600000..0x0000000092ffffff
       3: 0x0000000093000000..0x00000000973fffff
       4: 0x0000000097400000..0x00000000b71fffff
       5: 0x00000000c0000000..0x00000000dfffffff
       6: 0x00000000e2500000..0x00000000f87fffff
       7: 0x00000000f8800000..0x00000000fa7fffff
       8: 0x00000000fa800000..0x00000000fd3effff
       9: 0x00000000fd3f0000..0x00000000fd3fefff
      10: 0x00000000fd3ff000..0x00000000fd7fffff
      11: 0x00000000fd800000..0x00000000fd901fff
      12: 0x00000000fd902000..0x00000000fd909fff
      13: 0x00000000fd90a000..0x00000000fd90bfff
      14: 0x00000000fd90c000..0x00000000ffffffff
      15: 0x0000000880000000..0x0000000affffffff

    So we can add flags and nid to this debugfs.

    For example,
    cat /sys/kernel/debug/memblock/memory
       0: 0x0000000080000000..0x00000000901fffff    0 NONE
       1: 0x0000000090200000..0x00000000905fffff    0 NOMAP
       2: 0x0000000090600000..0x0000000092ffffff    0 NONE
       3: 0x0000000093000000..0x00000000973fffff    0 NOMAP
       4: 0x0000000097400000..0x00000000b71fffff    0 NONE
       5: 0x00000000c0000000..0x00000000dfffffff    0 NONE
       6: 0x00000000e2500000..0x00000000f87fffff    0 NONE
       7: 0x00000000f8800000..0x00000000fa7fffff    0 NOMAP
       8: 0x00000000fa800000..0x00000000fd3effff    0 NONE
       9: 0x00000000fd3f0000..0x00000000fd3fefff    0 NOMAP
      10: 0x00000000fd3ff000..0x00000000fd7fffff    0 NONE
      11: 0x00000000fd800000..0x00000000fd901fff    0 NOMAP
      12: 0x00000000fd902000..0x00000000fd909fff    0 NONE
      13: 0x00000000fd90a000..0x00000000fd90bfff    0 NOMAP
      14: 0x00000000fd90c000..0x00000000ffffffff    0 NONE
      15: 0x0000000880000000..0x0000000affffffff    0 NONE

    Signed-off-by: Yuwei Guan <ssawgyw@gmail.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Link: https://lore.kernel.org/r/20230519105321.333-1-ssawgyw@gmail.com
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:40 -04:00
Rafael Aquini 14978763a6 Fix some coding style errors in memblock.c
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit fc493f83a25835c14cd96379c1a07459230881bc
Author: Claudio Migliorelli <claudio.migliorelli@mail.polimi.it>
Date:   Sun Apr 23 15:29:35 2023 +0200

    Fix some coding style errors in memblock.c

    This patch removes the initialization of some static variables to 0 and
    `false` in the memblock source file, according to the coding style
    guidelines.

    Signed-off-by: Claudio Migliorelli <claudio.migliorelli@mail.polimi.it>
    Link: https://lore.kernel.org/r/87r0sa7mm8.fsf@mail.polimi.it
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:13 -04:00
Eric Chanudet 5eadecd371 memblock: fix crash when reserved memory is not added to memory
JIRA: https://issues.redhat.com/browse/RHEL-36126
Conflicts: backport out-of-order before
    commit 77e6c43e137c ("memblock: introduce MEMBLOCK_RSRV_NOINIT flag")
    where MEMBLOCK_RSRV_NOINIT is added and checked in the loop.

commit 6a9531c3a88096a26cf3ac582f7ec44f94a7dcb2
Author: Yajun Deng <yajun.deng@linux.dev>
Date:   Thu Jan 18 14:18:53 2024 +0800

    memblock: fix crash when reserved memory is not added to memory

    After commit 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")
    nid of a reserved region is used by init_reserved_page() (with
    CONFIG_DEFERRED_STRUCT_PAGE_INIT=y) to access node strucure.
    In many cases the nid of the reserved memory is not set and this causes
    a crash.

    When the nid of a reserved region is not set, fall back to
    early_pfn_to_nid(), so that nid of the first_online_node will be passed
    to init_reserved_page().

    Fixes: 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")
    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Link: https://lore.kernel.org/r/20240118061853.2652295-1-yajun.deng@linux.dev
    [rppt: massaged the commit message]
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Signed-off-by: Eric Chanudet <echanude@redhat.com>
2024-05-21 14:18:30 -04:00
Eric Chanudet 2a44683bfc mm: pass nid to reserve_bootmem_region()
JIRA: https://issues.redhat.com/browse/RHEL-36126

commit 61167ad5fecdeaa037f3df1ba354dddd5f66a1ed
Author: Yajun Deng <yajun.deng@linux.dev>
Date:   Mon Jun 19 10:34:06 2023 +0800

    mm: pass nid to reserve_bootmem_region()

    early_pfn_to_nid() is called frequently in init_reserved_page(), it
    returns the node id of the PFN.  These PFN are probably from the same
    memory region, they have the same node id.  It's not necessary to call
    early_pfn_to_nid() for each PFN.

    Pass nid to reserve_bootmem_region() and drop the call to
    early_pfn_to_nid() in init_reserved_page().  Also, set nid on all reserved
    pages before doing this, as some reserved memory regions may not be set
    nid.

    The most beneficial function is memmap_init_reserved_pages() if
    CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled.

    The following data was tested on an x86 machine with 190GB of RAM.

    before:
    memmap_init_reserved_pages()  67ms

    after:
    memmap_init_reserved_pages()  20ms

    Link: https://lkml.kernel.org/r/20230619023406.424298-1-yajun.deng@linux.dev
    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Eric Chanudet <echanude@redhat.com>
2024-05-21 14:18:30 -04:00
Aristeu Rozanski 43880a624f memblock: Avoid useless checks in memblock_merge_regions().
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 2fe03412e2e1be3d5ab37b8351a37c3aec506556
Author: Peng Zhang <zhangpeng.00@bytedance.com>
Date:   Sun Jan 29 17:00:34 2023 +0800

    memblock: Avoid useless checks in memblock_merge_regions().

    memblock_merge_regions() is called after regions have been modified to
    merge the neighboring compatible regions. That will check all regions
    but most checks are useless.

    Most of the time we only insert one or a few new regions, or modify one or
    a few regions. At this time, we don't need to check all the regions. We
    only need to check the changed regions, because other not related regions
    cannot be merged.

    Add two parameters to memblock_merge_regions() to indicate the lower and
    upper boundary to scan.

    Debug code that counts the number of total iterations in
    memblock_merge_regions(), like for instance

    void memblock_merge_regions(struct memblock_type *type)
    {
            static int iteration_count = 0;
            static int max_nr_regions = 0;

            max_nr_regions = max(max_nr_regions, (int)type->cnt);
            ...
            while () {
                    iteration_count++;
                    ...
            }
            pr_info("iteration_count: %d max_nr_regions %d", iteration_count,
    max_nr_regions);
    }

    Produces the following numbers on a physical machine with 1T of memory:

    before: [2.472243] iteration_count: 45410 max_nr_regions 178
    after:  [2.470869] iteration_count: 923 max_nr_regions 176

    The actual startup speed seems to change little, but it does reduce the
    scan overhead.

    Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
    Link: https://lore.kernel.org/r/20230129090034.12310-3-zhangpeng.00@bytedance.com
    [rppt: massaged the changelog]
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:04 -04:00
Aristeu Rozanski fa97bec97f memblock: Make a boundary tighter in memblock_add_range().
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit ad500fb2d11b3739dcbc17a31976828b9161ecf5
Author: Peng Zhang <zhangpeng.00@bytedance.com>
Date:   Sun Jan 29 17:00:33 2023 +0800

    memblock: Make a boundary tighter in memblock_add_range().

    When type->cnt * 2 + 1 is less than or equal to type->max, there is
    enough empty regions to insert.

    Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
    Link: https://lore.kernel.org/r/20230129090034.12310-2-zhangpeng.00@bytedance.com
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:04 -04:00
Lucas Zampieri 6f794c0e0b
Merge: MM update to v6.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3738

JIRA: https://issues.redhat.com/browse/RHEL-27739

Depends: !3662

Dropped Patches and the reason they were dropped:

Needs to be evaluated by the FS team:
138060ba92b3 ("fs: pass dentry to set acl method")
3b4c7bc01727 ("xattr: use rbtree for simple_xattrs")

Needs to be evaluated by the NVME team:
4003f107fa2e ("mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages")

Needs to be evaluated by the ZRAM team:
7c2af309abd2 ("zram: add size class equals check into recompression")

Signed-off-by: Audra Mitchell <audra@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Jocelyn Falempe <jfalempe@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-17 10:14:56 -03:00
Audra Mitchell 2e80e2d6e8 Revert "mm: Always release pages to the buddy allocator in memblock_free_late()."
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 647037adcad00f2bab8828d3d41cd0553d41f3bd
Author: Aaron Thompson <dev@aaront.org>
Date:   Tue Feb 7 08:21:51 2023 +0000

    Revert "mm: Always release pages to the buddy allocator in memblock_free_late()."

    This reverts commit 115d9d77bb0f9152c60b6e8646369fa7f6167593.

    The pages being freed by memblock_free_late() have already been
    initialized, but if they are in the deferred init range,
    __free_one_page() might access nearby uninitialized pages when trying to
    coalesce buddies. This can, for example, trigger this BUG:

      BUG: unable to handle page fault for address: ffffe964c02580c8
      RIP: 0010:__list_del_entry_valid+0x3f/0x70
       <TASK>
       __free_one_page+0x139/0x410
       __free_pages_ok+0x21d/0x450
       memblock_free_late+0x8c/0xb9
       efi_free_boot_services+0x16b/0x25c
       efi_enter_virtual_mode+0x403/0x446
       start_kernel+0x678/0x714
       secondary_startup_64_no_verify+0xd2/0xdb
       </TASK>

    A proper fix will be more involved so revert this change for the time
    being.

    Fixes: 115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
    Signed-off-by: Aaron Thompson <dev@aaront.org>
    Link: https://lore.kernel.org/r/20230207082151.1303-1-dev@aaront.org
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:03 -04:00
Audra Mitchell d3c2b38fbd mm: Always release pages to the buddy allocator in memblock_free_late().
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 115d9d77bb0f9152c60b6e8646369fa7f6167593
Author: Aaron Thompson <dev@aaront.org>
Date:   Fri Jan 6 22:22:44 2023 +0000

    mm: Always release pages to the buddy allocator in memblock_free_late().

    If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
    only releases pages to the buddy allocator if they are not in the
    deferred range. This is correct for free pages (as defined by
    for_each_free_mem_pfn_range_in_zone()) because free pages in the
    deferred range will be initialized and released as part of the deferred
    init process. memblock_free_pages() is called by memblock_free_late(),
    which is used to free reserved ranges after memblock_free_all() has
    run. All pages in reserved ranges have been initialized at that point,
    and accordingly, those pages are not touched by the deferred init
    process. This means that currently, if the pages that
    memblock_free_late() intends to release are in the deferred range, they
    will never be released to the buddy allocator. They will forever be
    reserved.

    In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
    which is also correct for free pages but is not correct for reserved
    pages. KMSAN metadata for reserved pages is initialized by
    kmsan_init_shadow(), which runs shortly before memblock_free_all().

    For both of these reasons, memblock_free_pages() should only be called
    for free pages, and memblock_free_late() should call __free_pages_core()
    directly instead.

    One case where this issue can occur in the wild is EFI boot on
    x86_64. The x86 EFI code reserves all EFI boot services memory ranges
    via memblock_reserve() and frees them later via memblock_free_late()
    (efi_reserve_boot_services() and efi_free_boot_services(),
    respectively). If any of those ranges happens to fall within the
    deferred init range, the pages will not be released and that memory will
    be unavailable.

    For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:

    v6.2-rc2:
      # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
      Node 0, zone      DMA
              spanned  4095
              present  3999
              managed  3840
      Node 0, zone    DMA32
              spanned  246652
              present  245868
              managed  178867

    v6.2-rc2 + patch:
      # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
      Node 0, zone      DMA
              spanned  4095
              present  3999
              managed  3840
      Node 0, zone    DMA32
              spanned  246652
              present  245868
              managed  222816   # +43,949 pages

    Fixes: 3a80a7fa79 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
    Signed-off-by: Aaron Thompson <dev@aaront.org>
    Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1ba989-000000@us-west-2.amazonses.com
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:02 -04:00
Audra Mitchell 990c8d6286 memblock: Fix doc for memblock_phys_free
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit fa81ab49bbe4e1ce756581c970486de0ddb14309
Author: Miaoqian Lin <linmq006@gmail.com>
Date:   Fri Dec 16 14:03:03 2022 +0400

    memblock: Fix doc for memblock_phys_free

    memblock_phys_free() is the counterpart to memblock_phys_alloc.
    Change memblock_alloc_xx() with memblock_phys_alloc_xx() to keep
    consistency.

    Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
    Link: https://lore.kernel.org/r/20221216100304.688209-1-linmq006@gmail.com
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:02 -04:00
Mark Langsdorf 7232a1f776 x86/numa: Fix the address overlap check in numa_fill_memblks()
JIRA: https://issues.redhat.com/browse/RHEL-26871

commit 9b99c17f7510bed2adbe17751fb8abddba5620bc
Author: Alison Schofield <alison.schofield@intel.com>
Date:   Fri Jan 12 12:09:50 2024 -0800

numa_fill_memblks() fills in the gaps in numa_meminfo memblks over a
physical address range. To do so, it first creates a list of existing
memblks that overlap that address range. The issue is that it is off
by one when comparing to the end of the address range, so memblks
that do not overlap are selected.

The impact of selecting a memblk that does not actually overlap is
that an existing memblk may be filled when the expected action is to
do nothing and return NUMA_NO_MEMBLK to the caller. The caller can
then add a new NUMA node and memblk.

Replace the broken open-coded search for address overlap with the
memblock helper memblock_addrs_overlap(). Update the kernel doc
and in code comments.

Suggested by: "Huang, Ying" <ying.huang@intel.com>

Fixes: 8f012db27c95 ("x86/numa: Introduce numa_fill_memblks()")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/10a3e6109c34c21a8dd4c513cf63df63481a2b07.1705085543.git.alison.schofield@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2024-04-05 17:00:51 -04:00
Paolo Bonzini 13262962e2 mm: Add support for unaccepted memory
JIRA: https://issues.redhat.com/browse/RHEL-10059

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways the kernel can deal with unaccepted memory:

 1. Accept all the memory during boot. It is easy to implement and it
    doesn't have runtime cost once the system is booted. The downside is
    very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock accepts memory on allocation. It serves early boot memory
    allocations and doesn't limit them to pre-accepted pool of memory.

  - page allocator accepts memory on the first allocation of the page.
    When kernel runs out of accepted memory, it accepts memory until the
    high watermark is reached. It helps to minimize fragmentation.

EFI code will provide two helpers if the platform supports unaccepted
memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com
(cherry picked from commit dcdfdd40fa82b6704d2841938e5c8ec3051eb0d6)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

[RHEL: upstream has mm/mm_init.c split out of mm/page_alloc.c]
2023-10-30 09:14:17 +01:00
Paolo Bonzini 0a2ad02005 mm: avoid passing 0 to __ffs()
JIRA: https://issues.redhat.com/browse/RHEL-10059

23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") results in
various boot failures (hang) on arm targets Debug messages reveal the
reason.

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If start==0, __ffs(start) returns 0xfffffff or (as int) -1, which min_t()
interprets as such, while min() apparently uses the returned unsigned long
value. Obviously a negative order isn't received well by the rest of the
code.

[akpm@linux-foundation.org: fix comment, per Mike]
  Link: https://lkml.kernel.org/r/ZDBa7HWZK69dKKzH@kernel.org
Link: https://lkml.kernel.org/r/20230406072529.vupqyrzqnhyozeyh@box.shutemov.name
Fixes: 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely")
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reported-by: Guenter Roeck <linux@roeck-us.net>
  Link: https://lkml.kernel.org/r/9460377a-38aa-4f39-ad57-fb73725f92db@roeck-us.net
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 59f876fb9d68a4d8c20305d7a7a0daf4ee9478a8)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-30 09:12:42 +01:00
Paolo Bonzini 538bf6f332 mm, treewide: redefine MAX_ORDER sanely
JIRA: https://issues.redhat.com/browse/RHEL-10059

MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.

This definition is counter-intuitive and lead to number of bugs all over
the kernel.

Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.

[kirill@shutemov.name: fix min() warning]
  Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box
[akpm@linux-foundation.org: fix another min_t warning]
[kirill@shutemov.name: fixups per Zi Yan]
  Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name
[akpm@linux-foundation.org: fix underlining in docs]
  Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 23baf831a32c04f9a968812511540b1b3e648bf5)

[RHEL: Fix conflicts by changing MAX_ORDER - 1 to MAX_ORDER,
       ">= MAX_ORDER" to "> MAX_ORDER", etc.]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-30 09:12:37 +01:00
Chris von Recklinghausen 8ba788a619 mm: add pageblock_align() macro
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 5f7fa13fa858c17580ed513bd5e0a4b36d68fdd6
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 7 14:08:43 2022 +0800

    mm: add pageblock_align() macro

    Add pageblock_align() macro and use it to simplify code.

    Link: https://lkml.kernel.org/r/20220907060844.126891-2-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:19 -04:00
Chris von Recklinghausen 556f683f8e mm: reuse pageblock_start/end_pfn() macro
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4f9bc69ac5ce34071a9a51343bc81ca76cb2e3f1
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 7 14:08:42 2022 +0800

    mm: reuse pageblock_start/end_pfn() macro

    Move pageblock_start_pfn/pageblock_end_pfn() into pageblock-flags.h, then
    they could be used somewhere else, not only in compaction, also use
    ALIGN_DOWN() instead of round_down() to be pair with ALIGN(), which should
    be same for pageblock usage.

    Link: https://lkml.kernel.org/r/20220907060844.126891-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:18 -04:00
Chris von Recklinghausen 6423549ae9 memblock,arm64: expand the static memblock memory table
Bugzilla: https://bugzilla.redhat.com/2160210

commit 450d0e74d886c172ac2f72518b797a18ee8d1327
Author: Zhou Guanghui <zhouguanghui1@huawei.com>
Date:   Wed Jun 15 10:27:42 2022 +0000

    memblock,arm64: expand the static memblock memory table

    In a system(Huawei Ascend ARM64 SoC) using HBM, a multi-bit ECC error
    occurs, and the BIOS will mark the corresponding area (for example, 2 MB)
    as unusable.  When the system restarts next time, these areas are not
    reported or reported as EFI_UNUSABLE_MEMORY.  Both cases lead to an
    increase in the number of memblocks, whereas EFI_UNUSABLE_MEMORY leads to
    a larger number of memblocks.

    For example, if the EFI_UNUSABLE_MEMORY type is reported:
    ...
    memory[0x92]    [0x0000200834a00000-0x0000200835bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
    memory[0x93]    [0x0000200835c00000-0x0000200835dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
    memory[0x94]    [0x0000200835e00000-0x00002008367fffff], 0x0000000000a00000 bytes on node 7 flags: 0x0
    memory[0x95]    [0x0000200836800000-0x00002008369fffff], 0x0000000000200000 bytes on node 7 flags: 0x4
    memory[0x96]    [0x0000200836a00000-0x0000200837bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
    memory[0x97]    [0x0000200837c00000-0x0000200837dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
    memory[0x98]    [0x0000200837e00000-0x000020087fffffff], 0x0000000048200000 bytes on node 7 flags: 0x0
    memory[0x99]    [0x0000200880000000-0x0000200bcfffffff], 0x0000000350000000 bytes on node 6 flags: 0x0
    memory[0x9a]    [0x0000200bd0000000-0x0000200bd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
    memory[0x9b]    [0x0000200bd0200000-0x0000200bd07fffff], 0x0000000000600000 bytes on node 6 flags: 0x0
    memory[0x9c]    [0x0000200bd0800000-0x0000200bd09fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
    memory[0x9d]    [0x0000200bd0a00000-0x0000200fcfffffff], 0x00000003ff600000 bytes on node 6 flags: 0x0
    memory[0x9e]    [0x0000200fd0000000-0x0000200fd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
    memory[0x9f]    [0x0000200fd0200000-0x0000200fffffffff], 0x000000002fe00000 bytes on node 6 flags: 0x0
    ...

    The EFI memory map is parsed to construct the memblock arrays before the
    memblock arrays can be resized.  As the result, memory regions beyond
    INIT_MEMBLOCK_REGIONS are lost.

    Add a new macro INIT_MEMBLOCK_MEMORY_REGIONS to replace
    INIT_MEMBLOCK_REGTIONS to define the size of the static memblock.memory
    array.

    Allow overriding memblock.memory array size with architecture defined
    INIT_MEMBLOCK_MEMORY_REGIONS and make arm64 to set
    INIT_MEMBLOCK_MEMORY_REGIONS to 1024 when CONFIG_EFI is enabled.

    Link: https://lkml.kernel.org/r/20220615102742.96450-1-zhouguanghui1@huawei.com
    Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Tested-by: Darren Hart <darren@os.amperecomputing.com>
    Acked-by: Will Deacon <will@kernel.org>         [arm64]
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Xu Qiang <xuqiang36@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:28 -04:00
Chris von Recklinghausen ad194ef36c memblock: avoid some repeat when add new range
Bugzilla: https://bugzilla.redhat.com/2160210

commit 28e1a8f4b0ff1eafc320ec733b9c61ee7eb633ea
Author: Jinyu Tang <tjytimi@163.com>
Date:   Wed Jun 15 17:40:15 2022 +0800

    memblock: avoid some repeat when add new range

    The worst case is that the new memory range overlaps all existing
    regions, which requires type->cnt + 1 empty struct memblock_region slots in
    the type->regions array.
    So if type->cnt + 1 + type->cnt is less than type->max, we can insert
    regions directly rather than calculate the needed amount before the
    insertion.
    And becase of merge operation in the end of function, tpye->cnt will
    increase slowly for many cases.

    This change allows to avoid unnecessary repeat of memblock ranges traversal
    for many cases when adding new memory range.

    Signed-off-by: Jinyu Tang <tjytimi@163.com>
    [rppt: massaged comment and changelog text]
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:16 -04:00
Chris von Recklinghausen 3fe654e9dd memblock: Disable mirror feature if kernelcore is not specified
Bugzilla: https://bugzilla.redhat.com/2160210

commit 902c2d91582c7ff0cb5f57ffb3766656f9b910c6
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Tue Jun 14 17:21:56 2022 +0800

    memblock: Disable mirror feature if kernelcore is not specified

    If system have some mirrored memory and mirrored feature is not specified
    in boot parameter, the basic mirrored feature will be enabled and this will
    lead to the following situations:

    - memblock memory allocation prefers mirrored region. This may have some
      unexpected influence on numa affinity.

    - contiguous memory will be split into several parts if parts of them
      is mirrored memory via memblock_mark_mirror().

    To fix this, variable mirrored_kernelcore will be checked in
    memblock_mark_mirror(). Mark mirrored memory with flag MEMBLOCK_MIRROR iff
    kernelcore=mirror is added in the kernel parameters.

    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Link: https://lore.kernel.org/r/20220614092156.1972846-6-mawupeng1@huawei.com
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:13 -04:00
Chris von Recklinghausen 4ac28d31c7 mm: Ratelimited mirrored memory related warning messages
Bugzilla: https://bugzilla.redhat.com/2160210

commit 14d9a675fd0d414b7ca3d47d2ff70fbda4f6cfc2
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Tue Jun 14 17:21:53 2022 +0800

    mm: Ratelimited mirrored memory related warning messages

    If system has mirrored memory, memblock will try to allocate mirrored
    memory firstly and fallback to non-mirrored memory when fails, but if with
    limited mirrored memory or some numa node without mirrored memory, lots of
    warning message about memblock allocation will occur.

    This patch ratelimit the warning message to avoid a very long print during
    bootup.

    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Link: https://lore.kernel.org/r/20220614092156.1972846-3-mawupeng1@huawei.com
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:13 -04:00
Waiman Long 6e40e36730 mm: kmemleak: remove kmemleak_not_leak_phys() and the min_count argument to kmemleak_alloc_phys()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2151065

commit c200d90049dbe08fa8b016f74b713fddefca0479
Author: Patrick Wang <patrick.wang.shcn@gmail.com>
Date:   Sat, 11 Jun 2022 11:55:48 +0800

    mm: kmemleak: remove kmemleak_not_leak_phys() and the min_count argument to kmemleak_alloc_phys()

    Patch series "mm: kmemleak: store objects allocated with physical address
    separately and check when scan", v4.

    The kmemleak_*_phys() interface uses "min_low_pfn" and "max_low_pfn" to
    check address.  But on some architectures, kmemleak_*_phys() is called
    before those two variables initialized.  The following steps will be
    taken:

    1) Add OBJECT_PHYS flag and rbtree for the objects allocated
       with physical address
    2) Store physical address in objects if allocated with OBJECT_PHYS
    3) Check the boundary when scan instead of in kmemleak_*_phys()

    This patch set will solve:
    https://lore.kernel.org/r/20220527032504.30341-1-yee.lee@mediatek.com
    https://lore.kernel.org/r/9dd08bb5-f39e-53d8-f88d-bec598a08c93@gmail.com

    v3: https://lore.kernel.org/r/20220609124950.1694394-1-patrick.wang.shcn@gmail.com
    v2: https://lore.kernel.org/r/20220603035415.1243913-1-patrick.wang.shcn@gmail.com
    v1: https://lore.kernel.org/r/20220531150823.1004101-1-patrick.wang.shcn@gmail.com

    This patch (of 4):

    Remove the unused kmemleak_not_leak_phys() function.  And remove the
    min_count argument to kmemleak_alloc_phys() function, assume it's 0.

    Link: https://lkml.kernel.org/r/20220611035551.1823303-1-patrick.wang.shcn@gmail.com
    Link: https://lkml.kernel.org/r/20220611035551.1823303-2-patrick.wang.shcn@gmail.com
    Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com>
    Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Yee Lee <yee.lee@mediatek.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-02-06 19:29:16 -05:00
Chris von Recklinghausen 291e420033 memblock: __next_mem_pfn_range_in_zone: remove unneeded local variable nid
Bugzilla: https://bugzilla.redhat.com/2120352

commit f30b002ccfee8c60c8feb590e145c0b5e8fa4c67
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Feb 17 22:07:54 2022 +0800

    memblock: __next_mem_pfn_range_in_zone: remove unneeded local variable nid

    The nid is only used to act as output parameter of __next_mem_range.
    Since NULL can be passed to __next_mem_range as out_nid, we can thus
    remove nid by passing NULL here.

    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    [rppt: updated the commit message]
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:44 -04:00
Chris von Recklinghausen 529f8d4da3 memblock: use kfree() to release kmalloced memblock regions
Bugzilla: https://bugzilla.redhat.com/2120352

commit c94afc46cae7ad41b2ad6a99368147879f4b0e56
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Feb 17 22:53:27 2022 +0800

    memblock: use kfree() to release kmalloced memblock regions

    memblock.{reserved,memory}.regions may be allocated using kmalloc() in
    memblock_double_array(). Use kfree() to release these kmalloced regions
    indicated by memblock_{reserved,memory}_in_slab.

    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Fixes: 3010f87650 ("mm: discard memblock data later")
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:44 -04:00
Chris von Recklinghausen 880e8c868a memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED
Bugzilla: https://bugzilla.redhat.com/2120352

commit f7892d8e288d4b090176f26d9bf7943dbbb639a6
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Nov 5 13:44:53 2021 -0700

    memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

    Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
    indicating that we're dealing with a memory region that is never
    indicated in the firmware-provided memory map, but always detected and
    added by a driver.

    Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such
    memory regions like ordinary MEMBLOCK_NONE memory regions -- for
    example, when selecting memory regions to add to the vmcore for dumping
    in the crashkernel via for_each_mem_range().

    However, especially kexec_file is not supposed to select such memblocks
    via for_each_free_mem_range() / for_each_free_mem_range_reverse() to
    place kexec images, similar to how we handle
    IORESOURCE_SYSRAM_DRIVER_MANAGED without CONFIG_ARCH_KEEP_MEMBLOCK.

    We'll make sure that memory hotplug code sets the flag where applicable
    (IORESOURCE_SYSRAM_DRIVER_MANAGED) next.  This prepares architectures
    that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
    support.

    Note that kexec *must not* indicate this memory to the second kernel and
    *must not* place kexec-images on this memory.  Let's add a comment to
    kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
    now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
    locate_mem_hole_callback() for kexec_walk_resources().

    Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
    semantics:
            MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
            firmware-provided memory map and added to the system early during
            boot; kexec *has to* indicate this memory to the second kernel and
            can place kexec-images on this memory. After memory hotunplug,
            kexec has to be re-armed. We mostly ignore this flag when
            "movable_node" is not set on the kernel command line, because
            then we're told to not care about hotunpluggability of such
            memory regions.

            MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
            the firmware-provided memory map; this memory is always detected
            and added to the system by a driver; memory might not actually be
            physically hotunpluggable. kexec *must not* indicate this memory to
            the second kernel and *must not* place kexec-images on this memory.

    Link: https://lkml.kernel.org/r/20211004093605.5830-5-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Christian Borntraeger <borntraeger@de.ibm.com>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Huacai Chen <chenhuacai@kernel.org>
    Cc: Jianyong Wu <Jianyong.Wu@arm.com>
    Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Shahab Vahedi <shahab@synopsys.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vineet Gupta <vgupta@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:30 -04:00
Chris von Recklinghausen 3cbb272e07 memblock: allow to specify flags with memblock_add_node()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 952eea9b01e4bbb7011329f1b7240844e61e5128
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Nov 5 13:44:49 2021 -0700

    memblock: allow to specify flags with memblock_add_node()

    We want to specify flags when hotplugging memory.  Let's prepare to pass
    flags to memblock_add_node() by adjusting all existing users.

    Note that when hotplugging memory the system is already up and running
    and we might have concurrent memblock users: for example, while we're
    hotplugging memory, kexec_file code might search for suitable memory
    regions to place kexec images.  It's important to add the memory
    directly to memblock via a single call with the right flags, instead of
    adding the memory first and apply flags later: otherwise, concurrent
    memblock users might temporarily stumble over memblocks with wrong
    flags, which will be important in a follow-up patch that introduces a
    new flag to properly handle add_memory_driver_managed().

    Link: https://lkml.kernel.org/r/20211004093605.5830-4-david@redhat.com
    Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Acked-by: Heiko Carstens <hca@linux.ibm.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Shahab Vahedi <shahab@synopsys.com>   [arch/arc]
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Christian Borntraeger <borntraeger@de.ibm.com>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Huacai Chen <chenhuacai@kernel.org>
    Cc: Jianyong Wu <Jianyong.Wu@arm.com>
    Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vineet Gupta <vgupta@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:30 -04:00
Chris von Recklinghausen 9bd2c53612 memblock: stop aliasing __memblock_free_late with memblock_free_late
Bugzilla: https://bugzilla.redhat.com/2120352

commit 621d973901cf9fa6c6e31b31bdd36c5c5f3c9c9e
Author: Mike Rapoport <rppt@kernel.org>
Date:   Fri Nov 5 13:43:16 2021 -0700

    memblock: stop aliasing __memblock_free_late with memblock_free_late

    memblock_free_late() is a NOP wrapper for __memblock_free_late(), there
    is no point to keep this indirection.

    Drop the wrapper and rename __memblock_free_late() to
    memblock_free_late().

    Link: https://lkml.kernel.org/r/20210930185031.18648-5-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Juergen Gross <jgross@suse.com>
    Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:09 -04:00
Al Stone d1fd9d18f4 memblock: use memblock_free for freeing virtual pointers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071840
Tested: This is one of a series of patch sets to enable Arm SystemReady IR
 support in the kernel for NXP i.MX8 platforms.  At this stage, this
 has been tested by ensuring we can survive the CI/CD loop -- i.e.,
 that we have not broken anything else, and a simple boot test.  When
 sufficient drivers have been brought in for i.MX8M, we will be able
 to run further tests.

Conflicts:
    init/main.c

    This patch is being applied out of order, but is a simple
    function name replacement, so applied manually.

commit 4421cca0a3e4833b3bf0f20de98eb580ab8c7290
Author: Mike Rapoport <rppt@kernel.org>
Date:   Fri Nov 5 13:43:22 2021 -0700

    memblock: use memblock_free for freeing virtual pointers

    Rename memblock_free_ptr() to memblock_free() and use memblock_free()
    when freeing a virtual pointer so that memblock_free() will be a
    counterpart of memblock_alloc()

    The callers are updated with the below semantic patch and manual
    addition of (void *) casting to pointers that are represented by
    unsigned long variables.

        @@
        identifier vaddr;
        expression size;
        @@
        (
        - memblock_phys_free(__pa(vaddr), size);
        + memblock_free(vaddr, size);
        |
        - memblock_free_ptr(vaddr, size);
        + memblock_free(vaddr, size);
        )

    [sfr@canb.auug.org.au: fixup]
      Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au

    Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Juergen Gross <jgross@suse.com>
    Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    (cherry picked from commit 4421cca0a3e4833b3bf0f20de98eb580ab8c7290)

Signed-off-by: Al Stone <ahs3@redhat.com>
2022-07-01 17:07:00 -06:00
Al Stone 14289d8c8f memblock: rename memblock_free to memblock_phys_free
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071840
Tested: This is one of a series of patch sets to enable Arm SystemReady IR
 support in the kernel for NXP i.MX8 platforms.  At this stage, this
 has been tested by ensuring we can survive the CI/CD loop -- i.e.,
 that we have not broken anything else, and a simple boot test.  When
 sufficient drivers have been brought in for i.MX8M, we will be able
 to run further tests.

Conflicts:
    arch/s390/kernel/setup.c
    arch/s390/kernel/smp.c

    These have been modified in ways that no longer strictly
    match the upstream code, throwing off the auto-merge; this
    is a simple function name replacement, however, so easily
    done manually instead.

commit 3ecc68349bbab6bff1d12cbc7951ca6019b2faf6
Author: Mike Rapoport <rppt@kernel.org>
Date:   Fri Nov 5 13:43:19 2021 -0700

    memblock: rename memblock_free to memblock_phys_free

    Since memblock_free() operates on a physical range, make its name
    reflect it and rename it to memblock_phys_free(), so it will be a
    logical counterpart to memblock_phys_alloc().

    The callers are updated with the below semantic patch:

        @@
        expression addr;
        expression size;
        @@
        - memblock_free(addr, size);
        + memblock_phys_free(addr, size);

    Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Juergen Gross <jgross@suse.com>
    Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    (cherry picked from commit 3ecc68349bbab6bff1d12cbc7951ca6019b2faf6)

Signed-off-by: Al Stone <ahs3@redhat.com>
2022-07-01 17:06:59 -06:00
Mark Salter 3dcddb7d46 arm64: Track no early_pgtable_alloc() for kmemleak
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076088

commit c6975d7cab5b903aadbc0f78f9af4fae1bd23a50
Author: Qian Cai <quic_qiancai@quicinc.com>
Date: Fri, 5 Nov 2021 11:05:09 -0400

    After switched page size from 64KB to 4KB on several arm64 servers here,
    kmemleak starts to run out of early memory pool due to a huge number of
    those early_pgtable_alloc() calls:

      kmemleak_alloc_phys()
      memblock_alloc_range_nid()
      memblock_phys_alloc_range()
      early_pgtable_alloc()
      init_pmd()
      alloc_init_pud()
      __create_pgd_mapping()
      __map_memblock()
      paging_init()
      setup_arch()
      start_kernel()

    Increased the default value of DEBUG_KMEMLEAK_MEM_POOL_SIZE by 4 times
    won't be enough for a server with 200GB+ memory. There isn't much
    interesting to check memory leaks for those early page tables and those
    early memory mappings should not reference to other memory. Hence, no
    kmemleak false positives, and we can safely skip tracking those early
    allocations from kmemleak like we did in the commit fed84c7852
    ("mm/memblock.c: skip kmemleak for kasan_init()") without needing to
    introduce complications to automatically scale the value depends on the
    runtime memory size etc. After the patch, the default value of
    DEBUG_KMEMLEAK_MEM_POOL_SIZE becomes sufficient again.

    Signed-off-by: Qian Cai <quic_qiancai@quicinc.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Link: https://lore.kernel.org/r/20211105150509.7826-1-quic_qiancai@quicinc.com
    Signed-off-by: Will Deacon <will@kernel.org>

Signed-off-by: Mark Salter <msalter@redhat.com>
2022-04-18 10:05:58 -04:00
Rafael Aquini ad73b2cd11 memblock: exclude MEMBLOCK_NOMAP regions from kmemleak
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 658aafc8139c23a6a23f6f4d9a0c4c95476838d4
Author: Mike Rapoport <rppt@kernel.org>
Date:   Thu Oct 21 10:09:29 2021 +0300

    memblock: exclude MEMBLOCK_NOMAP regions from kmemleak

    Vladimir Zapolskiy reports:

    Commit a7259df76702 ("memblock: make memblock_find_in_range method
    private") invokes a kernel panic while running kmemleak on OF platforms
    with nomaped regions:

      Unable to handle kernel paging request at virtual address fff000021e00000
      [...]
        scan_block+0x64/0x170
        scan_gray_list+0xe8/0x17c
        kmemleak_scan+0x270/0x514
        kmemleak_write+0x34c/0x4ac

    The memory allocated from memblock is registered with kmemleak, but if
    it is marked MEMBLOCK_NOMAP it won't have linear map entries so an
    attempt to scan such areas will fault.

    Ideally, memblock_mark_nomap() would inform kmemleak to ignore
    MEMBLOCK_NOMAP memory, but it can be called before kmemleak interfaces
    operating on physical addresses can use __va() conversion.

    Make sure that functions that mark allocated memory as MEMBLOCK_NOMAP
    take care of informing kmemleak to ignore such memory.

    Link: https://lore.kernel.org/all/8ade5174-b143-d621-8c8e-dc6a1898c6fb@linaro.org
    Link: https://lore.kernel.org/all/c30ff0a2-d196-c50d-22f0-bd50696b1205@quicinc.com
    Fixes: a7259df76702 ("memblock: make memblock_find_in_range method private")
    Reported-by: Vladimir Zapolskiy <vladimir.zapolskiy@linaro.org>
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Tested-by: Vladimir Zapolskiy <vladimir.zapolskiy@linaro.org>
    Tested-by: Qian Cai <quic_qiancai@quicinc.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:11 -05:00
Rafael Aquini ff0d6b7cfe Revert "memblock: exclude NOMAP regions from kmemleak"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 6c9a54551977ddf2d6e22c21354b4fb88946f96e
Author: Mike Rapoport <rppt@kernel.org>
Date:   Thu Oct 21 10:09:28 2021 +0300

    Revert "memblock: exclude NOMAP regions from kmemleak"

    Commit 6e44bd6d34d6 ("memblock: exclude NOMAP regions from kmemleak")
    breaks boot on EFI systems with kmemleak and VM_DEBUG enabled:

      efi: Processing EFI memory map:
      efi:   0x000090000000-0x000091ffffff [Conventional|   |  |  |  |  |  |  |  |  |   |WB|WT|WC|UC]
      efi:   0x000092000000-0x0000928fffff [Runtime Data|RUN|  |  |  |  |  |  |  |  |   |WB|WT|WC|UC]
      ------------[ cut here ]------------
      kernel BUG at mm/kmemleak.c:1140!
      Internal error: Oops - BUG: 0 [#1] SMP
      Modules linked in:
      CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.0-rc6-next-20211019+ #104
      pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      pc : kmemleak_free_part_phys+0x64/0x8c
      lr : kmemleak_free_part_phys+0x38/0x8c
      sp : ffff800011eafbc0
      x29: ffff800011eafbc0 x28: 1fffff7fffb41c0d x27: fffffbfffda0e068
      x26: 0000000092000000 x25: 1ffff000023d5f94 x24: ffff800011ed84d0
      x23: ffff800011ed84c0 x22: ffff800011ed83d8 x21: 0000000000900000
      x20: ffff800011782000 x19: 0000000092000000 x18: ffff800011ee0730
      x17: 0000000000000000 x16: 0000000000000000 x15: 1ffff0000233252c
      x14: ffff800019a905a0 x13: 0000000000000001 x12: ffff7000023d5ed7
      x11: 1ffff000023d5ed6 x10: ffff7000023d5ed6 x9 : dfff800000000000
      x8 : ffff800011eaf6b7 x7 : 0000000000000001 x6 : ffff800011eaf6b0
      x5 : 00008ffffdc2a12a x4 : ffff7000023d5ed7 x3 : 1ffff000023dbf99
      x2 : 1ffff000022f0463 x1 : 0000000000000000 x0 : ffffffffffffffff
      Call trace:
       kmemleak_free_part_phys+0x64/0x8c
       memblock_mark_nomap+0x5c/0x78
       reserve_regions+0x294/0x33c
       efi_init+0x2d0/0x490
       setup_arch+0x80/0x138
       start_kernel+0xa0/0x3ec
       __primary_switched+0xc0/0xc8
      Code: 34000041 97d526e7 f9418e80 36000040 (d4210000)
      random: get_random_bytes called from print_oops_end_marker+0x34/0x80 with crng_init=0
      ---[ end trace 0000000000000000 ]---

    The crash happens because kmemleak_free_part_phys() tries to use __va()
    before memstart_addr is initialized and this triggers a VM_BUG_ON() in
    arch/arm64/include/asm/memory.h:

    Revert 6e44bd6d34d6 ("memblock: exclude NOMAP regions from kmemleak"),
    the issue it is fixing will be fixed differently.

    Reported-by: Qian Cai <quic_qiancai@quicinc.com>
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:10 -05:00
Rafael Aquini 870fd9faaa memblock: check memory total_size
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 5173ed72bcfcddda21ff274ee31c6472fa150f29
Author: Peng Fan <peng.fan@nxp.com>
Date:   Mon Oct 18 15:15:45 2021 -0700

    memblock: check memory total_size

    mem=[X][G|M] is broken on ARM64 platform, there are cases that even
    type.cnt is 1, but total_size is not 0 because regions are merged into
    1.  So only check 'cnt' is not enough, total_size should be used,
    othersize bootargs 'mem=[X][G|B]' not work anymore.

    Link: https://lkml.kernel.org/r/20210930024437.32598-1-peng.fan@oss.nxp.com
    Fixes: e888fa7bb882 ("memblock: Check memory add/cap ordering")
    Signed-off-by: Peng Fan <peng.fan@nxp.com>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:04 -05:00
Rafael Aquini 953b2e148a memblock: exclude NOMAP regions from kmemleak
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 6e44bd6d34d659c44cd8e7fc925c8a97f49b3c33
Author: Mike Rapoport <rppt@kernel.org>
Date:   Wed Oct 13 08:36:59 2021 +0300

    memblock: exclude NOMAP regions from kmemleak

    Vladimir Zapolskiy reports:

    commit a7259df76702 ("memblock: make memblock_find_in_range method private")
    invokes a kernel panic while running kmemleak on OF platforms with nomaped
    regions:

      Unable to handle kernel paging request at virtual address fff000021e00000
      [...]
        scan_block+0x64/0x170
        scan_gray_list+0xe8/0x17c
        kmemleak_scan+0x270/0x514
        kmemleak_write+0x34c/0x4ac

    Indeed, NOMAP regions don't have linear map entries so an attempt to scan
    these areas would fault.

    Prevent such faults by excluding NOMAP regions from kmemleak.

    Link: https://lore.kernel.org/all/8ade5174-b143-d621-8c8e-dc6a1898c6fb@linaro.org
    Fixes: a7259df76702 ("memblock: make memblock_find_in_range method private")
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Tested-by: Vladimir Zapolskiy <vladimir.zapolskiy@linaro.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:00 -05:00
Rafael Aquini 26cc6f24ba memblock: introduce saner 'memblock_free_ptr()' interface
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 77e02cf57b6cff9919949defb7fd9b8ac16399a2
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue Sep 14 13:23:22 2021 -0700

    memblock: introduce saner 'memblock_free_ptr()' interface

    The boot-time allocation interface for memblock is a mess, with
    'memblock_alloc()' returning a virtual pointer, but then you are
    supposed to free it with 'memblock_free()' that takes a _physical_
    address.

    Not only is that all kinds of strange and illogical, but it actually
    causes bugs, when people then use it like a normal allocation function,
    and it fails spectacularly on a NULL pointer:

       https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/

    or just random memory corruption if the debug checks don't catch it:

       https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/

    I really don't want to apply patches that treat the symptoms, when the
    fundamental cause is this horribly confusing interface.

    I started out looking at just automating a sane replacement sequence,
    but because of this mix or virtual and physical addresses, and because
    people have used the "__pa()" macro that can take either a regular
    kernel pointer, or just the raw "unsigned long" address, it's all quite
    messy.

    So this just introduces a new saner interface for freeing a virtual
    address that was allocated using 'memblock_alloc()', and that was kept
    as a regular kernel pointer.  And then it converts a couple of users
    that are obvious and easy to test, including the 'xbc_nodes' case in
    lib/bootconfig.c that caused problems.

    Reported-by: kernel test robot <oliver.sang@intel.com>
    Fixes: 40caa127f3c7 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:49 -05:00
Rafael Aquini ca20b67b01 memblock: make memblock_find_in_range method private
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit a7259df7670240ee03b0cfce8a3e5d3773911e24
Author: Mike Rapoport <rppt@kernel.org>
Date:   Thu Sep 2 15:00:26 2021 -0700

    memblock: make memblock_find_in_range method private

    There are a lot of uses of memblock_find_in_range() along with
    memblock_reserve() from the times memblock allocation APIs did not exist.

    memblock_find_in_range() is the very core of memblock allocations, so any
    future changes to its internal behaviour would mandate updates of all the
    users outside memblock.

    Replace the calls to memblock_find_in_range() with an equivalent calls to
    memblock_phys_alloc() and memblock_phys_alloc_range() and make
    memblock_find_in_range() private method of memblock.

    This simplifies the callers, ensures that (unlikely) errors in
    memblock_reserve() are handled and improves maintainability of
    memblock_find_in_range().

    Link: https://lkml.kernel.org/r/20210816122622.30279-1-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>          [arm64]
    Acked-by: Kirill A. Shutemov <kirill.shtuemov@linux.intel.com>
    Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>        [ACPI]
    Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
    Acked-by: Nick Kossifidis <mick@ics.forth.gr>                   [riscv]
    Tested-by: Guenter Roeck <linux@roeck-us.net>
    Acked-by: Rob Herring <robh@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:24 -05:00
Rafael Aquini 07bffc920b memblock: stop poisoning raw allocations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 08678804e0b305bbbf5b756ad365373e5fe885a2
Author: Mike Rapoport <rppt@kernel.org>
Date:   Thu Sep 2 14:58:05 2021 -0700

    memblock: stop poisoning raw allocations

    Functions memblock_alloc_exact_nid_raw() and memblock_alloc_try_nid_raw()
    are intended for early memory allocation without overhead of zeroing the
    allocated memory.  Since these functions were used to allocate the memory
    map, they have ended up with addition of a call to page_init_poison() that
    poisoned the allocated memory when CONFIG_PAGE_POISON was set.

    Since the memory map is allocated using a dedicated memmep_alloc()
    function that takes care of the poisoning, remove page poisoning from the
    memblock_alloc_*_raw() functions.

    Link: https://lkml.kernel.org/r/20210714123739.16493-5-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Michal Simek <monstr@monstr.eu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:52 -05:00
Rafael Aquini b6289f4f7a memblock: Check memory add/cap ordering
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit e888fa7bb882a1f305526d8f49d7016a7bc5f5ca
Author: Geert Uytterhoeven <geert+renesas@glider.be>
Date:   Wed Aug 11 10:55:18 2021 +0200

    memblock: Check memory add/cap ordering

    For memblock_cap_memory_range() to work properly, it should be called
    after memory is detected and added to memblock with memblock_add() or
    memblock_add_node().  If memblock_cap_memory_range() would be called
    before memory is registered, we may silently corrupt memory later
    because the crash kernel will see all memory as available.

    Print a warning and bail out if ordering is not satisfied.

    Suggested-by: Mike Rapoport <rppt@kernel.org>
    Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Link: https://lore.kernel.org/r/aabc5bad008d49f07d542815c6c8d28ec90bb09e.1628672091.git.geert+renesas@glider.be

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:27 -05:00
Rafael Aquini 597b6ab1c9 memblock: Add missing debug code to memblock_add_node()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 00974b9a83cb233d9c8f9758f541d9aa2a80c5cd
Author: Geert Uytterhoeven <geert+renesas@glider.be>
Date:   Wed Aug 11 10:54:36 2021 +0200

    memblock: Add missing debug code to memblock_add_node()

    All other memblock APIs built on top of memblock_add_range() contain
    debug code to print their parameters.

    Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Link: https://lore.kernel.org/r/c45e5218b6fcf0e3aeb63d9a9d9792addae0bb7a.1628672041.git.geert+renesas@glider.be

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:26 -05:00
Mike Rapoport 79e482e9c3 memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
Commit b10d6bca87 ("arch, drivers: replace for_each_membock() with
for_each_mem_range()") didn't take into account that when there is
movable_node parameter in the kernel command line, for_each_mem_range()
would skip ranges marked with MEMBLOCK_HOTPLUG.

The page table setup code in POWER uses for_each_mem_range() to create
the linear mapping of the physical memory and since the regions marked
as MEMORY_HOTPLUG are skipped, they never make it to the linear map.

A later access to the memory in those ranges will fail:

  BUG: Unable to handle kernel data access on write at 0xc000000400000000
  Faulting instruction address: 0xc00000000008a3c0
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in:
  CPU: 0 PID: 53 Comm: kworker/u2:0 Not tainted 5.13.0 #7
  NIP:  c00000000008a3c0 LR: c0000000003c1ed8 CTR: 0000000000000040
  REGS: c000000008a57770 TRAP: 0300   Not tainted  (5.13.0)
  MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 84222202  XER: 20040000
  CFAR: c0000000003c1ed4 DAR: c000000400000000 DSISR: 42000000 IRQMASK: 0
  GPR00: c0000000003c1ed8 c000000008a57a10 c0000000019da700 c000000400000000
  GPR04: 0000000000000280 0000000000000180 0000000000000400 0000000000000200
  GPR08: 0000000000000100 0000000000000080 0000000000000040 0000000000000300
  GPR12: 0000000000000380 c000000001bc0000 c0000000001660c8 c000000006337e00
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: 0000000040000000 0000000020000000 c000000001a81990 c000000008c30000
  GPR24: c000000008c20000 c000000001a81998 000fffffffff0000 c000000001a819a0
  GPR28: c000000001a81908 c00c000001000000 c000000008c40000 c000000008a64680
  NIP clear_user_page+0x50/0x80
  LR __handle_mm_fault+0xc88/0x1910
  Call Trace:
    __handle_mm_fault+0xc44/0x1910 (unreliable)
    handle_mm_fault+0x130/0x2a0
    __get_user_pages+0x248/0x610
    __get_user_pages_remote+0x12c/0x3e0
    get_arg_page+0x54/0xf0
    copy_string_kernel+0x11c/0x210
    kernel_execve+0x16c/0x220
    call_usermodehelper_exec_async+0x1b0/0x2f0
    ret_from_kernel_thread+0x5c/0x70
  Instruction dump:
  79280fa4 79271764 79261f24 794ae8e2 7ca94214 7d683a14 7c893a14 7d893050
  7d4903a6 60000000 60000000 60000000 <7c001fec> 7c091fec 7c081fec 7c051fec
  ---[ end trace 490b8c67e6075e09 ]---

Making for_each_mem_range() include MEMBLOCK_HOTPLUG regions in the
traversal fixes this issue.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1976100
Link: https://lkml.kernel.org/r/20210712071132.20902-1-rppt@kernel.org
Fixes: b10d6bca87 ("arch, drivers: replace for_each_membock() with for_each_mem_range()")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>	[5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23 17:43:28 -07:00
Linus Torvalds a412897fb5 memblock, arm: fix crashes caused by holes in the memory map
The coordination between freeing of unused memory map, pfn_valid() and core
 mm assumptions about validity of the memory map in various ranges was not
 designed for complex layouts of the physical memory with a lot of holes all
 over the place.
 
 Kefen Wang reported crashes in move_freepages() on a system with the
 following memory layout [1]:
 
   node   0: [mem 0x0000000080a00000-0x00000000855fffff]
   node   0: [mem 0x0000000086a00000-0x0000000087dfffff]
   node   0: [mem 0x000000008bd00000-0x000000008c4fffff]
   node   0: [mem 0x000000008e300000-0x000000008ecfffff]
   node   0: [mem 0x0000000090d00000-0x00000000bfffffff]
   node   0: [mem 0x00000000cc000000-0x00000000dc9fffff]
   node   0: [mem 0x00000000de700000-0x00000000de9fffff]
   node   0: [mem 0x00000000e0800000-0x00000000e0bfffff]
   node   0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
   node   0: [mem 0x00000000fda00000-0x00000000ffffefff]
 
 These crashes can be mitigated by enabling CONFIG_HOLES_IN_ZONE on ARM and
 essentially turning pfn_valid_within() to pfn_valid() instead of having it
 hardwired to 1 on that architecture, but this would require to keep
 CONFIG_HOLES_IN_ZONE solely for this purpose.
 
 A cleaner approach is to update ARM's implementation of pfn_valid() to take
 into accounting rounding of the freed memory map to pageblock boundaries
 and make sure it returns true for PFNs that have memory map entries even if
 there is no physical memory backing those PFNs.
 
 [1] https://lore.kernel.org/lkml/2a1592ad-bc9d-4664-fd19-f7448a37edc0@huawei.com
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmDhzQQTHHJwcHRAbGlu
 dXguaWJtLmNvbQAKCRA5A4Ymyw79kXeUCACS0lssuKbaBxFk6OkEe0nbmbwN/n9z
 zKd2AWzw9xFxYZkLfOCmi5EPUMI0IeDYjOyZmnj8YDDd7wRLVxZ51LSdyFDZafXY
 j6SVYprSmwUjLkuajmqifY5DLbZYeGuI6WFvNVLljltHc0i/GIzx1Tld2yO/M0Jk
 NzHQ0/5nXmU74PvvY8LrWk+rRjTYqMuolHvbbl4nNId5e/FYEWNxEqNO5gq6aG5g
 +5t1BjyLf1NMp67uc5aLoLmr2ZwK8/UmZeSZ7i9z03gU/5B1srLluhoBsYBPVHFY
 hRNRKwWUDRUmqjJnu5/EzI+iQnj7t6zV1hyt+E5B1gb89vuSVcJNOPQt
 =wCcY
 -----END PGP SIGNATURE-----

Merge tag 'memblock-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:
 "Fix arm crashes caused by holes in the memory map.

  The coordination between freeing of unused memory map, pfn_valid() and
  core mm assumptions about validity of the memory map in various ranges
  was not designed for complex layouts of the physical memory with a lot
  of holes all over the place.

  Kefen Wang reported crashes in move_freepages() on a system with the
  following memory layout [1]:

	node 0: [mem 0x0000000080a00000-0x00000000855fffff]
	node 0: [mem 0x0000000086a00000-0x0000000087dfffff]
	node 0: [mem 0x000000008bd00000-0x000000008c4fffff]
	node 0: [mem 0x000000008e300000-0x000000008ecfffff]
	node 0: [mem 0x0000000090d00000-0x00000000bfffffff]
	node 0: [mem 0x00000000cc000000-0x00000000dc9fffff]
	node 0: [mem 0x00000000de700000-0x00000000de9fffff]
	node 0: [mem 0x00000000e0800000-0x00000000e0bfffff]
	node 0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
	node 0: [mem 0x00000000fda00000-0x00000000ffffefff]

  These crashes can be mitigated by enabling CONFIG_HOLES_IN_ZONE on ARM
  and essentially turning pfn_valid_within() to pfn_valid() instead of
  having it hardwired to 1 on that architecture, but this would require
  to keep CONFIG_HOLES_IN_ZONE solely for this purpose.

  A cleaner approach is to update ARM's implementation of pfn_valid() to
  take into accounting rounding of the freed memory map to pageblock
  boundaries and make sure it returns true for PFNs that have memory map
  entries even if there is no physical memory backing those PFNs"

Link: https://lore.kernel.org/lkml/2a1592ad-bc9d-4664-fd19-f7448a37edc0@huawei.com [1]

* tag 'memblock-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  arm: extend pfn_valid to take into account freed memory map alignment
  memblock: ensure there is no overflow in memblock_overlaps_region()
  memblock: align freed memory map on pageblock boundaries with SPARSEMEM
  memblock: free_unused_memmap: use pageblock units instead of MAX_ORDER
2021-07-04 12:23:05 -07:00
Mike Rapoport 9092d4f7a1 memblock: update initialization of reserved pages
The struct pages representing a reserved memory region are initialized
using reserve_bootmem_range() function.  This function is called for each
reserved region just before the memory is freed from memblock to the buddy
page allocator.

The struct pages for MEMBLOCK_NOMAP regions are kept with the default
values set by the memory map initialization which makes it necessary to
have a special treatment for such pages in pfn_valid() and
pfn_valid_within().

Split out initialization of the reserved pages to a function with a
meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
reserved regions and mark struct pages for the NOMAP regions as
PageReserved.

Link: https://lkml.kernel.org/r/20210511100550.28178-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:29 -07:00
Mike Rapoport 023accf5cd memblock: ensure there is no overflow in memblock_overlaps_region()
There maybe an overflow in memblock_overlaps_region() if it is called with
base and size such that

	base + size > PHYS_ADDR_MAX

Make sure that memblock_overlaps_region() caps the size to prevent such
overflow and remove now duplicated call to memblock_cap_size() from
memblock_is_region_reserved().

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Tony Lindgren <tony@atomide.com>
2021-06-30 11:38:56 +03:00
Mike Rapoport f921f53e08 memblock: align freed memory map on pageblock boundaries with SPARSEMEM
When CONFIG_SPARSEMEM=y the ranges of the memory map that are freed are not
aligned to the pageblock boundaries which breaks assumptions about
homogeneity of the memory map throughout core mm code.

Make sure that the freed memory map is always aligned on pageblock
boundaries regardless of the memory model selection.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Tony Lindgren <tony@atomide.com>
2021-06-30 11:38:51 +03:00
Mike Rapoport e2a86800d5 memblock: free_unused_memmap: use pageblock units instead of MAX_ORDER
The code that frees unused memory map uses rounds start and end of the
holes that are freed to MAX_ORDER_NR_PAGES to preserve continuity of the
memory map for MAX_ORDER regions.

Lots of core memory management functionality relies on homogeneity of the
memory map within each pageblock which size may differ from MAX_ORDER in
certain configurations.

Although currently, for the architectures that use free_unused_memmap(),
pageblock_order and MAX_ORDER are equivalent, it is cleaner to have common
notation thought mm code.

Replace MAX_ORDER_NR_PAGES with pageblock_nr_pages and update the comments
to make it more clear why the alignment to pageblock boundaries is
required.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Tony Lindgren <tony@atomide.com>
2021-06-30 11:38:33 +03:00