Commit Graph

66 Commits

Author SHA1 Message Date
Rafael Aquini fcacb3f652 mm/vmemmap: allow architectures to override how vmemmap optimization works
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 40135fc7188cee4ae64777148fd462f4ea523181
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Tue Jul 25 00:37:50 2023 +0530

    mm/vmemmap: allow architectures to override how vmemmap optimization works

    Architectures like powerpc will like to use different page table
    allocators and mapping mechanisms to implement vmemmap optimization.
    Similar to vmemmap_populate allow architectures to implement
    vmemap_populate_compound_pages

    Link: https://lkml.kernel.org/r/20230724190759.483013-5-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Joao Martins <joao.m.martins@oracle.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:45 -04:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Audra Mitchell b95517bc8e mm/sparse-vmemmap: generalise vmemmap_populate_hugepages()
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    We do not carry files for loongarch, so dropped those hunks.
    Minor context difference in arch/arm64/mm/mmu.c due to out of order backport:
    c3a1ef97db ("arm64/mm: Drop ARM64_KERNEL_USES_PMD_MAPS")

This patch is a backport of the following upstream commit:
commit 2045a3b8911b6ee64dd9b522d61abc468ecdcdb5
Author: Feiyang Chen <chenfeiyang@loongson.cn>
Date:   Thu Oct 27 20:52:52 2022 +0800

    mm/sparse-vmemmap: generalise vmemmap_populate_hugepages()

    Generalise vmemmap_populate_hugepages() so ARM64 & X86 & LoongArch can
    share its implementation.

    Link: https://lkml.kernel.org/r/20221027125253.3458989-4-chenhuacai@loongson.cn
    Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
    Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
    Acked-by: Will Deacon <will@kernel.org>
    Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dinh Nguyen <dinguyen@kernel.org>
    Cc: Guo Ren <guoren@kernel.org>
    Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
    Cc: Min Zhou <zhoumin@loongson.cn>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Xuefeng Li <lixuefeng@loongson.cn>
    Cc: Xuerui Wang <kernel@xen0n.name>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:01 -04:00
Jeff Moyer 8ad32283af mm/vmemmap/devdax: fix kernel crash when probing devdax devices
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217652

Conflicts: RHEL does not have commit 9420f89db2dd ("mm: move most of
  core MM initialization to mm/mm_init.c"), which moves
  compound_nr_pages() and mm_init_zone_device(), so this patch applies
  the changes to the functions in page_alloc.c.

commit 87a7ae75d7383afa998f57656d1d14e2a730cc47
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Tue Apr 11 19:52:13 2023 +0530

    mm/vmemmap/devdax: fix kernel crash when probing devdax devices
    
    commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for
    compound devmaps") added support for using optimized vmmemap for devdax
    devices.  But how vmemmap mappings are created are architecture specific.
    For example, powerpc with hash translation doesn't have vmemmap mappings
    in init_mm page table instead they are bolted table entries in the
    hardware page table
    
    vmemmap_populate_compound_pages() used by vmemmap optimization code is not
    aware of these architecture-specific mapping.  Hence allow architecture to
    opt for this feature.  I selected architectures supporting
    HUGETLB_PAGE_OPTIMIZE_VMEMMAP option as also supporting this feature.
    
    This patch fixes the below crash on ppc64.
    
    BUG: Unable to handle kernel data access on write at 0xc00c000100400038
    Faulting instruction address: 0xc000000001269d90
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in:
    CPU: 7 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc5-150500.34-default+ #2 5c90a668b6bbd142599890245c2fb5de19d7d28a
    Hardware name: IBM,9009-42G POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.40 (VL950_099) hv:phyp pSeries
    NIP:  c000000001269d90 LR: c0000000004c57d4 CTR: 0000000000000000
    REGS: c000000003632c30 TRAP: 0300   Not tainted  (6.3.0-rc5-150500.34-default+)
    MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24842228  XER: 00000000
    CFAR: c0000000004c57d0 DAR: c00c000100400038 DSISR: 42000000 IRQMASK: 0
    ....
    NIP [c000000001269d90] __init_single_page.isra.74+0x14/0x4c
    LR [c0000000004c57d4] __init_zone_device_page+0x44/0xd0
    Call Trace:
    [c000000003632ed0] [c000000003632f60] 0xc000000003632f60 (unreliable)
    [c000000003632f10] [c0000000004c5ca0] memmap_init_zone_device+0x170/0x250
    [c000000003632fe0] [c0000000005575f8] memremap_pages+0x2c8/0x7f0
    [c0000000036330c0] [c000000000557b5c] devm_memremap_pages+0x3c/0xa0
    [c000000003633100] [c000000000d458a8] dev_dax_probe+0x108/0x3e0
    [c0000000036331a0] [c000000000d41430] dax_bus_probe+0xb0/0x140
    [c0000000036331d0] [c000000000cef27c] really_probe+0x19c/0x520
    [c000000003633260] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
    [c0000000036332e0] [c000000000cef888] driver_probe_device+0x58/0x120
    [c000000003633320] [c000000000cefa6c] __device_attach_driver+0x11c/0x1e0
    [c0000000036333a0] [c000000000cebc58] bus_for_each_drv+0xa8/0x130
    [c000000003633400] [c000000000ceefcc] __device_attach+0x15c/0x250
    [c0000000036334a0] [c000000000ced458] bus_probe_device+0x108/0x110
    [c0000000036334f0] [c000000000ce92dc] device_add+0x7fc/0xa10
    [c0000000036335b0] [c000000000d447c8] devm_create_dev_dax+0x1d8/0x530
    [c000000003633640] [c000000000d46b60] __dax_pmem_probe+0x200/0x270
    [c0000000036337b0] [c000000000d46bf0] dax_pmem_probe+0x20/0x70
    [c0000000036337d0] [c000000000d2279c] nvdimm_bus_probe+0xac/0x2b0
    [c000000003633860] [c000000000cef27c] really_probe+0x19c/0x520
    [c0000000036338f0] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
    [c000000003633970] [c000000000cef888] driver_probe_device+0x58/0x120
    [c0000000036339b0] [c000000000cefd08] __driver_attach+0x1d8/0x240
    [c000000003633a30] [c000000000cebb04] bus_for_each_dev+0xb4/0x130
    [c000000003633a90] [c000000000cee564] driver_attach+0x34/0x50
    [c000000003633ab0] [c000000000ced878] bus_add_driver+0x218/0x300
    [c000000003633b40] [c000000000cf1144] driver_register+0xa4/0x1b0
    [c000000003633bb0] [c000000000d21a0c] __nd_driver_register+0x5c/0x100
    [c000000003633c10] [c00000000206a2e8] dax_pmem_init+0x34/0x48
    [c000000003633c30] [c0000000000132d0] do_one_initcall+0x60/0x320
    [c000000003633d00] [c0000000020051b0] kernel_init_freeable+0x360/0x400
    [c000000003633de0] [c000000000013764] kernel_init+0x34/0x1d0
    [c000000003633e50] [c00000000000de14] ret_from_kernel_thread+0x5c/0x64
    
    Link: https://lkml.kernel.org/r/20230411142214.64464-1-aneesh.kumar@linux.ibm.com
    Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Reported-by: Tarun Sahu <tsahu@linux.ibm.com>
    Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-07-20 13:18:35 -04:00
Chris von Recklinghausen 031f72cc83 mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
Bugzilla: https://bugzilla.redhat.com/2160210

commit 998a2997885f73e5cc732ac6d661dfa6e0f50654
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue Jun 28 17:22:31 2022 +0800

    mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c

    When I first introduced vmemmap manipulation functions related to HugeTLB,
    I thought those functions may be reused by other modules (e.g.  using
    similar approach to optimize vmemmap pages, unfortunately, the DAX used
    the same approach but does not use those functions).  After two years, we
    didn't see any other users.  So move those functions to hugetlb_vmemmap.c.
    Code movement without any functional change.

    Link: https://lkml.kernel.org/r/20220628092235.91270-5-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Will Deacon <will@kernel.org>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:30 -04:00
Chris von Recklinghausen 8c3672300b mm: sparsemem: drop unexpected word 'a' in comments
Bugzilla: https://bugzilla.redhat.com/2160210

commit f673bd7c2654a0e2a1ec59417dcf9b7ceae9c14c
Author: XueBing Chen <chenxuebing@jari.cn>
Date:   Sat Jun 25 16:51:35 2022 +0800

    mm: sparsemem: drop unexpected word 'a' in comments

    there is an unexpected word 'a' in the comments that need to be dropped

    Link: https://lkml.kernel.org/r/24fbdae3.c86.1819a0f31b9.Coremail.chenxuebing@jari.cn
    Signed-off-by: XueBing Chen <chenxuebing@jari.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:21 -04:00
Chris von Recklinghausen 267a7a9b62 docs: rename Documentation/vm to Documentation/mm
Conflicts: drop changes to arch/loongarch/Kconfig - unsupported config

Bugzilla: https://bugzilla.redhat.com/2160210

commit ee65728e103bb7dd99d8604bf6c7aa89c7d7e446
Author: Mike Rapoport <rppt@kernel.org>
Date:   Mon Jun 27 09:00:26 2022 +0300

    docs: rename Documentation/vm to Documentation/mm

    so it will be consistent with code mm directory and with
    Documentation/admin-guide/mm and won't be confused with virtual machines.

    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Tested-by: Ira Weiny <ira.weiny@intel.com>
    Acked-by: Jonathan Corbet <corbet@lwn.net>
    Acked-by: Wu XiangCheng <bobwxc@email.cn>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:15 -04:00
Chris von Recklinghausen c781b7f0ca mm/sparse-vmemmap.c: remove unwanted initialization in vmemmap_populate_compound_pages()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 55896f935a60b919ce699d11754061f6df936a7d
Author: Gautam Menghani <gautammenghani201@gmail.com>
Date:   Sun Jun 12 11:23:20 2022 -0700

    mm/sparse-vmemmap.c: remove unwanted initialization in vmemmap_populate_compound_pages()

    Remove unnecessary initialization for the variable 'next'.  This fixes
    the clang scan warning: Value stored to 'next' during its
    initialization is never read [deadcode.DeadStores]

    Link: https://lkml.kernel.org/r/20220612182320.160651-1-gautammenghani201@gmail.com
    Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:15 -04:00
Chris von Recklinghausen f3b3f27278 mm: use PAGE_ALIGNED instead of IS_ALIGNED
Bugzilla: https://bugzilla.redhat.com/2160210

commit 0b82ade6c042907e4f24bbe826958a896d24700d
Author: Fanjun Kong <bh1scw@gmail.com>
Date:   Thu May 26 22:02:57 2022 +0800

    mm: use PAGE_ALIGNED instead of IS_ALIGNED

    <linux/mm.h> already provides the PAGE_ALIGNED macro.  Let's use this
    macro instead of IS_ALIGNED and passing PAGE_SIZE directly.

    Link: https://lkml.kernel.org/r/20220526140257.1568744-1-bh1scw@gmail.com
    Signed-off-by: Fanjun Kong <bh1scw@gmail.com>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:14 -04:00
Chris von Recklinghausen d3a08a633d mm: Limit warning message in vmemmap_verify() to once
Bugzilla: https://bugzilla.redhat.com/2160210

commit abd62377c0064302df680ab33f4c05290ba24af8
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Tue Jun 14 17:21:54 2022 +0800

    mm: Limit warning message in vmemmap_verify() to once

    For a system only have limited mirrored memory or some numa node without
    mirrored memory, the per node vmemmap page_structs prefer to allocate
    memory from mirrored region, which will lead to vmemmap_verify() in
    vmemmap_populate_basepages() report lots of warning message.

    This patch change the frequency of "potential offnode page_structs" warning
    messages to only once to avoid a very long print during bootup.

    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Link: https://lore.kernel.org/r/20220614092156.1972846-4-mawupeng1@huawei.com
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:13 -04:00
Chris von Recklinghausen c0b78afd77 mm/sparse-vmemmap: improve memory savings for compound devmaps
Bugzilla: https://bugzilla.redhat.com/2160210

commit 4917f55b4ef963e2d2288fe4eb651728be8db406
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Thu Apr 28 23:16:16 2022 -0700

    mm/sparse-vmemmap: improve memory savings for compound devmaps

    A compound devmap is a dev_pagemap with @vmemmap_shift > 0 and it means
    that pages are mapped at a given huge page alignment and utilize uses
    compound pages as opposed to order-0 pages.

    Take advantage of the fact that most tail pages look the same (except the
    first two) to minimize struct page overhead.  Allocate a separate page for
    the vmemmap area which contains the head page and separate for the next 64
    pages.  The rest of the subsections then reuse this tail vmemmap page to
    initialize the rest of the tail pages.

    Sections are arch-dependent (e.g.  on x86 it's 64M, 128M or 512M) and when
    initializing compound devmap with big enough @vmemmap_shift (e.g.  1G PUD)
    it may cross multiple sections.  The vmemmap code needs to consult @pgmap
    so that multiple sections that all map the same tail data can refer back
    to the first copy of that data for a given gigantic page.

    On compound devmaps with 2M align, this mechanism lets 6 pages be saved
    out of the 8 necessary PFNs necessary to set the subsection's 512 struct
    pages being mapped.  On a 1G compound devmap it saves 4094 pages.

    Altmap isn't supported yet, given various restrictions in altmap pfn
    allocator, thus fallback to the already in use vmemmap_populate().  It is
    worth noting that altmap for devmap mappings was there to relieve the
    pressure of inordinate amounts of memmap space to map terabytes of pmem.
    With compound pages the motivation for altmaps for pmem gets reduced.

    Link: https://lkml.kernel.org/r/20220420155310.9712-5-joao.m.martins@oracle.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:55 -04:00
Chris von Recklinghausen fd3f21f44b mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper
Bugzilla: https://bugzilla.redhat.com/2160210

commit 2beea70a3edc03608ecc89b13ba9ba669c56b3fd
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Thu Apr 28 23:16:15 2022 -0700

    mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper

    In preparation for describing a memmap with compound pages, move the
    actual pte population logic into a separate function
    vmemmap_populate_address() and have a new helper vmemmap_populate_range()
    walk through all base pages it needs to populate.

    While doing that, change the helper to use a pte_t* as return value,
    rather than an hardcoded errno of 0 or -ENOMEM.

    Link: https://lkml.kernel.org/r/20220420155310.9712-3-joao.m.martins@oracle.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:55 -04:00
Chris von Recklinghausen 6425f9d3c8 mm/sparse-vmemmap: add a pgmap argument to section activation
Bugzilla: https://bugzilla.redhat.com/2160210

commit e3246d8f52173a798710314a42fea83223036fc8
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Thu Apr 28 23:16:15 2022 -0700

    mm/sparse-vmemmap: add a pgmap argument to section activation

    Patch series "sparse-vmemmap: memory savings for compound devmaps (device-dax)", v9.

    This series minimizes 'struct page' overhead by pursuing a similar
    approach as Muchun Song series "Free some vmemmap pages of hugetlb page"
    (now merged since v5.14), but applied to devmap with @vmemmap_shift
    (device-dax).

    The vmemmap dedpulication original idea (already used in HugeTLB) is to
    reuse/deduplicate tail page vmemmap areas, particular the area which only
    describes tail pages.  So a vmemmap page describes 64 struct pages, and
    the first page for a given ZONE_DEVICE vmemmap would contain the head page
    and 63 tail pages.  The second vmemmap page would contain only tail pages,
    and that's what gets reused across the rest of the subsection/section.
    The bigger the page size, the bigger the savings (2M hpage -> save 6
    vmemmap pages; 1G hpage -> save 4094 vmemmap pages).

    This is done for PMEM /specifically only/ on device-dax configured
    namespaces, not fsdax.  In other words, a devmap with a @vmemmap_shift.

    In terms of savings, per 1Tb of memory, the struct page cost would go down
    with compound devmap:

    * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of
      total memory)

    * with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of
      total memory)

    The series is mostly summed up by patch 4, and to summarize what the
    series does:

    Patches 1 - 3: Minor cleanups in preparation for patch 4.  Move the very
    nice docs of hugetlb_vmemmap.c into a Documentation/vm/ entry.

    Patch 4: Patch 4 is the one that takes care of the struct page savings
    (also referred to here as tail-page/vmemmap deduplication).  Much like
    Muchun series, we reuse the second PTE tail page vmemmap areas across a
    given @vmemmap_shift On important difference though, is that contrary to
    the hugetlbfs series, there's no vmemmap for the area because we are
    late-populating it as opposed to remapping a system-ram range.  IOW no
    freeing of pages of already initialized vmemmap like the case for
    hugetlbfs, which greatly simplifies the logic (besides not being
    arch-specific).  altmap case unchanged and still goes via the
    vmemmap_populate().  Also adjust the newly added docs to the device-dax
    case.

    [Note that device-dax is still a little behind HugeTLB in terms of
    savings.  I have an additional simple patch that reuses the head vmemmap
    page too, as a follow-up.  That will double the savings and namespaces
    initialization.]

    Patch 5: Initialize fewer struct pages depending on the page size with
    DRAM backed struct pages -- because fewer pages are unique and most tail
    pages (with bigger vmemmap_shift).

        NVDIMM namespace bootstrap improves from ~268-358 ms to
        ~80-110/<1ms on 128G NVDIMMs with 2M and 1G respectivally.  And struct
        page needed capacity will be 3.8x / 1071x smaller for 2M and 1G
        respectivelly.  Tested on x86 with 1.5Tb of pmem (including pinning,
        and RDMA registration/deregistration scalability with 2M MRs)

    This patch (of 5):

    In support of using compound pages for devmap mappings, plumb the pgmap
    down to the vmemmap_populate implementation.  Note that while altmap is
    retrievable from pgmap the memory hotplug code passes altmap without
    pgmap[*], so both need to be independently plumbed.

    So in addition to @altmap, pass @pgmap to sparse section populate
    functions namely:

            sparse_add_section
              section_activate
                populate_section_memmap
                  __populate_section_memmap

    Passing @pgmap allows __populate_section_memmap() to both fetch the
    vmemmap_shift in which memmap metadata is created for and also to let
    sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
    whether to just reuse tail pages from past onlined sections.

    While at it, fix the kdoc for @altmap for sparse_add_section().

    [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/

    Link: https://lkml.kernel.org/r/20220420155310.9712-1-joao.m.martins@oracle.com
    Link: https://lkml.kernel.org/r/20220420155310.9712-2-joao.m.martins@oracle.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:55 -04:00
Chris von Recklinghausen 96e53616ee mm: sparsemem: fix missing higher order allocation splitting
Bugzilla: https://bugzilla.redhat.com/2120352

commit 39d35edee4537487e5178f258e23518272a66413
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Mon Jun 20 10:30:19 2022 +0800

    mm: sparsemem: fix missing higher order allocation splitting

    Higher order allocations for vmemmap pages from buddy allocator must be
    able to be treated as indepdenent small pages as they can be freed
    individually by the caller.  There is no problem for higher order vmemmap
    pages allocated at boot time since each individual small page will be
    initialized at boot time.  However, it will be an issue for memory hotplug
    case since those higher order vmemmap pages are allocated from buddy
    allocator without initializing each individual small page's refcount.  The
    system will panic in put_page_testzero() when CONFIG_DEBUG_VM is enabled
    if the vmemmap page is freed.

    Link: https://lkml.kernel.org/r/20220620023019.94257-1-songmuchun@bytedance.com
    Fixes: d8d55f5616cf ("mm: sparsemem: use page table lock to protect kernel pmd operations")
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:22 -04:00
Chris von Recklinghausen e1c6813607 mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*
Bugzilla: https://bugzilla.redhat.com/2120352

commit 47010c040dec8af6347ec6259104fc13f7e7e30a
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Thu Apr 28 23:16:15 2022 -0700

    mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*

    The word of "free" is not expressive enough to express the feature of
    optimizing vmemmap pages associated with each HugeTLB, rename this keywork
    to "optimize".  In this patch , cheanup configs to make code more
    expressive.

    Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:18 -04:00
Chris von Recklinghausen 1ca2a4d7a5 mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
Bugzilla: https://bugzilla.redhat.com/2120352

commit e54084173487804f5e2f23facf107fd9336e637e
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue Mar 22 14:45:12 2022 -0700

    mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP

    The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those
    functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.

    Link: https://lkml.kernel.org/r/20211101031651.75851-6-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
    Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
    Cc: Chen Huang <chenhuang5@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Fam Zheng <fam.zheng@bytedance.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:17 -04:00
Chris von Recklinghausen 9d42769269 mm: sparsemem: use page table lock to protect kernel pmd operations
Bugzilla: https://bugzilla.redhat.com/2120352

commit d8d55f5616cf3b900a23a72dd24e7b07211e7859
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue Mar 22 14:45:06 2022 -0700

    mm: sparsemem: use page table lock to protect kernel pmd operations

    The init_mm.page_table_lock is used to protect kernel page tables, we
    can use it to serialize splitting vmemmap PMD mappings instead of mmap
    write lock, which can increase the concurrency of vmemmap_remap_free().

    Actually, It increase the concurrency between allocations of HugeTLB
    pages.  But it is not the only benefit.  There are a lot of users of
    mmap read lock of init_mm.  The mmap write lock is holding through
    vmemmap_remap_free(), removing mmap write lock usage to make it does not
    affect other users of mmap read lock.  It is not making anything worse
    and always a win to move.

    Now the kernel page table walker does not hold the page_table_lock when
    walking pmd entries.  There may be consistency issue of a pmd entry,
    because pmd entry might change from a huge pmd entry to a PTE page
    table.  There is only one user of kernel page table walker, namely
    ptdump.  The ptdump already considers the consistency, which use a local
    variable to cache the value of pmd entry.  But we also need to update
    ->action to ACTION_CONTINUE to make sure the walker does not walk every
    pte entry again when concurrent thread has split the huge pmd.

    Link: https://lkml.kernel.org/r/20211101031651.75851-4-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Barry Song <song.bao.hua@hisilicon.com>
    Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
    Cc: Chen Huang <chenhuang5@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Fam Zheng <fam.zheng@bytedance.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:16 -04:00
Chris von Recklinghausen f8e0384b9f mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page
Bugzilla: https://bugzilla.redhat.com/2120352

commit e7d324850bfcb30df563d144c0363cc44595277d
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue Mar 22 14:45:00 2022 -0700

    mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page

    Patch series "Free the 2nd vmemmap page associated with each HugeTLB
    page", v7.

    This series can minimize the overhead of struct page for 2MB HugeTLB
    pages significantly.  It further reduces the overhead of struct page by
    12.5% for a 2MB HugeTLB compared to the previous approach, which means
    2GB per 1TB HugeTLB.  It is a nice gain.  Comments and reviews are
    welcome.  Thanks.

    The main implementation and details can refer to the commit log of patch
    1.  In this series, I have changed the following four helpers, the
    following table shows the impact of the overhead of those helpers.

            +------------------+-----------------------+
            |       APIs       | head page | tail page |
            +------------------+-----------+-----------+
            |    PageHead()    |     Y     |     N     |
            +------------------+-----------+-----------+
            |    PageTail()    |     Y     |     N     |
            +------------------+-----------+-----------+
            |  PageCompound()  |     N     |     N     |
            +------------------+-----------+-----------+
            |  compound_head() |     Y     |     N     |
            +------------------+-----------+-----------+

            Y: Overhead is increased.
            N: Overhead is _NOT_ increased.

    It shows that the overhead of those helpers on a tail page don't change
    between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off".  But the
    overhead on a head page will be increased when "hugetlb_free_vmemmap=on"
    (except PageCompound()).  So I believe that Matthew Wilcox's folio series
    will help with this.

    The users of PageHead() and PageTail() are much less than compound_head()
    and most users of PageTail() are VM_BUG_ON(), so I have done some tests
    about the overhead of compound_head() on head pages.

    I have tested the overhead of calling compound_head() on a head page,
    which is 2.11ns (Measure the call time of 10 million times
    compound_head(), and then average).

    For a head page whose address is not aligned with PAGE_SIZE or a
    non-compound page, the overhead of compound_head() is 2.54ns which is
    increased by 20%.  For a head page whose address is aligned with
    PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by
    40%.  Most pages are the former.  I do not think the overhead is
    significant since the overhead of compound_head() itself is low.

    This patch (of 5):

    This patch minimizes the overhead of struct page for 2MB HugeTLB pages
    significantly.  It further reduces the overhead of struct page by 12.5%
    for a 2MB HugeTLB compared to the previous approach, which means 2GB per
    1TB HugeTLB (2MB type).

    After the feature of "Free sonme vmemmap pages of HugeTLB page" is
    enabled, the mapping of the vmemmap addresses associated with a 2MB
    HugeTLB page becomes the figure below.

         HugeTLB                    struct pages(8 pages)         page frame(8 pages)
     +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
     |           |                     |     0     | -------------> |     0     |
     |           |                     +-----------+                +-----------+
     |           |                     |     1     | -------------> |     1     |
     |           |                     +-----------+                +-----------+
     |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
     |           |                     +-----------+                   | | | | |
     |           |                     |     3     | ------------------+ | | | |
     |           |                     +-----------+                     | | | |
     |           |                     |     4     | --------------------+ | | |
     |    2MB    |                     +-----------+                       | | |
     |           |                     |     5     | ----------------------+ | |
     |           |                     +-----------+                         | |
     |           |                     |     6     | ------------------------+ |
     |           |                     +-----------+                           |
     |           |                     |     7     | --------------------------+
     |           |                     +-----------+
     |           |
     |           |
     |           |
     +-----------+

    As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and
    remaped. However, the 2nd vmemmap page frame is also can be freed to
    the buddy allocator, then we can change the mapping from the figure
    above to the figure below.

        HugeTLB                    struct pages(8 pages)         page frame(8 pages)
     +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
     |           |                     |     0     | -------------> |     0     |
     |           |                     +-----------+                +-----------+
     |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
     |           |                     +-----------+                  | | | | | |
     |           |                     |     2     | -----------------+ | | | | |
     |           |                     +-----------+                    | | | | |
     |           |                     |     3     | -------------------+ | | | |
     |           |                     +-----------+                      | | | |
     |           |                     |     4     | ---------------------+ | | |
     |    2MB    |                     +-----------+                        | | |
     |           |                     |     5     | -----------------------+ | |
     |           |                     +-----------+                          | |
     |           |                     |     6     | -------------------------+ |
     |           |                     +-----------+                            |
     |           |                     |     7     | ---------------------------+
     |           |                     +-----------+
     |           |
     |           |
     |           |
     +-----------+

    After we do this, all tail vmemmap pages (1-7) are mapped to the head
    vmemmap page frame (0).  In other words, there are more than one page
    struct with PG_head associated with each HugeTLB page.  We __know__ that
    there is only one head page struct, the tail page structs with PG_head are
    fake head page structs.  We need an approach to distinguish between those
    two different types of page structs so that compound_head(), PageHead()
    and PageTail() can work properly if the parameter is the tail page struct
    but with PG_head.

    The following code snippet describes how to distinguish between real and
    fake head page struct.

            if (test_bit(PG_head, &page->flags)) {
                    unsigned long head = READ_ONCE(page[1].compound_head);

                    if (head & 1) {
                            if (head == (unsigned long)page + 1)
                                    ==> head page struct
                            else
                                    ==> tail page struct
                    } else
                            ==> head page struct
            }

    We can safely access the field of the @page[1] with PG_head because the
    @page is a compound page composed with at least two contiguous pages.

    [songmuchun@bytedance.com: restore lost comment changes]

    Link: https://lkml.kernel.org/r/20211101031651.75851-1-songmuchun@bytedance.com
    Link: https://lkml.kernel.org/r/20211101031651.75851-2-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Chen Huang <chenhuang5@huawei.com>
    Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Cc: Fam Zheng <fam.zheng@bytedance.com>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:16 -04:00
Chris von Recklinghausen baf876910d mm: remove redundant smp_wmb()
Bugzilla: https://bugzilla.redhat.com/2120352

commit ed33b5a677da33d6e8f959879bb61e9791b80354
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Fri Nov 5 13:38:41 2021 -0700

    mm: remove redundant smp_wmb()

    The smp_wmb() which is in the __pte_alloc() is used to ensure all ptes
    setup is visible before the pte is made visible to other CPUs by being
    put into page tables.  We only need this when the pte is actually
    populated, so move it to pmd_install().  __pte_alloc_kernel(),
    __p4d_alloc(), __pud_alloc() and __pmd_alloc() are similar to this case.

    We can also defer smp_wmb() to the place where the pmd entry is really
    populated by preallocated pte.  There are two kinds of user of
    preallocated pte, one is filemap & finish_fault(), another is THP.  The
    former does not need another smp_wmb() because the smp_wmb() has been
    done by pmd_install().  Fortunately, the latter also does not need
    another smp_wmb() because there is already a smp_wmb() before populating
    the new pte when the THP uses a preallocated pte to split a huge pmd.

    Link: https://lkml.kernel.org/r/20210901102722.47686-3-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mika Penttila <mika.penttila@nextfour.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:08 -04:00
Muchun Song 3bc2b6a725 mm: sparsemem: split the huge PMD mapping of vmemmap pages
Patch series "Split huge PMD mapping of vmemmap pages", v4.

In order to reduce the difficulty of code review in series[1].  We disable
huge PMD mapping of vmemmap pages when that feature is enabled.  In this
series, we do not disable huge PMD mapping of vmemmap pages anymore.  We
will split huge PMD mapping when needed.  When HugeTLB pages are freed
from the pool we do not attempt coalasce and move back to a PMD mapping
because it is much more complex.

[1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/

This patch (of 3):

In [1], PMD mappings of vmemmap pages were disabled if the the feature
hugetlb_free_vmemmap was enabled.  This was done to simplify the initial
implementation of vmmemap freeing for hugetlb pages.  Now, remove this
simplification by allowing PMD mapping and switching to PTE mappings as
needed for allocated hugetlb pages.

When a hugetlb page is allocated, the vmemmap page tables are walked to
free vmemmap pages.  During this walk, split huge PMD mappings to PTE
mappings as required.  In the unlikely case PTE pages can not be
allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb
page.

When HugeTLB pages are freed from the pool, we do not attempt to
coalesce and move back to a PMD mapping because it is much more complex.

[1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com

Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:26 -07:00
Muchun Song ad2fa3717b mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page
When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it.  However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure.  In
this case, we just refuse to free the HugeTLB page.  This changes behavior
in some corner cases as listed below:

 1) Failing to free a huge page triggered by the user (decrease nr_pages).

    User needs to try again later.

 2) Failing to free a surplus huge page when freed by the application.

    Try again later when freeing a huge page next time.

 3) Failing to dissolve a free huge page on ZONE_MOVABLE via
    offline_pages().

    This can happen when we have plenty of ZONE_MOVABLE memory, but
    not enough kernel memory to allocate vmemmmap pages.  We may even
    be able to migrate huge page contents, but will not be able to
    dissolve the source huge page.  This will prevent an offline
    operation and is unfortunate as memory offlining is expected to
    succeed on movable zones.  Users that depend on memory hotplug
    to succeed for movable zones should carefully consider whether the
    memory savings gained from this feature are worth the risk of
    possibly not being able to offline memory in certain situations.

 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
    alloc_contig_range() - once we have that handling in place. Mainly
    affects CMA and virtio-mem.

    Similar to 3). virito-mem will handle migration errors gracefully.
    CMA might be able to fallback on other free areas within the CMA
    region.

Vmemmap pages are allocated from the page freeing context.  In order for
those allocations to be not disruptive (e.g.  trigger oom killer)
__GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure.  GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.

[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
  Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org

Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:25 -07:00
Muchun Song f41f2ed43c mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
Every HugeTLB has more than one struct page structure.  We __know__ that
we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
store metadata associated with each HugeTLB.

There are a lot of struct page structures associated with each HugeTLB
page.  For tail pages, the value of compound_head is the same.  So we can
reuse first page of tail page structures.  We map the virtual addresses of
the remaining pages of tail page structures to the first tail page struct,
and then free these page frames.  Therefore, we need to reserve two pages
as vmemmap areas.

When we allocate a HugeTLB page from the buddy, we can free some vmemmap
pages associated with each HugeTLB page.  It is more appropriate to do it
in the prep_new_huge_page().

The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages
associated with a HugeTLB page can be freed, returns zero for now, which
means the feature is disabled.  We will enable it once all the
infrastructure is there.

[willy@infradead.org: fix documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org

Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:25 -07:00
Wei Yang 6cda72047e mm/sparse: only sub-section aligned range would be populated
There are two code path which invoke __populate_section_memmap()

  * sparse_init_nid()
  * sparse_add_section()

For both case, we are sure the memory range is sub-section aligned.

  * we pass PAGES_PER_SECTION to sparse_init_nid()
  * we check range by check_pfn_span() before calling
    sparse_add_section()

Also, the counterpart of __populate_section_memmap(), we don't do such
calculation and check since the range is checked by check_pfn_span() in
__remove_pages().

Clear the calculation and check to keep it simple and comply with its
counterpart.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200703031828.14645-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:27 -07:00
Anshuman Khandual 56993b4e14 mm/sparsemem: enable vmem_altmap support in vmemmap_alloc_block_buf()
There are many instances where vmemap allocation is often switched between
regular memory and device memory just based on whether altmap is available
or not.  vmemmap_alloc_block_buf() is used in various platforms to
allocate vmemmap mappings.  Lets also enable it to handle altmap based
device memory allocation along with existing regular memory allocations.
This will help in avoiding the altmap based allocation switch in many
places.  To summarize there are two different methods to call
vmemmap_alloc_block_buf().

vmemmap_alloc_block_buf(size, node, NULL)   /* Allocate from system RAM */
vmemmap_alloc_block_buf(size, node, altmap) /* Allocate from altmap */

This converts altmap_alloc_block_buf() into a static function, drops it's
entry from the header and updates Documentation/vm/memory-model.rst.

Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jia He <justin.he@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Link: http://lkml.kernel.org/r/1594004178-8861-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:27 -07:00
Anshuman Khandual 1d9cfee753 mm/sparsemem: enable vmem_altmap support in vmemmap_populate_basepages()
Patch series "arm64: Enable vmemmap mapping from device memory", v4.

This series enables vmemmap backing memory allocation from device memory
ranges on arm64.  But before that, it enables vmemmap_populate_basepages()
and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based
alocation requests.

This patch (of 3):

vmemmap_populate_basepages() is used across platforms to allocate backing
memory for vmemmap mapping.  This is used as a standard default choice or
as a fallback when intended huge pages allocation fails.  This just
creates entire vmemmap mapping with base pages (PAGE_SIZE).

On arm64 platforms, vmemmap_populate_basepages() is called instead of the
platform specific vmemmap_populate() when ARM64_SWAPPER_USES_SECTION_MAPS
is not enabled as in case for ARM64_16K_PAGES and ARM64_64K_PAGES configs.

At present vmemmap_populate_basepages() does not support allocating from
driver defined struct vmem_altmap while trying to create vmemmap mapping
for a device memory range.  It prevents ARM64_16K_PAGES and
ARM64_64K_PAGES configs on arm64 from supporting device memory with
vmemap_altmap request.

This enables vmem_altmap support in vmemmap_populate_basepages() unlocking
device memory allocation for vmemap mapping on arm64 platforms with 16K or
64K base page configs.

Each architecture should evaluate and decide on subscribing device memory
based base page allocation through vmemmap_populate_basepages().  Hence
lets keep it disabled on all archs in order to preserve the existing
semantics.  A subsequent patch enables it on arm64.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jia He <justin.he@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Link: http://lkml.kernel.org/r/1594004178-8861-1-git-send-email-anshuman.khandual@arm.com
Link: http://lkml.kernel.org/r/1594004178-8861-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:27 -07:00
Mike Rapoport e31cf2f4ca mm: don't include asm/pgtable.h if linux/mm.h is already included
Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once.  For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
        return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g.  pte_alloc() and
pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.

The include statements in such cases are remove with a simple loop:

	for f in $(git grep -l "include <linux/mm.h>") ; do
		sed -i -e '/include <asm\/pgtable.h>/ d' $f
	done

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Dan Williams e9c0a3f054 mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
Allow sub-section sized ranges to be added to the memmap.

populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate().  There should be no sub-section usage in
current deployments.  New warnings are added to clarify which memmap
allocation paths are sub-section capable.

Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Mike Rapoport 57c8a661d9 mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>

@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
  Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport 97ad1087ef memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variants
Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of
identical MEMBLOCK definitions.

Link: http://lkml.kernel.org/r/1536927045-23536-29-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport eb31d559f1 memblock: remove _virt from APIs returning virtual address
The conversion is done using

sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
	$(git grep -l memblock_virt_alloc)

Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
Pavel Tatashin 2a3cb8baef mm/sparse: delete old sparse_init and enable new one
Rename new_sparse_init() to sparse_init() which enables it.  Delete old
sparse_init() and all the code that became obsolete with.

[pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
  Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Tested-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin afda57bc13 mm/sparse: move buffer init/fini to the common place
Now that both variants of sparse memory use the same buffers to populate
memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the
common place.

Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Tested-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin 35fd1eb1e8 mm/sparse: abstract sparse buffer allocations
Patch series "sparse_init rewrite", v6.

In sparse_init() we allocate two large buffers to temporary hold usemap
and memmap for the whole machine.  However, we can avoid doing that if
we changed sparse_init() to operated on per-node bases instead of doing
it on the whole machine beforehand.

As shown by Baoquan
  http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com

The buffers are large enough to cause machine stop to boot on small
memory systems.

Another benefit of these changes is that they also obsolete
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.

This patch (of 5):

When struct pages are allocated for sparse-vmemmap VA layout, we first try
to allocate one large buffer, and than if that fails allocate struct pages
for each section as we go.

The code that allocates buffer is uses global variables and is spread
across several call sites.

Cleanup the code by introducing three functions to handle the global
buffer:

sparse_buffer_init()	initialize the buffer
sparse_buffer_fini()	free the remaining part of the buffer
sparse_buffer_alloc()	alloc from the buffer, and if buffer is empty
return NULL

Define these functions in sparse.c instead of sparse-vmemmap.c because
later we will use them for non-vmemmap sparse allocations as well.

[akpm@linux-foundation.org: use PTR_ALIGN()]
[akpm@linux-foundation.org: s/BUG_ON/WARN_ON/]
Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Baoquan He c98aff6493 mm/sparse: optimize memmap allocation during sparse_init()
In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS.  They are used to store
each memory section's usemap and mem map if marked as present.  With the
help of these two arrays, continuous memory chunk is allocated for
usemap and memmap for memory sections on one node.  This avoids too many
memory fragmentations.  Like below diagram, '1' indicates the present
memory section, '0' means absent one.  The number 'n' could be much
smaller than NR_MEM_SECTIONS on most of systems.

  |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0|
  -------------------------------------------------------
   0 1 2 3         4 5         i   i+1     n-1   n

If we fail to populate the page tables to map one section's memmap, its
->section_mem_map will be cleared finally to indicate that it's not
present.  After use, these two arrays will be released at the end of
sparse_init().

In 4-level paging mode, each array costs 4M which can be ignorable.
While in 5-level paging, they costs 256M each, 512M altogether.  Kdump
kernel Usually only reserves very few memory, e.g 256M.  So, even thouth
they are temporarily allocated, still not acceptable.

In fact, there's no need to allocate them with the size of
NR_MEM_SECTIONS.  Since the ->section_mem_map clearing has been deferred
to the last, the number of present memory sections are kept the same
during sparse_init() until we finally clear out the memory section's
->section_mem_map if its usemap or memmap is not correctly handled.
Thus in the middle whenever for_each_present_section_nr() loop is taken,
the i-th present memory section is always the same one.

Here only allocate usemap_map and map_map with the size of
'nr_present_sections'.  For the i-th present memory section, install its
usemap and memmap to usemap_map[i] and mam_map[i] during allocation.
Then in the last for_each_present_section_nr() loop which clears the
failed memory section's ->section_mem_map, fetch usemap and memmap from
usemap_map[] and map_map[] array and set them into mem_section[]
accordingly.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oscar Salvador <osalvador@techadventures.net>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Baoquan He 07a34a8c36 mm/sparsemem.c: defer the ms->section_mem_map clearing
In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
will allocate one continuous memory chunk for mem maps on one node and
populate the relevant page tables to map memory section one by one.  If
fail to populate for a certain mem section, print warning and its
->section_mem_map will be cleared to cancel the marking of being
present.  Like this, the number of mem sections marked as present could
become less during sparse_init() execution.

Here just defer the ms->section_mem_map clearing if failed to populate
its page tables until the last for_each_present_section_nr() loop.  This
is in preparation for later optimizing the mem map allocation.

[akpm@linux-foundation.org: remove now-unused local `ms', per Oscar]
Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Christoph Hellwig eb8045335c mm: merge vmem_altmap_alloc into altmap_alloc_block_buf
There is no clear separation between the two, so merge them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig a8fc357b28 mm: split altmap memory map allocation from normal case
No functional changes, just untangling the call chain and document
why the altmap is passed around the hotplug code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig 7b73d978a5 mm: pass the vmem_altmap to vmemmap_populate
We can just pass this on instead of having to do a radix tree lookup
without proper locking a few levels into the callchain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Michal Hocko fcdaf842bd mm, sparse: do not swamp log with huge vmemmap allocation failures
While doing memory hotplug tests under heavy memory pressure we have
noticed too many page allocation failures when allocating vmemmap memmap
backed by huge page

  kworker/u3072:1: page allocation failure: order:9, mode:0x24084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO)
  [...]
  Call Trace:
    dump_trace+0x59/0x310
    show_stack_log_lvl+0xea/0x170
    show_stack+0x21/0x40
    dump_stack+0x5c/0x7c
    warn_alloc_failed+0xe2/0x150
    __alloc_pages_nodemask+0x3ed/0xb20
    alloc_pages_current+0x7f/0x100
    vmemmap_alloc_block+0x79/0xb6
    __vmemmap_alloc_block_buf+0x136/0x145
    vmemmap_populate+0xd2/0x2b9
    sparse_mem_map_populate+0x23/0x30
    sparse_add_one_section+0x68/0x18e
    __add_pages+0x10a/0x1d0
    arch_add_memory+0x4a/0xc0
    add_memory_resource+0x89/0x160
    add_memory+0x6d/0xd0
    acpi_memory_device_add+0x181/0x251
    acpi_bus_attach+0xfd/0x19b
    acpi_bus_scan+0x59/0x69
    acpi_device_hotplug+0xd2/0x41f
    acpi_hotplug_work_fn+0x1a/0x23
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xbd/0xe0
    ret_from_fork+0x3f/0x70

and we do see many of those because essentially every allocation fails
for each memory section.  This is an excessive way to tell the user that
there is nothing to really worry about because we do have a fallback
mechanism to use base pages.  The only downside might be a performance
degradation due to TLB pressure.

This patch changes vmemmap_alloc_block() to use __GFP_NOWARN and warn
explicitly once on the first allocation failure.  This will reduce the
noise in the kernel log considerably, while we still have an indication
that a performance might be impacted.

[mhocko@kernel.org: forgot to git add the follow up fix]
  Link: http://lkml.kernel.org/r/20171107090635.c27thtse2lchjgvb@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20171106092228.31098-1-mhocko@kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Joe Perches <joe@perches.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:07 -08:00
Pavel Tatashin f7f99100d8 mm: stop zeroing memory during allocation in vmemmap
vmemmap_alloc_block() will no longer zero the block, so zero memory at
its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sparse-vmemmap to use the non-zeroing version.
So, we will get the performance improvement by zeroing the memory in
parallel when struct pages are zeroed.

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

                         BASE            FIX
sparse_init     11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
                  --------------------------
Total           16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

[akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c]
Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:05 -08:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Michal Hocko b95046b047 mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations
Commit f52407ce2d ("memory hotplug: alloc page from other node in
memory online") has introduced N_HIGH_MEMORY checks to only use NUMA
aware allocations when there is some memory present because the
respective node might not have any memory yet at the time and so it
could fail or even OOM.

Things have changed since then though.  Zonelists are now always
initialized before we do any allocations even for hotplug (see
959ecc48fc ("mm/memory_hotplug.c: fix building of node hotplug
zonelist")).

Therefore these checks are not really needed.  In fact caller of the
allocator should never care about whether the node is populated because
that might change at any time.

Link: http://lkml.kernel.org/r/20170721143915.14161-10-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Michal Hocko dcda9b0471 mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator.  This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
ignored for smaller sizes.  This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success.  This will work independent of the order and overrides the
default allocator behavior.  Page allocator users have several levels of
guarantee vs.  cost options (take GFP_KERNEL as an example)

 - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
   attempt to free memory at all. The most light weight mode which even
   doesn't kick the background reclaim. Should be used carefully because
   it might deplete the memory and the next user might hit the more
   aggressive reclaim

 - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
   allocation without any attempt to free memory from the current
   context but can wake kswapd to reclaim memory if the zone is below
   the low watermark. Can be used from either atomic contexts or when
   the request is a performance optimization and there is another
   fallback for a slow path.

 - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
   non sleeping allocation with an expensive fallback so it can access
   some portion of memory reserves. Usually used from interrupt/bh
   context with an expensive slow path fallback.

 - GFP_KERNEL - both background and direct reclaim are allowed and the
   _default_ page allocator behavior is used. That means that !costly
   allocation requests are basically nofail but there is no guarantee of
   that behavior so failures have to be checked properly by callers
   (e.g. OOM killer victim is allowed to fail currently).

 - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
   and all allocation requests fail early rather than cause disruptive
   reclaim (one round of reclaim in this implementation). The OOM killer
   is not invoked.

 - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
   behavior and all allocation requests try really hard. The request
   will fail if the reclaim cannot make any progress. The OOM killer
   won't be triggered.

 - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
   and all allocation requests will loop endlessly until they succeed.
   This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic.  No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
  Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
  Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
Kirill A. Shutemov c2febafc67 mm: convert generic code to 5-level paging
Convert all non-architecture-specific code to 5-level paging.

It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 11:48:47 -08:00
Fabian Frederick bd721ea73e treewide: replace obsolete _refok by __ref
There was only one use of __initdata_refok and __exit_refok

__init_refok was used 46 times against 82 for __ref.

Those definitions are obsolete since commit 312b1485fb ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")

This patch removes the following compatibility definitions and replaces
them treewide.

/* compatibility defines */
#define __init_refok     __ref
#define __initdata_refok __refdata
#define __exit_refok     __ref

I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Joe Perches 1170532bb4 mm: convert printk(KERN_<LEVEL> to pr_<level>
Most of the mm subsystem uses pr_<level> so make it consistent.

Miscellanea:

 - Realign arguments
 - Add missing newline to format
 - kmemleak-test.c has a "kmemleak: " prefix added to the
   "Kmemleak testing" logging message via pr_fmt

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Joe Perches 756a025f00 mm: coalesce split strings
Kernel style prefers a single string over split strings when the string is
'user-visible'.

Miscellanea:

 - Add a missing newline
 - Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Dan Williams 4b94ffdc41 x86, mm: introduce vmem_altmap to augment vmemmap_populate()
In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array.  The default vmemmap_populate()
allocates page table storage area from the page allocator.  Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'.  Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: kbuild test robot <lkp@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Santosh Shilimkar bb016b8416 mm/sparse: use memblock apis for early memory allocations
Switch to memblock interfaces for early memory allocator instead of
bootmem allocator.  No functional change in beahvior than what it is in
current code from bootmem users points of view.

Archs already converted to NO_BOOTMEM now directly use memblock
interfaces instead of bootmem wrappers build on top of memblock.  And
the archs which still uses bootmem, these new apis just fallback to
exiting bootmem APIs.

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paul Walmsley <paul@pwsan.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:47 -08:00
Johannes Weiner 0aad818b2d sparse-vmemmap: specify vmemmap population range in bytes
The sparse code, when asking the architecture to populate the vmemmap,
specifies the section range as a starting page and a number of pages.

This is an awkward interface, because none of the arch-specific code
actually thinks of the range in terms of 'struct page' units and always
translates it to bytes first.

In addition, later patches mix huge page and regular page backing for
the vmemmap.  For this, they need to call vmemmap_populate_basepages()
on sub-section ranges with PAGE_SIZE and PMD_SIZE in mind.  But these
are not necessarily multiples of the 'struct page' size and so this unit
is too coarse.

Just translate the section range into bytes once in the generic sparse
code, then pass byte ranges down the stack.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Bernhard Schmidt <Bernhard.Schmidt@lrz.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: David S. Miller <davem@davemloft.net>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:35 -07:00