Commit Graph

199 Commits

Author SHA1 Message Date
Baoquan He 930e56cdd6 crash: add a new kexec flag for hotplug support
JIRA: https://issues.redhat.com/browse/RHEL-58641

Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 79365026f86948b52c3cb7bf099dded92c559b4c
Author: Sourabh Jain <sourabhjain@linux.ibm.com>
Date:   Tue Mar 26 11:24:09 2024 +0530

    crash: add a new kexec flag for hotplug support

    Commit a72bbec70da2 ("crash: hotplug support for kexec_load()")
    introduced a new kexec flag, `KEXEC_UPDATE_ELFCOREHDR`. Kexec tool uses
    this flag to indicate to the kernel that it is safe to modify the
    elfcorehdr of the kdump image loaded using the kexec_load system call.

    However, it is possible that architectures may need to update kexec
    segments other then elfcorehdr. For example, FDT (Flatten Device Tree)
    on PowerPC. Introducing a new kexec flag for every new kexec segment
    may not be a good solution. Hence, a generic kexec flag bit,
    `KEXEC_CRASH_HOTPLUG_SUPPORT`, is introduced to share the CPU/Memory
    hotplug support intent between the kexec tool and the kernel for the
    kexec_load system call.

    Now we have two kexec flags that enables crash hotplug support for
    kexec_load system call. First is KEXEC_UPDATE_ELFCOREHDR (only used in
    x86), and second is KEXEC_CRASH_HOTPLUG_SUPPORT (for all architectures).

    To simplify the process of finding and reporting the crash hotplug
    support the following changes are introduced.

    1. Define arch specific function to process the kexec flags and
       determine crash hotplug support

    2. Rename the @update_elfcorehdr member of struct kimage to
       @hotplug_support and populate it for both kexec_load and
       kexec_file_load syscalls, because architecture can update more than
       one kexec segment

    3. Let generic function crash_check_hotplug_support report hotplug
       support for loaded kdump image based on value of @hotplug_support

    To bring the x86 crash hotplug support in line with the above points,
    the following changes have been made:

    - Introduce the arch_crash_hotplug_support function to process kexec
      flags and determine crash hotplug support

    - Remove the arch_crash_hotplug_[cpu|memory]_support functions

    Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
    Acked-by: Baoquan He <bhe@redhat.com>
    Acked-by: Hari Bathini <hbathini@linux.ibm.com>
    Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
    Link: https://msgid.link/20240326055413.186534-3-sourabhjain@linux.ibm.com

Signed-off-by: Baoquan He <bhe@redhat.com>
2024-12-23 09:35:36 +08:00
Jeff Moyer f43d3fce8e mm/memory_hotplug: embed vmem_altmap details in memory block
JIRA: https://issues.redhat.com/browse/RHEL-23824
Conflicts: Context differences due to different patch application
  order in RHEL as compared to upstream.  Specifically, commits
  f42ce5f087eb ("mm/memory_hotplug: fix error handling in
  add_memory_resource()") and 001002e73712 ("mm/memory_hotplug: add
  missing mem_hotplug_lock").

commit 1a8c64e110435e44e71bcd50a75663174b575f22
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Tue Aug 8 14:45:01 2023 +0530

    mm/memory_hotplug: embed vmem_altmap details in memory block
    
    With memmap on memory, some architecture needs more details w.r.t altmap
    such as base_pfn, end_pfn, etc to unmap vmemmap memory.  Instead of
    computing them again when we remove a memory block, embed vmem_altmap
    details in struct memory_block if we are using memmap on memory block
    feature.
    
    [yangyingliang@huawei.com: fix error return code in add_memory_resource()]
      Link: https://lkml.kernel.org/r/20230809081552.1351184-1-yangyingliang@huawei.com
    Link: https://lkml.kernel.org/r/20230808091501.287660-7-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-07-26 14:57:31 -04:00
Mark Langsdorf 250213a584 mm/memory_hotplug: add missing mem_hotplug_lock
JIRA: https://issues.redhat.com/browse/RHEL-26183
Conflicts:
	drivers/base/memory.c - minor context differences

commit 001002e73712cdf6b8d9a103648cda3040ad7647
Author: Sumanth Korikkar <sumanthk@linux.ibm.com>
Date: Wed, 06 Dec 2023 16:12:46 +0000

From Documentation/core-api/memory-hotplug.rst:
When adding/removing/onlining/offlining memory or adding/removing
heterogeneous/device memory, we should always hold the mem_hotplug_lock
in write mode to serialise memory hotplug (e.g. access to global/zone
variables).

mhp_(de)init_memmap_on_memory() functions can change zone stats and
struct page content, but they are currently called w/o the
mem_hotplug_lock.

When memory block is being offlined and when kmemleak goes through each
populated zone, the following theoretical race conditions could occur:
CPU 0:					     | CPU 1:
memory_offline()			     |
-> offline_pages()			     |
	-> mem_hotplug_begin()		     |
	   ...				     |
	-> mem_hotplug_done()		     |
					     | kmemleak_scan()
					     | -> get_online_mems()
					     |    ...
-> mhp_deinit_memmap_on_memory()	     |
  [not protected by mem_hotplug_begin/done()]|
  Marks memory section as offline,	     |   Retrieves zone_start_pfn
  poisons vmemmap struct pages and updates   |   and struct page members.
  the zone related data			     |
   					     |    ...
   					     | -> put_online_mems()

Fix this by ensuring mem_hotplug_lock is taken before performing
mhp_init_memmap_on_memory().  Also ensure that
mhp_deinit_memmap_on_memory() holds the lock.

online/offline_pages() are currently only called from
memory_block_online/offline(), so it is safe to move the locking there.

Link: https://lkml.kernel.org/r/20231120145354.308999-2-sumanthk@linux.ibm.com
Fixes: a08a2ae346 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Cc: <stable@vger.kernel.org>	[5.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2024-05-31 15:37:45 -04:00
Mark Langsdorf 013a5d02a3 crash: memory and CPU hotplug sysfs attributes
JIRA: https://issues.redhat.com/browse/RHEL-26183
Conflicts:
	Documentation/ABI/testing/sysfs-devices-system-cpu -
The crash_hotplug text was added to the end of the file
in upstream, so I did the same here.
	include/linux/kexec.h - minor context differences

commit 88a6f89944216b028d3872b0cec0f51a2f955460
Author: Eric DeVolder <eric.devolder@oracle.com>
Date: Thu, 24 Aug 2023 16:25:14 +0000

Introduce the crash_hotplug attribute for memory and CPUs for use by
userspace.  These attributes directly facilitate the udev rule for
managing userspace re-loading of the crash kernel upon hot un/plug
changes.

For memory, expose the crash_hotplug attribute to the
/sys/devices/system/memory directory.  For example:

 # udevadm info --attribute-walk /sys/devices/system/memory/memory81
  looking at device '/devices/system/memory/memory81':
    KERNEL=="memory81"
    SUBSYSTEM=="memory"
    DRIVER==""
    ATTR{online}=="1"
    ATTR{phys_device}=="0"
    ATTR{phys_index}=="00000051"
    ATTR{removable}=="1"
    ATTR{state}=="online"
    ATTR{valid_zones}=="Movable"

  looking at parent device '/devices/system/memory':
    KERNELS=="memory"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{auto_online_blocks}=="offline"
    ATTRS{block_size_bytes}=="8000000"
    ATTRS{crash_hotplug}=="1"

For CPUs, expose the crash_hotplug attribute to the
/sys/devices/system/cpu directory. For example:

 # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0
  looking at device '/devices/system/cpu/cpu0':
    KERNEL=="cpu0"
    SUBSYSTEM=="cpu"
    DRIVER=="processor"
    ATTR{crash_notes}=="277c38600"
    ATTR{crash_notes_size}=="368"
    ATTR{online}=="1"

  looking at parent device '/devices/system/cpu':
    KERNELS=="cpu"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{crash_hotplug}=="1"
    ATTRS{isolated}==""
    ATTRS{kernel_max}=="8191"
    ATTRS{nohz_full}=="  (null)"
    ATTRS{offline}=="4-7"
    ATTRS{online}=="0-3"
    ATTRS{possible}=="0-7"
    ATTRS{present}=="0-3"

With these sysfs attributes in place, it is possible to efficiently
instruct the udev rule to skip crash kernel reloading for kernels
configured with crash hotplug support.

For example, the following is the proposed udev rule change for RHEL
system 98-kexec.rules (as the first lines of the rule file):

 # The kernel updates the crash elfcorehdr for CPU and memory changes
 SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
 SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

When examined in the context of 98-kexec.rules, the above rules test if
crash_hotplug is set, and if so, the userspace initiated
unload-then-reload of the crash kernel is skipped.

CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU
and CONFIG_MEMORY_HOTPLUG kernel config options.  If an architecture
supports, for example, memory hotplug but not CPU hotplug, then the
/sys/devices/system/memory/crash_hotplug attribute file is present, but
the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be
present.  Thus the udev rule skips userspace processing of memory hot
un/plug events, but the udev rule will evaluate false for CPU events, thus
allowing userspace to process CPU hot un/plug events (ie the
unload-then-reload of the kdump capture kernel).

Link: https://lkml.kernel.org/r/20230814214446.6659-5-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2024-05-31 15:37:43 -04:00
Chris von Recklinghausen 891dbb5790 mm/hwpoison: introduce per-memory_block hwpoison counter
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 5033091de814ab4b5623faed2755f3064e19e2d2
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Mon Oct 24 15:20:12 2022 +0900

    mm/hwpoison: introduce per-memory_block hwpoison counter

    Currently PageHWPoison flag does not behave well when experiencing memory
    hotremove/hotplug.  Any data field in struct page is unreliable when the
    associated memory is offlined, and the current mechanism can't tell
    whether a memory block is onlined because a new memory devices is
    installed or because previous failed offline operations are undone.
    Especially if there's a hwpoisoned memory, it's unclear what the best
    option is.

    So introduce a new mechanism to make struct memory_block remember that a
    memory block has hwpoisoned memory inside it.  And make any online event
    fail if the onlining memory block contains hwpoison.  struct memory_block
    is freed and reallocated over ACPI-based hotremove/hotplug, but not over
    sysfs-based hotremove/hotplug.  So the new counter can distinguish these
    cases.

    Link: https://lkml.kernel.org/r/20221024062012.1520887-5-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:19 -04:00
Mark Langsdorf d87cf41db0 drivers/base/memory: Fix comments for phys_index_show()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178302

commit 7c09f4281cb6ec0fc202f53924ed6c389c61bf0e
Author: Gavin Shan <gshan@redhat.com>
Date: Fri, 20 Jan 2023 13:57:26 +0800

According to 'admin-guide/mm/memory-hotplug.rst', the memory block ID,
instead of the section index, is shown by '/sys/devices/system/memory/
memoryX/phys_index'.

Fix the comments to match with 'admin-guide/mm/memory-hotplug.rst'.
Besides, use the existing helper memory_block_id() to convert the section
index to the memory block index.

No functional change intended.

Signed-off-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230120055727.355483-2-gshan@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2023-06-08 12:33:11 -04:00
Mark Langsdorf 951d600205 mm: kill is_memblock_offlined()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178302

commit 639118d1571f70b1157b4bb5ac574b0ab0f38099
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Sat, 27 Aug 2022 19:20:43 +0800

Directly check state of struct memory_block, no need a single function.

Link: https://lkml.kernel.org/r/20220827112043.187028-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2023-06-08 12:20:27 -04:00
Herton R. Krzesinski 1e9330a19c Merge: Update drivers/base to match Linux v6.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1332

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318

Bring drivers/base up to date with Linux v6.0.

Ugly dependencies in the CXL and nvdimm code caused a commit that removed
lockdep to fail to compile, and that commit was dropped from the series.

Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>

Approved-by: Jocelyn Falempe <jfalempe@redhat.com>
Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>

Conflicts:
      MAINTAINERS

There were conflicts in MAINTAINERS file because of previous merge of
IOMMU/DMA API update that did several updates there. Resolution kept
changes in place, so there is no change with this merge.

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-21 12:03:57 -03:00
Mark Langsdorf 325775c446 drivers/base/memory: fix an unlikely reference counting issue in __add_memory_block()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318

commit f47f758cff59c68015d6b9b9c077110df7c2c828
Author: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date: Thu, 28 Apr 2022 23:16:19 -0700

Omitted-fix: 4e224719f5d9b92abf1e0edfb2a83053208f3026
	Functionally identical to this fix, but with slightly difference
CCs.

__add_memory_block() calls both put_device() and device_unregister() when
storing the memory block into the xarray.  This is incorrect because
xarray doesn't take an additional reference and device_unregister()
already calls put_device().

Triggering the issue looks really unlikely and its only effect should be
to log a spurious warning about a ref counted issue.

Link: https://lkml.kernel.org/r/d44c63d78affe844f020dc02ad6af29abc448fc4.1650611702.git.christophe.jaillet@wanadoo.fr
Fixes: 4fb6eabf10 ("drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Scott Cheloha <cheloha@linux.vnet.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-10-25 10:36:23 -04:00
Chris von Recklinghausen 18b123b391 mm/memory-failure: disable unpoison once hw error happens
Bugzilla: https://bugzilla.redhat.com/2120352

commit 67f22ba7750f940bcd7e1b12720896c505c2d63f
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Wed Jun 15 17:32:09 2022 +0800

    mm/memory-failure: disable unpoison once hw error happens

    Currently unpoison_memory(unsigned long pfn) is designed for soft
    poison(hwpoison-inject) only.  Since 17fae1294a, the KPTE gets cleared
    on a x86 platform once hardware memory corrupts.

    Unpoisoning a hardware corrupted page puts page back buddy only, the
    kernel has a chance to access the page with *NOT PRESENT* KPTE.  This
    leads BUG during accessing on the corrupted KPTE.

    Suggested by David&Naoya, disable unpoison mechanism when a real HW error
    happens to avoid BUG like this:

     Unpoison: Software-unpoisoned page 0x61234
     BUG: unable to handle page fault for address: ffff888061234000
     #PF: supervisor write access in kernel mode
     #PF: error_code(0x0002) - not-present page
     PGD 2c01067 P4D 2c01067 PUD 107267063 PMD 10382b063 PTE 800fffff9edcb062
     Oops: 0002 [#1] PREEMPT SMP NOPTI
     CPU: 4 PID: 26551 Comm: stress Kdump: loaded Tainted: G   M       OE     5.18.0.bm.1-amd64 #7
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
     RIP: 0010:clear_page_erms+0x7/0x10
     Code: ...
     RSP: 0000:ffffc90001107bc8 EFLAGS: 00010246
     RAX: 0000000000000000 RBX: 0000000000000901 RCX: 0000000000001000
     RDX: ffffea0001848d00 RSI: ffffea0001848d40 RDI: ffff888061234000
     RBP: ffffea0001848d00 R08: 0000000000000901 R09: 0000000000001276
     R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001
     R13: 0000000000000000 R14: 0000000000140dca R15: 0000000000000001
     FS:  00007fd8b2333740(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: ffff888061234000 CR3: 00000001023d2005 CR4: 0000000000770ee0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     PKRU: 55555554
     Call Trace:
      <TASK>
      prep_new_page+0x151/0x170
      get_page_from_freelist+0xca0/0xe20
      ? sysvec_apic_timer_interrupt+0xab/0xc0
      ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
      __alloc_pages+0x17e/0x340
      __folio_alloc+0x17/0x40
      vma_alloc_folio+0x84/0x280
      __handle_mm_fault+0x8d4/0xeb0
      handle_mm_fault+0xd5/0x2a0
      do_user_addr_fault+0x1d0/0x680
      ? kvm_read_and_reset_apf_flags+0x3b/0x50
      exc_page_fault+0x78/0x170
      asm_exc_page_fault+0x27/0x30

    Link: https://lkml.kernel.org/r/20220615093209.259374-2-pizhenwei@bytedance.com
    Fixes: 847ce401df ("HWPOISON: Add unpoisoning support")
    Fixes: 17fae1294a ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: <stable@vger.kernel.org>    [5.8+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen 281e153b89 mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
Bugzilla: https://bugzilla.redhat.com/2120352

commit d1fe111fb62a1cf0446a2919f5effbb33ad0702c
Author: luofei <luofei@unicloud.com>
Date:   Tue Mar 22 14:44:38 2022 -0700

    mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler

    When the hwpoison page meets the filter conditions, it should not be
    regarded as successful memory_failure() processing for mce handler, but
    should return a distinct value, otherwise mce handler regards the error
    page has been identified and isolated, which may lead to calling
    set_mce_nospec() to change page attribute, etc.

    Here memory_failure() return -EOPNOTSUPP to indicate that the error
    event is filtered, mce handler should not take any action for this
    situation and hwpoison injector should treat as correct.

    Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com
    Signed-off-by: luofei <luofei@unicloud.com>
    Acked-by: Borislav Petkov <bp@suse.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
David Hildenbrand 233edca76b drivers/base/memory: clarify adding and removing of memory blocks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2077436

commit 2aa065f7afb28aabb475cc27f24cb18c5141173d
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Mar 22 14:47:34 2022 -0700

    drivers/base/memory: clarify adding and removing of memory blocks

    Let's make it clearer at which places we actually add and remove memory
    blocks -- streamlining the terminology -- and highlight which memory block
    start out online and which start out as offline.

     * rename add_memory_block -> add_boot_memory_block
     * rename init_memory_block -> add_memory_block
     * rename unregister_memory -> remove_memory_block
     * rename register_memory -> __add_memory_block
     * add add_hotplug_memory_block
     * mark add_boot_memory_block with __init (suggested by Oscar)

    __add_memory_block() is  a pure helper for add_memory_block(), remove
    the somewhat obvious comment.

    Link: https://lkml.kernel.org/r/20220221154531.11382-1-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
2022-04-21 17:04:04 +02:00
David Hildenbrand be1c559fd2 drivers/base/memory: determine and store zone for single-zone memory blocks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2077436

commit 395f6081bad49f9c54abafebab49ee23aa985bbd
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Mar 22 14:47:31 2022 -0700

    drivers/base/memory: determine and store zone for single-zone memory blocks

    test_pages_in_a_zone() is just another nasty PFN walker that can easily
    stumble over ZONE_DEVICE memory ranges falling into the same memory block
    as ordinary system RAM: the memmap of parts of these ranges might possibly
    be uninitialized.  In fact, we observed (on an older kernel) with UBSAN:

      UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
      index 7 is out of range for type 'zone [5]'
      CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
      Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
      Call Trace:
       dump_stack+0x9a/0xf0
       ubsan_epilogue+0x9/0x7a
       __ubsan_handle_out_of_bounds+0x13a/0x181
       test_pages_in_a_zone+0x3c4/0x500
       show_valid_zones+0x1fa/0x380
       dev_attr_show+0x43/0xb0
       sysfs_kf_seq_show+0x1c5/0x440
       seq_read+0x49d/0x1190
       vfs_read+0xff/0x300
       ksys_read+0xb8/0x170
       do_syscall_64+0xa5/0x4b0
       entry_SYSCALL_64_after_hwframe+0x6a/0xdf
      RIP: 0033:0x7f01f4439b52

    We seem to stumble over a memmap that contains a garbage zone id.  While
    we could try inserting pfn_to_online_page() calls, it will just make
    memory offlining slower, because we use test_pages_in_a_zone() to make
    sure we're offlining pages that all belong to the same zone.

    Let's just get rid of this PFN walker and determine the single zone of a
    memory block -- if any -- for early memory blocks during boot.  For memory
    onlining, we know the single zone already.  Let's avoid any additional
    memmap scanning and just rely on the zone information available during
    boot.

    For memory hot(un)plug, we only really care about memory blocks that:
    * span a single zone (and, thereby, a single node)
    * are completely System RAM (IOW, no holes, no ZONE_DEVICE)
    If one of these conditions is not met, we reject memory offlining.
    Hotplugged memory blocks (starting out offline), always meet both
    conditions.

    There are three scenarios to handle:

    (1) Memory hot(un)plug

    A memory block with zone == NULL cannot be offlined, corresponding to
    our previous test_pages_in_a_zone() check.

    After successful memory onlining/offlining, we simply set the zone
    accordingly.
    * Memory onlining: set the zone we just used for onlining
    * Memory offlining: set zone = NULL

    So a hotplugged memory block starts with zone = NULL. Once memory
    onlining is done, we set the proper zone.

    (2) Boot memory with !CONFIG_NUMA

    We know that there is just a single pgdat, so we simply scan all zones
    of that pgdat for an intersection with our memory block PFN range when
    adding the memory block. If more than one zone intersects (e.g., DMA and
    DMA32 on x86 for the first memory block) we set zone = NULL and
    consequently mimic what test_pages_in_a_zone() used to do.

    (3) Boot memory with CONFIG_NUMA

    At the point in time we create the memory block devices during boot, we
    don't know yet which nodes *actually* span a memory block. While we could
    scan all zones of all nodes for intersections, overlapping nodes complicate
    the situation and scanning all nodes is possibly expensive. But that
    problem has already been solved by the code that sets the node of a memory
    block and creates the link in the sysfs --
    do_register_memory_block_under_node().

    So, we hook into the code that sets the node id for a memory block. If
    we already have a different node id set for the memory block, we know
    that multiple nodes *actually* have PFNs falling into our memory block:
    we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
    to do. If there is no node id set, we do the same as (2) for the given
    node.

    Note that the call order in driver_init() is:
    -> memory_dev_init(): create memory block devices
    -> node_dev_init(): link memory block devices to the node and set the
                        node id

    So in summary, we detect if there is a single zone responsible for this
    memory block and we consequently store the zone in that case in the
    memory block, updating it during memory onlining/offlining.

    Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reported-by: Rafael Parra <rparrazo@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rafael Parra <rparrazo@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
2022-04-21 17:04:04 +02:00
David Hildenbrand 30ac1afdd3 drivers/base/memory: add memory block to memory group after registration succeeded
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2077436

commit 7ea0d2d79da09d1f7d71c96a9c9bc1b5229360b5
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Mar 22 14:47:09 2022 -0700

    drivers/base/memory: add memory block to memory group after registration succeeded

    If register_memory() fails, we freed the memory block but already added
    the memory block to the group list, not good.  Let's defer adding the
    block to the memory group to after registering the memory block device.

    We do handle it properly during unregister_memory(), but that's not
    called when the registration fails.

    Link: https://lkml.kernel.org/r/20220128144540.153902-1-david@redhat.com
    Fixes: 028fc57a1c36 ("drivers/base/memory: introduce "memory groups" to logically group memory blocks")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
2022-04-21 16:37:23 +02:00
Rafael Aquini 7247aa5ad7 mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policy
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 3fcebf90209a7f52d384ad7701425aa91be309ab
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Sep 7 19:55:48 2021 -0700

    mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policy

    Currently, the "auto-movable" online policy does not allow for hotplugged
    KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we
    can have, primarily, because there is no coordiantion across memory
    devices and we don't want to create zone-imbalances accidentially when
    unplugging memory.

    However, within a single memory device it's different.  Let's allow for
    KERNEL memory within a dynamic memory group to allow for more MOVABLE
    within the same memory group.  The only thing we have to take care of is
    that the managing driver avoids zone imbalances by unplugging MOVABLE
    memory first, otherwise there can be corner cases where unplug of memory
    could result in (accidential) zone imbalances.

    virtio-mem is the only user of dynamic memory groups and recently added
    support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we
    don't need a new toggle to enable it for dynamic memory groups.

    We limit this handling to dynamic memory groups, because:

    * We want to keep the runtime overhead for collecting stats when
      onlining a single memory block small.  We tend to have only a handful of
      dynamic memory groups, but we can have quite some static memory groups
      (e.g., 256 DIMMs).

    * It doesn't make too much sense for static memory groups, as we try
      onlining all applicable memory blocks either completely to ZONE_MOVABLE
      or not.  In ordinary operation, we won't have a mixture of zones within
      a static memory group.

    When adding memory to a dynamic memory group, we'll first online memory to
    ZONE_MOVABLE as long as early KERNEL memory allows for it.  Then, we'll
    online the next unit(s) to ZONE_NORMAL, until we can online the next
    unit(s) to ZONE_MOVABLE.

    For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will
    result in a layout like:

      [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]...
      ^ movable memory due to early kernel memory
                               ^ allows for more movable memory ...
                                  ^-----^ ... here
                                           ^ allows for more movable memory ...
                                              ^-----^ ... here

    While the created layout is sub-optimal when it comes to contiguous zones,
    it gives us the maximum flexibility when dynamically growing/shrinking a
    device; we can grow small VMs really big in small steps, and still shrink
    reliably to e.g., 1/4 of the maximum VM size in this example, removing
    full memory blocks along with meta data more reliably.

    Mark dynamic memory groups in the xarray such that we can efficiently
    iterate over them when collecting stats.  In usual setups, we have one
    virtio-mem device per NUMA node, and usually only a small number of NUMA
    nodes.

    Note: for now, there seems to be no compelling reason to make this
    behavior configurable.

    Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Hui Zhu <teawater@gmail.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Len Brown <lenb@kernel.org>
    Cc: Marek Kedzierski <mkedzier@redhat.com>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
    Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:12 -05:00
Rafael Aquini 6dc7c441ac mm/memory_hotplug: memory group aware "auto-movable" online policy
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 445fcf7c721450dd1d4ec6c217b3c6a932602a44
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Sep 7 19:55:45 2021 -0700

    mm/memory_hotplug: memory group aware "auto-movable" online policy

    Use memory groups to improve our "auto-movable" onlining policy:

    1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE
       only if all other memory blocks in the group are either MOVABLE or could
       be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture.

    2. For dynamic memory groups (e.g., a virtio-mem device), online a
       memory block MOVABLE only if all other memory blocks inside the
       current unit are either MOVABLE or could be onlined MOVABLE. For a
       virtio-mem device with a device block size with 512 MiB, all 128 MiB
       memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not
       a mixture.

    We have to pass the memory group to zone_for_pfn_range() to take the
    memory group into account.

    Note: for now, there seems to be no compelling reason to make this
    behavior configurable.

    Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Hui Zhu <teawater@gmail.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Len Brown <lenb@kernel.org>
    Cc: Marek Kedzierski <mkedzier@redhat.com>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
    Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:12 -05:00
Rafael Aquini f7399a0e34 mm/memory_hotplug: track present pages in memory groups
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 836809ec75cc07c6d07c43036e3844affbe0d46f
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Sep 7 19:55:30 2021 -0700

    mm/memory_hotplug: track present pages in memory groups

    Let's track all present pages in each memory group.  Especially, track
    memory present in ZONE_MOVABLE and memory present in one of the kernel
    zones (which really only is ZONE_NORMAL right now as memory groups only
    apply to hotplugged memory) separately within a memory group, to prepare
    for making smart auto-online decision for individual memory blocks within
    a memory group based on group statistics.

    Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Hui Zhu <teawater@gmail.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Len Brown <lenb@kernel.org>
    Cc: Marek Kedzierski <mkedzier@redhat.com>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
    Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:09 -05:00
Rafael Aquini d4d97b7673 drivers/base/memory: introduce "memory groups" to logically group memory blocks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 028fc57a1c361116e3bcebfeba4ca87878baaf4f
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Sep 7 19:55:26 2021 -0700

    drivers/base/memory: introduce "memory groups" to logically group memory blocks

    In our "auto-movable" memory onlining policy, we want to make decisions
    across memory blocks of a single memory device.  Examples of memory
    devices include ACPI memory devices (in the simplest case a single DIMM)
    and virtio-mem.  For now, we don't have a connection between a single
    memory block device and the real memory device.  Each memory device
    consists of 1..X memory block devices.

    Let's logically group memory blocks belonging to the same memory device in
    "memory groups".  Memory groups can span multiple physical ranges and a
    memory group itself does not contain any information regarding physical
    ranges, only properties (e.g., "max_pages") necessary for improved memory
    onlining.

    Introduce two memory group types:

    1) Static memory group: E.g., a single ACPI memory device, consisting
       of 1..X memory resources.  A memory group consists of 1..Y memory
       blocks.  The whole group is added/removed in one go.  If any part
       cannot get offlined, the whole group cannot be removed.

    2) Dynamic memory group: E.g., a single virtio-mem device.  Memory is
       dynamically added/removed in a fixed granularity, called a "unit",
       consisting of 1..X memory blocks.  A unit is added/removed in one go.
       If any part of a unit cannot get offlined, the whole unit cannot be
       removed.

    In case of 1) we usually want either all memory managed by ZONE_MOVABLE or
    none.  In case of 2) we usually want to have as many units as possible
    managed by ZONE_MOVABLE.  We want a single unit to be of the same type.

    For now, memory groups are an internal concept that is not exposed to user
    space; we might want to change that in the future, though.

    add_memory() users can specify a mgid instead of a nid when passing the
    MHP_NID_IS_MGID flag.

    Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Hui Zhu <teawater@gmail.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Len Brown <lenb@kernel.org>
    Cc: Marek Kedzierski <mkedzier@redhat.com>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
    Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:08 -05:00
Rafael Aquini 726fecee67 mm: track present early pages per zone
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 4b0970024408afb17886e0c76e9761c4264db2a8
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Sep 7 19:55:19 2021 -0700

    mm: track present early pages per zone

    Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.

    I. Goal

    The goal of this series is improving in-kernel auto-online support.  It
    tackles the fundamental problems that:

     1) We can create zone imbalances when onlining all memory blindly to
        ZONE_MOVABLE, in the worst case crashing the system. We have to know
        upfront how much memory we are going to hotplug such that we can
        safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
        via "online_movable". This is far from practical and only applicable in
        limited setups -- like inside VMs under the RHV/oVirt hypervisor which
        will never hotplug more than 3 times the boot memory (and the
        limitation is only in place due to the Linux limitation).

     2) We see more setups that implement dynamic VM resizing, hot(un)plugging
        memory to resize VM memory. In these setups, we might hotplug a lot of
        memory, but it might happen in various small steps in both directions
        (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
        primary driver of this upstream right now, performing such dynamic
        resizing NUMA-aware via multiple virtio-mem devices.

        Onlining all hotplugged memory to ZONE_NORMAL means we basically have
        no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
        easily run into zone imbalances when growing a VM. We want a mixture,
        and we want as much memory as reasonable/configured in ZONE_MOVABLE.
        Details regarding zone imbalances can be found at [1].

     3) Memory devices consist of 1..X memory block devices, however, the
        kernel doesn't really track the relationship. Consequently, also user
        space has no idea. We want to make per-device decisions.

        As one example, for memory hotunplug it doesn't make sense to use a
        mixture of zones within a single DIMM: we want all MOVABLE if
        possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
        block the whole DIMM from getting hotunplugged.

        As another example, virtio-mem operates on individual units that span
        1..X memory blocks. Similar to a DIMM, we want a unit to either be all
        MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
        all units of a virtio-mem device logically belong together and are
        managed (added/removed) by a single driver. We want as much memory of
        a virtio-mem device to be MOVABLE as possible.

     4) We want memory onlining to be done right from the kernel while adding
        memory, not triggered by user space via udev rules; for example, this
        is reqired for fast memory hotplug for drivers that add individual
        memory blocks, like virito-mem. We want a way to configure a policy in
        the kernel and avoid implementing advanced policies in user space.

    The auto-onlining support we have in the kernel is not sufficient.  All we
    have is a) online everything MOVABLE (online_movable) b) online everything
    !MOVABLE (online_kernel) c) keep zones contiguous (online).  This series
    allows configuring c) to mean instead "online movable if possible
    according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
    -- a new onlining policy.

    II. Approach

    This series does 3 things:

     1) Introduces the "auto-movable" online policy that initially operates on
        individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
        to make a decision whether a memory block will be onlined to
        ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
        memory does not allow for more MOVABLE memory (details in the
        patches). CMA memory is treated like MOVABLE memory.

     2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
        groups and uses group information to make decisions in the
        "auto-movable" online policy across memory blocks of a single memory
        device (modeled as memory group). More details can be found in patch
        #3 or in the DIMM example below.

     3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
        allowing ZONE_NORMAL memory within a dynamic memory group to allow for
        more ZONE_MOVABLE memory within the same memory group. The target use
        case is dynamic VM resizing using virtio-mem. See the virtio-mem
        example below.

    I remember that the basic idea of using a ratio to implement a policy in
    the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
    lost the pointer to that discussion).

    For me, the main use case is using it along with virtio-mem (and DIMMs /
    ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
    amount of memory we can hotunplug reliably again if we might eventually
    hotplug a lot of memory to a VM.

    III. Target Usage

    The target usage will be:

     1) Linux boots with "mhp_default_online_type=offline"

     2) User space (e.g., systemd unit) configures memory onlining (according
        to a config file and system properties), for example:
        * Setting memory_hotplug.online_policy=auto-movable
        * Setting memory_hotplug.auto_movable_ratio=301
        * Setting memory_hotplug.auto_movable_numa_aware=true

     3) User space enabled auto onlining via "echo online >
        /sys/devices/system/memory/auto_online_blocks"

     4) User space triggers manual onlining of all already-offline memory
        blocks (go over offline memory blocks and set them to "online")

    IV. Example

    For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
    301% results in the following layout:
            Memory block 0-15:    DMA32   (early)
            Memory block 32-47:   Normal  (early)
            Memory block 48-79:   Movable (DIMM 0)
            Memory block 80-111:  Movable (DIMM 1)
            Memory block 112-143: Movable (DIMM 2)
            Memory block 144-275: Normal  (DIMM 3)
            Memory block 176-207: Normal  (DIMM 4)
            ... all Normal
            (-> hotplugged Normal memory does not allow for more Movable memory)

    For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
    will result in the following layout:
            Memory block 0-15:    DMA32   (early)
            Memory block 32-47:   Normal  (early)
            Memory block 48-143:  Movable (virtio-mem, first 12 GiB)
            Memory block 144:     Normal  (virtio-mem, next 128 MiB)
            Memory block 145-147: Movable (virtio-mem, next 384 MiB)
            Memory block 148:     Normal  (virtio-mem, next 128 MiB)
            Memory block 149-151: Movable (virtio-mem, next 384 MiB)
            ... Normal/Movable mixture as above
            (-> hotplugged Normal memory allows for more Movable memory within
                the same device)

    Which gives us maximum flexibility when dynamically growing/shrinking a
    VM in smaller steps.

    V. Doc Update

    I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
    usptream. Until then, details can be found in patch #2.

    VI. Future Work

     1) Use memory groups for ppc64 dlpar
     2) Being able to specify a portion of (early) kernel memory that will be
        excluded from the ratio. Like "128 MiB globally/per node" are excluded.

        This might be helpful when starting VMs with extremely small memory
        footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
        the first hotplugged units getting onlined to ZONE_MOVABLE. One
        alternative would be a trigger to not consider ZONE_DMA memory
        in the ratio. We'll have to see if this is really rrequired.
     3) Indicate to user space that MOVABLE might be a bad idea -- especially
        relevant when memory ballooning without support for balloon compaction
        is active.

    This patch (of 9):

    For implementing a new memory onlining policy, which determines when to
    online memory blocks to ZONE_MOVABLE semi-automatically, we need the
    number of present early (boot) pages -- present pages excluding hotplugged
    pages.  Let's track these pages per zone.

    Pass a page instead of the zone to adjust_present_page_count(), similar as
    adjust_managed_page_count() and derive the zone from the page.

    It's worth noting that a memory block to be offlined/onlined is either
    completely "early" or "not early".  add_memory() and friends can only add
    complete memory blocks and we only online/offline complete (individual)
    memory blocks.

    Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Marek Kedzierski <mkedzier@redhat.com>
    Cc: Hui Zhu <teawater@gmail.com>
    Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
    Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
    Cc: Len Brown <lenb@kernel.org>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:07 -05:00
Rafael Aquini e33698dd33 mm: sparse: pass section_nr to find_memory_block
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit fc1f5e980a463325cf41d39ac6a69aa3cca73995
Author: Ohhoon Kwon <ohoono.kwon@samsung.com>
Date:   Thu Sep 2 14:57:01 2021 -0700

    mm: sparse: pass section_nr to find_memory_block

    With CONFIG_SPARSEMEM_EXTREME enabled, __section_nr() which converts
    mem_section to section_nr could be costly since it iterates all section
    roots to check if the given mem_section is in its range.

    On the other hand, __nr_to_section() which converts section_nr to
    mem_section can be done in O(1).

    Let's pass section_nr instead of mem_section ptr to find_memory_block() in
    order to reduce needless iterations.

    Link: https://lkml.kernel.org/r/20210707150212.855-3-ohoono.kwon@samsung.com
    Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Baoquan He <bhe@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:38 -05:00
Greg Kroah-Hartman 68afbd8459 Linux 5.13-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmDGe+4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/IUH/iyHVulAtAhL9bnR
 qL4M1kWfcG1sKS2TzGRZzo6YiUABf89vFP90r4sKxG3AKrb8YkTwmJr8B/sWwcsv
 PpKkXXTobbDfpSrsXGEapBkQOE7h2w739XeXyBLRPkoCR4UrEFn68TV2rLjMLBPS
 /EIZkonXLWzzWalgKDP4wSJ7GaQxi3LMx3dGAvbFArEGZ1mPHNlgWy2VokFY/yBf
 qh1EZ5rugysc78JCpTqfTf3fUPK2idQW5gtHSMbyESrWwJ/3XXL9o1ET3JWURYf1
 b0FgVztzddwgULoIGWLxDH5WWts3l54sjBLj0yrLUlnGKA5FjrZb12g9PdhdywuY
 /8KfjeE=
 =JfJm
 -----END PGP SIGNATURE-----

Merge tag 'v5.13-rc6' into driver-core-next

We need the driver core fix in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-14 09:07:45 +02:00
David Hildenbrand 928130532e drivers/base/memory: fix trying offlining memory blocks with memory holes on aarch64
offline_pages() properly checks for memory holes and bails out.
However, we do a page_zone(pfn_to_page(start_pfn)) before calling
offline_pages() when offlining a memory block.

We should not unconditionally call page_zone(pfn_to_page(start_pfn)) on
aarch64 in offlining code, otherwise we can trigger a BUG when hitting a
memory hole:

   kernel BUG at include/linux/mm.h:1383!
   Internal error: Oops - BUG: 0 [#1] SMP
   Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb nvme i2c_algo_bit mlx5_core i2c_core nvme_core firmware_class
   CPU: 13 PID: 1694 Comm: ranbug Not tainted 5.12.0-next-20210524+ #4
   Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
   pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
   pc : memory_subsys_offline+0x1f8/0x250
   lr : memory_subsys_offline+0x1f8/0x250
   Call trace:
     memory_subsys_offline+0x1f8/0x250
     device_offline+0x154/0x1d8
     online_store+0xa4/0x118
     dev_attr_store+0x44/0x78
     sysfs_kf_write+0xe8/0x138
     kernfs_fop_write_iter+0x26c/0x3d0
     new_sync_write+0x2bc/0x4f8
     vfs_write+0x718/0xc88
     ksys_write+0xf8/0x1e0
     __arm64_sys_write+0x74/0xa8
     invoke_syscall.constprop.0+0x78/0x1e8
     do_el0_svc+0xe4/0x298
     el0_svc+0x20/0x30
     el0_sync_handler+0xb0/0xb8
     el0_sync+0x178/0x180
   Kernel panic - not syncing: Oops - BUG: Fatal exception
   SMP: stopping secondary CPUs
   Kernel Offset: disabled
   CPU features: 0x00000251,20000846
   Memory Limit: none

If nr_vmemmap_pages is set, we know that we are dealing with hotplugged
memory that doesn't have any holes.  So call
page_zone(pfn_to_page(start_pfn)) only when really necessary -- when
nr_vmemmap_pages is set and we actually adjust the present pages.

Link: https://lkml.kernel.org/r/20210526075226.5572-1-david@redhat.com
Fixes: a08a2ae346 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Qian Cai (QUIC) <quic_qiancai@quicinc.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-05 08:58:11 -07:00
Rikard Falkeborn 5a576764e4 drivers/base: Constify static attribute_group structs
These are only used by putting their address in an array of pointers to
const struct attribute_group (either directly or via the
__ATTRIBUTE_GROUP macro). Make them const to allow the compiler to place
them in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Link: https://lore.kernel.org/r/20210528213408.20067-1-rikard.falkeborn@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-04 15:06:28 +02:00
Oscar Salvador a08a2ae346 mm,memory_hotplug: allocate memmap from the added memory range
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section.  Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
    (eg: ~2MB per 128MB memory section on x86_64)
    This can even lead to extreme cases where system goes OOM because
    the physically hotplugged memory depletes the available memory before
    it is onlined.
 b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
    populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.  That means that we can
reserve a part of the physically hotadded memory to back vmemmap page
tables.  This implementation uses the beginning of the hotplugged memory
for that purpose.

There are some non-obviously things to consider though.

Vmemmap pages are allocated/freed during the memory hotplug events
(add_memory_resource(), try_remove_memory()) when the memory is
added/removed.  This means that the reserved physical range is not
online although it is used.  The most obvious side effect is that
pfn_to_online_page() returns NULL for those pfns.  The current design
expects that this should be OK as the hotplugged memory is considered a
garbage until it is onlined.  For example hibernation wouldn't save the
content of those vmmemmaps into the image so it wouldn't be restored on
resume but this should be OK as there no real content to recover anyway
while metadata is reachable from other data structures (e.g.  vmemmap
page tables).

The reserved space is therefore (de)initialized during the {on,off}line
events (mhp_{de}init_memmap_on_memory).  That is done by extracting page
allocator independent initialization from the regular onlining path.
The primary reason to handle the reserved space outside of
{on,off}line_pages is to make each initialization specific to the
purpose rather than special case them in a single function.

As per above, the functions that are introduced are:

 - mhp_init_memmap_on_memory:
   Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
   kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
   fully span.

 - mhp_deinit_memmap_on_memory:
   Offlines as many sections as vmemmap pages fully span, removes the
   range from zhe zone by remove_pfn_range_from_zone(), and calls
   kasan_remove_zero_shadow() for the range.

The new function memory_block_online() calls mhp_init_memmap_on_memory()
before doing the actual online_pages().  Should online_pages() fail, we
clean up by calling mhp_deinit_memmap_on_memory().  Adjusting of
present_pages is done at the end once we know that online_pages()
succedeed.

On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages().  This is necessary because
offline_pages() tears down some structures based on the fact whether the
node or the zone become empty.  If offline_pages() fails, we account back
vmemmap pages.  If it succeeds, we call mhp_deinit_memmap_on_memory().

Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.

 If all is good and the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does the right
 thing and calls vmem_altmap_free instead of free_pagetable.

Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:26 -07:00
Oscar Salvador 8736cc2d00 drivers/base/memory: introduce memory_block_{online,offline}
Patch series "Allocate memmap from hotadded memory (per device)", v10.

The primary goal of this patchset is to reduce memory overhead of the
hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).  The
current way we use to populate memmap (struct page array) has two main
drawbacks:

a) it consumes an additional memory until the hotadded memory itself is
   onlined and

b) memmap might end up on a different numa node which is especially
   true for movable_node configuration.

c) due to fragmentation we might end up populating memmap with base
   pages

One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hot-added memory itself.  SPARSEMEM_VMEMMAP memory model allows
us to map any pfn range so the memory doesn't need to be online to be
usable for the array.  See patch 4 for more details.  This feature is
only usable when CONFIG_SPARSEMEM_VMEMMAP is set.

[Overall design]:

Implementation wise we reuse vmem_altmap infrastructure to override the
default allocator used by vmemap_populate.  memory_block structure gains a
new field called nr_vmemmap_pages, which accounts for the number of
vmemmap pages used by that memory_block.  E.g: On x86_64, that is 512
vmemmap pages on small memory bloks and 4096 on large memory blocks (1GB)

We also introduce new two functions: memory_block_{online,offline}.  These
functions take care of initializing/unitializing vmemmap pages prior to
calling {online,offline}_pages, so the latter functions can remain totally
untouched.

More details can be found in the respective changelogs.

This patch (of 8):

This is a preparatory patch that introduces two new functions:
memory_block_online() and memory_block_offline().

For now, these functions will only call online_pages() and offline_pages()
respectively, but they will be later in charge of preparing the vmemmap
pages, carrying out the initialization and proper accounting of such
pages.

Since memory_block struct contains all the information, pass this struct
down the chain till the end functions.

Link: https://lkml.kernel.org/r/20210421102701.25051-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20210421102701.25051-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:26 -07:00
David Hildenbrand e9a2e48e87 drivers/base/memory: don't store phys_device in memory blocks
No need to store the value for each and every memory block, as we can
easily query the value at runtime.  Reshuffle the members to optimize the
memory layout.  Also, let's clarify what the interface once was used for
and why it's legacy nowadays.

"phys_device" was used on s390x in older versions of lsmem[2]/chmem[3],
back when they were still part of s390x-tools.  They were later replaced
by the variants in linux-utils.  For example, RHEL6 and RHEL7 contain
lsmem/chmem from s390-utils.  RHEL8 switched to versions from util-linux
on s390x [4].

"phys_device" was added with sysfs support for memory hotplug in commit
3947be1969 ("[PATCH] memory hotplug: sysfs and add/remove functions") in
2005.  It always returned 0.

s390x started returning something != 0 on some setups (if sclp.rzm is set
by HW) in 2010 via commit 57b552ba0b ("memory hotplug/s390: set
phys_device").

For s390x, it allowed for identifying which memory block devices belong to
the same storage increment (RZM).  Only if all memory block devices
comprising a single storage increment were offline, the memory could
actually be removed in the hypervisor.

Since commit e5d709bb5f ("s390/memory hotplug: provide
memory_block_size_bytes() function") in 2013 a memory block device spans
at least one storage increment - which is why the interface isn't really
helpful/used anymore (except by old lsmem/chmem tools).

There were once RFC patches to make use of "phys_device" in ACPI context;
however, the underlying problem could be solved using different interfaces
[1].

[1] https://patchwork.kernel.org/patch/2163871/
[2] https://github.com/ibm-s390-tools/s390-tools/blob/v2.1.0/zconf/lsmem
[3] https://github.com/ibm-s390-tools/s390-tools/blob/v2.1.0/zconf/chmem
[4] https://bugzilla.redhat.com/show_bug.cgi?id=1504134

Link: https://lkml.kernel.org/r/20210201181347.13262-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Tom Rix <trix@redhat.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:41:00 -08:00
Anshuman Khandual 1adf8b468f mm/memory_hotplug: rename all existing 'memhp' into 'mhp'
This renames all 'memhp' instances to 'mhp' except for memhp_default_state
for being a kernel command line option.  This is just a clean up and
should not cause a functional change.  Let's make it consistent rater than
mixing the two prefixes.  In preparation for more users of the 'mhp'
terminology.

Link: https://lkml.kernel.org/r/1611554093-27316-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:41:00 -08:00
David Hildenbrand b611719978 mm/memory_hotplug: prepare passing flags to add_memory() and friends
We soon want to pass flags, e.g., to mark added System RAM resources.
mergeable.  Prepare for that.

This patch is based on a similar patch by Oscar Salvador:

https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.de

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Juergen Gross <jgross@suse.com> # Xen related part
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Libor Pechacek <lpechacek@suse.cz>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Leonardo Bras <leobras.c@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Julien Grall <julien@xen.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:18 -07:00
Joe Perches 948b3edba8 drivers core: Miscellaneous changes for sysfs_emit
Change additional instances that could use sysfs_emit and sysfs_emit_at
that the coccinelle script could not convert.

o macros creating show functions with ## concatenation
o unbound sprintf uses with buf+len for start of output to sysfs_emit_at
o returns with ?: tests and sprintf to sysfs_emit
o sysfs output with struct class * not struct device * arguments

Miscellanea:

o remove unnecessary initializations around these changes
o consistently use int len for return length of show functions
o use octal permissions and not S_<FOO>
o rename a few show function names so DEVICE_ATTR_<FOO> can be used
o use DEVICE_ATTR_ADMIN_RO where appropriate
o consistently use const char *output for strings
o checkpatch/style neatening

Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/8bc24444fe2049a9b2de6127389b57edfdfe324d.1600285923.git.joe@perches.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-02 13:12:07 +02:00
Joe Perches 973c39115c drivers core: Remove strcat uses around sysfs_emit and neaten
strcat is no longer necessary for sysfs_emit and sysfs_emit_at uses.

Convert the strcat uses to sysfs_emit calls and neaten other block
uses of direct returns to use an intermediate const char *.

Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/5d606519698ce4c8f1203a2b35797d8254c6050a.1600285923.git.joe@perches.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-02 13:09:10 +02:00
Joe Perches aa838896d8 drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions
Convert the various sprintf fmaily calls in sysfs device show functions
to sysfs_emit and sysfs_emit_at for PAGE_SIZE buffer safety.

Done with:

$ spatch -sp-file sysfs_emit_dev.cocci --in-place --max-width=80 .

And cocci script:

$ cat sysfs_emit_dev.cocci
@@
identifier d_show;
identifier dev, attr, buf;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	return
-	sprintf(buf,
+	sysfs_emit(buf,
	...);
	...>
}

@@
identifier d_show;
identifier dev, attr, buf;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	return
-	snprintf(buf, PAGE_SIZE,
+	sysfs_emit(buf,
	...);
	...>
}

@@
identifier d_show;
identifier dev, attr, buf;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	return
-	scnprintf(buf, PAGE_SIZE,
+	sysfs_emit(buf,
	...);
	...>
}

@@
identifier d_show;
identifier dev, attr, buf;
expression chr;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	return
-	strcpy(buf, chr);
+	sysfs_emit(buf, chr);
	...>
}

@@
identifier d_show;
identifier dev, attr, buf;
identifier len;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	len =
-	sprintf(buf,
+	sysfs_emit(buf,
	...);
	...>
	return len;
}

@@
identifier d_show;
identifier dev, attr, buf;
identifier len;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	len =
-	snprintf(buf, PAGE_SIZE,
+	sysfs_emit(buf,
	...);
	...>
	return len;
}

@@
identifier d_show;
identifier dev, attr, buf;
identifier len;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	len =
-	scnprintf(buf, PAGE_SIZE,
+	sysfs_emit(buf,
	...);
	...>
	return len;
}

@@
identifier d_show;
identifier dev, attr, buf;
identifier len;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
-	len += scnprintf(buf + len, PAGE_SIZE - len,
+	len += sysfs_emit_at(buf, len,
	...);
	...>
	return len;
}

@@
identifier d_show;
identifier dev, attr, buf;
expression chr;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	...
-	strcpy(buf, chr);
-	return strlen(buf);
+	return sysfs_emit(buf, chr);
}

Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/3d033c33056d88bbe34d4ddb62afd05ee166ab9a.1600285923.git.joe@perches.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-02 13:09:10 +02:00
Wei Yang 178bdbed3e drivers/base/memory: rename base_memory_block_id to memory_block_id
memory_block may have a larger granularity than section, this is why we
have base_section_nr. But base_memory_block_id seems a little
misleading, since there is no larger granularity concept which groups
several memory_block.

What we need here is the exact memory_block_id to a section_nr. Let's
rename it to make it more precise.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20200623025701.2016-2-richard.weiyang@linux.alibaba.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-10 14:38:44 +02:00
Wei Yang 40ba2cde77 drivers/base/memory: init_memory_block() first parameter is not necessary
The first parameter of init_memory_block() is intended to retrieve the
memory_block initiated. But now, we never use it.

Drop it for now.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20200623025701.2016-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-10 14:38:43 +02:00
Scott Cheloha 4fb6eabf10 drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
Searching for a particular memory block by id is an O(n) operation because
each memory block's underlying device is kept in an unsorted linked list
on the subsystem bus.

We can cut the lookup cost to O(log n) if we cache each memory block
in an xarray.  This time complexity improvement is significant on
systems with many memory blocks.  For example:

1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks.  With this
   change  memory_dev_init() completes ~12ms faster and walk_memory_blocks()
   completes ~12ms faster.

Before:
[    0.005042] memory_dev_init: adding memory blocks
[    0.021591] memory_dev_init: added memory blocks
[    0.022699] walk_memory_blocks: walking memory blocks
[    0.038730] walk_memory_blocks: walked memory blocks 0-511

After:
[    0.005057] memory_dev_init: adding memory blocks
[    0.009415] memory_dev_init: added memory blocks
[    0.010519] walk_memory_blocks: walking memory blocks
[    0.014135] walk_memory_blocks: walked memory blocks 0-511

2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks.  With
   this change memory_dev_init() completes ~88ms faster and
   walk_memory_blocks() completes ~87ms faster.

Before:
[    0.252246] memory_dev_init: adding memory blocks
[    0.395469] memory_dev_init: added memory blocks
[    0.409413] walk_memory_blocks: walking memory blocks
[    0.433028] walk_memory_blocks: walked memory blocks 0-511
[    0.433094] walk_memory_blocks: walking memory blocks
[    0.500244] walk_memory_blocks: walked memory blocks 131072-131583

After:
[    0.245063] memory_dev_init: adding memory blocks
[    0.299539] memory_dev_init: added memory blocks
[    0.313609] walk_memory_blocks: walking memory blocks
[    0.315287] walk_memory_blocks: walked memory blocks 0-511
[    0.315349] walk_memory_blocks: walking memory blocks
[    0.316988] walk_memory_blocks: walked memory blocks 131072-131583

3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks.  With
   this change we complete memory_dev_init() ~37 minutes faster and
   walk_memory_blocks() at least ~30 minutes faster.  The exact timing
   for walk_memory_blocks() is  missing, though I observed that the
   soft lockups in walk_memory_blocks() disappeared with the change,
   suggesting that lower bound.

Before:
[   13.703907] memory_dev_init: adding blocks
[ 2287.406099] memory_dev_init: added all blocks
[ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160

After:
[   13.696898] memory_dev_init: adding blocks
[   15.660035] memory_dev_init: added all blocks
(the walk_memory_blocks traces disappear)

There should be no significant negative impact for machines with few
memory blocks.  A sparse xarray has a small footprint and an O(log n)
lookup is negligibly slower than an O(n) lookup for only the smallest
number of memory blocks.

1. A 16GB x86 machine with 128MB memblocks has 132 blocks.  With this
   change memory_dev_init() completes ~300us faster and walk_memory_blocks()
   completes no faster or slower.  The improvement is pretty close to noise.

Before:
[    0.224752] memory_dev_init: adding memory blocks
[    0.227116] memory_dev_init: added memory blocks
[    0.227183] walk_memory_blocks: walking memory blocks
[    0.227183] walk_memory_blocks: walked memory blocks 0-131

After:
[    0.224911] memory_dev_init: adding memory blocks
[    0.226935] memory_dev_init: added memory blocks
[    0.227089] walk_memory_blocks: walking memory blocks
[    0.227089] walk_memory_blocks: walked memory blocks 0-131

[david@redhat.com: document the locking]
  Link: http://lkml.kernel.org/r/bc21eec6-7251-4c91-2f57-9a0671f8d414@redhat.com
Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rick Lindsley <ricklind@linux.vnet.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200121231028.13699-1-cheloha@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:49 -07:00
David Hildenbrand 5f47adf762 mm/memory_hotplug: allow to specify a default online_type
For now, distributions implement advanced udev rules to essentially
- Don't online any hotplugged memory (s390x)
- Online all memory to ZONE_NORMAL (e.g., most virt environments like
  hyperv)
- Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
  care of (e.g., bare metal, special virt environments)

In summary: All memory is usually onlined the same way, however, the
kernel always has to ask user space to come up with the same answer.
E.g., Hyper-V always waits for a memory block to get onlined before
continuing, otherwise it might end up adding memory faster than
onlining it, which can result in strange OOM situations.  This waiting
slows down adding of a bigger amount of memory.

Let's allow to specify a default online_type, not just "online" and
"offline".  This allows distributions to configure the default online_type
when booting up and be done with it.

We can now specify "offline", "online", "online_movable" and
"online_kernel" via
- "memhp_default_state=" on the kernel cmdline
- /sys/devices/system/memory/auto_online_blocks
just like we are able to specify for a single memory block via
/sys/devices/system/memory/memoryX/state

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:41 -07:00
David Hildenbrand 862919e568 mm/memory_hotplug: convert memhp_auto_online to store an online_type
...  and rename it to memhp_default_online_type.  This is a preparation
for more detailed default online behavior.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:40 -07:00
David Hildenbrand 4dc8207bfd drivers/base/memory: store mapping between MMOP_* and string in an array
Let's use a simple array which we can reuse soon.  While at it, move the
string->mmop conversion out of the device hotplug lock.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-4-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:40 -07:00
David Hildenbrand efc978ad0e drivers/base/memory: map MMOP_OFFLINE to 0
Historically, we used the value -1.  Just treat 0 as the special case now.
Clarify a comment (which was wrong, when we come via device_online() the
first time, the online_type would have been 0 / MEM_ONLINE).  The default
is now always MMOP_OFFLINE.  This removes the last user of the manual
"-1", which didn't use the enum value.

This is a preparation to use the online_type as an array index.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:40 -07:00
David Hildenbrand 956f8b4450 drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE
Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.

Distributions nowadays use udev rules ([1] [2]) to specify if and how to
online hotplugged memory.  The rules seem to get more complex with many
special cases.  Due to the various special cases,
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used.  All memory hotplug
is handled via udev rules.

Every time we hotplug memory, the udev rule will come to the same
conclusion.  Especially Hyper-V (but also soon virtio-mem) add a lot of
memory in separate memory blocks and wait for memory to get onlined by
user space before continuing to add more memory blocks (to not add memory
faster than it is getting onlined).  This of course slows down the whole
memory hotplug process.

To make the job of distributions easier and to avoid udev rules that get
more and more complicated, let's extend the mechanism provided by
- /sys/devices/system/memory/auto_online_blocks
- "memhp_default_state=" on the kernel cmdline
to be able to specify also "online_movable" as well as "online_kernel"

=== Example /usr/libexec/config-memhotplug ===

#!/bin/bash

VIRT=`systemd-detect-virt --vm`
ARCH=`uname -p`

sense_virtio_mem() {
  if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
    DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
    if [ $DEVICES != "0" ]; then
        return 0
    fi
  fi
  return 1
}

if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
  echo "Memory hotplug configuration support missing in the kernel"
  exit 1
fi

if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
  echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
  exit 1
fi

if [ $VIRT == "microsoft" ]; then
  echo "Detected Hyper-V on $ARCH"
  # Hyper-V wants all memory in ZONE_NORMAL
  ONLINE_TYPE="online_kernel"
elif sense_virtio_mem; then
  echo "Detected virtio-mem on $ARCH"
  # virtio-mem wants all memory in ZONE_NORMAL
  ONLINE_TYPE="online_kernel"
elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
  echo "Detected $ARCH"
  # standby memory should not be onlined automatically
  ONLINE_TYPE="offline"
elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
  echo "Detected" $ARCH
  # PPC64 onlines all hotplugged memory right from the kernel
  ONLINE_TYPE="offline"
elif [ $VIRT == "none" ]; then
  echo "Detected bare-metal on $ARCH"
  # Bare metal users expect hotplugged memory to be unpluggable. We assume
  # that ZONE imbalances on such enterpise servers cannot happen and is
  # properly documented
  ONLINE_TYPE="online_movable"
else
  # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
  # imbalances won't happen
  echo "Detected $VIRT on $ARCH"
  # Usually, ballooning is used in virtual environments, so memory should go to
  # ZONE_NORMAL. However, sometimes "movable_node" is relevant.
  ONLINE_TYPE="online"
fi

echo "Selected online_type:" $ONLINE_TYPE

# Configure what to do with memory that will be hotplugged in the future
echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
if [ $? != "0" ]; then
  echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
  # A backup udev rule should handle old kernels if necessary
  exit 1
fi

# Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
if [ $ONLINE_TYPE != "offline" ]; then
  for MEMORY in /sys/devices/system/memory/memory*; do
    STATE=`cat $MEMORY/state`
    if [ $STATE == "offline" ]; then
        echo $ONLINE_TYPE > $MEMORY/state
    fi
  done
fi

=== Example /usr/lib/systemd/system/config-memhotplug.service ===

[Unit]
Description=Configure memory hotplug behavior
DefaultDependencies=no
Conflicts=shutdown.target
Before=sysinit.target shutdown.target
After=systemd-modules-load.service
ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks

[Service]
ExecStart=/usr/libexec/config-memhotplug
Type=oneshot
TimeoutSec=0
RemainAfterExit=yes

[Install]
WantedBy=sysinit.target

=== Example modification to the 40-redhat.rules [2] ===

: diff --git a/40-redhat.rules b/40-redhat.rules-new
: index 2c690e5..168fd03 100644
: --- a/40-redhat.rules
: +++ b/40-redhat.rules-new
: @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
:  # Memory hotadd request
:  SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
:  ACTION!="add", GOTO="memory_hotplug_end"
: +# memory hotplug behavior configured
: +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
: +
:  PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
:
:  ENV{.state}="online"

===

[1] https://github.com/lnykryn/systemd-rhel/pull/281
[2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules

This patch (of 8):

The name is misleading and it's not really clear what is "kept".  Let's
just name it like the online_type name we expose to user space ("online").

Add some documentation to the types.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Yumei Huang <yuhuang@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:40 -07:00
David Hildenbrand fada9ae3ed drivers/base/memory.c: drop pages_correctly_probed()
pages_correctly_probed() is a leftover from ancient times.  It dates back
to commit 3947be1969 ("[PATCH] memory hotplug: sysfs and add/remove
functions"), where Pg_reserved checks were added as a sfety net:

	/*
	 * The probe routines leave the pages reserved, just
	 * as the bootmem code does.  Make sure they're still
	 * that way.
	 */

The checks were refactored quite a bit over the years, especially in
commit b77eab7079 ("mm/memory_hotplug: optimize probe routine"), where
checks for present, valid, and online sections were added.

Hotplugged memory is added via add_memory(), which will create the full
memmap for the hotplugged memory, and mark all sections valid and present.

Only full memory blocks are onlined/offlined, so we also cannot have an
inconsistency in that regard (especially, memory blocks with some sections
being online and some being offline).

1. Boot memory always starts online.  Since commit c5e79ef561
   ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with
   holes") we disallow to offline any memory with holes.  Therefore, we
   never online memory with holes.  Present and validity checks are
   superfluous.

2. Only complete memory blocks are onlined/offlined (and especially,
   the state - online or offline - is stored for whole memory blocks).
   Besides the core, only arch/powerpc/platforms/powernv/memtrace.c
   manually calls offline_pages() and fiddels with memory block states.
   But it also only offlines complete memory blocks.

3. To make any of these conditions trigger, something would have to be
   terribly messed up in the core.  (e.g., online/offline only some
   sections of a memory block).

4. Memory unplug properly makes sure that all sysfs attributes were
   removed (and therefore, that all threads left the sysfs handlers).  We
   don't have to worry about zombie devices at this point.

5. The valid_section_nr(section_nr) check is actually dead code, as it
   would never have been reached due to the WARN_ON_ONCE(!pfn_valid(pfn)).

No wonder we haven't seen any of these errors in a long time (or even
   ever, according to my search).  Let's just get rid of them.  Now, all
   checks that could hinder onlining and offlining are completely
   contained in online_pages()/offline_pages().

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Link: http://lkml.kernel.org/r/20200127110424.5757-3-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:40 -07:00
David Hildenbrand 68c3a6ac65 drivers/base/memory.c: drop section_count
Patch series "mm: drop superfluous section checks when onlining/offlining".

Let's drop some superfluous section checks on the onlining/offlining path.

This patch (of 3):

Since commit c5e79ef561 ("mm/memory_hotplug.c: don't allow to
online/offline memory blocks with holes") we have a generic check in
offline_pages() that disallows offlining memory blocks with holes.

Memory blocks with missing sections are just another variant of these type
of blocks.  We can stop checking (and especially storing) present
sections.  A proper error message is now printed why offlining failed.

section_count was initially introduced in commit 0768121597 ("Driver
core: Add section count to memory_block struct") in order to detect when
it is okay to remove a memory block.  It was used in commit 26bbe7ef6d
("drivers/base/memory.c: prohibit offlining of memory blocks with missing
sections") to disallow offlining memory blocks with missing sections.  As
we refactored creation/removal of memory devices and have a proper check
for holes in place, we can drop the section_count.

This also removes a leftover comment regarding the mem_sysfs_mutex, which
was removed in commit 848e19ad3c ("drivers/base/memory.c: drop the
mem_sysfs_mutex").

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:40 -07:00
David Hildenbrand 53cdc1cb29 drivers/base/memory.c: indicate all memory blocks as removable
We see multiple issues with the implementation/interface to compute
whether a memory block can be offlined (exposed via
/sys/devices/system/memory/memoryX/removable) and would like to simplify
it (remove the implementation).

1. It runs basically lockless. While this might be good for performance,
   we see possible races with memory offlining that will require at
   least some sort of locking to fix.

2. Nowadays, more false positives are possible. No arch-specific checks
   are performed that validate if memory offlining will not be denied
   right away (and such check will require locking). For example, arm64
   won't allow to offline any memory block that was added during boot -
   which will imply a very high error rate. Other archs have other
   constraints.

3. The interface is inherently racy. E.g., if a memory block is detected
   to be removable (and was not a false positive at that time), there is
   still no guarantee that offlining will actually succeed. So any
   caller already has to deal with false positives.

4. It is unclear which performance benefit this interface actually
   provides. The introducing commit 5c755e9fd8 ("memory-hotplug: add
   sysfs removable attribute for hotplug memory remove") mentioned

	"A user-level agent must be able to identify which sections
	 of memory are likely to be removable before attempting the
	 potentially expensive operation."

   However, no actual performance comparison was included.

Known users:

 - lsmem: Will group memory blocks based on the "removable" property. [1]

 - chmem: Indirect user. It has a RANGE mode where one can specify
          removable ranges identified via lsmem to be offlined. However,
          it also has a "SIZE" mode, which allows a sysadmin to skip the
          manual "identify removable blocks" step. [2]

 - powerpc-utils: Uses the "removable" attribute to skip some memory
          blocks right away when trying to find some to offline+remove.
          However, with ballooning enabled, it already skips this
          information completely (because it once resulted in many false
          negatives). Therefore, the implementation can deal with false
          positives properly already. [3]

According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer
driven from userspace via the drmgr command (powerpc-utils).  Nowadays
it's managed in the kernel - including onlining/offlining of memory
blocks - triggered by drmgr writing to /sys/kernel/dlpar.  So the
affected legacy userspace handling is only active on old kernels.  Only
very old versions of drmgr on a new kernel (unlikely) might execute
slower - totally acceptable.

With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not
break any user space tool.  We implement a very bad heuristic now.
Without CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report
"not removable" as before.

Original discussion can be found in [4] ("[PATCH RFC v1] mm:
is_mem_section_removable() overhaul").

Other users of is_mem_section_removable() will be removed next, so that
we can remove is_mem_section_removable() completely.

[1] http://man7.org/linux/man-pages/man1/lsmem.1.html
[2] http://man7.org/linux/man-pages/man8/chmem.8.html
[3] https://github.com/ibm-power-utilities/powerpc-utils
[4] https://lkml.kernel.org/r/20200117105759.27905-1-david@redhat.com

Also, this patch probably fixes a crash reported by Steve.
http://lkml.kernel.org/r/CAPcyv4jpdaNvJ67SkjyUJLBnBnXXQv686BiVW042g03FUmWLXw@mail.gmail.com

Reported-by: "Scargall, Steve" <steve.scargall@intel.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Nathan Fontenot <ndfont@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Karel Zak <kzak@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200128093542.6908-1-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29 09:47:05 -07:00
David Hildenbrand 9291799884 mm/memory_hotplug: drop valid_start/valid_end from test_pages_in_a_zone()
The callers are only interested in the actual zone, they don't care about
boundaries.  Return the zone instead to simplify.

Link: http://lkml.kernel.org/r/20200110183308.11849-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:23 +00:00
David Hildenbrand bd5c2344f9 mm/memory_hotplug: pass in nid to online_pages()
Patch series "mm/memory_hotplug: pass in nid to online_pages()".

Simplify onlining code and get rid of find_memory_block().  Pass in the
nid from the memory block we are trying to online directly, instead of
manually looking it up.

This patch (of 2):

No need to lookup the memory block, we can directly pass in the nid.

Link: http://lkml.kernel.org/r/20200113113354.6341-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:39 -08:00
David Hildenbrand 3f9903b9ca mm: remove the memory isolate notifier
Luckily, we have no users left, so we can get rid of it.  Cleanup
set_migratetype_isolate() a little bit.

Link: http://lkml.kernel.org/r/20191114131911.11783-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:38 -08:00
David Hildenbrand 848e19ad3c drivers/base/memory.c: drop the mem_sysfs_mutex
The mem_sysfs_mutex isn't really helpful.  Also, it's not really clear
what the mutex protects at all.

The device lists of the memory subsystem are protected separately.  We
don't need that mutex when looking up.  creating, or removing
independent devices.  find_memory_block_by_id() will perform locking on
its own and grab a reference of the returned device.

At the time memory_dev_init() is called, we cannot have concurrent
hot(un)plug operations yet - we're still fairly early during boot.  We
don't need any locking.

The creation/removal of memory block devices should be protected on a
higher level - especially using the device hotplug lock to avoid
documented issues (see Documentation/core-api/memory-hotplug.rst) - or
if that is reworked, using similar locking.

Protecting in the context of these functions only doesn't really make
sense.  Especially, if we would have a situation where the same memory
blocks are created/deleted at the same time, there is something horribly
going wrong (imagining adding/removing a DIMM at the same time from two
call paths) - after the functions succeeded something else in the
callers would blow up (e.g., create_memory_block_devices() succeeded but
there are no memory block devices anymore).

All relevant call paths (except when adding memory early during boot via
ACPI, which is now documented) hold the device hotplug lock when adding
memory, and when removing memory.  Let's document that instead.

Add a simple safety net to create_memory_block_devices() in case we
would actually remove memory blocks while adding them, so we'll never
dereference a NULL pointer.  Simplify memory_dev_init() now that the
lock is gone.

Link: http://lkml.kernel.org/r/20190925082621.4927-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:04 -08:00
Naoya Horiguchi feec24a613 mm, soft-offline: convert parameter to pfn
Currently soft_offline_page() receives struct page, and its sibling
memory_failure() receives pfn.  This discrepancy looks weird and makes
precheck on pfn validity tricky.  So let's align them.

Link: http://lkml.kernel.org/r/20191016234706.GA5493@www9186uo.sakura.ne.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:04 -08:00
David Hildenbrand 2c91f8fc6c mm/memory_hotplug: fix try_offline_node()
try_offline_node() is pretty much broken right now:

 - The node span is updated when onlining memory, not when adding it. We
   ignore memory that was mever onlined. Bad.

 - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
   trigger a kernel panic. Bad for memory that is offline but also bad
   for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
   first PFN of a section might contain garbage.

 - Sections belonging to mixed nodes are not properly considered.

As memory blocks might belong to multiple nodes, we would have to walk
all pageblocks (or at least subsections) within present sections.
However, we don't have a way to identify whether a memmap that is not
online was initialized (relevant for ZONE_DEVICE).  This makes things
more complicated.

Luckily, we can piggy pack on the node span and the nid stored in memory
blocks.  Currently, the node span is grown when calling
move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
removing memory, before calling try_offline_node().  Sysfs links are
created via link_mem_sections(), e.g., during boot or when adding
memory.

If the node still spans memory or if any memory block belongs to the
nid, we don't set the node offline.  As memory blocks that span multiple
nodes cannot get offlined, the nid stored in memory blocks is reliable
enough (for such online memory blocks, the node still spans the memory).

Introduce for_each_memory_block() to efficiently walk all memory blocks.

Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
when removing ZONE_DEVICE memory to fix similar issues (access of
garbage memmaps) - until we have a reliable way to identify whether
these memmaps were properly initialized.  This implies later, that once
a node had ZONE_DEVICE memory, we won't be able to set a node offline -
which should be acceptable.

Since commit f1dd2cd13c ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") memory that is added is not
assoziated with a zone/node (memmap not initialized).  The introducing
commit 60a5a19e74 ("memory-hotplug: remove sysfs file of node")
already missed that we could have multiple nodes for a section and that
the zone/node span is updated when onlining pages, not when adding them.

I tested this by hotplugging two DIMMs to a memory-less and cpu-less
NUMA node.  The node is properly onlined when adding the DIMMs.  When
removing the DIMMs, the node is properly offlined.

Masayoshi Mizuma reported:

: Without this patch, memory hotplug fails as panic:
:
:  BUG: kernel NULL pointer dereference, address: 0000000000000000
:  ...
:  Call Trace:
:   remove_memory_block_devices+0x81/0xc0
:   try_remove_memory+0xb4/0x130
:   __remove_memory+0xa/0x20
:   acpi_memory_device_remove+0x84/0x100
:   acpi_bus_trim+0x57/0x90
:   acpi_bus_trim+0x2e/0x90
:   acpi_device_hotplug+0x2b2/0x4d0
:   acpi_hotplug_work_fn+0x1a/0x30
:   process_one_work+0x171/0x380
:   worker_thread+0x49/0x3f0
:   kthread+0xf8/0x130
:   ret_from_fork+0x35/0x40

[david@redhat.com: v3]
  Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
Fixes: 60a5a19e74 ("memory-hotplug: remove sysfs file of node")
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e86b
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Nayna Jain <nayna@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15 18:34:00 -08:00
David Hildenbrand 641fe2e938 drivers/base/memory.c: don't access uninitialized memmaps in soft_offline_page_store()
Uninitialized memmaps contain garbage and in the worst case trigger kernel
BUGs, especially with CONFIG_PAGE_POISONING.  They should not get touched.

Right now, when trying to soft-offline a PFN that resides on a memory
block that was never onlined, one gets a misleading error with
CONFIG_PAGE_POISONING:

  :/# echo 5637144576 > /sys/devices/system/memory/soft_offline_page
  [   23.097167] soft offline: 0x150000 page already poisoned

But the actual result depends on the garbage in the memmap.

soft_offline_page() can only work with online pages, it returns -EIO in
case of ZONE_DEVICE.  Make sure to only forward pages that are online
(iow, managed by the buddy) and, therefore, have an initialized memmap.

Add a check against pfn_to_online_page() and similarly return -EIO.

Link: http://lkml.kernel.org/r/20191010141200.8985-1-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: <stable@vger.kernel.org>	[4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:31 -04:00
David Hildenbrand b6c88d3b9d drivers/base/memory.c: don't store end_section_nr in memory blocks
Each memory block spans the same amount of sections/pages/bytes.  The size
is determined before the first memory block is created.  No need to store
what we can easily calculate - and the calculations even look simpler now.

Michal brought up the idea of variable-sized memory blocks.  However, if
we ever implement something like this, we will need an API compatibility
switch and reworks at various places (most code assumes a fixed memory
block size).  So let's cleanup what we have right now.

While at it, fix the variable naming in register_mem_sect_under_node() -
we no longer talk about a single section.

Link: http://lkml.kernel.org/r/20190809110200.2746-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:09 -07:00