Commit Graph

177 Commits

Author SHA1 Message Date
John W. Linville 6a09fa9f20 cxl/region: Deal with numa nodes not enumerated by SRAT
JIRA: https://issues.redhat.com/browse/RHEL-54609

For the numa nodes that are not created by SRAT, no memory_target is
allocated and is not managed by the HMAT_REPORTING code. Therefore
hmat_callback() memory hotplug notifier will exit early on those NUMA
nodes. The CXL memory hotplug notifier will need to call
node_set_perf_attrs() directly in order to setup the access sysfs
attributes.

In acpi_numa_init(), the last proximity domain (pxm) id created by SRAT is
stored. Add a helper function acpi_node_backed_by_real_pxm() in order to
check if a NUMA node id is defined by SRAT or created by CFMWS.

node_set_perf_attrs() symbol is exported to allow update of perf attribs
for a node. The sysfs path of
/sys/devices/system/node/nodeX/access0/initiators/* is created by
node_set_perf_attrs() for the various attributes where nodeX is matched
to the NUMA node of the CXL region.

Cc: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/20240308220055.2172956-13-dave.jiang@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit debdce20c4f28b7e5aa48512e7abf270a00e9051)
Signed-off-by: John W. Linville <linville@redhat.com>
2024-10-07 13:55:15 -04:00
Radomir Vrbovsky cf2a246e31 Merge: Update ACPI to match v6.10
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5014

JIRA: https://issues.redhat.com/browse/RHEL-54149
Omitted-fix: 7cc06e729460a209b84d3db4db56c9f85f048cc2
Omitted-fix: e075c3b13a0a142dcd3151b25d29a24f31b7b640
    both of these fixes to drivers/platform/x86 required multiple additional patches before they would apply semi-cleanly, and I did not want to take the additional commits into an already long series
Omitted-fix: 4ce8f4c2027db46299b450b28e9e116aaf00a757
Omitted-fix: d80b83c911ca9b8d35213bf62e9cf336c78c5d24
        These are both the same commit under two different hashes. They apply to drivers/platform/x86/x86-android-tablets.c, which is not present in centos-9.
Omitted-fix: 0e6b6dedf16800df0ff73ffe2bb5066514db29c2
        This commit was reverted two months later by 779bac9994452f6a894524f70c00cfb0cd4b6364. Simplify the history by not including either of them.

Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>

Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-09-18 14:45:36 +02:00
David Arcari 13fdf39824 base/node.c: initialize the accessor list before registering
JIRA: https://issues.redhat.com/browse/RHEL-43147

commit 48b5928e18dc27e05cab3dc4c78cd8a15baaf1e5
Author: Gregory Price <gourry.memverge@gmail.com>
Date:   Mon Oct 30 00:42:39 2023 -0400

    base/node.c: initialize the accessor list before registering

    The current code registers the node as available in the node array
    before initializing the accessor list.  This makes it so that
    anything which might access the accessor list as a result of
    allocations will cause an undefined memory access.

    In one example, an extension to access hmat data during interleave
    caused this undefined access as a result of a bulk allocation
    that occurs during node initialization but before the accessor
    list is initialized.

    Initialize the accessor list before making the node generally
    available to the global system.

    Fixes: 08d9dbe72b ("node: Link memory nodes to their compute nodes")
    Signed-off-by: Gregory Price <gregory.price@memverge.com>
    Link: https://lore.kernel.org/r/20231030044239.971756-1-gregory.price@memverge.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: David Arcari <darcari@redhat.com>
2024-08-29 08:19:45 -04:00
Mark Langsdorf a96f446081 base/node / ACPI: Enumerate node access class for 'struct access_coordinate'
JIRA: https://issues.redhat.com/browse/RHEL-54149

commit 11270e526276ffad4c4237acb393da82a3287487
Author: Dave Jiang <dave.jiang@intel.com>
Date: Tue, 12 Mar 2024 12:34:11 +0000

Both generic node and HMAT handling code have been using magic numbers to
indicate access classes for 'struct access_coordinate'. Introduce enums to
enumerate the access0 and access1 classes shared by the two subsystems.
Update the function parameters and callers as appropriate to utilize the
new enum.

Access0 is named to ACCESS_COORDINATE_LOCAL in order to indicate that the
access class is for 'struct access_coordinate' between a target node and
the nearest initiator node.

Access1 is named to ACCESS_COORDINATE_CPU in order to indicate that the
access class is for 'struct access_coordinate' between a target node and
the nearest CPU node.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/20240308220055.2172956-3-dave.jiang@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2024-08-22 11:22:11 -04:00
Mark Langsdorf 38fbd9c69d base/node / acpi: Change 'node_hmem_attrs' to 'access_coordinates'
JIRA: https://issues.redhat.com/browse/RHEL-54149

commit 6a954e94d038f41d79c4e04348c95774d1c9337d
Author: Dave Jiang <dave.jiang@intel.com>
Date: Fri, 22 Dec 2023 14:23:13 +0000

Dan Williams suggested changing the struct 'node_hmem_attrs' to
'access_coordinates' [1]. The struct is a container of r/w-latency and
r/w-bandwidth numbers. Moving forward, this container will also be used by
CXL to store the performance characteristics of each link hop in
the PCIE/CXL topology. So, where node_hmem_attrs is just the access
parameters of a memory-node, access_coordinates applies more broadly
to hardware topology characteristics. The observation is that seemed like
an exercise in having the application identify "where" it falls on a
spectrum of bandwidth and latency needs. For the tuple of
read/write-latency and read/write-bandwidth, "coordinates" is not a perfect
fit. Sometimes it is just conveying values in isolation and not a
"location" relative to other performance points, but in the end this data
is used to identify the performance operation point of a given memory-node.
[2]

Link: http://lore.kernel.org/r/64471313421f7_1b66294d5@dwillia2-xfh.jf.intel.com.notmuch/
Link: https://lore.kernel.org/linux-cxl/645e6215ee0de_1e6f2945e@dwillia2-xfh.jf.intel.com.notmuch/
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/170319615734.2212653.15319394025985499185.stgit@djiang5-mobl3
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2024-08-22 11:21:54 -04:00
Mark Langsdorf db140b75f8 mm,thp: fix nodeN/meminfo output alignment
JIRA: https://issues.redhat.com/browse/RHEL-26183

commit 4b5b7850c9282f9c7e646ec140b84b2d2f0aeeb8
Author: Hugh Dickins <hughd@google.com>
Date: Mon, 21 Aug 2023 13:38:01 +0000

Add one more space to FileHugePages and FilePmdMapped, so the output is
aligned with other rows in /sys/devices/system/node/nodeN/meminfo.

Link: https://lkml.kernel.org/r/be861b50-a790-e041-bcb0-2a987dcfd1a@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2024-05-31 15:37:43 -04:00
Mark Langsdorf 8820b56ce0 base/node: Remove duplicated include
JIRA: https://issues.redhat.com/browse/RHEL-26183

commit 7f0718eda1b3c85ba7874f32ce90cfb156f5967a
Author: GUO Zihua <guozihua@huawei.com>
Date: Sat, 12 Aug 2023 13:00:56 +0000

Remove duplicated include of linux/hugetlb.h. Resolves checkincludes
message.

Signed-off-by: GUO Zihua <guozihua@huawei.com>
Link: https://lore.kernel.org/r/20230810120008.25297-1-guozihua@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2024-05-31 15:37:43 -04:00
Mark Langsdorf 573f26235a base/node: Use 'property' to identify an access parameter
JIRA: https://issues.redhat.com/browse/RHEL-26183

commit 7810f4dc879500b413bafab18ff870a68f38329a
Author: Dave Jiang <dave.jiang@intel.com>
Date: Wed, 31 May 2023 20:26:00 +0000

Usage of 'attr' and 'name' in the context of a sysfs attribute
definition are confusing because those read as being related to:

	struct attribute .name

Rename 'name' to 'property' in preparation for renaming 'struct
node_hmem_attr' to a more generic name that can be used in more contexts
('struct access_coordinate'), and not be confused with 'struct
attribute'.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/168332213518.2189163.18377767521423011290.stgit@djiang5-mobl3
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2024-05-31 15:37:40 -04:00
Aristeu Rozanski 274a3f2005 mm: memory-failure: add memory failure stats to sysfs
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 44b8f8bf2438bfee3aceae4d647a7460213ff340
Author: Jiaqi Yan <jiaqiyan@google.com>
Date:   Fri Jan 20 03:46:20 2023 +0000

    mm: memory-failure: add memory failure stats to sysfs

    Patch series "Introduce per NUMA node memory error statistics", v2.

    Background
    ==========

    In the RFC for Kernel Support of Memory Error Detection [1], one advantage
    of software-based scanning over hardware patrol scrubber is the ability to
    make statistics visible to system administrators.  The statistics include
    2 categories:

    * Memory error statistics, for example, how many memory error are
      encountered, how many of them are recovered by the kernel.  Note these
      memory errors are non-fatal to kernel: during the machine check
      exception (MCE) handling kernel already classified MCE's severity to be
      unnecessary to panic (but either action required or optional).

    * Scanner statistics, for example how many times the scanner have fully
      scanned a NUMA node, how many errors are first detected by the scanner.

    The memory error statistics are useful to userspace and actually not
    specific to scanner detected memory errors, and are the focus of this
    patchset.

    Motivation
    ==========

    Memory error stats are important to userspace but insufficient in kernel
    today.  Datacenter administrators can better monitor a machine's memory
    health with the visible stats.  For example, while memory errors are
    inevitable on servers with 10+ TB memory, starting server maintenance when
    there are only 1~2 recovered memory errors could be overreacting; in cloud
    production environment maintenance usually means live migrate all the
    workload running on the server and this usually causes nontrivial
    disruption to the customer.  Providing insight into the scope of memory
    errors on a system helps to determine the appropriate follow-up action.
    In addition, the kernel's existing memory error stats need to be
    standardized so that userspace can reliably count on their usefulness.

    Today kernel provides following memory error info to userspace, but they
    are not sufficient or have disadvantages:
    * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
      not per NUMA node stats though
    * ras:memory_failure_event: only available after explicitly enabled
    * /dev/mcelog provides many useful info about the MCEs, but doesn't
      capture how memory_failure recovered memory MCEs
    * kernel logs: userspace needs to process log text

    Exposing memory error stats is also a good start for the in-kernel memory
    error detector.  Today the data source of memory error stats are either
    direct memory error consumption, or hardware patrol scrubber detection
    (either signaled as UCNA or SRAO).  Once in-kernel memory scanner is
    implemented, it will be the main source as it is usually configured to
    scan memory DIMMs constantly and faster than hardware patrol scrubber.

    How Implemented
    ===============

    As Naoya pointed out [2], exposing memory error statistics to userspace is
    useful independent of software or hardware scanner.  Therefore we
    implement the memory error statistics independent of the in-kernel memory
    error detector.  It exposes the following per NUMA node memory error
    counters:

      /sys/devices/system/node/node${X}/memory_failure/total
      /sys/devices/system/node/node${X}/memory_failure/recovered
      /sys/devices/system/node/node${X}/memory_failure/ignored
      /sys/devices/system/node/node${X}/memory_failure/failed
      /sys/devices/system/node/node${X}/memory_failure/delayed

    These counters describe how many raw pages are poisoned and after the
    attempted recoveries by the kernel, their resolutions: how many are
    recovered, ignored, failed, or delayed respectively.  This approach can be
    easier to extend for future use cases than /proc/meminfo, trace event, and
    log.  The following math holds for the statistics:

    * total = recovered + ignored + failed + delayed

    These memory error stats are reset during machine boot.

    The 1st commit introduces these sysfs entries.  The 2nd commit populates
    memory error stats every time memory_failure attempts memory error
    recovery.  The 3rd commit adds documentations for introduced stats.

    [1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af
    [2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6

    This patch (of 3):

    Today kernel provides following memory error info to userspace, but each
    has its own disadvantage

    * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
      not per NUMA node stats though

    * ras:memory_failure_event: only available after explicitly enabled

    * /dev/mcelog provides many useful info about the MCEs, but
      doesn't capture how memory_failure recovered memory MCEs

    * kernel logs: userspace needs to process log text

    Exposes per NUMA node memory error stats as sysfs entries:

      /sys/devices/system/node/node${X}/memory_failure/total
      /sys/devices/system/node/node${X}/memory_failure/recovered
      /sys/devices/system/node/node${X}/memory_failure/ignored
      /sys/devices/system/node/node${X}/memory_failure/failed
      /sys/devices/system/node/node${X}/memory_failure/delayed

    These counters describe how many raw pages are poisoned and after the
    attempted recoveries by the kernel, their resolutions: how many are
    recovered, ignored, failed, or delayed respectively.  The following math
    holds for the statistics:

    * total = recovered + ignored + failed + delayed

    Link: https://lkml.kernel.org/r/20230120034622.2698268-1-jiaqiyan@google.com
    Link: https://lkml.kernel.org/r/20230120034622.2698268-2-jiaqiyan@google.com
    Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:11 -04:00
Paolo Bonzini 13262962e2 mm: Add support for unaccepted memory
JIRA: https://issues.redhat.com/browse/RHEL-10059

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways the kernel can deal with unaccepted memory:

 1. Accept all the memory during boot. It is easy to implement and it
    doesn't have runtime cost once the system is booted. The downside is
    very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock accepts memory on allocation. It serves early boot memory
    allocations and doesn't limit them to pre-accepted pool of memory.

  - page allocator accepts memory on the first allocation of the page.
    When kernel runs out of accepted memory, it accepts memory until the
    high watermark is reached. It helps to minimize fragmentation.

EFI code will provide two helpers if the platform supports unaccepted
memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com
(cherry picked from commit dcdfdd40fa82b6704d2841938e5c8ec3051eb0d6)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

[RHEL: upstream has mm/mm_init.c split out of mm/page_alloc.c]
2023-10-30 09:14:17 +01:00
Nico Pache b319f1136b mm: hugetlb: eliminate memory-less nodes handling
Conflicts:
       mm/hugetlb.c: add only the hugetlb_sysfs_init in conflict

commit a4a00b451ef5e1deb959088e25e248f4ee399792
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Wed Sep 14 15:26:03 2022 +0800

    mm: hugetlb: eliminate memory-less nodes handling

    The memory-notify-based approach aims to handle meory-less nodes, however,
    it just adds the complexity of code as pointed by David in thread [1].
    The handling of memory-less nodes is introduced by commit 4faf8d950e
    ("hugetlb: handle memory hot-plug events").  >From its commit message, we
    cannot find any necessity of handling this case.  So, we can simply
    register/unregister sysfs entries in register_node/unregister_node to
    simlify the code.

    BTW, hotplug callback added because in hugetlb_register_all_nodes() we
    register sysfs nodes only for N_MEMORY nodes, seeing commit 9b5e5d0fdc,
    which said it was a preparation for handling memory-less nodes via memory
    hotplug.  Since we want to remove memory hotplug, so make sure we only
    register per-node sysfs for online (N_ONLINE) nodes in
    hugetlb_register_all_nodes().

    https://lore.kernel.org/linux-mm/60933ffc-b850-976c-78a0-0ee6e0ea9ef0@redhat.com/ [1]
    Link: https://lkml.kernel.org/r/20220914072603.60293-3-songmuchun@bytedance.com
    Suggested-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andi Kleen <andi@firstfloor.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Rafael J. Wysocki <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache e73467a7bd mm: hugetlb: simplify per-node sysfs creation and removal
commit b958d4d08fbfe938af24ea06ebbf839b48fa18a9
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Wed Sep 14 15:26:02 2022 +0800

    mm: hugetlb: simplify per-node sysfs creation and removal

    Patch series "simplify handling of per-node sysfs creation and removal",
    v4.

    This patch (of 2):

    The following commit offload per-node sysfs creation and removal to a
    kworker and did not say why it is needed.  And it also said "I don't know
    that this is absolutely required".  It seems like the author was not sure
    as well.  Since it only complicates the code, this patch will revert the
    changes to simplify the code.

      39da08cb07 ("hugetlb: offload per node attribute registrations")

    We could use memory hotplug notifier to do per-node sysfs creation and
    removal instead of inserting those operations to node registration and
    unregistration.  Then, it can reduce the code coupling between node.c and
    hugetlb.c.  Also, it can simplify the code.

    Link: https://lkml.kernel.org/r/20220914072603.60293-1-songmuchun@bytedance.com
    Link: https://lkml.kernel.org/r/20220914072603.60293-2-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andi Kleen <andi@firstfloor.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Rafael J. Wysocki <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Eric Auger 431fb0900c mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2175143
We keep track of several kernel memory stats (total kernel memory, page
tables, stack, vmalloc, etc) on multiple levels (global, per-node,
per-memcg, etc). These stats give insights to users to how much memory
is used by the kernel and for what purposes.

Currently, memory used by KVM mmu is not accounted in any of those
kernel memory stats. This patch series accounts the memory pages
used by KVM for page tables in those stats in a new
NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account
for other types of secondary pages tables (e.g. iommu page tables).

KVM has a decent number of large allocations that aren't for page
tables, but for most of them, the number/size of those allocations
scales linearly with either the number of vCPUs or the amount of memory
assigned to the VM. KVM's secondary page table allocations do not scale
linearly, especially when nested virtualization is in use.

From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's
per-VM pages_{4k,2m,1g} stats unless the guest is doing something
bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is
forced to allocate a large number of page tables even though the guest
isn't accessing that much memory). However, someone would need to either
understand how KVM works to make that connection, or know (or be told) to
go look at KVM's stats if they're running VMs to better decipher the stats.

Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE
is informative. For example, when backing a VM with THP vs. HugeTLB,
NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order
of magnitude higher with THP. So having this stat will at the very least
prove to be useful for understanding tradeoffs between VM backing types,
and likely even steer folks towards potential optimizations.

The original discussion with more details about the rationale:
https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org

This stat will be used by subsequent patches to count KVM mmu
memory usage.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220823004639.2387269-2-yosryahmed@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit ebc97a52b5d6cd5fb0c15a3fc9cdd6eb924646a1)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
2023-05-04 18:25:10 +02:00
Nico Pache a3c9522fba drivers/base/node.c: fix compaction sysfs file leak
commit da63dc84befaa9e6079a0bc363ff0eaa975f9073
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:06 2022 -0700

    drivers/base/node.c: fix compaction sysfs file leak

    Compaction sysfs file is created via compaction_register_node in
    register_node.  But we forgot to remove it in unregister_node.  Thus
    compaction sysfs file is leaked.  Using compaction_unregister_node to fix
    this issue.

    Link: https://lkml.kernel.org/r/20220401070905.43679-1-linmiaohe@huawei.com
    Fixes: ed4a6d7f06 ("mm: compaction: add /sys trigger for per-node memory compaction")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Rafael J. Wysocki <rafael@kernel.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:37 -07:00
Phil Auld 487bc5be95 drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist
Bugzilla: https://bugzilla.redhat.com/2107713
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
Tested: Ran simple reproducer program. Also ls -l is sufficient to show the reported
size

commit 7ee951acd31a88f941fd6535fbdee3a1567f1d63
Author: Phil Auld <pauld@redhat.com>
Date:   Fri Jul 15 09:49:24 2022 -0400

    drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist

    Using bin_attributes with a 0 size causes fstat and friends to return that
    0 size. This breaks userspace code that retrieves the size before reading
    the file. Rather than reverting 75bd50fa841 ("drivers/base/node.c: use
    bin_attribute to break the size limitation of cpumap ABI") let's put in a
    size value at compile time.

    For cpulist the maximum size is on the order of
            NR_CPUS * (ceil(log10(NR_CPUS)) + 1)/2

    which for 8192 is 20480 (8192 * 5)/2. In order to get near that you'd need
    a system with every other CPU on one node. For example: (0,2,4,8, ... ).
    To simplify the math and support larger NR_CPUS in the future we are using
    (NR_CPUS * 7)/2. We also set it to a min of PAGE_SIZE to retain the older
    behavior for smaller NR_CPUS.

    The cpumap file the size works out to be NR_CPUS/4 + NR_CPUS/32 - 1
    (or NR_CPUS * 9/32 - 1) including the ","s.

    Add a set of macros for these values to cpumask.h so they can be used in
    multiple places. Apply these to the handful of such files in
    drivers/base/topology.c as well as node.c.

    As an example, on an 80 cpu 4-node system (NR_CPUS == 8192):

    before:

    -r--r--r--. 1 root root 0 Jul 12 14:08 system/node/node0/cpulist
    -r--r--r--. 1 root root 0 Jul 11 17:25 system/node/node0/cpumap

    after:

    -r--r--r--. 1 root root 28672 Jul 13 11:32 system/node/node0/cpulist
    -r--r--r--. 1 root root  4096 Jul 13 11:31 system/node/node0/cpumap

    CONFIG_NR_CPUS = 16384
    -r--r--r--. 1 root root 57344 Jul 13 14:03 system/node/node0/cpulist
    -r--r--r--. 1 root root  4607 Jul 13 14:02 system/node/node0/cpumap

    The actual number of cpus doesn't matter for the reported size since they
    are based on NR_CPUS.

    Fixes: 75bd50fa841d ("drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI")
    Fixes: bb9ec13d156e ("topology: use bin_attribute to break the size limitation of cpumap ABI")
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Cc: Yury Norov <yury.norov@gmail.com>
    Cc: stable@vger.kernel.org
    Acked-by: Yury Norov <yury.norov@gmail.com> (for include/linux/cpumask.h)
    Signed-off-by: Phil Auld <pauld@redhat.com>
    Link: https://lore.kernel.org/r/20220715134924.3466194-1-pauld@redhat.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-07-19 10:43:10 -04:00
Patrick Talbert ba60709985 Merge: Update drivers/base to v5.18
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/944

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067252
Omitted-fix: 68af28426b3ca1bf9ba21c7d8bdd0ff639e5134c
Requires substantial amd-pmc backport commits that are not in this MR or the centos-9 codebase yet.
Omitted-fix: b1f66033cd4e9ce8cbe2c74c98d4e04c0b2d7b40
Requires substantial amd-pmc backport commits that are not in this MR or the centos-9 codebase yet.
Omitted-fix: 753ee989f7cf0c0a76a7f56956827a8863a60f97
Fix for the previous omitted fix.
Omitted-fix: 32370191c0851da069d242f581cbe2fdb80040cb
Same fix as "platform/x86: amd-pmc: Set QOS during suspend on CZN w/ timer wakeup" but with a different commit hash

Update drivers/base subsystem to match Linux v5.18, including v5.17, v5.16, and v5.15. In general, all the commits applied cleanly, as would be expected with applying those commits to a v5.14 base.

A bunch of memory hotplug and memblock commits tangentially touched drivers/base in from v5.17 or v5.18, but David Hildebrand wrote them and can backport them better than I can, so I don't consider them to really be drivers/base commits. I pulled in a few commits that don't strictly touch drivers/base to finish out Luis Chamberlain's series on reworking the firmware loader.

A half-dozen MSI related commits that touched drivers/base in v5.17 were not backported, as they were part of a roughly 30 commit series that would be better handled by the PCI subsystem team.

Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>

Conflicts:
- drivers/base/node.c: context differs due to MR !833.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-19 11:43:50 +02:00
Patrick Talbert 7ac8f4344e Merge: SGX updates from v5.17
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/833

```
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081354
Upstream Status: merged into the linux.git

Bring in important SGX commits from v5.17. All the commits except
one apply cleanly.

Three commits are touching arch/x86/mm and NUMA code. They were
checked carefully, they add new functionality related to SGX, but
do not change anything previously existing:

drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI
x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node

The third commit has conflicts, but only a context ones. An include,
a helper function and a new case in switch() are added. Nothing existing
previously is touched. The details are in the commit message:

x86/sgx: Remove .fixup usage

Signed-off-by: Vladis Dronov <vdronov@redhat.com>
```

Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Dean Nelson <dnelson@redhat.com>
Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-23 09:54:44 +02:00
Mark Langsdorf 33329016f0 drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067252

commit 75bd50fa841db5434728d238b8b5659498ccf0ab
Author: Tian Tao <tiantao6@hisilicon.com>
Date: Fri, 6 Aug 2021 23:02:50 +1200

Reading /sys/devices/system/cpu/cpuX/nodeX/ returns cpumap and cpulist.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

This patch moves to use bin_attribute to extend the ABI to be more
than one page so that cpumap bitmask and list won't be potentially
trimmed.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Link: https://lore.kernel.org/r/20210806110251.560-5-song.bao.hua@hisilicon.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-06-14 16:01:47 -05:00
Mark Langsdorf 0ddad73cfb driver: base: Prefer unsigned int to bare use of unsigned
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067252

commit e7deeb9d79d8691f1e6c4c6707471ec3d7b9886b
Author: Jinchao Wang <wjc@cdjrlc.com>
Date: Tue, 29 Jun 2021 01:19:07 +0800

Fix checkpatch warnings:
    WARNING: Prefer 'unsigned int' to bare use of 'unsigned'

Signed-off-by: Jinchao Wang <wjc@cdjrlc.com>
Link: https://lore.kernel.org/r/20210628171907.63646-2-wjc@cdjrlc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-06-14 16:01:47 -05:00
Vladis Dronov 185a794c93 x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081354
Upstream Status: merged into the linux.git

commit 50468e4313355b161cac8a5155a45832995b7f25
Author: Jarkko Sakkinen <jarkko@kernel.org>
Date:   Tue Nov 16 18:21:16 2021 +0200

    x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node

    == Problem ==

    The amount of SGX memory on a system is determined by the BIOS and it
    varies wildly between systems.  It can be as small as dozens of MB's
    and as large as many GB's on servers.  Just like how applications need
    to know how much regular RAM is available, enclave builders need to
    know how much SGX memory an enclave can consume.

    == Solution ==

    Introduce a new sysfs file:

            /sys/devices/system/node/nodeX/x86/sgx_total_bytes

    to enumerate the amount of SGX memory available in each NUMA node.
    This serves the same function for SGX as /proc/meminfo or
    /sys/devices/system/node/nodeX/meminfo does for normal RAM.

    'sgx_total_bytes' is needed today to help drive the SGX selftests.
    SGX-specific swap code is exercised by creating overcommitted enclaves
    which are larger than the physical SGX memory on the system.  They
    currently use a CPUID-based approach which can diverge from the actual
    amount of SGX memory available.  'sgx_total_bytes' ensures that the
    selftests can work efficiently and do not attempt stupid things like
    creating a 100,000 MB enclave on a system with 128 MB of SGX memory.

    == Implementation Details ==

    Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an
    arch specific attribute group, and add an attribute for the amount of
    SGX memory in bytes to each NUMA node:

    == ABI Design Discussion ==

    As opposed to the per-node ABI, a single, global ABI was considered.
    However, this would prevent enclaves from being able to size
    themselves so that they fit on a single NUMA node.  Essentially, a
    single value would rule out NUMA optimizations for enclaves.

    Create a new "x86/" directory inside each "nodeX/" sysfs directory.
    'sgx_total_bytes' is expected to be the first of at least a few
    sgx-specific files to be placed in the new directory.  Just scanning
    /proc/meminfo, these are the no-brainers that we have for RAM, but we
    need for SGX:

            MemTotal:       xxxx kB // sgx_total_bytes (implemented here)
            MemFree:        yyyy kB // sgx_free_bytes
            SwapTotal:      zzzz kB // sgx_swapped_bytes

    So, at *least* three.  I think we will eventually end up needing
    something more along the lines of a dozen.  A new directory (as
    opposed to being in the nodeX/ "root") directory avoids cluttering the
    root with several "sgx_*" files.

    Place the new file in a new "nodeX/x86/" directory because SGX is
    highly x86-specific.  It is very unlikely that any other architecture
    (or even non-Intel x86 vendor) will ever implement SGX.  Using "sgx/"
    as opposed to "x86/" was also considered.  But, there is a real chance
    this can get used for other arch-specific purposes.

    [ dhansen: rewrite changelog ]

    Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Acked-by: Borislav Petkov <bp@suse.de>
    Link: https://lkml.kernel.org/r/20211116162116.93081-2-jarkko@kernel.org

Signed-off-by: Vladis Dronov <vdronov@redhat.com>
2022-05-05 12:26:38 +02:00
Vladis Dronov d44a348f6d drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081354
Upstream Status: merged into the linux.git

commit 75bd50fa841db5434728d238b8b5659498ccf0ab
Author: Tian Tao <tiantao6@hisilicon.com>
Date:   Fri Aug 6 23:02:50 2021 +1200

    drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI

    Reading /sys/devices/system/cpu/cpuX/nodeX/ returns cpumap and cpulist.
    However, the size of this file is limited to PAGE_SIZE because of the
    limitation for sysfs attribute.

    This patch moves to use bin_attribute to extend the ABI to be more
    than one page so that cpumap bitmask and list won't be potentially
    trimmed.

    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Link: https://lore.kernel.org/r/20210806110251.560-5-song.bao.hua@hisilicon.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: Vladis Dronov <vdronov@redhat.com>
2022-05-05 12:26:38 +02:00
David Hildenbrand be1c559fd2 drivers/base/memory: determine and store zone for single-zone memory blocks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2077436

commit 395f6081bad49f9c54abafebab49ee23aa985bbd
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Mar 22 14:47:31 2022 -0700

    drivers/base/memory: determine and store zone for single-zone memory blocks

    test_pages_in_a_zone() is just another nasty PFN walker that can easily
    stumble over ZONE_DEVICE memory ranges falling into the same memory block
    as ordinary system RAM: the memmap of parts of these ranges might possibly
    be uninitialized.  In fact, we observed (on an older kernel) with UBSAN:

      UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
      index 7 is out of range for type 'zone [5]'
      CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
      Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
      Call Trace:
       dump_stack+0x9a/0xf0
       ubsan_epilogue+0x9/0x7a
       __ubsan_handle_out_of_bounds+0x13a/0x181
       test_pages_in_a_zone+0x3c4/0x500
       show_valid_zones+0x1fa/0x380
       dev_attr_show+0x43/0xb0
       sysfs_kf_seq_show+0x1c5/0x440
       seq_read+0x49d/0x1190
       vfs_read+0xff/0x300
       ksys_read+0xb8/0x170
       do_syscall_64+0xa5/0x4b0
       entry_SYSCALL_64_after_hwframe+0x6a/0xdf
      RIP: 0033:0x7f01f4439b52

    We seem to stumble over a memmap that contains a garbage zone id.  While
    we could try inserting pfn_to_online_page() calls, it will just make
    memory offlining slower, because we use test_pages_in_a_zone() to make
    sure we're offlining pages that all belong to the same zone.

    Let's just get rid of this PFN walker and determine the single zone of a
    memory block -- if any -- for early memory blocks during boot.  For memory
    onlining, we know the single zone already.  Let's avoid any additional
    memmap scanning and just rely on the zone information available during
    boot.

    For memory hot(un)plug, we only really care about memory blocks that:
    * span a single zone (and, thereby, a single node)
    * are completely System RAM (IOW, no holes, no ZONE_DEVICE)
    If one of these conditions is not met, we reject memory offlining.
    Hotplugged memory blocks (starting out offline), always meet both
    conditions.

    There are three scenarios to handle:

    (1) Memory hot(un)plug

    A memory block with zone == NULL cannot be offlined, corresponding to
    our previous test_pages_in_a_zone() check.

    After successful memory onlining/offlining, we simply set the zone
    accordingly.
    * Memory onlining: set the zone we just used for onlining
    * Memory offlining: set zone = NULL

    So a hotplugged memory block starts with zone = NULL. Once memory
    onlining is done, we set the proper zone.

    (2) Boot memory with !CONFIG_NUMA

    We know that there is just a single pgdat, so we simply scan all zones
    of that pgdat for an intersection with our memory block PFN range when
    adding the memory block. If more than one zone intersects (e.g., DMA and
    DMA32 on x86 for the first memory block) we set zone = NULL and
    consequently mimic what test_pages_in_a_zone() used to do.

    (3) Boot memory with CONFIG_NUMA

    At the point in time we create the memory block devices during boot, we
    don't know yet which nodes *actually* span a memory block. While we could
    scan all zones of all nodes for intersections, overlapping nodes complicate
    the situation and scanning all nodes is possibly expensive. But that
    problem has already been solved by the code that sets the node of a memory
    block and creates the link in the sysfs --
    do_register_memory_block_under_node().

    So, we hook into the code that sets the node id for a memory block. If
    we already have a different node id set for the memory block, we know
    that multiple nodes *actually* have PFNs falling into our memory block:
    we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
    to do. If there is no node id set, we do the same as (2) for the given
    node.

    Note that the call order in driver_init() is:
    -> memory_dev_init(): create memory block devices
    -> node_dev_init(): link memory block devices to the node and set the
                        node id

    So in summary, we detect if there is a single zone responsible for this
    memory block and we consequently store the zone in that case in the
    memory block, updating it during memory onlining/offlining.

    Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reported-by: Rafael Parra <rparrazo@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rafael Parra <rparrazo@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
2022-04-21 17:04:04 +02:00
David Hildenbrand 322c4f0efc drivers/base/node: rename link_mem_sections() to register_memory_block_under_node()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2077436

commit cc6515591b25f08ce199e9379844a964f52a27f2
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Mar 22 14:47:28 2022 -0700

    drivers/base/node: rename link_mem_sections() to register_memory_block_under_node()

    Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2.

    I remember talking to Michal in the past about removing
    test_pages_in_a_zone(), which we use for:
    * verifying that a memory block we intend to offline is really only managed
      by a single zone. We don't support offlining of memory blocks that are
      managed by multiple zones (e.g., multiple nodes, DMA and DMA32)
    * exposing that zone to user space via
      /sys/devices/system/memory/memory*/valid_zones

    Now that I identified some more cases where test_pages_in_a_zone() might
    go wrong, and we received an UBSAN report (see patch #3), let's get rid of
    this PFN walker.

    So instead of detecting the zone at runtime with test_pages_in_a_zone() by
    scanning the memmap, let's determine and remember for each memory block if
    it's managed by a single zone.  The stored zone can then be used for the
    above two cases, avoiding a manual lookup using test_pages_in_a_zone().

    This avoids eventually stumbling over uninitialized memmaps in corner
    cases, especially when ZONE_DEVICE ranges partly fall into memory block
    (that are responsible for managing System RAM).

    Handling memory onlining is easy, because we online to exactly one zone.
    Handling boot memory is more tricky, because we want to avoid scanning all
    zones of all nodes to detect possible zones that overlap with the physical
    memory region of interest.  Fortunately, we already have code that
    determines the applicable nodes for a memory block, to create sysfs links
    -- we'll hook into that.

    Patch #1 is a simple cleanup I had laying around for a longer time.
    Patch #2 contains the main logic to remove test_pages_in_a_zone() and
    further details.

    [1] https://lkml.kernel.org/r/20220128144540.153902-1-david@redhat.com
    [2] https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com

    This patch (of 2):

    Let's adjust the stale terminology, making it match
    unregister_memory_block_under_nodes() and
    do_register_memory_block_under_node().  We're dealing with memory block
    devices, which span 1..X memory sections.

    Link: https://lkml.kernel.org/r/20220210184359.235565-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20220210184359.235565-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Oscar Salvador <osalvador@suse.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Cc: Rafael Parra <rparrazo@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
2022-04-21 17:04:04 +02:00
David Hildenbrand 66a4a4cbf7 drivers/base/node: consolidate node device subsystem initialization in node_dev_init()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2077436

commit 2848a28b0a6052a4c8450397d2647d7d8e3f6f06
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Mar 22 14:47:13 2022 -0700

    drivers/base/node: consolidate node device subsystem initialization in node_dev_init()

    ...  and call node_dev_init() after memory_dev_init() from driver_init(),
    so before any of the existing arch/subsys calls.  All online nodes should
    be known at that point: early during boot, arch code determines node and
    zone ranges and sets the relevant nodes online; usually this happens in
    setup_arch().

    This is in line with memory_dev_init(), which initializes the memory
    device subsystem and creates all memory block devices.

    Similar to memory_dev_init(), panic() if anything goes wrong, we don't
    want to continue with such basic initialization errors.

    The important part is that node_dev_init() gets called after
    memory_dev_init() and after cpu_dev_init(), but before any of the relevant
    archs call register_cpu() to register the new cpu device under the node
    device.  The latter should be the case for the current users of
    topology_init().

    Link: https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Tested-by: Anatoly Pugachev <matorola@gmail.com> (sparc64)
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Cc: Rich Felker <dalias@libc.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
2022-04-21 17:04:04 +02:00
Rafael Aquini c35f5b11f5 mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 50f9481ed9fb8a2d2a06a155634c7f9eeff9fa61
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Nov 5 13:44:24 2021 -0700

    mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE

    CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for
    CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use
    CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE.

    Link: https://lkml.kernel.org/r/20210929143600.49379-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Shuah Khan <skhan@linuxfoundation.org>        [kselftest]
    Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Acked-by: Oscar Salvador <osalvador@suse.de>
    Cc: Alex Shi <alexs@kernel.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:27 -05:00
Rafael Aquini af51d4fce9 mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 859a85ddf90e714092dea71a0e54c7b9896621be
Author: Mike Rapoport <rppt@kernel.org>
Date:   Tue Sep 7 19:54:52 2021 -0700

    mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE

    Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".

    After recent updates to freeing unused parts of the memory map, no
    architecture can have holes in the memory map within a pageblock.  This
    makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
    option redundant.

    The first patch removes them both in a mechanical way and the second patch
    simplifies memory_hotplug::test_pages_in_a_zone() that had
    pfn_valid_within() surrounded by more logic than simple if.

    This patch (of 2):

    After recent changes in freeing of the unused parts of the memory map and
    rework of pfn_valid() in arm and arm64 there are no architectures that can
    have holes in the memory map within a pageblock and so nothing can enable
    CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
    pfn_valid_within().

    With that, pfn_valid_within() is always hardwired to 1 and can be
    completely removed.

    Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.

    Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
    Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:02 -05:00
Linus Torvalds f5c13f1fde Driver core changes for 5.14-rc1
Here is the small set of driver core and debugfs updates for 5.14-rc1.
 
 Included in here are:
 	- debugfs api cleanups (touched some drivers)
 	- devres updates
 	- tiny driver core updates and tweaks
 
 Nothing major in here at all, and all have been in linux-next for a
 while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYOM7jA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yloDQCfZOlLYXF+2KgXJQqevNnRiu7/B1gAn3aCX6xh
 UWVUfu5LDIXi2uFERRT1
 =Ze3R
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core changes from Greg KH:
 "Here is the small set of driver core and debugfs updates for 5.14-rc1.

  Included in here are:

   - debugfs api cleanups (touched some drivers)

   - devres updates

   - tiny driver core updates and tweaks

  Nothing major in here at all, and all have been in linux-next for a
  while with no reported issues"

* tag 'driver-core-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (27 commits)
  docs: ABI: testing: sysfs-firmware-memmap: add some memmap types.
  devres: Enable trace events
  devres: No need to call remove_nodes() when there none present
  devres: Use list_for_each_safe_from() in remove_nodes()
  devres: Make locking straight forward in release_nodes()
  kernfs: move revalidate to be near lookup
  drivers/base: Constify static attribute_group structs
  firmware_loader: remove unneeded 'comma' macro
  devcoredump: remove contact information
  driver core: Drop helper devm_platform_ioremap_resource_wc()
  component: Rename 'dev' to 'parent'
  component: Drop 'dev' argument to component_match_realloc()
  device property: Don't check for NULL twice in the loops
  driver core: auxiliary bus: Fix typo in the docs
  drivers/base/node.c: make CACHE_ATTR define static DEVICE_ATTR_RO
  debugfs: remove return value of debugfs_create_ulong()
  debugfs: remove return value of debugfs_create_bool()
  scsi: snic: debugfs: remove local storage of debugfs files
  b43: don't save dentries for debugfs
  b43legacy: don't save dentries for debugfs
  ...
2021-07-05 13:51:41 -07:00
Mel Gorman f19298b951 mm/vmstat: convert NUMA statistics to basic NUMA counters
NUMA statistics are maintained on the zone level for hits, misses, foreign
etc but nothing relies on them being perfectly accurate for functional
correctness.  The counters are used by userspace to get a general overview
of a workloads NUMA behaviour but the page allocator incurs a high cost to
maintain perfect accuracy similar to what is required for a vmstat like
NR_FREE_PAGES.  There even is a sysctl vm.numa_stat to allow userspace to
turn off the collection of NUMA statistics like NUMA_HIT.

This patch converts NUMA_HIT and friends to be NUMA events with similar
accuracy to VM events.  There is a possibility that slight errors will be
introduced but the overall trend as seen by userspace will be similar.
The counters are no longer updated from vmstat_refresh context as it is
unnecessary overhead for counters that may never be read by userspace.
Note that counters could be maintained at the node level to save space but
it would have a user-visible impact due to /proc/zoneinfo.

[lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA]

Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Rikard Falkeborn 5a576764e4 drivers/base: Constify static attribute_group structs
These are only used by putting their address in an array of pointers to
const struct attribute_group (either directly or via the
__ATTRIBUTE_GROUP macro). Make them const to allow the compiler to place
them in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Link: https://lore.kernel.org/r/20210528213408.20067-1-rikard.falkeborn@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-04 15:06:28 +02:00
Ruiqi Gong fd03c075e3 drivers/base/node.c: make CACHE_ATTR define static DEVICE_ATTR_RO
Mark DEVICE_ATTR_RO(name) in CACHE_ATTR(name, fmt)'s definition as static to fix
the following Sparse tool reports:

drivers/base/node.c:239:1: warning:
 symbol 'dev_attr_line_size' was not declared. Should it be static?
drivers/base/node.c:240:1: warning:
 symbol 'dev_attr_indexing' was not declared. Should it be static?

Where dev_attr_{line_size,indexing} are generated by CACHE_ATTR's expansion.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Ruiqi Gong <gongruiqi1@huawei.com>
Link: https://lore.kernel.org/r/20210514020548.32483-1-gongruiqi1@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-21 22:04:58 +02:00
Dan Carpenter 4ce535ec00 node: fix device cleanups in error handling code
We can't use kfree() to free device managed resources so the kfree(dev)
is against the rules.

It's easier to write this code if we open code the device_register() as
a device_initialize() and device_add().  That way if dev_set_name() set
name fails we can call put_device() and it will clean up correctly.

Fixes: acc02a109b ("node: Add memory-side caching attributes")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/YHA0JUra+F64+NpB@mwanda
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-10 11:10:21 +02:00
Shakeel Butt b603894248 mm: memcg: add swapcache stat for memcg v2
This patch adds swapcache stat for the cgroup v2.  The swapcache
represents the memory that is accounted against both the memory and the
swap limit of the cgroup.  The main motivation behind exposing the
swapcache stat is for enabling users to gracefully migrate from cgroup
v1's memsw counter to cgroup v2's memory and swap counters.

Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
workload but without control on the exact proportion of memory and swap.
Cgroup v2 provides separate limits for memory and swap which enables more
control on the exact usage of memory and swap individually for the
workload.

With some little subtleties, the v1's memsw limit can be switched with the
sum of the v2's memory and swap limits.  However the alternative for memsw
usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
stat enables that alternative.  Adding the memory usage and swap usage and
subtracting the swapcache will approximate the memsw usage.  This will
help in the transparent migration of the workloads depending on memsw
usage and limit to v2' memory and swap counters.

The reasons these applications are still interested in this approximate
memsw usage are: (1) these applications are not really interested in two
separate memory and swap usage metrics.  A single usage metric is more
simple to use and reason about for them.

(2) The memsw usage metric hides the underlying system's swap setup from
the applications.  Applications with multiple instances running in a
datacenter with heterogeneous systems (some have swap and some don't) will
keep seeing a consistent view of their usage.

[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]

Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song 380780e718 mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_FILE_PMDMAPPED account to pages.  This patch is
consistent with 8f182270df ("mm/swap.c: flush lru pvecs on compound page
arrival").  Doing this also can make the unit of vmstat counters more
unified.  Finally, the unit of the vmstat counters are pages, kB and
bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song a1528e21f8 mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_SHMEM_PMDMAPPED account to pages.  This patch is
consistent with 8f182270df ("mm/swap.c: flush lru pvecs on compound page
arrival").  Doing this also can make the unit of vmstat counters more
unified.  Finally, the unit of the vmstat counters are pages, kB and
bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-6-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song 57b2847d3c mm: memcontrol: convert NR_SHMEM_THPS account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_SHMEM_THPS account to pages.  This patch is
consistent with 8f182270df ("mm/swap.c: flush lru pvecs on compound page
arrival").  Doing this also can make the unit of vmstat counters more
unified.  Finally, the unit of the vmstat counters are pages, kB and
bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song bf9ecead53 mm: memcontrol: convert NR_FILE_THPS account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with if hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_FILE_THPS account to pages.  This patch is consistent
with 8f182270df ("mm/swap.c: flush lru pvecs on compound page arrival").
Doing this also can make the unit of vmstat counters more unified.
Finally, the unit of the vmstat counters are pages, kB and bytes.  The
B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song 69473e5de8 mm: memcontrol: convert NR_ANON_THPS account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_ANON_THPS account to pages.  This patch is consistent
with 8f182270df ("mm/swap.c: flush lru pvecs on compound page arrival").
Doing this also can make the unit of vmstat counters more unified.
Finally, the unit of the vmstat counters are pages, kB and bytes.  The
B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Shakeel Butt f0c0c115fb mm: memcontrol: account pagetables per node
For many workloads, pagetable consumption is significant and it makes
sense to expose it in the memory.stat for the memory cgroups.  However at
the moment, the pagetables are accounted per-zone.  Converting them to
per-node and using the right interface will correctly account for the
memory cgroups as well.

[akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/]

Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:40 -08:00
Laurent Dufour 90c7eaeb14 mm: don't panic when links can't be created in sysfs
At boot time, or when doing memory hot-add operations, if the links in
sysfs can't be created, the system is still able to run, so just report
the error in the kernel log rather than BUG_ON and potentially make system
unusable because the callpath can be called with locks held.

Since the number of memory blocks managed could be high, the messages are
rate limited.

As a consequence, link_mem_sections() has no status to report anymore.

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200915094143.79181-4-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:18 -07:00
Linus Torvalds fe151462bd Driver Core patches for 5.10-rc1
Here is the "big" set of driver core patches for 5.10-rc1
 
 They include a lot of different things, all related to the driver core
 and/or some driver logic:
 	- sysfs common write functions to make it easier to audit sysfs
 	  attributes
 	- device connection cleanups and fixes
 	- devm helpers for a few functions
 	- NOIO allocations for when devices are being removed
 	- minor cleanups and fixes
 
 All have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCX4c4yA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylS7gCfcS+7/PE42eXxMY0z8rBX8aDMadIAn2DVEghA
 Eoh9UoMEW4g1uMKORA0c
 =CVAW
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the "big" set of driver core patches for 5.10-rc1

  They include a lot of different things, all related to the driver core
  and/or some driver logic:

   - sysfs common write functions to make it easier to audit sysfs
     attributes

   - device connection cleanups and fixes

   - devm helpers for a few functions

   - NOIO allocations for when devices are being removed

   - minor cleanups and fixes

  All have been in linux-next for a while with no reported issues"

* tag 'driver-core-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
  regmap: debugfs: use semicolons rather than commas to separate statements
  platform/x86: intel_pmc_core: do not create a static struct device
  drivers core: node: Use a more typical macro definition style for ACCESS_ATTR
  drivers core: Use sysfs_emit for shared_cpu_map_show and shared_cpu_list_show
  mm: and drivers core: Convert hugetlb_report_node_meminfo to sysfs_emit
  drivers core: Miscellaneous changes for sysfs_emit
  drivers core: Reindent a couple uses around sysfs_emit
  drivers core: Remove strcat uses around sysfs_emit and neaten
  drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions
  sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
  dyndbg: use keyword, arg varnames for query term pairs
  driver core: force NOIO allocations during unplug
  platform_device: switch to simpler IDA interface
  driver core: platform: Document return type of more functions
  Revert "driver core: Annotate dev_err_probe() with __must_check"
  Revert "test_firmware: Test platform fw loading on non-EFI systems"
  iio: adc: xilinx-xadc: use devm_krealloc()
  hwmon: pmbus: use more devres helpers
  devres: provide devm_krealloc()
  syscore: Use pm_pr_dbg() for syscore_{suspend,resume}()
  ...
2020-10-14 16:09:32 -07:00
Rafael J. Wysocki e4174ff78b Merge branch 'acpi-numa'
* acpi-numa:
  docs: mm: numaperf.rst Add brief description for access class 1.
  node: Add access1 class to represent CPU to memory characteristics
  ACPI: HMAT: Fix handling of changes from ACPI 6.2 to ACPI 6.3
  ACPI: Let ACPI know we support Generic Initiator Affinity Structures
  x86: Support Generic Initiator only proximity domains
  ACPI: Support Generic Initiator only domains
  ACPI / NUMA: Add stub function for pxm_to_node()
  irq-chip/gic-v3-its: Fix crash if ITS is in a proximity domain without processor or memory
  ACPI: Remove side effect of partly creating a node in acpi_get_node()
  ACPI: Rename acpi_map_pxm_to_online_node() to pxm_to_online_node()
  ACPI: Remove side effect of partly creating a node in acpi_map_pxm_to_online_node()
  ACPI: Do not create new NUMA domains from ACPI static tables that are not SRAT
  ACPI: Add out of bounds and numa_off protections to pxm_to_node()
2020-10-13 14:44:50 +02:00
Jonathan Cameron 894c26a1c2 ACPI: Support Generic Initiator only domains
Generic Initiators are a new ACPI concept that allows for the
description of proximity domains that contain a device which
performs memory access (such as a network card) but neither
host CPU nor Memory.

This patch has the parsing code and provides the infrastructure
for an architecture to associate these new domains with their
nearest memory processing node.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-02 18:51:57 +02:00
Joe Perches 6284a6e894 drivers core: node: Use a more typical macro definition style for ACCESS_ATTR
Remove the trailing semicolon from the macro and add it to its uses.

Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/faf51a671160cf884efa68fb458d3e8a44b1a7a7.1600285923.git.joe@perches.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-02 13:24:40 +02:00
Joe Perches 7981593bf0 mm: and drivers core: Convert hugetlb_report_node_meminfo to sysfs_emit
Convert the unbound sprintf in hugetlb_report_node_meminfo to use
sysfs_emit_at so that no possible overrun of a PAGE_SIZE buf can occur.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Link: https://lore.kernel.org/r/894b351b82da6013cde7f36ff4b5493cd0ec30d0.1600285923.git.joe@perches.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-02 13:16:33 +02:00
Joe Perches 948b3edba8 drivers core: Miscellaneous changes for sysfs_emit
Change additional instances that could use sysfs_emit and sysfs_emit_at
that the coccinelle script could not convert.

o macros creating show functions with ## concatenation
o unbound sprintf uses with buf+len for start of output to sysfs_emit_at
o returns with ?: tests and sprintf to sysfs_emit
o sysfs output with struct class * not struct device * arguments

Miscellanea:

o remove unnecessary initializations around these changes
o consistently use int len for return length of show functions
o use octal permissions and not S_<FOO>
o rename a few show function names so DEVICE_ATTR_<FOO> can be used
o use DEVICE_ATTR_ADMIN_RO where appropriate
o consistently use const char *output for strings
o checkpatch/style neatening

Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/8bc24444fe2049a9b2de6127389b57edfdfe324d.1600285923.git.joe@perches.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-02 13:12:07 +02:00
Joe Perches 27275d3018 drivers core: Reindent a couple uses around sysfs_emit
Just a couple of whitespace realignment to open parenthesis for
multi-line statements.

Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/33224191421dbb56015eded428edfddcba997d63.1600285923.git.joe@perches.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-02 13:09:10 +02:00
Joe Perches aa838896d8 drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions
Convert the various sprintf fmaily calls in sysfs device show functions
to sysfs_emit and sysfs_emit_at for PAGE_SIZE buffer safety.

Done with:

$ spatch -sp-file sysfs_emit_dev.cocci --in-place --max-width=80 .

And cocci script:

$ cat sysfs_emit_dev.cocci
@@
identifier d_show;
identifier dev, attr, buf;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	return
-	sprintf(buf,
+	sysfs_emit(buf,
	...);
	...>
}

@@
identifier d_show;
identifier dev, attr, buf;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	return
-	snprintf(buf, PAGE_SIZE,
+	sysfs_emit(buf,
	...);
	...>
}

@@
identifier d_show;
identifier dev, attr, buf;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	return
-	scnprintf(buf, PAGE_SIZE,
+	sysfs_emit(buf,
	...);
	...>
}

@@
identifier d_show;
identifier dev, attr, buf;
expression chr;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	return
-	strcpy(buf, chr);
+	sysfs_emit(buf, chr);
	...>
}

@@
identifier d_show;
identifier dev, attr, buf;
identifier len;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	len =
-	sprintf(buf,
+	sysfs_emit(buf,
	...);
	...>
	return len;
}

@@
identifier d_show;
identifier dev, attr, buf;
identifier len;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	len =
-	snprintf(buf, PAGE_SIZE,
+	sysfs_emit(buf,
	...);
	...>
	return len;
}

@@
identifier d_show;
identifier dev, attr, buf;
identifier len;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
	len =
-	scnprintf(buf, PAGE_SIZE,
+	sysfs_emit(buf,
	...);
	...>
	return len;
}

@@
identifier d_show;
identifier dev, attr, buf;
identifier len;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	<...
-	len += scnprintf(buf + len, PAGE_SIZE - len,
+	len += sysfs_emit_at(buf, len,
	...);
	...>
	return len;
}

@@
identifier d_show;
identifier dev, attr, buf;
expression chr;
@@

ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	...
-	strcpy(buf, chr);
-	return strlen(buf);
+	return sysfs_emit(buf, chr);
}

Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/3d033c33056d88bbe34d4ddb62afd05ee166ab9a.1600285923.git.joe@perches.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-02 13:09:10 +02:00
Laurent Dufour f85086f95f mm: don't rely on system state to detect hot-plug operations
In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state.  In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1].  So checking against the system state is not
enough.

The consequence is that on system with interleaved node's ranges like this:

 Early memory node ranges
   node   1: [mem 0x0000000000000000-0x000000011fffffff]
   node   2: [mem 0x0000000120000000-0x000000014fffffff]
   node   1: [mem 0x0000000150000000-0x00000001ffffffff]
   node   0: [mem 0x0000000200000000-0x000000048fffffff]
   node   2: [mem 0x0000000490000000-0x00000007ffffffff]

This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done.  At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:

  $ ls -l /sys/devices/system/memory/memory21/node*
  total 0
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2

In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():

  kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
  CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
  Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation.  An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.

[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:

  $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
        -m size=$MEM,slots=255,maxmem=4294967296k  \
        -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
        -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
        -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
        -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
        -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
        -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
        -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
        -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

Fixes: 4fbce63391 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-26 10:33:57 -07:00
Shakeel Butt 991e767385 mm: memcontrol: account kernel stack per node
Currently the kernel stack is being accounted per-zone.  There is no need
to do that.  In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB.  Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item.  In addition localize the kernel stack stats updates to
account_kernel_stack().

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:25 -07:00
Roman Gushchin d42f3245c7 mm: memcg: convert vmstat slab counters to bytes
In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

Internally global and per-node counters are stored in pages, however memcg
and lruvec counters are stored in bytes.  This scheme may look weird, but
only for now.  As soon as slab pages will be shared between multiple
cgroups, global and node counters will reflect the total number of slab
pages.  However memcg and lruvec counters will be used for per-memcg slab
memory tracking, which will take separate kernel objects in the account.
Keeping global and node counters in pages helps to avoid additional
overhead.

The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:24 -07:00