JIRA: https://issues.redhat.com/browse/RHEL-75923
commit b79276dcac9124a79c8cf7cc8fbdd3d4c3c9a7c7
Author: Mario Limonciello <mario.limonciello@amd.com>
Date: Mon Nov 4 16:28:55 2024 -0600
ACPI: processor: Move arch_init_invariance_cppc() call later
arch_init_invariance_cppc() is called at the end of
acpi_cppc_processor_probe() in order to configure frequency invariance
based upon the values from _CPC.
This however doesn't work on AMD CPPC shared memory designs that have
AMD preferred cores enabled because _CPC needs to be analyzed from all
cores to judge if preferred cores are enabled.
This issue manifests to users as a warning since commit 21fb59ab4b97
("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"):
```
Could not retrieve highest performance (-19)
```
However the warning isn't the cause of this, it was actually
commit 279f838a61f9 ("x86/amd: Detect preferred cores in
amd_get_boost_ratio_numerator()") which exposed the issue.
To fix this problem, change arch_init_invariance_cppc() into a new weak
symbol that is called at the end of acpi_processor_driver_init().
Each architecture that supports it can declare the symbol to override
the weak one.
Define it for x86, in arch/x86/kernel/acpi/cppc.c, and for all of the
architectures using the generic arch_topology.c code.
Fixes: 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()")
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219431
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/20241104222855.3959267-1-superm1@kernel.org
[ rjw: Changelog edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: David Arcari <darcari@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56494
commit e5bc44e47c531860be96ac615314b1ab23d5aa2b
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu Apr 25 09:37:09 2024 +0200
arch/topology: Fix variable naming to avoid shadowing
Using 'hw_pressure' for local variable name is confusing in regard to the
per-CPU 'hw_pressure' variable that uses the same name:
include/linux/arch_topology.h:DECLARE_PER_CPU(unsigned long, hw_pressure);
... which puts it into a global scope for all code that includes
<linux/topology.h>, shadowing the local variable.
Rename it to avoid compiler confusion & Sparse warnings.
[ mingo: Expanded the changelog. ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240425073709.379016-1-vincent.guittot@linaro.org
Closes: https://lore.kernel.org/oe-kbuild-all/202404250740.VhQQoD7N-lkp@intel.com/
Fixes: d4dbc991714e ("sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()")
Tested-by: Konrad Dybcio <konrad.dybcio@linaro.org> # QC SM8550 QRD
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Minor differences since we already have ddae0ca2a8f
("sched: Move psi_account_irqtime() out of update_rq_clock_task()
hotpath") which changes some nearby code.
commit d4dbc991714eefcbd8d54a3204bd77a0a52bd32d
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Tue Mar 26 10:16:15 2024 +0100
sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
Now that cpufreq provides a pressure value to the scheduler, rename
arch_update_thermal_pressure into HW pressure to reflect that it returns
a pressure applied by HW (i.e. with a high frequency change) and not
always related to thermal mitigation but also generated by max current
limitation as an example. Such high frequency signal needs filtering to be
smoothed and provide an value that reflects the average available capacity
into the scheduler time scale.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lore.kernel.org/r/20240326091616.3696851-5-vincent.guittot@linaro.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29020
commit 98323e9d70172f1b46d1cadb20d6c54abf62870d
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed Jan 17 20:05:45 2024 +0100
topology: Set capacity_freq_ref in all cases
If "capacity-dmips-mhz" is not set, raw_capacity is null and we skip the
normalization step which includes setting per_cpu capacity_freq_ref.
Always register the notifier but skip the capacity normalization if
raw_capacity is null.
Fixes: 9942cb22ea45 ("sched/topology: Add a new arch_scale_freq_ref() method")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Mark Brown <broonie@kernel.org>
Tested-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20240117190545.596057-1-vincent.guittot@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29020
commit 1f023007f5e782bda19ad9104830c404fd622c5d
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Mon Dec 11 11:48:55 2023 +0100
arm64/amu: Use capacity_ref_freq() to set AMU ratio
Use the new capacity_ref_freq() method to set the ratio that is used by AMU for
computing the arch_scale_freq_capacity().
This helps to keep everything aligned using the same reference for
computing CPUs capacity.
The default value of the ratio (stored in per_cpu(arch_max_freq_scale))
ensures that arch_scale_freq_capacity() returns max capacity until it is
set to its correct value with the cpu capacity and capacity_ref_freq().
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20231211104855.558096-8-vincent.guittot@linaro.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29020
commit 5477fa249b56c59c3baa1b237bf083cffa64c84a
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Mon Dec 11 11:48:54 2023 +0100
cpufreq/cppc: Set the frequency used for computing the capacity
Save the frequency associated to the performance that has been used when
initializing the capacity of CPUs.
Also, cppc cpufreq driver can register an artificial energy model. In such
case, it needs the frequency for this compute capacity.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20231211104855.558096-7-vincent.guittot@linaro.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29020
Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file
does not exist.
commit 9942cb22ea458c34fa17b73d143ea32d4df1caca
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Mon Dec 11 11:48:49 2023 +0100
sched/topology: Add a new arch_scale_freq_ref() method
Create a new method to get a unique and fixed max frequency. Currently
cpuinfo.max_freq or the highest (or last) state of performance domain are
used as the max frequency when computing the frequency for a level of
utilization, but:
- cpuinfo_max_freq can change at runtime. boost is one example of
such change.
- cpuinfo.max_freq and last item of the PD can be different leading to
different results between cpufreq and energy model.
We need to save the reference frequency that has been used when computing
the CPUs capacity and use this fixed and coherent value to convert between
frequency and CPU's capacity.
In fact, we already save the frequency that has been used when computing
the capacity of each CPU. We extend the precision to save kHz instead of
MHz currently and we modify the type to be aligned with other variables
used when converting frequency to capacity and the other way.
[ mingo: Minor edits. ]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org
Signed-off-by: Phil Auld <pauld@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2180619
commit 3522340199cc060b70f0094e3039bdb43c3f6ee1
Author: Pierre Gondois <pierre.gondois@arm.com>
Date: Fri Apr 14 10:14:51 2023 +0200
arch_topology: Remove early cacheinfo error message if -ENOENT
fetch_cache_info() tries to get the number of cache leaves/levels
for each CPU in order to pre-allocate memory for cacheinfo struct.
Allocating this memory later triggers a:
'BUG: sleeping function called from invalid context'
in PREEMPT_RT kernels.
If there is no cache related information available in DT or ACPI,
fetch_cache_info() fails and an error message is printed:
'Early cacheinfo failed, ret = ...'
Not having cache information should be a valid configuration.
Remove the error message if fetch_cache_info() fails with -ENOENT.
Suggested-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/all/20230404-hatred-swimmer-6fecdf33b57a@spud/
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20230414081453.244787-4-pierre.gondois@arm.com
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2180619
commit e103d55465db06c5344201fd5fa11bb19bc479c5
Author: Radu Rendec <rrendec@redhat.com>
Date: Wed Apr 12 14:57:59 2023 -0400
cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
Recent work enables cacheinfo memory for secondary CPUs to be allocated
early, while still running on the primary CPU. That allows cacheinfo
memory to be allocated safely on RT kernels. To make that work, the
number of cache levels/leaves must be defined in the device tree or ACPI
tables. Further work adds a path for early detection of the number of
cache levels/leaves, which makes it possible to allocate the cacheinfo
memory early without requiring extra DT/ACPI information.
This patch addresses a specific issue with ACPI systems with no PPTT. In
that case, parse_acpi_topology() returns an error code, which in turn
makes init_cpu_topology() return early, before fetch_cache_info() is
called. In that case, the early cache level detection doesn't run.
The solution is to simply remove the "return" statement and let the code
flow fall through to calling fetch_cache_info().
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Link: https://lore.kernel.org/all/dea94484-797f-3034-7b86-6d88801c0d91@arm.com/
Reviewed-by: Pierre Gondois <pierre.gondois@arm.com>
Link: https://lore.kernel.org/r/20230412185759.755408-4-rrendec@redhat.com
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2180619
commit 5944ce092b97caed5d86d961e963b883b5c44ee2
Author: Pierre Gondois <pierre.gondois@arm.com>
Date: Wed Jan 4 19:30:29 2023 +0100
arch_topology: Build cacheinfo from primary CPU
commit 3fcbf1c77d08 ("arch_topology: Fix cache attributes detection
in the CPU hotplug path")
adds a call to detect_cache_attributes() to populate the cacheinfo
before updating the siblings mask. detect_cache_attributes() allocates
memory and can take the PPTT mutex (on ACPI platforms). On PREEMPT_RT
kernels, on secondary CPUs, this triggers a:
'BUG: sleeping function called from invalid context' [1]
as the code is executed with preemption and interrupts disabled.
The primary CPU was previously storing the cache information using
the now removed (struct cpu_topology).llc_id:
commit 5b8dc787ce4a ("arch_topology: Drop LLC identifier stash from
the CPU topology")
allocate_cache_info() tries to build the cacheinfo from the primary
CPU prior secondary CPUs boot, if the DT/ACPI description
contains cache information.
If allocate_cache_info() fails, then fallback to the current state
for the cacheinfo allocation. [1] will be triggered in such case.
When unplugging a CPU, the cacheinfo memory cannot be freed. If it
was, then the memory would be allocated early by the re-plugged
CPU and would trigger [1].
Note that populate_cache_leaves() might be called multiple times
due to populate_leaves being moved up. This is required since
detect_cache_attributes() might be called with per_cpu_cacheinfo(cpu)
being allocated but not populated.
[1]:
| BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
| in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/111
| preempt_count: 1, expected: 0
| RCU nest depth: 1, expected: 1
| 3 locks held by swapper/111/0:
| #0: (&pcp->lock){+.+.}-{3:3}, at: get_page_from_freelist+0x218/0x12c8
| #1: (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x48/0xf0
| #2: (&zone->lock){+.+.}-{3:3}, at: rmqueue_bulk+0x64/0xa80
| irq event stamp: 0
| hardirqs last enabled at (0): 0x0
| hardirqs last disabled at (0): copy_process+0x5dc/0x1ab8
| softirqs last enabled at (0): copy_process+0x5dc/0x1ab8
| softirqs last disabled at (0): 0x0
| Preemption disabled at:
| migrate_enable+0x30/0x130
| CPU: 111 PID: 0 Comm: swapper/111 Tainted: G W 6.0.0-rc4-rt6-[...]
| Call trace:
| __kmalloc+0xbc/0x1e8
| detect_cache_attributes+0x2d4/0x5f0
| update_siblings_masks+0x30/0x368
| store_cpu_topology+0x78/0xb8
| secondary_start_kernel+0xd0/0x198
| __secondary_switched+0xb0/0xb4
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/20230104183033.755668-7-pierre.gondois@arm.com
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2180619
commit 456797da792fa7cbf6698febf275fe9b36691f78
Author: Conor Dooley <conor.dooley@microchip.com>
Date: Fri Jul 15 18:51:55 2022 +0100
arm64: topology: move store_cpu_topology() to shared code
arm64's method of defining a default cpu topology requires only minimal
changes to apply to RISC-V also. The current arm64 implementation exits
early in a uniprocessor configuration by reading MPIDR & claiming that
uniprocessor can rely on the default values.
This is appears to be a hangover from prior to '3102bc0e6ac7 ("arm64:
topology: Stop using MPIDR for topology information")', because the
current code just assigns default values for multiprocessor systems.
With the MPIDR references removed, store_cpu_topolgy() can be moved to
the common arch_topology code.
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122311
Conflicts:
drivers/base/arch_topology.c - the only part of this
commit that I needed was the global export of cpu_scale,
so I dropped the second stanza that exports
topology_set_thermal_pressure as it would require backporting
several extra patches.
drivers/cpufreq/qcom-cpufreq-hw.c - this driver is
unsupported so I dropped all the changes silently.
Omitted-fix: 93d9e6f93e1586fcc97498c764be2e8c8401f4bd
Omitted-fix: e0e27c3d4e20dab861566f1c348ae44e4b498630
Omitted-fix: 5e4f009da6be563984ba4db4ef4f32529e9aeb90
Omitted-fix: 6240aaad75e1a623872a830d13393d7aabf1052c
Omitted-fix: f84ccad5f5660f86a642a3d7e2bfdc4e7a8a2d49
Omitted-fix: e4e6448638a01905faeda9bf96aa9df7c8ef463c
Omitted-fix: 91dc90fdb8b8199519a3aac9c46a433b02223c5b
all of these commits apply to qcom-cpufreq-hw.c,
which wasn't modified, so these can be ignored.
commit 275157b367f479334f3e2df7be93a3dd772f359c
Author: Thara Gopinath <thara.gopinath@linaro.org>
Date: Mon Aug 9 15:16:01 2021 -0400
Add interrupt support to notify the kernel of h/w initiated frequency
throttling by LMh. Convey this to scheduler via thermal presssure
interface.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
[Viresh: Added changes for arch_topology.c to fix build errors ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1332
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
Bring drivers/base up to date with Linux v6.0.
Ugly dependencies in the CXL and nvdimm code caused a commit that removed
lockdep to fail to compile, and that commit was dropped from the series.
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Jocelyn Falempe <jfalempe@redhat.com>
Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Conflicts:
MAINTAINERS
There were conflicts in MAINTAINERS file because of previous merge of
IOMMU/DMA API update that did several updates there. Resolution kept
changes in place, so there is no change with this merge.
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2150425
commit a2a9d1850060e5d995136561d76e81d61414f076
Author: Perry Yuan <Perry.Yuan@amd.com>
Date: Mon Aug 15 00:35:48 2022 +0800
ACPI: CPPC: Add ACPI disabled check to acpi_cpc_valid()
Make acpi_cpc_valid() check if ACPI is disabled, so that its callers
don't need to check that separately. This will also cause the AMD
pstate driver to refuse to load right away when ACPI is disabled.
Also update the warning message in amd_pstate_init() to mention the
ACPI disabled case for completeness.
Signed-off-by: Perry Yuan <Perry.Yuan@amd.com>
[ rjw: Subject edits, new changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: David Arcari <darcari@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 5ac251c8a05ce074e5efac779debf82a15d870a3
Author: Yicong Yang <yangyicong@hisilicon.com>
Date: Mon, 5 Sep 2022 20:26:15 +0800
Currently cpu_clustergroup_mask() will return CPU mask if cluster span more
or the same CPUs as cpu_coregroup_mask(). This will result topology borken
on non-Cluster SMT machines when building with CONFIG_SCHED_CLUSTER=y.
Test with:
qemu-system-aarch64 -enable-kvm -machine virt \
-net none \
-cpu host \
-bios ./QEMU_EFI.fd \
-m 2G \
-smp 48,sockets=2,cores=12,threads=2 \
-kernel $Image \
-initrd $Rootfs \
-nographic
-append "rdinit=init console=ttyAMA0 sched_verbose loglevel=8"
We'll get below error:
[ 3.084568] BUG: arch topology borken
[ 3.084570] the SMT domain not a subset of the CLS domain
Since cluster is a level higher than SMT, fix this by making cluster
spans at least SMT CPUs.
Fixes: bfcc4397435d ("arch_topology: Limit span of cpu_clustergroup_mask()")
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20220905122615.12946-1-yangyicong@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 9b03e79300100bcd36e77c8ce94ee7f47cd2f528
Author: Florian Fainelli <f.fainelli@gmail.com>
Date: Fri, 5 Aug 2022 16:07:36 -0700
Architectures which do not have cacheinfo such as ARM 32-bit would spit
out the following during boot:
Early cacheinfo failed, ret = -2
Treat -ENOENT specifically to silence this error since it means that the
platform does not support reporting its cache information.
Fixes: 3fcbf1c77d08 ("arch_topology: Fix cache attributes detection in the CPU hotplug path")
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Michael Walle <michael@walle.cc>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220805230736.1562801-1-f.fainelli@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 3fcbf1c77d089fcf0331fd8f3cbbe6c436a3edbd
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Wed, 20 Jul 2022 13:55:40 +0100
init_cpu_topology() is called only once at the boot and all the cache
attributes are detected early for all the possible CPUs. However when
the CPUs are hotplugged out, the cacheinfo gets removed. While the
attributes are added back when the CPUs are hotplugged back in as part
of CPU hotplug state machine, it ends up called quite late after the
update_siblings_masks() are called in the secondary_start_kernel()
resulting in wrong llc_sibling_masks.
Move the call to detect_cache_attributes() inside update_siblings_masks()
to ensure the cacheinfo is updated before the LLC sibling masks are
updated. This will fix the incorrect LLC sibling masks generated when
the CPUs are hotplugged out and hotplugged back in again.
Reported-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20220720-arch_topo_fixes-v3-3-43d696288e84@arm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 00e66e37af0090f9ed95ca4bc3d8f5c6171daaf0
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Mon, 4 Jul 2022 11:16:04 +0100
We don't support the topology for clusters of CPU clusters while the
DT and ACPI bindings theoritcally support the same. Just warn about the
same so that it is clear to the users of arch_topology that the nested
clusters are not yet supported.
Link: https://lore.kernel.org/r/20220704101605.1318280-21-sudeep.holla@arm.com
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit dea8c0b40fb500be29f4649cf01202e42a8a54f8
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Mon, 4 Jul 2022 11:16:03 +0100
Finally let us add support for socket nodes in /cpu-map in the device
tree. Since this may not be present in all the old platforms and even
most of the existing platforms, we need to assume absence of the socket
node indicates that it is a single socket system and handle appropriately.
Also it is likely that most single socket systems skip to as the node
since it is optional.
Link: https://lore.kernel.org/r/20220704101605.1318280-20-sudeep.holla@arm.com
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 556c9678a7d4456a677588ce308200a673b7eb1f
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Mon, 4 Jul 2022 11:16:02 +0100
Let us set the cluster identifier as parsed from the device tree
cluster nodes within /cpu-map.
We don't support nesting of clusters yet as there are no real hardware
to support clusters of clusters.
Link: https://lore.kernel.org/r/20220704101605.1318280-19-sudeep.holla@arm.com
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
Omitted-fix: 6b66ca0bac1b9cee7608d7c4dc59b699458b4cb8
Omitted-fix: c749b275056d4d1023af125b320c91a24d6856b8
These two commits are a fix and a reversion of the fix,
so I'm dropping both of them. The final fix is "arch_topology:
Make cluster topology span at least SMT CPUs", included later
in this series.
commit bfcc4397435dc0407099b9a805391abc05c2313b
Author: Ionela Voinescu <ionela.voinescu@arm.com>
Date: Mon, 4 Jul 2022 11:16:01 +0100
Currently the cluster identifier is not set on DT based platforms.
The reset or default value is -1 for all the CPUs. Once we assign the
cluster identifier values correctly, the cluster_sibling mask will be
populated and returned by cpu_clustergroup_mask() to contribute in the
creation of the CLS scheduling domain level, if SCHED_CLUSTER is
enabled.
To avoid topologies that will result in questionable or incorrect
scheduling domains, impose restrictions regarding the span of clusters,
as presented to scheduling domains building code: cluster_sibling should
not span more or the same CPUs as cpu_coregroup_mask().
This is needed in order to obtain a strict separation between the MC and
CLS levels, and maintain the same domains for existing platforms in
the presence of CONFIG_SCHED_CLUSTER, where the new cluster information
is redundant and irrelevant for the scheduler.
While previously the scheduling domain builder code would have removed MC
as redundant and kept CLS if SCHED_CLUSTER was enabled and the
cpu_coregroup_mask() and cpu_clustergroup_mask() spanned the same CPUs,
now CLS will be removed and MC kept.
Link: https://lore.kernel.org/r/20220704101605.1318280-18-sudeep.holla@arm.com
Cc: Darren Hart <darren@os.amperecomputing.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 26a2b73a7b15a51ec1409648c9b43882f66fbacf
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Mon, 4 Jul 2022 11:16:00 +0100
Currently as we parse the CPU topology from /cpu-map node from the
device tree, we assign generated cluster count as the physical package
identifier for each CPU which is wrong.
The device tree bindings for CPU topology supports sockets to infer
the socket or physical package identifier for a given CPU. Since it is
fairly new and not supported on most of the old and existing systems, we
can assume all such systems have single socket/physical package.
Fix the physical package identifier to 0 by removing the assignment of
cluster identifier to the same.
Link: https://lore.kernel.org/r/20220704101605.1318280-17-sudeep.holla@arm.com
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 5a01bb8efb5177236498fc57b147cabd2b792613
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Mon, 4 Jul 2022 11:15:59 +0100
There is no point in looping through all the CPU's physical package
identifier to check if it is valid or not once a CPU which is outside
the topology(i.e. outlier CPU) is found.
Let us just break out of the loop early in such case.
Link: https://lore.kernel.org/r/20220704101605.1318280-16-sudeep.holla@arm.com
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 9eb5e54f876dda1aae0aef10bdd61da4331509ba
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Mon, 4 Jul 2022 11:15:58 +0100
Instead of just comparing the cpu topology IDs with -1 to check their
validity, improve that by checking for a valid non-negative value.
Link: https://lore.kernel.org/r/20220704101605.1318280-15-sudeep.holla@arm.com
Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 3f8283296b16c4e43fd79a5ac364ae8171fe1567
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Mon, 4 Jul 2022 11:15:57 +0100
Currently the cluster identifier is not set on the DT based platforms.
The reset or default value is -1 for all the CPUs. Once we assign the
cluster identifier values correctly that may result in getting the thread
siblings wrong as the core identifiers can be same for 2 different CPUs
belonging to 2 different cluster.
So, in order to get the thread sibling cpumasks correct, we need to
update them only if the cores they belong are in the same cluster within
the socket. Let us skip updation of the thread sibling cpumaks if the
cluster identifier doesn't match.
This change won't affect even if the cluster identifiers are not set
currently but will avoid any breakage once we set the same correctly.
Link: https://lore.kernel.org/r/20220704101605.1318280-14-sudeep.holla@arm.com
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 5b8dc787ce4a45c87254d1b0b22f161347ab7f81
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Mon, 4 Jul 2022 11:15:56 +0100
Since the cacheinfo LLC information is used directly in arch_topology,
there is no need to parse and store the LLC ID information only for
ACPI systems in the CPU topology.
Remove the redundant LLC ID from the generic CPU arch_topology
information.
Link: https://lore.kernel.org/r/20220704101605.1318280-13-sudeep.holla@arm.com
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit f027db2f9a09e76858d06828b9ff817272d64ccc
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Mon, 4 Jul 2022 11:15:54 +0100
The cacheinfo is now initialised early along with the CPU topology
initialisation. Instead of relying on the LLC ID information parsed
separately only with ACPI PPTT elsewhere, migrate to use the similar
information from the cacheinfo.
This is generic for both DT and ACPI systems. The ACPI LLC ID information
parsed separately can now be removed from arch specific code.
Link: https://lore.kernel.org/r/20220704101605.1318280-11-sudeep.holla@arm.com
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit 38db9b95464f82fed28794afe0214d9439d86f7c
Author: Sudeep Holla <sudeep.holla@arm.com>
Date: Mon, 4 Jul 2022 11:15:53 +0100
Currently ACPI populates just the minimum information about the last
level cache from PPTT in order to feed the same to build sched_domains.
Similar support for DT platforms is not present.
In order to enable the same, the entire cache hierarchy information can
be built as part of CPU topoplogy parsing both on ACPI and DT platforms.
Note that this change builds the cacheinfo early even on ACPI systems,
but the current mechanism of building llc_sibling mask remains unchanged.
Link: https://lore.kernel.org/r/20220704101605.1318280-10-sudeep.holla@arm.com
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318
commit c3d438eeb5413111889edef10cb5fcc2c0fb8bc9
Author: Lukasz Luba <lukasz.luba@arm.com>
Date: Wed, 27 Apr 2022 09:08:06 +0100
Add trace event to capture the moment of the call for updating the thermal
pressure value. It's helpful to investigate how often those events occur
in a system dealing with throttling. This trace event is needed since the
old 'cdev_update' might not be used by some drivers.
The old 'cdev_update' trace event only provides a cooling state
value: [0, n]. That state value then needs additional tools to translate
it: state -> freq -> capacity -> thermal pressure. This new trace event
just stores proper thermal pressure value in the trace buffer, no need
for additional logic. This is helpful for cooperation when someone can
simply sends to the list the trace buffer output from the platform (no
need from additional information from other subsystems).
There are also platforms which due to some design reasons don't use
cooling devices and thus don't trigger old 'cdev_update' trace event.
They are also important and measuring latency for the thermal signal
raising/decaying characteristics is in scope. This new trace event
would cover them as well.
We already have a trace point 'pelt_thermal_tp' which after a change to
trace event can be paired with this new 'thermal_pressure_update' and
derive more insight what is going on in the system under thermal pressure
(and why).
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20220427080806.1906-1-lukasz.luba@arm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284
commit 1dc9f1a66e1718479e1c4f95514e1750602a3cb9
Author: Wang Qing <wangqing@vivo.com>
Date: Sun, 10 Apr 2022 19:36:19 -0700
When ACPI is not enabled, cpuid_topo->llc_id = cpu_topo->llc_id = -1, which
will set llc_sibling 0xff(...), this is misleading.
Don't set llc_sibling(default 0) if we don't know the cache topology.
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Wang Qing <wangqing@vivo.com>
Fixes: 37c3ec2d81 ("arm64: topology: divorce MC scheduling domain from core_siblings")
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/1649644580-54626-1-git-send-email-wangqing@vivo.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284
commit 9924fbb51e0ae30b8d7eec7c1c839d74da9678b3
Author: Ionela Voinescu <ionela.voinescu@arm.com>
Date: Thu, 10 Mar 2022 14:54:50 +0000
Define topology_init_cpu_capacity_cppc() to use highest performance
values from _CPC objects to obtain and set maximum capacity information
for each CPU. acpi_cppc_processor_probe() is a good point at which to
trigger the initialization of CPU (u-arch) capacity values, as at this
point the highest performance values can be obtained from each CPU's
_CPC objects. Architectures can therefore use this functionality
through arch_init_invariance_cppc().
The performance scale used by CPPC is a unified scale for all CPUs in
the system. Therefore, by obtaining the raw highest performance values
from the _CPC objects, and normalizing them on the [0, 1024] capacity
scale, used by the task scheduler, we obtain the CPU capacity of each
CPU.
While an ACPI Notify(0x85) could alert about a change in the highest
performance value, which should in turn retrigger the CPU capacity
computations, this notification is not currently handled by the ACPI
processor driver. When supported, a call to arch_init_invariance_cppc()
would perform the update.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284
Conflicts:
drivers/base/arch_topology.c - the first stanza removes
topology_set_thermal_pressure() in its entirety. Upstream also
removed an -EXPORT_SYMBOL_GPL() that centos-9 did not have so
there was technically a conflict but it was resolved by not
removing a line of code that wasn't present anyway.
commit 7e97b3dc2556743dd02612c92a8de7026e8d7dc9
Author: Lukasz Luba <lukasz.luba@arm.com>
Date: Tue, 9 Nov 2021 19:57:14 +0000
There is no need of this function (and related) since code has been
converted to use the new arch_update_thermal_pressure() API. The old
code can be removed.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284
commit c214f124161d446b340597e7c968e0a2dc149142
Author: Lukasz Luba <lukasz.luba@arm.com>
Date: Tue, 9 Nov 2021 19:57:10 +0000
The thermal pressure is a mechanism which is used for providing
information about reduced CPU performance to the scheduler. Usually code
has to convert the value from frequency units into capacity units,
which are understandable by the scheduler. Create a common conversion code
which can be just used via a handy API.
Internally, the topology_update_thermal_pressure() operates on frequency
in MHz and max CPU frequency is taken from 'freq_factor' (per-cpu).
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067252
commit b39214911a548746e7375985c87cdaa8e1a0dfde
Author: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Date: Wed, 29 Sep 2021 03:31:38 +0800
arch_topology.c hasn't use any macro or function declared in linux/percpu.h,
linux/smp.h and linux/string.h.
Thus, these files can be removed from arch_topology.c safely without
affecting the compilation of the drivers/base/ module
Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Link: https://lore.kernel.org/r/20210928193138.24192-1-liumh1@shanghaitech.edu.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2047951
commit db1e59483dfd8d4e956575302520bb8f7e20c79b
Author: Darren Hart <darren@os.amperecomputing.com>
Date: Mon, 11 Apr 2022 13:53:34 -0700
Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
Control Unit, but have no shared CPU-side last level cache.
cpu_coregroup_mask() will return a cpumask with weight 1, while
cpu_clustergroup_mask() will return a cpumask with weight 2.
As a result, build_sched_domain() will BUG() once per CPU with:
BUG: arch topology borken
the CLS domain not a subset of the MC domain
The MC level cpumask is then extended to that of the CLS child, and is
later removed entirely as redundant. This sched domain topology is an
improvement over previous topologies, or those built without
SCHED_CLUSTER, particularly for certain latency sensitive workloads.
With the current scheduler model and heuristics, this is a desirable
default topology for Ampere Altra and Altra Max system.
Rather than create a custom sched domains topology structure and
introduce new logic in arch/arm64 to detect these systems, update the
core_mask so coregroup is never a subset of clustergroup, extending it
to cluster_siblings if necessary. Only do this if CONFIG_SCHED_CLUSTER
is enabled to avoid also changing the topology (MC) when
CONFIG_SCHED_CLUSTER is disabled.
This has the added benefit over a custom topology of working for both
symmetric and asymmetric topologies. It does not address systems where
the CLUSTER topology is above a populated MC topology, but these are not
considered today and can be addressed separately if and when they
appear.
The final sched domain topology for a 2 socket Ampere Altra system is
unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
CONFIG_SCHED_CLUSTER=y
CLS [0-1]
DIE [0-79]
NUMA [0-159]
CONFIG_SCHED_CLUSTER is not set
DIE [0-79]
NUMA [0-159]
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: D. Scott Phillips <scott@os.amperecomputing.com>
Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Cc: <stable@vger.kernel.org> # 5.16.x
Suggested-by: Barry Song <song.bao.hua@hisilicon.com>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
Link: https://lore.kernel.org/r/c8fe9fce7c86ed56b4c455b8c902982dc2303868.1649696956.git.darren@os.amperecomputing.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Salter <msalter@redhat.com>
Bugzilla: http://bugzilla.redhat.com/2020279
commit c5e22feffdd736cb02b98b0f5b375c8ebc858dd4
Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date: Fri Sep 24 20:51:02 2021 +1200
topology: Represent clusters of CPUs within a die
Both ACPI and DT provide the ability to describe additional layers of
topology between that of individual cores and higher level constructs
such as the level at which the last level cache is shared.
In ACPI this can be represented in PPTT as a Processor Hierarchy
Node Structure [1] that is the parent of the CPU cores and in turn
has a parent Processor Hierarchy Nodes Structure representing
a higher level of topology.
For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus.
+-----------------------------------+ +---------+
| +------+ +------+ +--------------------------+ |
| | CPU0 | | cpu1 | | +-----------+ | |
| +------+ +------+ | | | | |
| +----+ L3 | | |
| +------+ +------+ cluster | | tag | | |
| | CPU2 | | CPU3 | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | |
+-----------------------------------+ | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| | | L3 | | |
| +------+ +------+ +----+ tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | L3 |
| data |
+-----------------------------------+ | |
| +------+ +------+ | +-----------+ | |
| | | | | | | | | |
| +------+ +------+ +----+ L3 | | |
| | | tag | | |
| +------+ +------+ | | | | |
| | | | | | +-----------+ | |
| +------+ +------+ +--------------------------+ |
+-----------------------------------| | |
+-----------------------------------| | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| +----+ L3 | | |
| +------+ +------+ | | tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | |
+-----------------------------------+ | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| | | L3 | | |
| +------+ +------+ +---+ tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | |
+-----------------------------------+ | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| | | L3 | | |
| +------+ +------+ +--+ tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | +---------+
+-----------------------------------+
That means spreading tasks among clusters will bring more bandwidth
while packing tasks within one cluster will lead to smaller cache
synchronization latency. So both kernel and userspace will have
a chance to leverage this topology to deploy tasks accordingly to
achieve either smaller cache latency within one cluster or an even
distribution of load among clusters for higher throughput.
This patch exposes cluster topology to both kernel and userspace.
Libraried like hwloc will know cluster by cluster_cpus and related
sysfs attributes. PoC of HWLOC support at [2].
Note this patch only handle the ACPI case.
Special consideration is needed for SMT processors, where it is
necessary to move 2 levels up the hierarchy from the leaf nodes
(thus skipping the processor core level).
Note that arm64 / ACPI does not provide any means of identifying
a die level in the topology but that may be unrelate to the cluster
level.
[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
structure (Type 0)
[2] https://github.com/hisilicon/hwloc/tree/linux-cluster
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com
Signed-off-by: Phil Auld <pauld@redhat.com>
Currently topology_scale_freq_tick() (which gets called from
scheduler_tick()) may end up using a pointer to "struct
scale_freq_data", which was previously cleared by
topology_clear_scale_freq_source(), as there is no protection in place
here. The users of topology_clear_scale_freq_source() though needs a
guarantee that the previously cleared scale_freq_data isn't used
anymore, so they can free the related resources.
Since topology_scale_freq_tick() is called from scheduler tick, we don't
want to add locking in there. Use the RCU update mechanism instead
(which is already used by the scheduler's utilization update path) to
guarantee race free updates here.
synchronize_rcu() makes sure that all RCU critical sections that started
before it is called, will finish before it returns. And so the callers
of topology_clear_scale_freq_source() don't need to worry about their
callback getting called anymore.
Cc: Paul E. McKenney <paulmck@kernel.org>
Fixes: 01e055c120 ("arch_topology: Allow multiple entities to provide sched_freq_tick() callback")
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Qian Cai <quic_qiancai@quicinc.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
It is possible now for other parts of the kernel to provide their own
implementation of sched_freq_tick() and they can very well be modules
themselves (like CPPC cpufreq driver, which is going to use these in a
later commit).
Export arch_freq_scale and topology_{set|clear}_scale_freq_source().
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
This patch attempts to make it generic enough so other parts of the
kernel can also provide their own implementation of scale_freq_tick()
callback, which is called by the scheduler periodically to update the
per-cpu arch_freq_scale variable.
The implementations now need to provide 'struct scale_freq_data' for the
CPUs for which they have hardware counters available, and a callback
gets registered for each possible CPU in a per-cpu variable.
The arch specific (or ARM AMU) counters are updated to adapt to this and
they take the highest priority if they are available, i.e. they will be
used instead of CPPC based counters for example.
The special code to rebuild the sched domains, in case invariance status
change for the system, is moved out of arm64 specific code and is added
to arch_topology.c.
Note that this also defines SCALE_FREQ_SOURCE_CPUFREQ but doesn't use it
and it is added to show that cpufreq is also acts as source of
information for FIE and will be used by default if no other counters are
supported for a platform.
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Will Deacon <will@kernel.org> # for arm64
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Rename freq_scale to a less generic name, as it will get exported soon
for modules. Since x86 already names its own implementation of this as
arch_freq_scale, lets stick to that.
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Here is the "big" set of driver core patches for 5.10-rc1
They include a lot of different things, all related to the driver core
and/or some driver logic:
- sysfs common write functions to make it easier to audit sysfs
attributes
- device connection cleanups and fixes
- devm helpers for a few functions
- NOIO allocations for when devices are being removed
- minor cleanups and fixes
All have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCX4c4yA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylS7gCfcS+7/PE42eXxMY0z8rBX8aDMadIAn2DVEghA
Eoh9UoMEW4g1uMKORA0c
=CVAW
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "big" set of driver core patches for 5.10-rc1
They include a lot of different things, all related to the driver core
and/or some driver logic:
- sysfs common write functions to make it easier to audit sysfs
attributes
- device connection cleanups and fixes
- devm helpers for a few functions
- NOIO allocations for when devices are being removed
- minor cleanups and fixes
All have been in linux-next for a while with no reported issues"
* tag 'driver-core-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
regmap: debugfs: use semicolons rather than commas to separate statements
platform/x86: intel_pmc_core: do not create a static struct device
drivers core: node: Use a more typical macro definition style for ACCESS_ATTR
drivers core: Use sysfs_emit for shared_cpu_map_show and shared_cpu_list_show
mm: and drivers core: Convert hugetlb_report_node_meminfo to sysfs_emit
drivers core: Miscellaneous changes for sysfs_emit
drivers core: Reindent a couple uses around sysfs_emit
drivers core: Remove strcat uses around sysfs_emit and neaten
drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions
sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
dyndbg: use keyword, arg varnames for query term pairs
driver core: force NOIO allocations during unplug
platform_device: switch to simpler IDA interface
driver core: platform: Document return type of more functions
Revert "driver core: Annotate dev_err_probe() with __must_check"
Revert "test_firmware: Test platform fw loading on non-EFI systems"
iio: adc: xilinx-xadc: use devm_krealloc()
hwmon: pmbus: use more devres helpers
devres: provide devm_krealloc()
syscore: Use pm_pr_dbg() for syscore_{suspend,resume}()
...
Compared to other arch_* functions, arch_set_freq_scale() has an atypical
weak definition that can be replaced by a strong architecture specific
implementation.
The more typical support for architectural functions involves defining
an empty stub in a header file if the symbol is not already defined in
architecture code. Some examples involve:
- #define arch_scale_freq_capacity topology_get_freq_scale
- #define arch_scale_freq_invariant topology_scale_freq_invariant
- #define arch_scale_cpu_capacity topology_get_cpu_scale
- #define arch_update_cpu_topology topology_update_cpu_topology
- #define arch_scale_thermal_pressure topology_get_thermal_pressure
- #define arch_set_thermal_pressure topology_set_thermal_pressure
Bring arch_set_freq_scale() in line with these functions by renaming it to
topology_set_freq_scale() in the arch topology driver, and by defining the
arch_set_freq_scale symbol to point to the new function for arm and arm64.
While there are other users of the arch_topology driver, this patch defines
arch_set_freq_scale for arm and arm64 only, due to their existing
definitions of arch_scale_freq_capacity. This is the getter function of the
frequency invariance scale factor and without a getter function, the
setter function - arch_set_freq_scale() has not purpose.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com> (BL_SWITCHER and topology parts)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
arch_scale_freq_invariant() is used by schedutil to determine whether
the scheduler's load-tracking signals are frequency invariant. Its
definition is overridable, though by default it is hardcoded to 'true'
if arch_scale_freq_capacity() is defined ('false' otherwise).
This behaviour is not overridden on arm, arm64 and other users of the
generic arch topology driver, which is somewhat precarious:
arch_scale_freq_capacity() will always be defined, yet not all cpufreq
drivers are guaranteed to drive the frequency invariance scale factor
setting. In other words, the load-tracking signals may very well *not*
be frequency invariant.
Now that cpufreq can be queried on whether the current driver is driving
the Frequency Invariance (FI) scale setting, the current situation can
be improved. This combines the query of whether cpufreq supports the
setting of the frequency scale factor, with whether all online CPUs are
counter-based FI enabled.
While cpufreq FI enablement applies at system level, for all CPUs,
counter-based FI support could also be used for only a subset of CPUs to
set the invariance scale factor. Therefore, if cpufreq-based FI support
is present, we consider the system to be invariant. If missing, we
require all online CPUs to be counter-based FI enabled in order for the
full system to be considered invariant.
If the system ends up not being invariant, a new condition is needed in
the counter initialization code that disables all scale factor setting
based on counters.
Precedence of counters over cpufreq use is not important here. The
invariant status is only given to the system if all CPUs have at least
one method of setting the frequency scale factor.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The passed cpumask arguments to arch_set_freq_scale() and
arch_freq_counters_available() are only iterated over, so reflect this
in the prototype. This also allows to pass system cpumasks like
cpu_online_mask without getting a warning.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The current frequency passed to arch_set_freq_scale() could end up
being 0, signaling an error in setting a new frequency. Also, if the
maximum frequency in 0, this will result in a division by 0 error.
Therefore, validate these input values before using them for the
setting of the frequency scale factor.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The following commit:
14533a16c4 ("thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code")
moved the definition of arch_set_thermal_pressure() to sched/core.c, but
kept its declaration in linux/arch_topology.h. When building e.g. an x86
kernel with CONFIG_SCHED_THERMAL_PRESSURE=y, cpufreq_cooling.c ends up
getting the declaration of arch_set_thermal_pressure() from
include/linux/arch_topology.h, which is somewhat awkward.
On top of this, sched/core.c unconditionally defines
o The thermal_pressure percpu variable
o arch_set_thermal_pressure()
while arch_scale_thermal_pressure() does nothing unless redefined by the
architecture.
arch_*() functions are meant to be defined by architectures, so revert the
aforementioned commit and re-implement it in a way that keeps
arch_set_thermal_pressure() architecture-definable, and doesn't define the
thermal pressure percpu variable for kernels that don't need
it (CONFIG_SCHED_THERMAL_PRESSURE=n).
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200712165917.9168-2-valentin.schneider@arm.com
- In-kernel Pointer Authentication support (previously only offered to
user space).
- ARM Activity Monitors (AMU) extension support allowing better CPU
utilisation numbers for the scheduler (frequency invariance).
- Memory hot-remove support for arm64.
- Lots of asm annotations (SYM_*) in preparation for the in-kernel
Branch Target Identification (BTI) support.
- arm64 perf updates: ARMv8.5-PMU 64-bit counters, refactoring the PMU
init callbacks, support for new DT compatibles.
- IPv6 header checksum optimisation.
- Fixes: SDEI (software delegated exception interface) double-lock on
hibernate with shared events.
- Minor clean-ups and refactoring: cpu_ops accessor, cpu_do_switch_mm()
converted to C, cpufeature finalisation helper.
- sys_mremap() comment explaining the asymmetric address untagging
behaviour.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl6DVyIACgkQa9axLQDI
XvHkqRAAiZA2EYKiQL4M1DJ1cNTADjT7xKX9+UtYBXj7GMVhgVWdunpHVE6qtfgk
cT6avmKrS/6PDqizJgr+Z1yX8x3Kvs57G4BvmIUKIw97mkdewvFQ9JKv6VA1vb86
7Qrl1WzqsGg5Kj9uUfI4h+ZoT1H4C/9PQeFxJwgZRtF9DxRh8O7VeZI+JCu8Aub2
lIkjI8rh+EpTsGT9h/PMGWUcawnKQloZ1/F+GfMAuYBvIv2RNN2xVreJtTmm4NyJ
VcpL0KCNyAI2lGdaJg5nBLRDyGuXDm5i+PLsCSXMquI4fie00txXeD8sjbeuO0ks
YTJ0EhmUUhbSE17go+SxYiEFE0v09i+lD5ud+B4Vmojp0KTczTta9VSgURlbb2/9
n9biq5G3PPDNIrZqiTT2Tf4AMz1350nkbzL2gzKecM5aIzR/u3y5yII5CgfZtFnj
7bGbyFpFpcqI7UaISPsNCxmknbTt/7ff0WM3+7SbecxI3AD2mnxsOdN9JTLyhDp+
owjyiaWxl5zMWF9DhplLG/9BKpNWSxh3skazdOdELd8GTq2MbJlXrVG2XgXTAOh3
y1s6RQrfw8zXh8TSqdmmzauComXIRWTum/sbVB3U8Z3AUsIeq/NTSbN5X9JyIbOP
HOabhlVhhkI6omN1grqPX4jwUiZLZoNfn7Ez4q71549KVK/uBtA=
=LJVX
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"The bulk is in-kernel pointer authentication, activity monitors and
lots of asm symbol annotations. I also queued the sys_mremap() patch
commenting the asymmetry in the address untagging.
Summary:
- In-kernel Pointer Authentication support (previously only offered
to user space).
- ARM Activity Monitors (AMU) extension support allowing better CPU
utilisation numbers for the scheduler (frequency invariance).
- Memory hot-remove support for arm64.
- Lots of asm annotations (SYM_*) in preparation for the in-kernel
Branch Target Identification (BTI) support.
- arm64 perf updates: ARMv8.5-PMU 64-bit counters, refactoring the
PMU init callbacks, support for new DT compatibles.
- IPv6 header checksum optimisation.
- Fixes: SDEI (software delegated exception interface) double-lock on
hibernate with shared events.
- Minor clean-ups and refactoring: cpu_ops accessor,
cpu_do_switch_mm() converted to C, cpufeature finalisation helper.
- sys_mremap() comment explaining the asymmetric address untagging
behaviour"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (81 commits)
mm/mremap: Add comment explaining the untagging behaviour of mremap()
arm64: head: Convert install_el2_stub to SYM_INNER_LABEL
arm64: Introduce get_cpu_ops() helper function
arm64: Rename cpu_read_ops() to init_cpu_ops()
arm64: Declare ACPI parking protocol CPU operation if needed
arm64: move kimage_vaddr to .rodata
arm64: use mov_q instead of literal ldr
arm64: Kconfig: verify binutils support for ARM64_PTR_AUTH
lkdtm: arm64: test kernel pointer authentication
arm64: compile the kernel with ptrauth return address signing
kconfig: Add support for 'as-option'
arm64: suspend: restore the kernel ptrauth keys
arm64: __show_regs: strip PAC from lr in printk
arm64: unwind: strip PAC from kernel addresses
arm64: mask PAC bits of __builtin_return_address
arm64: initialize ptrauth keys for kernel booting task
arm64: initialize and switch ptrauth kernel keys
arm64: enable ptrauth earlier
arm64: cpufeature: handle conflicts based on capability
arm64: cpufeature: Move cpu capability helpers inside C file
...