Centos-kernel-stream-9/drivers/base/arch_topology.c

898 lines
22 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0
/*
* Arch specific cpu topology information
*
* Copyright (C) 2016, ARM Ltd.
* Written by: Juri Lelli, ARM Ltd.
*/
#include <linux/acpi.h>
#include <linux/cacheinfo.h>
#include <linux/cpu.h>
#include <linux/cpufreq.h>
#include <linux/device.h>
#include <linux/of.h>
#include <linux/slab.h>
#include <linux/sched/topology.h>
#include <linux/cpuset.h>
#include <linux/cpumask.h>
#include <linux/init.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
sched/topology: Add a new arch_scale_freq_ref() method JIRA: https://issues.redhat.com/browse/RHEL-29020 Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file does not exist. commit 9942cb22ea458c34fa17b73d143ea32d4df1caca Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Dec 11 11:48:49 2023 +0100 sched/topology: Add a new arch_scale_freq_ref() method Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-27 17:55:25 +00:00
#include <linux/units.h>
arch_topology: Trace the update thermal pressure Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318 commit c3d438eeb5413111889edef10cb5fcc2c0fb8bc9 Author: Lukasz Luba <lukasz.luba@arm.com> Date: Wed, 27 Apr 2022 09:08:06 +0100 Add trace event to capture the moment of the call for updating the thermal pressure value. It's helpful to investigate how often those events occur in a system dealing with throttling. This trace event is needed since the old 'cdev_update' might not be used by some drivers. The old 'cdev_update' trace event only provides a cooling state value: [0, n]. That state value then needs additional tools to translate it: state -> freq -> capacity -> thermal pressure. This new trace event just stores proper thermal pressure value in the trace buffer, no need for additional logic. This is helpful for cooperation when someone can simply sends to the list the trace buffer output from the platform (no need from additional information from other subsystems). There are also platforms which due to some design reasons don't use cooling devices and thus don't trigger old 'cdev_update' trace event. They are also important and measuring latency for the thermal signal raising/decaying characteristics is in scope. This new trace event would cover them as well. We already have a trace point 'pelt_thermal_tp' which after a change to trace event can be paired with this new 'thermal_pressure_update' and derive more insight what is going on in the system under thermal pressure (and why). Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20220427080806.1906-1-lukasz.luba@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-08-22 18:19:43 +00:00
#define CREATE_TRACE_POINTS
#include <trace/events/hw_pressure.h>
arch_topology: Trace the update thermal pressure Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318 commit c3d438eeb5413111889edef10cb5fcc2c0fb8bc9 Author: Lukasz Luba <lukasz.luba@arm.com> Date: Wed, 27 Apr 2022 09:08:06 +0100 Add trace event to capture the moment of the call for updating the thermal pressure value. It's helpful to investigate how often those events occur in a system dealing with throttling. This trace event is needed since the old 'cdev_update' might not be used by some drivers. The old 'cdev_update' trace event only provides a cooling state value: [0, n]. That state value then needs additional tools to translate it: state -> freq -> capacity -> thermal pressure. This new trace event just stores proper thermal pressure value in the trace buffer, no need for additional logic. This is helpful for cooperation when someone can simply sends to the list the trace buffer output from the platform (no need from additional information from other subsystems). There are also platforms which due to some design reasons don't use cooling devices and thus don't trigger old 'cdev_update' trace event. They are also important and measuring latency for the thermal signal raising/decaying characteristics is in scope. This new trace event would cover them as well. We already have a trace point 'pelt_thermal_tp' which after a change to trace event can be paired with this new 'thermal_pressure_update' and derive more insight what is going on in the system under thermal pressure (and why). Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20220427080806.1906-1-lukasz.luba@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-08-22 18:19:43 +00:00
static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask;
static bool scale_freq_invariant;
sched/topology: Add a new arch_scale_freq_ref() method JIRA: https://issues.redhat.com/browse/RHEL-29020 Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file does not exist. commit 9942cb22ea458c34fa17b73d143ea32d4df1caca Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Dec 11 11:48:49 2023 +0100 sched/topology: Add a new arch_scale_freq_ref() method Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-27 17:55:25 +00:00
DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 1;
EXPORT_PER_CPU_SYMBOL_GPL(capacity_freq_ref);
static bool supports_scale_freq_counters(const struct cpumask *cpus)
{
return cpumask_subset(cpus, &scale_freq_counters_mask);
}
arch_topology, arm, arm64: define arch_scale_freq_invariant() arch_scale_freq_invariant() is used by schedutil to determine whether the scheduler's load-tracking signals are frequency invariant. Its definition is overridable, though by default it is hardcoded to 'true' if arch_scale_freq_capacity() is defined ('false' otherwise). This behaviour is not overridden on arm, arm64 and other users of the generic arch topology driver, which is somewhat precarious: arch_scale_freq_capacity() will always be defined, yet not all cpufreq drivers are guaranteed to drive the frequency invariance scale factor setting. In other words, the load-tracking signals may very well *not* be frequency invariant. Now that cpufreq can be queried on whether the current driver is driving the Frequency Invariance (FI) scale setting, the current situation can be improved. This combines the query of whether cpufreq supports the setting of the frequency scale factor, with whether all online CPUs are counter-based FI enabled. While cpufreq FI enablement applies at system level, for all CPUs, counter-based FI support could also be used for only a subset of CPUs to set the invariance scale factor. Therefore, if cpufreq-based FI support is present, we consider the system to be invariant. If missing, we require all online CPUs to be counter-based FI enabled in order for the full system to be considered invariant. If the system ends up not being invariant, a new condition is needed in the counter initialization code that disables all scale factor setting based on counters. Precedence of counters over cpufreq use is not important here. The invariant status is only given to the system if all CPUs have at least one method of setting the frequency scale factor. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-09-01 20:55:49 +00:00
bool topology_scale_freq_invariant(void)
{
return cpufreq_supports_freq_invariance() ||
supports_scale_freq_counters(cpu_online_mask);
}
static void update_scale_freq_invariant(bool status)
{
if (scale_freq_invariant == status)
return;
/*
* Task scheduler behavior depends on frequency invariance support,
* either cpufreq or counter driven. If the support status changes as
* a result of counter initialisation and use, retrigger the build of
* scheduling domains to ensure the information is propagated properly.
*/
if (topology_scale_freq_invariant() == status) {
scale_freq_invariant = status;
rebuild_sched_domains_energy();
}
}
void topology_set_scale_freq_source(struct scale_freq_data *data,
const struct cpumask *cpus)
{
struct scale_freq_data *sfd;
int cpu;
/*
* Avoid calling rebuild_sched_domains() unnecessarily if FIE is
* supported by cpufreq.
*/
if (cpumask_empty(&scale_freq_counters_mask))
scale_freq_invariant = topology_scale_freq_invariant();
rcu_read_lock();
for_each_cpu(cpu, cpus) {
sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
/* Use ARCH provided counters whenever possible */
if (!sfd || sfd->source != SCALE_FREQ_SOURCE_ARCH) {
rcu_assign_pointer(per_cpu(sft_data, cpu), data);
cpumask_set_cpu(cpu, &scale_freq_counters_mask);
}
}
rcu_read_unlock();
update_scale_freq_invariant(true);
arch_topology, arm, arm64: define arch_scale_freq_invariant() arch_scale_freq_invariant() is used by schedutil to determine whether the scheduler's load-tracking signals are frequency invariant. Its definition is overridable, though by default it is hardcoded to 'true' if arch_scale_freq_capacity() is defined ('false' otherwise). This behaviour is not overridden on arm, arm64 and other users of the generic arch topology driver, which is somewhat precarious: arch_scale_freq_capacity() will always be defined, yet not all cpufreq drivers are guaranteed to drive the frequency invariance scale factor setting. In other words, the load-tracking signals may very well *not* be frequency invariant. Now that cpufreq can be queried on whether the current driver is driving the Frequency Invariance (FI) scale setting, the current situation can be improved. This combines the query of whether cpufreq supports the setting of the frequency scale factor, with whether all online CPUs are counter-based FI enabled. While cpufreq FI enablement applies at system level, for all CPUs, counter-based FI support could also be used for only a subset of CPUs to set the invariance scale factor. Therefore, if cpufreq-based FI support is present, we consider the system to be invariant. If missing, we require all online CPUs to be counter-based FI enabled in order for the full system to be considered invariant. If the system ends up not being invariant, a new condition is needed in the counter initialization code that disables all scale factor setting based on counters. Precedence of counters over cpufreq use is not important here. The invariant status is only given to the system if all CPUs have at least one method of setting the frequency scale factor. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-09-01 20:55:49 +00:00
}
EXPORT_SYMBOL_GPL(topology_set_scale_freq_source);
arch_topology, arm, arm64: define arch_scale_freq_invariant() arch_scale_freq_invariant() is used by schedutil to determine whether the scheduler's load-tracking signals are frequency invariant. Its definition is overridable, though by default it is hardcoded to 'true' if arch_scale_freq_capacity() is defined ('false' otherwise). This behaviour is not overridden on arm, arm64 and other users of the generic arch topology driver, which is somewhat precarious: arch_scale_freq_capacity() will always be defined, yet not all cpufreq drivers are guaranteed to drive the frequency invariance scale factor setting. In other words, the load-tracking signals may very well *not* be frequency invariant. Now that cpufreq can be queried on whether the current driver is driving the Frequency Invariance (FI) scale setting, the current situation can be improved. This combines the query of whether cpufreq supports the setting of the frequency scale factor, with whether all online CPUs are counter-based FI enabled. While cpufreq FI enablement applies at system level, for all CPUs, counter-based FI support could also be used for only a subset of CPUs to set the invariance scale factor. Therefore, if cpufreq-based FI support is present, we consider the system to be invariant. If missing, we require all online CPUs to be counter-based FI enabled in order for the full system to be considered invariant. If the system ends up not being invariant, a new condition is needed in the counter initialization code that disables all scale factor setting based on counters. Precedence of counters over cpufreq use is not important here. The invariant status is only given to the system if all CPUs have at least one method of setting the frequency scale factor. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-09-01 20:55:49 +00:00
void topology_clear_scale_freq_source(enum scale_freq_source source,
const struct cpumask *cpus)
arm64: use activity monitors for frequency invariance The Frequency Invariance Engine (FIE) is providing a frequency scaling correction factor that helps achieve more accurate load-tracking. So far, for arm and arm64 platforms, this scale factor has been obtained based on the ratio between the current frequency and the maximum supported frequency recorded by the cpufreq policy. The setting of this scale factor is triggered from cpufreq drivers by calling arch_set_freq_scale. The current frequency used in computation is the frequency requested by a governor, but it may not be the frequency that was implemented by the platform. This correction factor can also be obtained using a core counter and a constant counter to get information on the performance (frequency based only) obtained in a period of time. This will more accurately reflect the actual current frequency of the CPU, compared with the alternative implementation that reflects the request of a performance level from the OS. Therefore, implement arch_scale_freq_tick to use activity monitors, if present, for the computation of the frequency scale factor. The use of AMU counters depends on: - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present - CONFIG_CPU_FREQ - the current frequency obtained using counter information is divided by the maximum frequency obtained from the cpufreq policy. While it is possible to have a combination of CPUs in the system with and without support for activity monitors, the use of counters for frequency invariance is only enabled for a CPU if all related CPUs (CPUs in the same frequency domain) support and have enabled the core and constant activity monitor counters. In this way, there is a clear separation between the policies for which arch_set_freq_scale (cpufreq based FIE) is used, and the policies for which arch_scale_freq_tick (counter based FIE) is used to set the frequency scale factor. For this purpose, a late_initcall_sync is registered to trigger validation work for policies that will enable or disable the use of AMU counters for frequency invariance. If CONFIG_CPU_FREQ is not defined, the use of counters is enabled on all CPUs only if all possible CPUs correctly support the necessary counters. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-05 09:06:26 +00:00
{
struct scale_freq_data *sfd;
int cpu;
rcu_read_lock();
for_each_cpu(cpu, cpus) {
sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
if (sfd && sfd->source == source) {
rcu_assign_pointer(per_cpu(sft_data, cpu), NULL);
cpumask_clear_cpu(cpu, &scale_freq_counters_mask);
}
}
rcu_read_unlock();
/*
* Make sure all references to previous sft_data are dropped to avoid
* use-after-free races.
*/
synchronize_rcu();
update_scale_freq_invariant(false);
arm64: use activity monitors for frequency invariance The Frequency Invariance Engine (FIE) is providing a frequency scaling correction factor that helps achieve more accurate load-tracking. So far, for arm and arm64 platforms, this scale factor has been obtained based on the ratio between the current frequency and the maximum supported frequency recorded by the cpufreq policy. The setting of this scale factor is triggered from cpufreq drivers by calling arch_set_freq_scale. The current frequency used in computation is the frequency requested by a governor, but it may not be the frequency that was implemented by the platform. This correction factor can also be obtained using a core counter and a constant counter to get information on the performance (frequency based only) obtained in a period of time. This will more accurately reflect the actual current frequency of the CPU, compared with the alternative implementation that reflects the request of a performance level from the OS. Therefore, implement arch_scale_freq_tick to use activity monitors, if present, for the computation of the frequency scale factor. The use of AMU counters depends on: - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present - CONFIG_CPU_FREQ - the current frequency obtained using counter information is divided by the maximum frequency obtained from the cpufreq policy. While it is possible to have a combination of CPUs in the system with and without support for activity monitors, the use of counters for frequency invariance is only enabled for a CPU if all related CPUs (CPUs in the same frequency domain) support and have enabled the core and constant activity monitor counters. In this way, there is a clear separation between the policies for which arch_set_freq_scale (cpufreq based FIE) is used, and the policies for which arch_scale_freq_tick (counter based FIE) is used to set the frequency scale factor. For this purpose, a late_initcall_sync is registered to trigger validation work for policies that will enable or disable the use of AMU counters for frequency invariance. If CONFIG_CPU_FREQ is not defined, the use of counters is enabled on all CPUs only if all possible CPUs correctly support the necessary counters. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-05 09:06:26 +00:00
}
EXPORT_SYMBOL_GPL(topology_clear_scale_freq_source);
void topology_scale_freq_tick(void)
{
struct scale_freq_data *sfd = rcu_dereference_sched(*this_cpu_ptr(&sft_data));
if (sfd)
sfd->set_freq_scale();
}
DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
EXPORT_PER_CPU_SYMBOL_GPL(arch_freq_scale);
cpufreq,arm,arm64: restructure definitions of arch_set_freq_scale() Compared to other arch_* functions, arch_set_freq_scale() has an atypical weak definition that can be replaced by a strong architecture specific implementation. The more typical support for architectural functions involves defining an empty stub in a header file if the symbol is not already defined in architecture code. Some examples involve: - #define arch_scale_freq_capacity topology_get_freq_scale - #define arch_scale_freq_invariant topology_scale_freq_invariant - #define arch_scale_cpu_capacity topology_get_cpu_scale - #define arch_update_cpu_topology topology_update_cpu_topology - #define arch_scale_thermal_pressure topology_get_thermal_pressure - #define arch_set_thermal_pressure topology_set_thermal_pressure Bring arch_set_freq_scale() in line with these functions by renaming it to topology_set_freq_scale() in the arch topology driver, and by defining the arch_set_freq_scale symbol to point to the new function for arm and arm64. While there are other users of the arch_topology driver, this patch defines arch_set_freq_scale for arm and arm64 only, due to their existing definitions of arch_scale_freq_capacity. This is the getter function of the frequency invariance scale factor and without a getter function, the setter function - arch_set_freq_scale() has not purpose. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> (BL_SWITCHER and topology parts) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-09-24 12:30:15 +00:00
void topology_set_freq_scale(const struct cpumask *cpus, unsigned long cur_freq,
unsigned long max_freq)
{
unsigned long scale;
int i;
if (WARN_ON_ONCE(!cur_freq || !max_freq))
return;
arm64: use activity monitors for frequency invariance The Frequency Invariance Engine (FIE) is providing a frequency scaling correction factor that helps achieve more accurate load-tracking. So far, for arm and arm64 platforms, this scale factor has been obtained based on the ratio between the current frequency and the maximum supported frequency recorded by the cpufreq policy. The setting of this scale factor is triggered from cpufreq drivers by calling arch_set_freq_scale. The current frequency used in computation is the frequency requested by a governor, but it may not be the frequency that was implemented by the platform. This correction factor can also be obtained using a core counter and a constant counter to get information on the performance (frequency based only) obtained in a period of time. This will more accurately reflect the actual current frequency of the CPU, compared with the alternative implementation that reflects the request of a performance level from the OS. Therefore, implement arch_scale_freq_tick to use activity monitors, if present, for the computation of the frequency scale factor. The use of AMU counters depends on: - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present - CONFIG_CPU_FREQ - the current frequency obtained using counter information is divided by the maximum frequency obtained from the cpufreq policy. While it is possible to have a combination of CPUs in the system with and without support for activity monitors, the use of counters for frequency invariance is only enabled for a CPU if all related CPUs (CPUs in the same frequency domain) support and have enabled the core and constant activity monitor counters. In this way, there is a clear separation between the policies for which arch_set_freq_scale (cpufreq based FIE) is used, and the policies for which arch_scale_freq_tick (counter based FIE) is used to set the frequency scale factor. For this purpose, a late_initcall_sync is registered to trigger validation work for policies that will enable or disable the use of AMU counters for frequency invariance. If CONFIG_CPU_FREQ is not defined, the use of counters is enabled on all CPUs only if all possible CPUs correctly support the necessary counters. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-05 09:06:26 +00:00
/*
* If the use of counters for FIE is enabled, just return as we don't
* want to update the scale factor with information from CPUFREQ.
* Instead the scale factor will be updated from arch_scale_freq_tick.
*/
if (supports_scale_freq_counters(cpus))
arm64: use activity monitors for frequency invariance The Frequency Invariance Engine (FIE) is providing a frequency scaling correction factor that helps achieve more accurate load-tracking. So far, for arm and arm64 platforms, this scale factor has been obtained based on the ratio between the current frequency and the maximum supported frequency recorded by the cpufreq policy. The setting of this scale factor is triggered from cpufreq drivers by calling arch_set_freq_scale. The current frequency used in computation is the frequency requested by a governor, but it may not be the frequency that was implemented by the platform. This correction factor can also be obtained using a core counter and a constant counter to get information on the performance (frequency based only) obtained in a period of time. This will more accurately reflect the actual current frequency of the CPU, compared with the alternative implementation that reflects the request of a performance level from the OS. Therefore, implement arch_scale_freq_tick to use activity monitors, if present, for the computation of the frequency scale factor. The use of AMU counters depends on: - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present - CONFIG_CPU_FREQ - the current frequency obtained using counter information is divided by the maximum frequency obtained from the cpufreq policy. While it is possible to have a combination of CPUs in the system with and without support for activity monitors, the use of counters for frequency invariance is only enabled for a CPU if all related CPUs (CPUs in the same frequency domain) support and have enabled the core and constant activity monitor counters. In this way, there is a clear separation between the policies for which arch_set_freq_scale (cpufreq based FIE) is used, and the policies for which arch_scale_freq_tick (counter based FIE) is used to set the frequency scale factor. For this purpose, a late_initcall_sync is registered to trigger validation work for policies that will enable or disable the use of AMU counters for frequency invariance. If CONFIG_CPU_FREQ is not defined, the use of counters is enabled on all CPUs only if all possible CPUs correctly support the necessary counters. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-05 09:06:26 +00:00
return;
scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
for_each_cpu(i, cpus)
per_cpu(arch_freq_scale, i) = scale;
}
DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
EXPORT_PER_CPU_SYMBOL_GPL(cpu_scale);
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
{
per_cpu(cpu_scale, cpu) = capacity;
}
DEFINE_PER_CPU(unsigned long, hw_pressure);
/**
* topology_update_hw_pressure() - Update HW pressure for CPUs
* @cpus : The related CPUs for which capacity has been reduced
* @capped_freq : The maximum allowed frequency that CPUs can run at
*
* Update the value of HW pressure for all @cpus in the mask. The
* cpumask should include all (online+offline) affected CPUs, to avoid
* operating on stale data when hot-plug is used for some CPUs. The
* @capped_freq reflects the currently allowed max CPUs frequency due to
* HW capping. It might be also a boost frequency value, which is bigger
sched/topology: Add a new arch_scale_freq_ref() method JIRA: https://issues.redhat.com/browse/RHEL-29020 Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file does not exist. commit 9942cb22ea458c34fa17b73d143ea32d4df1caca Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Dec 11 11:48:49 2023 +0100 sched/topology: Add a new arch_scale_freq_ref() method Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-27 17:55:25 +00:00
* than the internal 'capacity_freq_ref' max frequency. In such case the
* pressure value should simply be removed, since this is an indication that
* there is no HW throttling. The @capped_freq must be provided in kHz.
*/
void topology_update_hw_pressure(const struct cpumask *cpus,
unsigned long capped_freq)
{
unsigned long max_capacity, capacity, pressure;
u32 max_freq;
int cpu;
cpu = cpumask_first(cpus);
max_capacity = arch_scale_cpu_capacity(cpu);
sched/topology: Add a new arch_scale_freq_ref() method JIRA: https://issues.redhat.com/browse/RHEL-29020 Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file does not exist. commit 9942cb22ea458c34fa17b73d143ea32d4df1caca Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Dec 11 11:48:49 2023 +0100 sched/topology: Add a new arch_scale_freq_ref() method Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-27 17:55:25 +00:00
max_freq = arch_scale_freq_ref(cpu);
/*
* Handle properly the boost frequencies, which should simply clean
* the HW pressure value.
*/
if (max_freq <= capped_freq)
capacity = max_capacity;
else
capacity = mult_frac(max_capacity, capped_freq, max_freq);
pressure = max_capacity - capacity;
trace_hw_pressure_update(cpu, pressure);
arch_topology: Trace the update thermal pressure Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318 commit c3d438eeb5413111889edef10cb5fcc2c0fb8bc9 Author: Lukasz Luba <lukasz.luba@arm.com> Date: Wed, 27 Apr 2022 09:08:06 +0100 Add trace event to capture the moment of the call for updating the thermal pressure value. It's helpful to investigate how often those events occur in a system dealing with throttling. This trace event is needed since the old 'cdev_update' might not be used by some drivers. The old 'cdev_update' trace event only provides a cooling state value: [0, n]. That state value then needs additional tools to translate it: state -> freq -> capacity -> thermal pressure. This new trace event just stores proper thermal pressure value in the trace buffer, no need for additional logic. This is helpful for cooperation when someone can simply sends to the list the trace buffer output from the platform (no need from additional information from other subsystems). There are also platforms which due to some design reasons don't use cooling devices and thus don't trigger old 'cdev_update' trace event. They are also important and measuring latency for the thermal signal raising/decaying characteristics is in scope. This new trace event would cover them as well. We already have a trace point 'pelt_thermal_tp' which after a change to trace event can be paired with this new 'thermal_pressure_update' and derive more insight what is going on in the system under thermal pressure (and why). Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20220427080806.1906-1-lukasz.luba@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-08-22 18:19:43 +00:00
for_each_cpu(cpu, cpus)
WRITE_ONCE(per_cpu(hw_pressure, cpu), pressure);
}
EXPORT_SYMBOL_GPL(topology_update_hw_pressure);
static ssize_t cpu_capacity_show(struct device *dev,
struct device_attribute *attr,
char *buf)
{
struct cpu *cpu = container_of(dev, struct cpu, dev);
drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions Convert the various sprintf fmaily calls in sysfs device show functions to sysfs_emit and sysfs_emit_at for PAGE_SIZE buffer safety. Done with: $ spatch -sp-file sysfs_emit_dev.cocci --in-place --max-width=80 . And cocci script: $ cat sysfs_emit_dev.cocci @@ identifier d_show; identifier dev, attr, buf; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - sprintf(buf, + sysfs_emit(buf, ...); ...> } @@ identifier d_show; identifier dev, attr, buf; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - snprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> } @@ identifier d_show; identifier dev, attr, buf; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - scnprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> } @@ identifier d_show; identifier dev, attr, buf; expression chr; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - strcpy(buf, chr); + sysfs_emit(buf, chr); ...> } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... len = - sprintf(buf, + sysfs_emit(buf, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... len = - snprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... len = - scnprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... - len += scnprintf(buf + len, PAGE_SIZE - len, + len += sysfs_emit_at(buf, len, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; expression chr; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { ... - strcpy(buf, chr); - return strlen(buf); + return sysfs_emit(buf, chr); } Signed-off-by: Joe Perches <joe@perches.com> Link: https://lore.kernel.org/r/3d033c33056d88bbe34d4ddb62afd05ee166ab9a.1600285923.git.joe@perches.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-16 20:40:39 +00:00
return sysfs_emit(buf, "%lu\n", topology_get_cpu_scale(cpu->dev.id));
}
static void update_topology_flags_workfn(struct work_struct *work);
static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
arch_topology: Make cpu_capacity sysfs node as read-only If user updates any cpu's cpu_capacity, then the new value is going to be applied to all its online sibling cpus. But this need not to be correct always, as sibling cpus (in ARM, same micro architecture cpus) would have different cpu_capacity with different performance characteristics. So, updating the user supplied cpu_capacity to all cpu siblings is not correct. And another problem is, current code assumes that 'all cpus in a cluster or with same package_id (core_siblings), would have same cpu_capacity'. But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove cpu topology sibling masks")', when a cpu hotplugged out, the cpu information gets cleared in its sibling cpus. So, user supplied cpu_capacity would be applied to only online sibling cpus at the time. After that, if any cpu hotplugged in, it would have different cpu_capacity than its siblings, which breaks the above assumption. So, instead of mucking around the core sibling mask for user supplied value, use device-tree to set cpu capacity. And make the cpu_capacity node as read-only to know the asymmetry between cpus in the system. While at it, remove cpu_scale_mutex usage, which used for sysfs write protection. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Quentin Perret <quentin.perret@arm.com> Reviewed-by: Quentin Perret <quentin.perret@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-01 04:24:41 +00:00
static DEVICE_ATTR_RO(cpu_capacity);
static int register_cpu_capacity_sysctl(void)
{
int i;
struct device *cpu;
for_each_possible_cpu(i) {
cpu = get_cpu_device(i);
if (!cpu) {
pr_err("%s: too early to get CPU%d device!\n",
__func__, i);
continue;
}
device_create_file(cpu, &dev_attr_cpu_capacity);
}
return 0;
}
subsys_initcall(register_cpu_capacity_sysctl);
static int update_topology;
int topology_update_cpu_topology(void)
{
return update_topology;
}
/*
* Updating the sched_domains can't be done directly from cpufreq callbacks
* due to locking, so queue the work for later.
*/
static void update_topology_flags_workfn(struct work_struct *work)
{
update_topology = 1;
rebuild_sched_domains();
pr_debug("sched_domain hierarchy rebuilt, flags updated\n");
update_topology = 0;
}
static u32 *raw_capacity;
static int free_raw_capacity(void)
{
kfree(raw_capacity);
raw_capacity = NULL;
return 0;
}
void topology_normalize_cpu_scale(void)
{
u64 capacity;
u64 capacity_scale;
int cpu;
if (!raw_capacity)
return;
capacity_scale = 1;
for_each_possible_cpu(cpu) {
sched/topology: Add a new arch_scale_freq_ref() method JIRA: https://issues.redhat.com/browse/RHEL-29020 Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file does not exist. commit 9942cb22ea458c34fa17b73d143ea32d4df1caca Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Dec 11 11:48:49 2023 +0100 sched/topology: Add a new arch_scale_freq_ref() method Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-27 17:55:25 +00:00
capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
capacity_scale = max(capacity, capacity_scale);
}
pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
for_each_possible_cpu(cpu) {
sched/topology: Add a new arch_scale_freq_ref() method JIRA: https://issues.redhat.com/browse/RHEL-29020 Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file does not exist. commit 9942cb22ea458c34fa17b73d143ea32d4df1caca Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Dec 11 11:48:49 2023 +0100 sched/topology: Add a new arch_scale_freq_ref() method Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-27 17:55:25 +00:00
capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale);
topology_set_cpu_scale(cpu, capacity);
pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
cpu, topology_get_cpu_scale(cpu));
}
}
bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
{
struct clk *cpu_clk;
static bool cap_parsing_failed;
int ret;
u32 cpu_capacity;
if (cap_parsing_failed)
return false;
ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz",
&cpu_capacity);
if (!ret) {
if (!raw_capacity) {
raw_capacity = kcalloc(num_possible_cpus(),
sizeof(*raw_capacity),
GFP_KERNEL);
if (!raw_capacity) {
cap_parsing_failed = true;
return false;
}
}
raw_capacity[cpu] = cpu_capacity;
pr_debug("cpu_capacity: %pOF cpu_capacity=%u (raw)\n",
cpu_node, raw_capacity[cpu]);
/*
sched/topology: Add a new arch_scale_freq_ref() method JIRA: https://issues.redhat.com/browse/RHEL-29020 Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file does not exist. commit 9942cb22ea458c34fa17b73d143ea32d4df1caca Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Dec 11 11:48:49 2023 +0100 sched/topology: Add a new arch_scale_freq_ref() method Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-27 17:55:25 +00:00
* Update capacity_freq_ref for calculating early boot CPU capacities.
* For non-clk CPU DVFS mechanism, there's no way to get the
* frequency value now, assuming they are running at the same
sched/topology: Add a new arch_scale_freq_ref() method JIRA: https://issues.redhat.com/browse/RHEL-29020 Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file does not exist. commit 9942cb22ea458c34fa17b73d143ea32d4df1caca Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Dec 11 11:48:49 2023 +0100 sched/topology: Add a new arch_scale_freq_ref() method Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-27 17:55:25 +00:00
* frequency (by keeping the initial capacity_freq_ref value).
*/
cpu_clk = of_clk_get(cpu_node, 0);
if (!PTR_ERR_OR_ZERO(cpu_clk)) {
sched/topology: Add a new arch_scale_freq_ref() method JIRA: https://issues.redhat.com/browse/RHEL-29020 Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file does not exist. commit 9942cb22ea458c34fa17b73d143ea32d4df1caca Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Dec 11 11:48:49 2023 +0100 sched/topology: Add a new arch_scale_freq_ref() method Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-27 17:55:25 +00:00
per_cpu(capacity_freq_ref, cpu) =
clk_get_rate(cpu_clk) / HZ_PER_KHZ;
clk_put(cpu_clk);
}
} else {
if (raw_capacity) {
pr_err("cpu_capacity: missing %pOF raw capacity\n",
cpu_node);
pr_err("cpu_capacity: partial information: fallback to 1024 for all CPUs\n");
}
cap_parsing_failed = true;
free_raw_capacity();
}
return !ret;
}
void __weak freq_inv_set_max_ratio(int cpu, u64 max_rate)
{
}
arch_topology: obtain cpu capacity using information from CPPC Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284 commit 9924fbb51e0ae30b8d7eec7c1c839d74da9678b3 Author: Ionela Voinescu <ionela.voinescu@arm.com> Date: Thu, 10 Mar 2022 14:54:50 +0000 Define topology_init_cpu_capacity_cppc() to use highest performance values from _CPC objects to obtain and set maximum capacity information for each CPU. acpi_cppc_processor_probe() is a good point at which to trigger the initialization of CPU (u-arch) capacity values, as at this point the highest performance values can be obtained from each CPU's _CPC objects. Architectures can therefore use this functionality through arch_init_invariance_cppc(). The performance scale used by CPPC is a unified scale for all CPUs in the system. Therefore, by obtaining the raw highest performance values from the _CPC objects, and normalizing them on the [0, 1024] capacity scale, used by the task scheduler, we obtain the CPU capacity of each CPU. While an ACPI Notify(0x85) could alert about a change in the highest performance value, which should in turn retrigger the CPU capacity computations, this notification is not currently handled by the ACPI processor driver. When supported, a call to arch_init_invariance_cppc() would perform the update. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-05-31 20:07:06 +00:00
#ifdef CONFIG_ACPI_CPPC_LIB
#include <acpi/cppc_acpi.h>
ACPI: processor: Move arch_init_invariance_cppc() call later JIRA: https://issues.redhat.com/browse/RHEL-75923 commit b79276dcac9124a79c8cf7cc8fbdd3d4c3c9a7c7 Author: Mario Limonciello <mario.limonciello@amd.com> Date: Mon Nov 4 16:28:55 2024 -0600 ACPI: processor: Move arch_init_invariance_cppc() call later arch_init_invariance_cppc() is called at the end of acpi_cppc_processor_probe() in order to configure frequency invariance based upon the values from _CPC. This however doesn't work on AMD CPPC shared memory designs that have AMD preferred cores enabled because _CPC needs to be analyzed from all cores to judge if preferred cores are enabled. This issue manifests to users as a warning since commit 21fb59ab4b97 ("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"): ``` Could not retrieve highest performance (-19) ``` However the warning isn't the cause of this, it was actually commit 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") which exposed the issue. To fix this problem, change arch_init_invariance_cppc() into a new weak symbol that is called at the end of acpi_processor_driver_init(). Each architecture that supports it can declare the symbol to override the weak one. Define it for x86, in arch/x86/kernel/acpi/cppc.c, and for all of the architectures using the generic arch_topology.c code. Fixes: 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219431 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20241104222855.3959267-1-superm1@kernel.org [ rjw: Changelog edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: David Arcari <darcari@redhat.com>
2025-02-18 12:30:14 +00:00
static inline void topology_init_cpu_capacity_cppc(void)
arch_topology: obtain cpu capacity using information from CPPC Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284 commit 9924fbb51e0ae30b8d7eec7c1c839d74da9678b3 Author: Ionela Voinescu <ionela.voinescu@arm.com> Date: Thu, 10 Mar 2022 14:54:50 +0000 Define topology_init_cpu_capacity_cppc() to use highest performance values from _CPC objects to obtain and set maximum capacity information for each CPU. acpi_cppc_processor_probe() is a good point at which to trigger the initialization of CPU (u-arch) capacity values, as at this point the highest performance values can be obtained from each CPU's _CPC objects. Architectures can therefore use this functionality through arch_init_invariance_cppc(). The performance scale used by CPPC is a unified scale for all CPUs in the system. Therefore, by obtaining the raw highest performance values from the _CPC objects, and normalizing them on the [0, 1024] capacity scale, used by the task scheduler, we obtain the CPU capacity of each CPU. While an ACPI Notify(0x85) could alert about a change in the highest performance value, which should in turn retrigger the CPU capacity computations, this notification is not currently handled by the ACPI processor driver. When supported, a call to arch_init_invariance_cppc() would perform the update. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-05-31 20:07:06 +00:00
{
u64 capacity, capacity_scale = 0;
arch_topology: obtain cpu capacity using information from CPPC Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284 commit 9924fbb51e0ae30b8d7eec7c1c839d74da9678b3 Author: Ionela Voinescu <ionela.voinescu@arm.com> Date: Thu, 10 Mar 2022 14:54:50 +0000 Define topology_init_cpu_capacity_cppc() to use highest performance values from _CPC objects to obtain and set maximum capacity information for each CPU. acpi_cppc_processor_probe() is a good point at which to trigger the initialization of CPU (u-arch) capacity values, as at this point the highest performance values can be obtained from each CPU's _CPC objects. Architectures can therefore use this functionality through arch_init_invariance_cppc(). The performance scale used by CPPC is a unified scale for all CPUs in the system. Therefore, by obtaining the raw highest performance values from the _CPC objects, and normalizing them on the [0, 1024] capacity scale, used by the task scheduler, we obtain the CPU capacity of each CPU. While an ACPI Notify(0x85) could alert about a change in the highest performance value, which should in turn retrigger the CPU capacity computations, this notification is not currently handled by the ACPI processor driver. When supported, a call to arch_init_invariance_cppc() would perform the update. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-05-31 20:07:06 +00:00
struct cppc_perf_caps perf_caps;
int cpu;
if (likely(!acpi_cpc_valid()))
arch_topology: obtain cpu capacity using information from CPPC Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284 commit 9924fbb51e0ae30b8d7eec7c1c839d74da9678b3 Author: Ionela Voinescu <ionela.voinescu@arm.com> Date: Thu, 10 Mar 2022 14:54:50 +0000 Define topology_init_cpu_capacity_cppc() to use highest performance values from _CPC objects to obtain and set maximum capacity information for each CPU. acpi_cppc_processor_probe() is a good point at which to trigger the initialization of CPU (u-arch) capacity values, as at this point the highest performance values can be obtained from each CPU's _CPC objects. Architectures can therefore use this functionality through arch_init_invariance_cppc(). The performance scale used by CPPC is a unified scale for all CPUs in the system. Therefore, by obtaining the raw highest performance values from the _CPC objects, and normalizing them on the [0, 1024] capacity scale, used by the task scheduler, we obtain the CPU capacity of each CPU. While an ACPI Notify(0x85) could alert about a change in the highest performance value, which should in turn retrigger the CPU capacity computations, this notification is not currently handled by the ACPI processor driver. When supported, a call to arch_init_invariance_cppc() would perform the update. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-05-31 20:07:06 +00:00
return;
raw_capacity = kcalloc(num_possible_cpus(), sizeof(*raw_capacity),
GFP_KERNEL);
if (!raw_capacity)
return;
for_each_possible_cpu(cpu) {
if (!cppc_get_perf_caps(cpu, &perf_caps) &&
(perf_caps.highest_perf >= perf_caps.nominal_perf) &&
(perf_caps.highest_perf >= perf_caps.lowest_perf)) {
raw_capacity[cpu] = perf_caps.highest_perf;
capacity_scale = max_t(u64, capacity_scale, raw_capacity[cpu]);
per_cpu(capacity_freq_ref, cpu) = cppc_perf_to_khz(&perf_caps, raw_capacity[cpu]);
arch_topology: obtain cpu capacity using information from CPPC Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284 commit 9924fbb51e0ae30b8d7eec7c1c839d74da9678b3 Author: Ionela Voinescu <ionela.voinescu@arm.com> Date: Thu, 10 Mar 2022 14:54:50 +0000 Define topology_init_cpu_capacity_cppc() to use highest performance values from _CPC objects to obtain and set maximum capacity information for each CPU. acpi_cppc_processor_probe() is a good point at which to trigger the initialization of CPU (u-arch) capacity values, as at this point the highest performance values can be obtained from each CPU's _CPC objects. Architectures can therefore use this functionality through arch_init_invariance_cppc(). The performance scale used by CPPC is a unified scale for all CPUs in the system. Therefore, by obtaining the raw highest performance values from the _CPC objects, and normalizing them on the [0, 1024] capacity scale, used by the task scheduler, we obtain the CPU capacity of each CPU. While an ACPI Notify(0x85) could alert about a change in the highest performance value, which should in turn retrigger the CPU capacity computations, this notification is not currently handled by the ACPI processor driver. When supported, a call to arch_init_invariance_cppc() would perform the update. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-05-31 20:07:06 +00:00
pr_debug("cpu_capacity: CPU%d cpu_capacity=%u (raw).\n",
cpu, raw_capacity[cpu]);
continue;
}
pr_err("cpu_capacity: CPU%d missing/invalid highest performance.\n", cpu);
pr_err("cpu_capacity: partial information: fallback to 1024 for all CPUs\n");
goto exit;
}
for_each_possible_cpu(cpu) {
freq_inv_set_max_ratio(cpu,
per_cpu(capacity_freq_ref, cpu) * HZ_PER_KHZ);
capacity = raw_capacity[cpu];
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale);
topology_set_cpu_scale(cpu, capacity);
pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
cpu, topology_get_cpu_scale(cpu));
}
arch_topology: obtain cpu capacity using information from CPPC Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284 commit 9924fbb51e0ae30b8d7eec7c1c839d74da9678b3 Author: Ionela Voinescu <ionela.voinescu@arm.com> Date: Thu, 10 Mar 2022 14:54:50 +0000 Define topology_init_cpu_capacity_cppc() to use highest performance values from _CPC objects to obtain and set maximum capacity information for each CPU. acpi_cppc_processor_probe() is a good point at which to trigger the initialization of CPU (u-arch) capacity values, as at this point the highest performance values can be obtained from each CPU's _CPC objects. Architectures can therefore use this functionality through arch_init_invariance_cppc(). The performance scale used by CPPC is a unified scale for all CPUs in the system. Therefore, by obtaining the raw highest performance values from the _CPC objects, and normalizing them on the [0, 1024] capacity scale, used by the task scheduler, we obtain the CPU capacity of each CPU. While an ACPI Notify(0x85) could alert about a change in the highest performance value, which should in turn retrigger the CPU capacity computations, this notification is not currently handled by the ACPI processor driver. When supported, a call to arch_init_invariance_cppc() would perform the update. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-05-31 20:07:06 +00:00
schedule_work(&update_topology_flags_work);
pr_debug("cpu_capacity: cpu_capacity initialization done\n");
exit:
free_raw_capacity();
}
ACPI: processor: Move arch_init_invariance_cppc() call later JIRA: https://issues.redhat.com/browse/RHEL-75923 commit b79276dcac9124a79c8cf7cc8fbdd3d4c3c9a7c7 Author: Mario Limonciello <mario.limonciello@amd.com> Date: Mon Nov 4 16:28:55 2024 -0600 ACPI: processor: Move arch_init_invariance_cppc() call later arch_init_invariance_cppc() is called at the end of acpi_cppc_processor_probe() in order to configure frequency invariance based upon the values from _CPC. This however doesn't work on AMD CPPC shared memory designs that have AMD preferred cores enabled because _CPC needs to be analyzed from all cores to judge if preferred cores are enabled. This issue manifests to users as a warning since commit 21fb59ab4b97 ("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"): ``` Could not retrieve highest performance (-19) ``` However the warning isn't the cause of this, it was actually commit 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") which exposed the issue. To fix this problem, change arch_init_invariance_cppc() into a new weak symbol that is called at the end of acpi_processor_driver_init(). Each architecture that supports it can declare the symbol to override the weak one. Define it for x86, in arch/x86/kernel/acpi/cppc.c, and for all of the architectures using the generic arch_topology.c code. Fixes: 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219431 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20241104222855.3959267-1-superm1@kernel.org [ rjw: Changelog edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: David Arcari <darcari@redhat.com>
2025-02-18 12:30:14 +00:00
void acpi_processor_init_invariance_cppc(void)
{
topology_init_cpu_capacity_cppc();
}
arch_topology: obtain cpu capacity using information from CPPC Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284 commit 9924fbb51e0ae30b8d7eec7c1c839d74da9678b3 Author: Ionela Voinescu <ionela.voinescu@arm.com> Date: Thu, 10 Mar 2022 14:54:50 +0000 Define topology_init_cpu_capacity_cppc() to use highest performance values from _CPC objects to obtain and set maximum capacity information for each CPU. acpi_cppc_processor_probe() is a good point at which to trigger the initialization of CPU (u-arch) capacity values, as at this point the highest performance values can be obtained from each CPU's _CPC objects. Architectures can therefore use this functionality through arch_init_invariance_cppc(). The performance scale used by CPPC is a unified scale for all CPUs in the system. Therefore, by obtaining the raw highest performance values from the _CPC objects, and normalizing them on the [0, 1024] capacity scale, used by the task scheduler, we obtain the CPU capacity of each CPU. While an ACPI Notify(0x85) could alert about a change in the highest performance value, which should in turn retrigger the CPU capacity computations, this notification is not currently handled by the ACPI processor driver. When supported, a call to arch_init_invariance_cppc() would perform the update. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-05-31 20:07:06 +00:00
#endif
#ifdef CONFIG_CPU_FREQ
Revert "base: arch_topology: fix section mismatch build warnings" This reverts commit 452562abb5b7 ("base: arch_topology: fix section mismatch build warnings"). It causes the notifier call hangs in some use-cases. In some cases with using maxcpus, some of cpus are booted first and then the remaining cpus are booted. As an example, some users who want to realize fast boot up often use the following procedure. 1) Define all CPUs on device tree (CA57x4 + CA53x4) 2) Add "maxcpus=4" in bootargs 3) Kernel boot up with CA57x4 4) After kernel boot up, CA53x4 is booted from user When kernel init was finished, CPUFREQ_POLICY_NOTIFIER was not still unregisterd. This means that "__init init_cpu_capacity_callback()" will be called after kernel init sequence. To avoid this problem, it needs to remove __init{,data} annotations by reverting this commit. Also, this commit was needed to fix kernel compile issue below. However, this issue was also fixed by another patch: commit 82d8ba717ccb ("arch_topology: Fix section miss match warning due to free_raw_capacity()") in v4.15 as well. Whereas commit 452562abb5b7 added all the missing __init annotations, commit 82d8ba717ccb removed it from free_raw_capacity(). WARNING: vmlinux.o(.text+0x548f24): Section mismatch in reference from the function init_cpu_capacity_callback() to the variable .init.text:$x The function init_cpu_capacity_callback() references the variable __init $x. This is often because init_cpu_capacity_callback lacks a __init annotation or the annotation of $x is wrong. Fixes: 82d8ba717ccb ("arch_topology: Fix section miss match warning due to free_raw_capacity()") Cc: stable <stable@vger.kernel.org> Signed-off-by: Gaku Inami <gaku.inami.xh@renesas.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-13 02:06:40 +00:00
static cpumask_var_t cpus_to_visit;
static void parsing_done_workfn(struct work_struct *work);
static DECLARE_WORK(parsing_done_work, parsing_done_workfn);
Revert "base: arch_topology: fix section mismatch build warnings" This reverts commit 452562abb5b7 ("base: arch_topology: fix section mismatch build warnings"). It causes the notifier call hangs in some use-cases. In some cases with using maxcpus, some of cpus are booted first and then the remaining cpus are booted. As an example, some users who want to realize fast boot up often use the following procedure. 1) Define all CPUs on device tree (CA57x4 + CA53x4) 2) Add "maxcpus=4" in bootargs 3) Kernel boot up with CA57x4 4) After kernel boot up, CA53x4 is booted from user When kernel init was finished, CPUFREQ_POLICY_NOTIFIER was not still unregisterd. This means that "__init init_cpu_capacity_callback()" will be called after kernel init sequence. To avoid this problem, it needs to remove __init{,data} annotations by reverting this commit. Also, this commit was needed to fix kernel compile issue below. However, this issue was also fixed by another patch: commit 82d8ba717ccb ("arch_topology: Fix section miss match warning due to free_raw_capacity()") in v4.15 as well. Whereas commit 452562abb5b7 added all the missing __init annotations, commit 82d8ba717ccb removed it from free_raw_capacity(). WARNING: vmlinux.o(.text+0x548f24): Section mismatch in reference from the function init_cpu_capacity_callback() to the variable .init.text:$x The function init_cpu_capacity_callback() references the variable __init $x. This is often because init_cpu_capacity_callback lacks a __init annotation or the annotation of $x is wrong. Fixes: 82d8ba717ccb ("arch_topology: Fix section miss match warning due to free_raw_capacity()") Cc: stable <stable@vger.kernel.org> Signed-off-by: Gaku Inami <gaku.inami.xh@renesas.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-13 02:06:40 +00:00
static int
init_cpu_capacity_callback(struct notifier_block *nb,
unsigned long val,
void *data)
{
struct cpufreq_policy *policy = data;
int cpu;
if (val != CPUFREQ_CREATE_POLICY)
return 0;
pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] (to_visit=%*pbl)\n",
cpumask_pr_args(policy->related_cpus),
cpumask_pr_args(cpus_to_visit));
cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
for_each_cpu(cpu, policy->related_cpus) {
sched/topology: Add a new arch_scale_freq_ref() method JIRA: https://issues.redhat.com/browse/RHEL-29020 Conflicts: Dropped arch/riscv/include/asm/topology.h hunk since the file does not exist. commit 9942cb22ea458c34fa17b73d143ea32d4df1caca Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Dec 11 11:48:49 2023 +0100 sched/topology: Add a new arch_scale_freq_ref() method Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-27 17:55:25 +00:00
per_cpu(capacity_freq_ref, cpu) = policy->cpuinfo.max_freq;
freq_inv_set_max_ratio(cpu,
per_cpu(capacity_freq_ref, cpu) * HZ_PER_KHZ);
}
if (cpumask_empty(cpus_to_visit)) {
if (raw_capacity) {
topology_normalize_cpu_scale();
schedule_work(&update_topology_flags_work);
free_raw_capacity();
}
pr_debug("cpu_capacity: parsing done\n");
schedule_work(&parsing_done_work);
}
return 0;
}
Revert "base: arch_topology: fix section mismatch build warnings" This reverts commit 452562abb5b7 ("base: arch_topology: fix section mismatch build warnings"). It causes the notifier call hangs in some use-cases. In some cases with using maxcpus, some of cpus are booted first and then the remaining cpus are booted. As an example, some users who want to realize fast boot up often use the following procedure. 1) Define all CPUs on device tree (CA57x4 + CA53x4) 2) Add "maxcpus=4" in bootargs 3) Kernel boot up with CA57x4 4) After kernel boot up, CA53x4 is booted from user When kernel init was finished, CPUFREQ_POLICY_NOTIFIER was not still unregisterd. This means that "__init init_cpu_capacity_callback()" will be called after kernel init sequence. To avoid this problem, it needs to remove __init{,data} annotations by reverting this commit. Also, this commit was needed to fix kernel compile issue below. However, this issue was also fixed by another patch: commit 82d8ba717ccb ("arch_topology: Fix section miss match warning due to free_raw_capacity()") in v4.15 as well. Whereas commit 452562abb5b7 added all the missing __init annotations, commit 82d8ba717ccb removed it from free_raw_capacity(). WARNING: vmlinux.o(.text+0x548f24): Section mismatch in reference from the function init_cpu_capacity_callback() to the variable .init.text:$x The function init_cpu_capacity_callback() references the variable __init $x. This is often because init_cpu_capacity_callback lacks a __init annotation or the annotation of $x is wrong. Fixes: 82d8ba717ccb ("arch_topology: Fix section miss match warning due to free_raw_capacity()") Cc: stable <stable@vger.kernel.org> Signed-off-by: Gaku Inami <gaku.inami.xh@renesas.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-13 02:06:40 +00:00
static struct notifier_block init_cpu_capacity_notifier = {
.notifier_call = init_cpu_capacity_callback,
};
static int __init register_cpufreq_notifier(void)
{
int ret;
/*
arch_topology: obtain cpu capacity using information from CPPC Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2067284 commit 9924fbb51e0ae30b8d7eec7c1c839d74da9678b3 Author: Ionela Voinescu <ionela.voinescu@arm.com> Date: Thu, 10 Mar 2022 14:54:50 +0000 Define topology_init_cpu_capacity_cppc() to use highest performance values from _CPC objects to obtain and set maximum capacity information for each CPU. acpi_cppc_processor_probe() is a good point at which to trigger the initialization of CPU (u-arch) capacity values, as at this point the highest performance values can be obtained from each CPU's _CPC objects. Architectures can therefore use this functionality through arch_init_invariance_cppc(). The performance scale used by CPPC is a unified scale for all CPUs in the system. Therefore, by obtaining the raw highest performance values from the _CPC objects, and normalizing them on the [0, 1024] capacity scale, used by the task scheduler, we obtain the CPU capacity of each CPU. While an ACPI Notify(0x85) could alert about a change in the highest performance value, which should in turn retrigger the CPU capacity computations, this notification is not currently handled by the ACPI processor driver. When supported, a call to arch_init_invariance_cppc() would perform the update. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-05-31 20:07:06 +00:00
* On ACPI-based systems skip registering cpufreq notifier as cpufreq
* information is not needed for cpu capacity initialization.
*/
if (!acpi_disabled)
return -EINVAL;
if (!alloc_cpumask_var(&cpus_to_visit, GFP_KERNEL))
return -ENOMEM;
cpumask_copy(cpus_to_visit, cpu_possible_mask);
ret = cpufreq_register_notifier(&init_cpu_capacity_notifier,
CPUFREQ_POLICY_NOTIFIER);
if (ret)
free_cpumask_var(cpus_to_visit);
return ret;
}
core_initcall(register_cpufreq_notifier);
Revert "base: arch_topology: fix section mismatch build warnings" This reverts commit 452562abb5b7 ("base: arch_topology: fix section mismatch build warnings"). It causes the notifier call hangs in some use-cases. In some cases with using maxcpus, some of cpus are booted first and then the remaining cpus are booted. As an example, some users who want to realize fast boot up often use the following procedure. 1) Define all CPUs on device tree (CA57x4 + CA53x4) 2) Add "maxcpus=4" in bootargs 3) Kernel boot up with CA57x4 4) After kernel boot up, CA53x4 is booted from user When kernel init was finished, CPUFREQ_POLICY_NOTIFIER was not still unregisterd. This means that "__init init_cpu_capacity_callback()" will be called after kernel init sequence. To avoid this problem, it needs to remove __init{,data} annotations by reverting this commit. Also, this commit was needed to fix kernel compile issue below. However, this issue was also fixed by another patch: commit 82d8ba717ccb ("arch_topology: Fix section miss match warning due to free_raw_capacity()") in v4.15 as well. Whereas commit 452562abb5b7 added all the missing __init annotations, commit 82d8ba717ccb removed it from free_raw_capacity(). WARNING: vmlinux.o(.text+0x548f24): Section mismatch in reference from the function init_cpu_capacity_callback() to the variable .init.text:$x The function init_cpu_capacity_callback() references the variable __init $x. This is often because init_cpu_capacity_callback lacks a __init annotation or the annotation of $x is wrong. Fixes: 82d8ba717ccb ("arch_topology: Fix section miss match warning due to free_raw_capacity()") Cc: stable <stable@vger.kernel.org> Signed-off-by: Gaku Inami <gaku.inami.xh@renesas.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-13 02:06:40 +00:00
static void parsing_done_workfn(struct work_struct *work)
{
cpufreq_unregister_notifier(&init_cpu_capacity_notifier,
CPUFREQ_POLICY_NOTIFIER);
free_cpumask_var(cpus_to_visit);
}
#else
core_initcall(free_raw_capacity);
#endif
#if defined(CONFIG_ARM64) || defined(CONFIG_RISCV)
/*
* This function returns the logic cpu number of the node.
* There are basically three kinds of return values:
* (1) logic cpu number which is > 0.
* (2) -ENODEV when the device tree(DT) node is valid and found in the DT but
* there is no possible logical CPU in the kernel to match. This happens
* when CONFIG_NR_CPUS is configure to be smaller than the number of
* CPU nodes in DT. We need to just ignore this case.
* (3) -1 if the node does not exist in the device tree
*/
static int __init get_cpu_for_node(struct device_node *node)
{
struct device_node *cpu_node;
int cpu;
cpu_node = of_parse_phandle(node, "cpu", 0);
if (!cpu_node)
return -1;
cpu = of_cpu_node_to_id(cpu_node);
if (cpu >= 0)
topology_parse_cpu_capacity(cpu_node, cpu);
else
pr_info("CPU node for %pOF exist but the possible cpu range is :%*pbl\n",
cpu_node, cpumask_pr_args(cpu_possible_mask));
of_node_put(cpu_node);
return cpu;
}
static int __init parse_core(struct device_node *core, int package_id,
int cluster_id, int core_id)
{
char name[20];
bool leaf = true;
int i = 0;
int cpu;
struct device_node *t;
do {
snprintf(name, sizeof(name), "thread%d", i);
t = of_get_child_by_name(core, name);
if (t) {
leaf = false;
cpu = get_cpu_for_node(t);
if (cpu >= 0) {
cpu_topology[cpu].package_id = package_id;
cpu_topology[cpu].cluster_id = cluster_id;
cpu_topology[cpu].core_id = core_id;
cpu_topology[cpu].thread_id = i;
} else if (cpu != -ENODEV) {
pr_err("%pOF: Can't get CPU for thread\n", t);
of_node_put(t);
return -EINVAL;
}
of_node_put(t);
}
i++;
} while (t);
cpu = get_cpu_for_node(core);
if (cpu >= 0) {
if (!leaf) {
pr_err("%pOF: Core has both threads and CPU\n",
core);
return -EINVAL;
}
cpu_topology[cpu].package_id = package_id;
cpu_topology[cpu].cluster_id = cluster_id;
cpu_topology[cpu].core_id = core_id;
} else if (leaf && cpu != -ENODEV) {
pr_err("%pOF: Can't get CPU for leaf core\n", core);
return -EINVAL;
}
return 0;
}
static int __init parse_cluster(struct device_node *cluster, int package_id,
int cluster_id, int depth)
{
char name[20];
bool leaf = true;
bool has_cores = false;
struct device_node *c;
int core_id = 0;
int i, ret;
/*
* First check for child clusters; we currently ignore any
* information about the nesting of clusters and present the
* scheduler with a flat list of them.
*/
i = 0;
do {
snprintf(name, sizeof(name), "cluster%d", i);
c = of_get_child_by_name(cluster, name);
if (c) {
leaf = false;
ret = parse_cluster(c, package_id, i, depth + 1);
if (depth > 0)
pr_warn("Topology for clusters of clusters not yet supported\n");
of_node_put(c);
if (ret != 0)
return ret;
}
i++;
} while (c);
/* Now check for cores */
i = 0;
do {
snprintf(name, sizeof(name), "core%d", i);
c = of_get_child_by_name(cluster, name);
if (c) {
has_cores = true;
if (depth == 0) {
pr_err("%pOF: cpu-map children should be clusters\n",
c);
of_node_put(c);
return -EINVAL;
}
if (leaf) {
ret = parse_core(c, package_id, cluster_id,
core_id++);
} else {
pr_err("%pOF: Non-leaf cluster with core %s\n",
cluster, name);
ret = -EINVAL;
}
of_node_put(c);
if (ret != 0)
return ret;
}
i++;
} while (c);
if (leaf && !has_cores)
pr_warn("%pOF: empty cluster\n", cluster);
return 0;
}
static int __init parse_socket(struct device_node *socket)
{
char name[20];
struct device_node *c;
bool has_socket = false;
int package_id = 0, ret;
do {
snprintf(name, sizeof(name), "socket%d", package_id);
c = of_get_child_by_name(socket, name);
if (c) {
has_socket = true;
ret = parse_cluster(c, package_id, -1, 0);
of_node_put(c);
if (ret != 0)
return ret;
}
package_id++;
} while (c);
if (!has_socket)
ret = parse_cluster(socket, 0, -1, 0);
return ret;
}
static int __init parse_dt_topology(void)
{
struct device_node *cn, *map;
int ret = 0;
int cpu;
cn = of_find_node_by_path("/cpus");
if (!cn) {
pr_err("No CPU information found in DT\n");
return 0;
}
/*
* When topology is provided cpu-map is essentially a root
* cluster with restricted subnodes.
*/
map = of_get_child_by_name(cn, "cpu-map");
if (!map)
goto out;
ret = parse_socket(map);
if (ret != 0)
goto out_map;
topology_normalize_cpu_scale();
/*
* Check that all cores are in the topology; the SMP code will
* only mark cores described in the DT as possible.
*/
for_each_possible_cpu(cpu)
if (cpu_topology[cpu].package_id < 0) {
ret = -EINVAL;
break;
}
out_map:
of_node_put(map);
out:
of_node_put(cn);
return ret;
}
#endif
/*
* cpu topology table
*/
struct cpu_topology cpu_topology[NR_CPUS];
EXPORT_SYMBOL_GPL(cpu_topology);
const struct cpumask *cpu_coregroup_mask(int cpu)
{
const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
/* Find the smaller of NUMA, core or LLC siblings */
if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
/* not numa in package, lets use the package siblings */
core_mask = &cpu_topology[cpu].core_sibling;
}
if (last_level_cache_is_valid(cpu)) {
if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
core_mask = &cpu_topology[cpu].llc_sibling;
}
topology: make core_mask include at least cluster_siblings Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2047951 commit db1e59483dfd8d4e956575302520bb8f7e20c79b Author: Darren Hart <darren@os.amperecomputing.com> Date: Mon, 11 Apr 2022 13:53:34 -0700 Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop Control Unit, but have no shared CPU-side last level cache. cpu_coregroup_mask() will return a cpumask with weight 1, while cpu_clustergroup_mask() will return a cpumask with weight 2. As a result, build_sched_domain() will BUG() once per CPU with: BUG: arch topology borken the CLS domain not a subset of the MC domain The MC level cpumask is then extended to that of the CLS child, and is later removed entirely as redundant. This sched domain topology is an improvement over previous topologies, or those built without SCHED_CLUSTER, particularly for certain latency sensitive workloads. With the current scheduler model and heuristics, this is a desirable default topology for Ampere Altra and Altra Max system. Rather than create a custom sched domains topology structure and introduce new logic in arch/arm64 to detect these systems, update the core_mask so coregroup is never a subset of clustergroup, extending it to cluster_siblings if necessary. Only do this if CONFIG_SCHED_CLUSTER is enabled to avoid also changing the topology (MC) when CONFIG_SCHED_CLUSTER is disabled. This has the added benefit over a custom topology of working for both symmetric and asymmetric topologies. It does not address systems where the CLUSTER topology is above a populated MC topology, but these are not considered today and can be addressed separately if and when they appear. The final sched domain topology for a 2 socket Ampere Altra system is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided: For CPU0: CONFIG_SCHED_CLUSTER=y CLS [0-1] DIE [0-79] NUMA [0-159] CONFIG_SCHED_CLUSTER is not set DIE [0-79] NUMA [0-159] Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: D. Scott Phillips <scott@os.amperecomputing.com> Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com> Cc: <stable@vger.kernel.org> # 5.16.x Suggested-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Darren Hart <darren@os.amperecomputing.com> Link: https://lore.kernel.org/r/c8fe9fce7c86ed56b4c455b8c902982dc2303868.1649696956.git.darren@os.amperecomputing.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mark Salter <msalter@redhat.com>
2022-05-17 14:55:58 +00:00
/*
* For systems with no shared cpu-side LLC but with clusters defined,
* extend core_mask to cluster_siblings. The sched domain builder will
* then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
*/
if (IS_ENABLED(CONFIG_SCHED_CLUSTER) &&
cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
core_mask = &cpu_topology[cpu].cluster_sibling;
return core_mask;
}
topology: Represent clusters of CPUs within a die Bugzilla: http://bugzilla.redhat.com/2020279 commit c5e22feffdd736cb02b98b0f5b375c8ebc858dd4 Author: Jonathan Cameron <Jonathan.Cameron@huawei.com> Date: Fri Sep 24 20:51:02 2021 +1200 topology: Represent clusters of CPUs within a die Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ | +------+ +------+ +--------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | | +-----------+ | | | +------+ +------+ +--------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +---+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +--+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-17 16:58:26 +00:00
const struct cpumask *cpu_clustergroup_mask(int cpu)
{
arch_topology: Limit span of cpu_clustergroup_mask() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318 Omitted-fix: 6b66ca0bac1b9cee7608d7c4dc59b699458b4cb8 Omitted-fix: c749b275056d4d1023af125b320c91a24d6856b8 These two commits are a fix and a reversion of the fix, so I'm dropping both of them. The final fix is "arch_topology: Make cluster topology span at least SMT CPUs", included later in this series. commit bfcc4397435dc0407099b9a805391abc05c2313b Author: Ionela Voinescu <ionela.voinescu@arm.com> Date: Mon, 4 Jul 2022 11:16:01 +0100 Currently the cluster identifier is not set on DT based platforms. The reset or default value is -1 for all the CPUs. Once we assign the cluster identifier values correctly, the cluster_sibling mask will be populated and returned by cpu_clustergroup_mask() to contribute in the creation of the CLS scheduling domain level, if SCHED_CLUSTER is enabled. To avoid topologies that will result in questionable or incorrect scheduling domains, impose restrictions regarding the span of clusters, as presented to scheduling domains building code: cluster_sibling should not span more or the same CPUs as cpu_coregroup_mask(). This is needed in order to obtain a strict separation between the MC and CLS levels, and maintain the same domains for existing platforms in the presence of CONFIG_SCHED_CLUSTER, where the new cluster information is redundant and irrelevant for the scheduler. While previously the scheduling domain builder code would have removed MC as redundant and kept CLS if SCHED_CLUSTER was enabled and the cpu_coregroup_mask() and cpu_clustergroup_mask() spanned the same CPUs, now CLS will be removed and MC kept. Link: https://lore.kernel.org/r/20220704101605.1318280-18-sudeep.holla@arm.com Cc: Darren Hart <darren@os.amperecomputing.com> Tested-by: Conor Dooley <conor.dooley@microchip.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-08-22 20:35:09 +00:00
/*
* Forbid cpu_clustergroup_mask() to span more or the same CPUs as
* cpu_coregroup_mask().
*/
if (cpumask_subset(cpu_coregroup_mask(cpu),
&cpu_topology[cpu].cluster_sibling))
2022-09-19 18:14:40 +00:00
return topology_sibling_cpumask(cpu);
arch_topology: Limit span of cpu_clustergroup_mask() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318 Omitted-fix: 6b66ca0bac1b9cee7608d7c4dc59b699458b4cb8 Omitted-fix: c749b275056d4d1023af125b320c91a24d6856b8 These two commits are a fix and a reversion of the fix, so I'm dropping both of them. The final fix is "arch_topology: Make cluster topology span at least SMT CPUs", included later in this series. commit bfcc4397435dc0407099b9a805391abc05c2313b Author: Ionela Voinescu <ionela.voinescu@arm.com> Date: Mon, 4 Jul 2022 11:16:01 +0100 Currently the cluster identifier is not set on DT based platforms. The reset or default value is -1 for all the CPUs. Once we assign the cluster identifier values correctly, the cluster_sibling mask will be populated and returned by cpu_clustergroup_mask() to contribute in the creation of the CLS scheduling domain level, if SCHED_CLUSTER is enabled. To avoid topologies that will result in questionable or incorrect scheduling domains, impose restrictions regarding the span of clusters, as presented to scheduling domains building code: cluster_sibling should not span more or the same CPUs as cpu_coregroup_mask(). This is needed in order to obtain a strict separation between the MC and CLS levels, and maintain the same domains for existing platforms in the presence of CONFIG_SCHED_CLUSTER, where the new cluster information is redundant and irrelevant for the scheduler. While previously the scheduling domain builder code would have removed MC as redundant and kept CLS if SCHED_CLUSTER was enabled and the cpu_coregroup_mask() and cpu_clustergroup_mask() spanned the same CPUs, now CLS will be removed and MC kept. Link: https://lore.kernel.org/r/20220704101605.1318280-18-sudeep.holla@arm.com Cc: Darren Hart <darren@os.amperecomputing.com> Tested-by: Conor Dooley <conor.dooley@microchip.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-08-22 20:35:09 +00:00
topology: Represent clusters of CPUs within a die Bugzilla: http://bugzilla.redhat.com/2020279 commit c5e22feffdd736cb02b98b0f5b375c8ebc858dd4 Author: Jonathan Cameron <Jonathan.Cameron@huawei.com> Date: Fri Sep 24 20:51:02 2021 +1200 topology: Represent clusters of CPUs within a die Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ | +------+ +------+ +--------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | | +-----------+ | | | +------+ +------+ +--------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +---+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +--+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-17 16:58:26 +00:00
return &cpu_topology[cpu].cluster_sibling;
}
void update_siblings_masks(unsigned int cpuid)
{
struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
arch_topology: Fix cache attributes detection in the CPU hotplug path Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122318 commit 3fcbf1c77d089fcf0331fd8f3cbbe6c436a3edbd Author: Sudeep Holla <sudeep.holla@arm.com> Date: Wed, 20 Jul 2022 13:55:40 +0100 init_cpu_topology() is called only once at the boot and all the cache attributes are detected early for all the possible CPUs. However when the CPUs are hotplugged out, the cacheinfo gets removed. While the attributes are added back when the CPUs are hotplugged back in as part of CPU hotplug state machine, it ends up called quite late after the update_siblings_masks() are called in the secondary_start_kernel() resulting in wrong llc_sibling_masks. Move the call to detect_cache_attributes() inside update_siblings_masks() to ensure the cacheinfo is updated before the LLC sibling masks are updated. This will fix the incorrect LLC sibling masks generated when the CPUs are hotplugged out and hotplugged back in again. Reported-by: Ionela Voinescu <ionela.voinescu@arm.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Ionela Voinescu <ionela.voinescu@arm.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20220720-arch_topo_fixes-v3-3-43d696288e84@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2022-08-22 20:56:54 +00:00
int cpu, ret;
ret = detect_cache_attributes(cpuid);
if (ret && ret != -ENOENT)
arch_topology: Build cacheinfo from primary CPU Bugzilla: https://bugzilla.redhat.com/2180619 commit 5944ce092b97caed5d86d961e963b883b5c44ee2 Author: Pierre Gondois <pierre.gondois@arm.com> Date: Wed Jan 4 19:30:29 2023 +0100 arch_topology: Build cacheinfo from primary CPU commit 3fcbf1c77d08 ("arch_topology: Fix cache attributes detection in the CPU hotplug path") adds a call to detect_cache_attributes() to populate the cacheinfo before updating the siblings mask. detect_cache_attributes() allocates memory and can take the PPTT mutex (on ACPI platforms). On PREEMPT_RT kernels, on secondary CPUs, this triggers a: 'BUG: sleeping function called from invalid context' [1] as the code is executed with preemption and interrupts disabled. The primary CPU was previously storing the cache information using the now removed (struct cpu_topology).llc_id: commit 5b8dc787ce4a ("arch_topology: Drop LLC identifier stash from the CPU topology") allocate_cache_info() tries to build the cacheinfo from the primary CPU prior secondary CPUs boot, if the DT/ACPI description contains cache information. If allocate_cache_info() fails, then fallback to the current state for the cacheinfo allocation. [1] will be triggered in such case. When unplugging a CPU, the cacheinfo memory cannot be freed. If it was, then the memory would be allocated early by the re-plugged CPU and would trigger [1]. Note that populate_cache_leaves() might be called multiple times due to populate_leaves being moved up. This is required since detect_cache_attributes() might be called with per_cpu_cacheinfo(cpu) being allocated but not populated. [1]: | BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 | in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/111 | preempt_count: 1, expected: 0 | RCU nest depth: 1, expected: 1 | 3 locks held by swapper/111/0: | #0: (&pcp->lock){+.+.}-{3:3}, at: get_page_from_freelist+0x218/0x12c8 | #1: (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x48/0xf0 | #2: (&zone->lock){+.+.}-{3:3}, at: rmqueue_bulk+0x64/0xa80 | irq event stamp: 0 | hardirqs last enabled at (0): 0x0 | hardirqs last disabled at (0): copy_process+0x5dc/0x1ab8 | softirqs last enabled at (0): copy_process+0x5dc/0x1ab8 | softirqs last disabled at (0): 0x0 | Preemption disabled at: | migrate_enable+0x30/0x130 | CPU: 111 PID: 0 Comm: swapper/111 Tainted: G W 6.0.0-rc4-rt6-[...] | Call trace: | __kmalloc+0xbc/0x1e8 | detect_cache_attributes+0x2d4/0x5f0 | update_siblings_masks+0x30/0x368 | store_cpu_topology+0x78/0xb8 | secondary_start_kernel+0xd0/0x198 | __secondary_switched+0xb0/0xb4 Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Link: https://lore.kernel.org/r/20230104183033.755668-7-pierre.gondois@arm.com Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Radu Rendec <rrendec@redhat.com>
2023-05-02 20:57:28 +00:00
pr_info("Early cacheinfo allocation failed, ret = %d\n", ret);
/* update core and thread sibling masks */
for_each_online_cpu(cpu) {
cpu_topo = &cpu_topology[cpu];
if (last_level_cache_is_shared(cpu, cpuid)) {
cpumask_set_cpu(cpu, &cpuid_topo->llc_sibling);
cpumask_set_cpu(cpuid, &cpu_topo->llc_sibling);
}
if (cpuid_topo->package_id != cpu_topo->package_id)
continue;
cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
if (cpuid_topo->cluster_id != cpu_topo->cluster_id)
continue;
if (cpuid_topo->cluster_id >= 0) {
topology: Represent clusters of CPUs within a die Bugzilla: http://bugzilla.redhat.com/2020279 commit c5e22feffdd736cb02b98b0f5b375c8ebc858dd4 Author: Jonathan Cameron <Jonathan.Cameron@huawei.com> Date: Fri Sep 24 20:51:02 2021 +1200 topology: Represent clusters of CPUs within a die Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ | +------+ +------+ +--------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | | +-----------+ | | | +------+ +------+ +--------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +---+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +--+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-17 16:58:26 +00:00
cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
}
if (cpuid_topo->core_id != cpu_topo->core_id)
continue;
cpumask_set_cpu(cpuid, &cpu_topo->thread_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->thread_sibling);
}
}
static void clear_cpu_topology(int cpu)
{
struct cpu_topology *cpu_topo = &cpu_topology[cpu];
cpumask_clear(&cpu_topo->llc_sibling);
cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
topology: Represent clusters of CPUs within a die Bugzilla: http://bugzilla.redhat.com/2020279 commit c5e22feffdd736cb02b98b0f5b375c8ebc858dd4 Author: Jonathan Cameron <Jonathan.Cameron@huawei.com> Date: Fri Sep 24 20:51:02 2021 +1200 topology: Represent clusters of CPUs within a die Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ | +------+ +------+ +--------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | | +-----------+ | | | +------+ +------+ +--------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +---+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +--+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-17 16:58:26 +00:00
cpumask_clear(&cpu_topo->cluster_sibling);
cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
cpumask_clear(&cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
cpumask_clear(&cpu_topo->thread_sibling);
cpumask_set_cpu(cpu, &cpu_topo->thread_sibling);
}
void __init reset_cpu_topology(void)
{
unsigned int cpu;
for_each_possible_cpu(cpu) {
struct cpu_topology *cpu_topo = &cpu_topology[cpu];
cpu_topo->thread_id = -1;
cpu_topo->core_id = -1;
topology: Represent clusters of CPUs within a die Bugzilla: http://bugzilla.redhat.com/2020279 commit c5e22feffdd736cb02b98b0f5b375c8ebc858dd4 Author: Jonathan Cameron <Jonathan.Cameron@huawei.com> Date: Fri Sep 24 20:51:02 2021 +1200 topology: Represent clusters of CPUs within a die Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ | +------+ +------+ +--------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | | +-----------+ | | | +------+ +------+ +--------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +---+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +--+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-17 16:58:26 +00:00
cpu_topo->cluster_id = -1;
cpu_topo->package_id = -1;
clear_cpu_topology(cpu);
}
}
void remove_cpu_topology(unsigned int cpu)
{
int sibling;
for_each_cpu(sibling, topology_core_cpumask(cpu))
cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
for_each_cpu(sibling, topology_sibling_cpumask(cpu))
cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
arch_topology: Fix missing clear cluster_cpumask in remove_cpu_topology() Bugzilla: http://bugzilla.redhat.com/2020279 commit 4cc4cc28ec4154c4f1395648ab67ac9fd3e71fdc Author: Wang ShaoBo <bobo.shaobowang@huawei.com> Date: Wed Nov 10 17:58:56 2021 +0800 arch_topology: Fix missing clear cluster_cpumask in remove_cpu_topology() When testing cpu online and offline, warning happened like this: [ 146.746743] WARNING: CPU: 92 PID: 974 at kernel/sched/topology.c:2215 build_sched_domains+0x81c/0x11b0 [ 146.749988] CPU: 92 PID: 974 Comm: kworker/92:2 Not tainted 5.15.0 #9 [ 146.750402] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.79 08/21/2021 [ 146.751213] Workqueue: events cpuset_hotplug_workfn [ 146.751629] pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 146.752048] pc : build_sched_domains+0x81c/0x11b0 [ 146.752461] lr : build_sched_domains+0x414/0x11b0 [ 146.752860] sp : ffff800040a83a80 [ 146.753247] x29: ffff800040a83a80 x28: ffff20801f13a980 x27: ffff20800448ae00 [ 146.753644] x26: ffff800012a858e8 x25: ffff800012ea48c0 x24: 0000000000000000 [ 146.754039] x23: ffff800010ab7d60 x22: ffff800012f03758 x21: 000000000000005f [ 146.754427] x20: 000000000000005c x19: ffff004080012840 x18: ffffffffffffffff [ 146.754814] x17: 3661613030303230 x16: 30303078303a3239 x15: ffff800011f92b48 [ 146.755197] x14: ffff20be3f95cef6 x13: 2e6e69616d6f642d x12: 6465686373204c4c [ 146.755578] x11: ffff20bf7fc83a00 x10: 0000000000000040 x9 : 0000000000000000 [ 146.755957] x8 : 0000000000000002 x7 : ffffffffe0000000 x6 : 0000000000000002 [ 146.756334] x5 : 0000000090000000 x4 : 00000000f0000000 x3 : 0000000000000001 [ 146.756705] x2 : 0000000000000080 x1 : ffff800012f03860 x0 : 0000000000000001 [ 146.757070] Call trace: [ 146.757421] build_sched_domains+0x81c/0x11b0 [ 146.757771] partition_sched_domains_locked+0x57c/0x978 [ 146.758118] rebuild_sched_domains_locked+0x44c/0x7f0 [ 146.758460] rebuild_sched_domains+0x2c/0x48 [ 146.758791] cpuset_hotplug_workfn+0x3fc/0x888 [ 146.759114] process_one_work+0x1f4/0x480 [ 146.759429] worker_thread+0x48/0x460 [ 146.759734] kthread+0x158/0x168 [ 146.760030] ret_from_fork+0x10/0x20 [ 146.760318] ---[ end trace 82c44aad6900e81a ]--- For some architectures like risc-v and arm64 which use common code clear_cpu_topology() in shutting down CPUx, When CONFIG_SCHED_CLUSTER is set, cluster_sibling in cpu_topology of each sibling adjacent to CPUx is missed clearing, this causes checking failed in topology_span_sane() and rebuilding topology failure at end when CPU online. Different sibling's cluster_sibling in cpu_topology[] when CPU92 offline (CPU 92, 93, 94, 95 are in one cluster): Before revision: CPU [92] [93] [94] [95] cluster_sibling [92] [92-95] [92-95] [92-95] After revision: CPU [92] [93] [94] [95] cluster_sibling [92] [93-95] [93-95] [93-95] Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Barry Song <song.bao.hua@hisilicon.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20211110095856.469360-1-bobo.shaobowang@huawei.com Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-17 16:59:25 +00:00
for_each_cpu(sibling, topology_cluster_cpumask(cpu))
cpumask_clear_cpu(cpu, topology_cluster_cpumask(sibling));
for_each_cpu(sibling, topology_llc_cpumask(cpu))
cpumask_clear_cpu(cpu, topology_llc_cpumask(sibling));
clear_cpu_topology(cpu);
}
__weak int __init parse_acpi_topology(void)
{
return 0;
}
#if defined(CONFIG_ARM64) || defined(CONFIG_RISCV)
void __init init_cpu_topology(void)
{
arch_topology: Build cacheinfo from primary CPU Bugzilla: https://bugzilla.redhat.com/2180619 commit 5944ce092b97caed5d86d961e963b883b5c44ee2 Author: Pierre Gondois <pierre.gondois@arm.com> Date: Wed Jan 4 19:30:29 2023 +0100 arch_topology: Build cacheinfo from primary CPU commit 3fcbf1c77d08 ("arch_topology: Fix cache attributes detection in the CPU hotplug path") adds a call to detect_cache_attributes() to populate the cacheinfo before updating the siblings mask. detect_cache_attributes() allocates memory and can take the PPTT mutex (on ACPI platforms). On PREEMPT_RT kernels, on secondary CPUs, this triggers a: 'BUG: sleeping function called from invalid context' [1] as the code is executed with preemption and interrupts disabled. The primary CPU was previously storing the cache information using the now removed (struct cpu_topology).llc_id: commit 5b8dc787ce4a ("arch_topology: Drop LLC identifier stash from the CPU topology") allocate_cache_info() tries to build the cacheinfo from the primary CPU prior secondary CPUs boot, if the DT/ACPI description contains cache information. If allocate_cache_info() fails, then fallback to the current state for the cacheinfo allocation. [1] will be triggered in such case. When unplugging a CPU, the cacheinfo memory cannot be freed. If it was, then the memory would be allocated early by the re-plugged CPU and would trigger [1]. Note that populate_cache_leaves() might be called multiple times due to populate_leaves being moved up. This is required since detect_cache_attributes() might be called with per_cpu_cacheinfo(cpu) being allocated but not populated. [1]: | BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 | in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/111 | preempt_count: 1, expected: 0 | RCU nest depth: 1, expected: 1 | 3 locks held by swapper/111/0: | #0: (&pcp->lock){+.+.}-{3:3}, at: get_page_from_freelist+0x218/0x12c8 | #1: (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x48/0xf0 | #2: (&zone->lock){+.+.}-{3:3}, at: rmqueue_bulk+0x64/0xa80 | irq event stamp: 0 | hardirqs last enabled at (0): 0x0 | hardirqs last disabled at (0): copy_process+0x5dc/0x1ab8 | softirqs last enabled at (0): copy_process+0x5dc/0x1ab8 | softirqs last disabled at (0): 0x0 | Preemption disabled at: | migrate_enable+0x30/0x130 | CPU: 111 PID: 0 Comm: swapper/111 Tainted: G W 6.0.0-rc4-rt6-[...] | Call trace: | __kmalloc+0xbc/0x1e8 | detect_cache_attributes+0x2d4/0x5f0 | update_siblings_masks+0x30/0x368 | store_cpu_topology+0x78/0xb8 | secondary_start_kernel+0xd0/0x198 | __secondary_switched+0xb0/0xb4 Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Link: https://lore.kernel.org/r/20230104183033.755668-7-pierre.gondois@arm.com Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Radu Rendec <rrendec@redhat.com>
2023-05-02 20:57:28 +00:00
int cpu, ret;
reset_cpu_topology();
ret = parse_acpi_topology();
if (!ret)
ret = of_have_populated_dt() && parse_dt_topology();
if (ret) {
/*
* Discard anything that was parsed if we hit an error so we
cacheinfo: Allow early level detection when DT/ACPI info is missing/broken Bugzilla: https://bugzilla.redhat.com/2180619 commit e103d55465db06c5344201fd5fa11bb19bc479c5 Author: Radu Rendec <rrendec@redhat.com> Date: Wed Apr 12 14:57:59 2023 -0400 cacheinfo: Allow early level detection when DT/ACPI info is missing/broken Recent work enables cacheinfo memory for secondary CPUs to be allocated early, while still running on the primary CPU. That allows cacheinfo memory to be allocated safely on RT kernels. To make that work, the number of cache levels/leaves must be defined in the device tree or ACPI tables. Further work adds a path for early detection of the number of cache levels/leaves, which makes it possible to allocate the cacheinfo memory early without requiring extra DT/ACPI information. This patch addresses a specific issue with ACPI systems with no PPTT. In that case, parse_acpi_topology() returns an error code, which in turn makes init_cpu_topology() return early, before fetch_cache_info() is called. In that case, the early cache level detection doesn't run. The solution is to simply remove the "return" statement and let the code flow fall through to calling fetch_cache_info(). Signed-off-by: Radu Rendec <rrendec@redhat.com> Reported-by: Pierre Gondois <pierre.gondois@arm.com> Link: https://lore.kernel.org/all/dea94484-797f-3034-7b86-6d88801c0d91@arm.com/ Reviewed-by: Pierre Gondois <pierre.gondois@arm.com> Link: https://lore.kernel.org/r/20230412185759.755408-4-rrendec@redhat.com Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Radu Rendec <rrendec@redhat.com>
2023-05-02 20:57:28 +00:00
* don't use partial information. But do not return yet to give
* arch-specific early cache level detection a chance to run.
*/
reset_cpu_topology();
}
arch_topology: Build cacheinfo from primary CPU Bugzilla: https://bugzilla.redhat.com/2180619 commit 5944ce092b97caed5d86d961e963b883b5c44ee2 Author: Pierre Gondois <pierre.gondois@arm.com> Date: Wed Jan 4 19:30:29 2023 +0100 arch_topology: Build cacheinfo from primary CPU commit 3fcbf1c77d08 ("arch_topology: Fix cache attributes detection in the CPU hotplug path") adds a call to detect_cache_attributes() to populate the cacheinfo before updating the siblings mask. detect_cache_attributes() allocates memory and can take the PPTT mutex (on ACPI platforms). On PREEMPT_RT kernels, on secondary CPUs, this triggers a: 'BUG: sleeping function called from invalid context' [1] as the code is executed with preemption and interrupts disabled. The primary CPU was previously storing the cache information using the now removed (struct cpu_topology).llc_id: commit 5b8dc787ce4a ("arch_topology: Drop LLC identifier stash from the CPU topology") allocate_cache_info() tries to build the cacheinfo from the primary CPU prior secondary CPUs boot, if the DT/ACPI description contains cache information. If allocate_cache_info() fails, then fallback to the current state for the cacheinfo allocation. [1] will be triggered in such case. When unplugging a CPU, the cacheinfo memory cannot be freed. If it was, then the memory would be allocated early by the re-plugged CPU and would trigger [1]. Note that populate_cache_leaves() might be called multiple times due to populate_leaves being moved up. This is required since detect_cache_attributes() might be called with per_cpu_cacheinfo(cpu) being allocated but not populated. [1]: | BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 | in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/111 | preempt_count: 1, expected: 0 | RCU nest depth: 1, expected: 1 | 3 locks held by swapper/111/0: | #0: (&pcp->lock){+.+.}-{3:3}, at: get_page_from_freelist+0x218/0x12c8 | #1: (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x48/0xf0 | #2: (&zone->lock){+.+.}-{3:3}, at: rmqueue_bulk+0x64/0xa80 | irq event stamp: 0 | hardirqs last enabled at (0): 0x0 | hardirqs last disabled at (0): copy_process+0x5dc/0x1ab8 | softirqs last enabled at (0): copy_process+0x5dc/0x1ab8 | softirqs last disabled at (0): 0x0 | Preemption disabled at: | migrate_enable+0x30/0x130 | CPU: 111 PID: 0 Comm: swapper/111 Tainted: G W 6.0.0-rc4-rt6-[...] | Call trace: | __kmalloc+0xbc/0x1e8 | detect_cache_attributes+0x2d4/0x5f0 | update_siblings_masks+0x30/0x368 | store_cpu_topology+0x78/0xb8 | secondary_start_kernel+0xd0/0x198 | __secondary_switched+0xb0/0xb4 Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Link: https://lore.kernel.org/r/20230104183033.755668-7-pierre.gondois@arm.com Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Radu Rendec <rrendec@redhat.com>
2023-05-02 20:57:28 +00:00
for_each_possible_cpu(cpu) {
ret = fetch_cache_info(cpu);
if (!ret)
continue;
else if (ret != -ENOENT)
arch_topology: Build cacheinfo from primary CPU Bugzilla: https://bugzilla.redhat.com/2180619 commit 5944ce092b97caed5d86d961e963b883b5c44ee2 Author: Pierre Gondois <pierre.gondois@arm.com> Date: Wed Jan 4 19:30:29 2023 +0100 arch_topology: Build cacheinfo from primary CPU commit 3fcbf1c77d08 ("arch_topology: Fix cache attributes detection in the CPU hotplug path") adds a call to detect_cache_attributes() to populate the cacheinfo before updating the siblings mask. detect_cache_attributes() allocates memory and can take the PPTT mutex (on ACPI platforms). On PREEMPT_RT kernels, on secondary CPUs, this triggers a: 'BUG: sleeping function called from invalid context' [1] as the code is executed with preemption and interrupts disabled. The primary CPU was previously storing the cache information using the now removed (struct cpu_topology).llc_id: commit 5b8dc787ce4a ("arch_topology: Drop LLC identifier stash from the CPU topology") allocate_cache_info() tries to build the cacheinfo from the primary CPU prior secondary CPUs boot, if the DT/ACPI description contains cache information. If allocate_cache_info() fails, then fallback to the current state for the cacheinfo allocation. [1] will be triggered in such case. When unplugging a CPU, the cacheinfo memory cannot be freed. If it was, then the memory would be allocated early by the re-plugged CPU and would trigger [1]. Note that populate_cache_leaves() might be called multiple times due to populate_leaves being moved up. This is required since detect_cache_attributes() might be called with per_cpu_cacheinfo(cpu) being allocated but not populated. [1]: | BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 | in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/111 | preempt_count: 1, expected: 0 | RCU nest depth: 1, expected: 1 | 3 locks held by swapper/111/0: | #0: (&pcp->lock){+.+.}-{3:3}, at: get_page_from_freelist+0x218/0x12c8 | #1: (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x48/0xf0 | #2: (&zone->lock){+.+.}-{3:3}, at: rmqueue_bulk+0x64/0xa80 | irq event stamp: 0 | hardirqs last enabled at (0): 0x0 | hardirqs last disabled at (0): copy_process+0x5dc/0x1ab8 | softirqs last enabled at (0): copy_process+0x5dc/0x1ab8 | softirqs last disabled at (0): 0x0 | Preemption disabled at: | migrate_enable+0x30/0x130 | CPU: 111 PID: 0 Comm: swapper/111 Tainted: G W 6.0.0-rc4-rt6-[...] | Call trace: | __kmalloc+0xbc/0x1e8 | detect_cache_attributes+0x2d4/0x5f0 | update_siblings_masks+0x30/0x368 | store_cpu_topology+0x78/0xb8 | secondary_start_kernel+0xd0/0x198 | __secondary_switched+0xb0/0xb4 Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Link: https://lore.kernel.org/r/20230104183033.755668-7-pierre.gondois@arm.com Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Radu Rendec <rrendec@redhat.com>
2023-05-02 20:57:28 +00:00
pr_err("Early cacheinfo failed, ret = %d\n", ret);
return;
arch_topology: Build cacheinfo from primary CPU Bugzilla: https://bugzilla.redhat.com/2180619 commit 5944ce092b97caed5d86d961e963b883b5c44ee2 Author: Pierre Gondois <pierre.gondois@arm.com> Date: Wed Jan 4 19:30:29 2023 +0100 arch_topology: Build cacheinfo from primary CPU commit 3fcbf1c77d08 ("arch_topology: Fix cache attributes detection in the CPU hotplug path") adds a call to detect_cache_attributes() to populate the cacheinfo before updating the siblings mask. detect_cache_attributes() allocates memory and can take the PPTT mutex (on ACPI platforms). On PREEMPT_RT kernels, on secondary CPUs, this triggers a: 'BUG: sleeping function called from invalid context' [1] as the code is executed with preemption and interrupts disabled. The primary CPU was previously storing the cache information using the now removed (struct cpu_topology).llc_id: commit 5b8dc787ce4a ("arch_topology: Drop LLC identifier stash from the CPU topology") allocate_cache_info() tries to build the cacheinfo from the primary CPU prior secondary CPUs boot, if the DT/ACPI description contains cache information. If allocate_cache_info() fails, then fallback to the current state for the cacheinfo allocation. [1] will be triggered in such case. When unplugging a CPU, the cacheinfo memory cannot be freed. If it was, then the memory would be allocated early by the re-plugged CPU and would trigger [1]. Note that populate_cache_leaves() might be called multiple times due to populate_leaves being moved up. This is required since detect_cache_attributes() might be called with per_cpu_cacheinfo(cpu) being allocated but not populated. [1]: | BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 | in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/111 | preempt_count: 1, expected: 0 | RCU nest depth: 1, expected: 1 | 3 locks held by swapper/111/0: | #0: (&pcp->lock){+.+.}-{3:3}, at: get_page_from_freelist+0x218/0x12c8 | #1: (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x48/0xf0 | #2: (&zone->lock){+.+.}-{3:3}, at: rmqueue_bulk+0x64/0xa80 | irq event stamp: 0 | hardirqs last enabled at (0): 0x0 | hardirqs last disabled at (0): copy_process+0x5dc/0x1ab8 | softirqs last enabled at (0): copy_process+0x5dc/0x1ab8 | softirqs last disabled at (0): 0x0 | Preemption disabled at: | migrate_enable+0x30/0x130 | CPU: 111 PID: 0 Comm: swapper/111 Tainted: G W 6.0.0-rc4-rt6-[...] | Call trace: | __kmalloc+0xbc/0x1e8 | detect_cache_attributes+0x2d4/0x5f0 | update_siblings_masks+0x30/0x368 | store_cpu_topology+0x78/0xb8 | secondary_start_kernel+0xd0/0x198 | __secondary_switched+0xb0/0xb4 Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Link: https://lore.kernel.org/r/20230104183033.755668-7-pierre.gondois@arm.com Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Radu Rendec <rrendec@redhat.com>
2023-05-02 20:57:28 +00:00
}
}
void store_cpu_topology(unsigned int cpuid)
{
struct cpu_topology *cpuid_topo = &cpu_topology[cpuid];
if (cpuid_topo->package_id != -1)
goto topology_populated;
cpuid_topo->thread_id = -1;
cpuid_topo->core_id = cpuid;
cpuid_topo->package_id = cpu_to_node(cpuid);
pr_debug("CPU%u: package %d core %d thread %d\n",
cpuid, cpuid_topo->package_id, cpuid_topo->core_id,
cpuid_topo->thread_id);
topology_populated:
update_siblings_masks(cpuid);
}
#endif