Commit Graph

156 Commits

Author SHA1 Message Date
Phil Auld 20a9e21c07 sched: Move sched domain name out of CONFIG_SCHED_DEBUG
JIRA: https://issues.redhat.com/browse/RHEL-23495

commit 1c055a0f5d3bafaca5d218bbb3e4e63d6307be45
Author: Swapnil Sapkal <swapnil.sapkal@amd.com>
Date:   Fri Dec 20 06:32:22 2024 +0000

    sched: Move sched domain name out of CONFIG_SCHED_DEBUG

    /proc/schedstat file shows cpu and sched domain level scheduler
    statistics. It does not show domain name instead shows domain level.
    It will be very useful for tools like `perf sched stats`[1] to
    aggragate domain level stats if domain names are shown in /proc/schedstat.
    But sched domain name is guarded by CONFIG_SCHED_DEBUG. As per the
    discussion[2], move sched domain name out of CONFIG_SCHED_DEBUG.

    [1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/
    [2] https://lore.kernel.org/lkml/fcefeb4d-3acb-462d-9c9b-3df8d927e522@amd.com/

    Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
    Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20241220063224.17767-5-swapnil.sapkal@amd.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-19 13:37:35 +00:00
Phil Auld e8bf69e6e0 sched: Fix spelling in comments
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Dropped hunks in mm_cid code which we don't have. Minor
context diffs due to still having IA64 in tree and previous Kabi
workarounds.

commit 402de7fc880fef055bc984957454b532987e9ad0
Author: Ingo Molnar <mingo@kernel.org>
Date:   Mon May 27 16:54:52 2024 +0200

    sched: Fix spelling in comments

    Do a spell-checking pass.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:02 -04:00
Phil Auld 096e01219f sched/topology: Remove root_domain::max_cpu_capacity
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit fa427e8e53d8db15090af7e952a55870dc2a453f
Author: Qais Yousef <qyousef@layalina.io>
Date:   Sun Mar 24 00:45:51 2024 +0000

    sched/topology: Remove root_domain::max_cpu_capacity

    The value is no longer used as we now keep track of max_allowed_capacity
    for each task instead.

    Signed-off-by: Qais Yousef <qyousef@layalina.io>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20240324004552.999936-4-qyousef@layalina.io

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:49 -04:00
Phil Auld b921e17c63 sched/topology: Export asym_cap_list
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 77222b0d12e8ae6f082261842174cc2e981bf99c
Author: Qais Yousef <qyousef@layalina.io>
Date:   Sun Mar 24 00:45:49 2024 +0000

    sched/topology: Export asym_cap_list

    So that we can use it to iterate through available capacities in the
    system. Sort asym_cap_list in descending order as expected users are
    likely to be interested on the highest capacity first.

    Make the list RCU protected to allow for cheap access in hot paths.

    Signed-off-by: Qais Yousef <qyousef@layalina.io>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20240324004552.999936-2-qyousef@layalina.io

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:49 -04:00
Phil Auld 608b243988 sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Context diff in powerpc code due to not having aa80c6343fcf
("powerpc/smp: Enable Asym packing for cores on shared processor").

commit 54de442747037485da1fc4eca9636287a61e97e3
Author: Alex Shi <alexs@kernel.org>
Date:   Sat Feb 10 19:39:23 2024 +0800

    sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC

    SD_SHARE_PKG_RESOURCES is a bit of a misnomer: its naming suggests that
    it's sharing all 'package resources' - while in reality it's specifically
    for sharing the LLC only.

    Rename it to SD_SHARE_LLC to reduce confusion.

    [ mingo: Rewrote the confusing changelog as well. ]

    Suggested-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Alex Shi <alexs@kernel.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Link: https://lore.kernel.org/r/20240210113924.1130448-5-alexs@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:47 -04:00
Phil Auld 3dfbbee9e0 sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit d654c8ddde84b9d1a30a40917e588b5a1e53dada
Author: Alex Shi <alexs@kernel.org>
Date:   Sat Feb 10 19:39:19 2024 +0800

    sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS

    These flags are already documented in include/linux/sched/sd_flags.h.

    Also, add missing SD_CLUSTER and keep the comment on SD_ASYM_PACKING
    as it is a special case.

    Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
    Signed-off-by: Alex Shi <alexs@kernel.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20240210113924.1130448-1-alexs@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:46 -04:00
Phil Auld fbd1479673 sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level
JIRA: https://issues.redhat.com/browse/RHEL-48226

commit a1fd0b9d751f840df23ef0e75b691fc00cfd4743
Author: Vitalii Bursov <vitaly@bursov.com>
Date:   Tue Apr 30 18:05:23 2024 +0300

    sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level

    Change relax_domain_level checks so that it would be possible
    to include or exclude all domains from newidle balancing.

    This matches the behavior described in the documentation:

      -1   no request. use system default or follow request of others.
       0   no search.
       1   search siblings (hyperthreads in a core).

    "2" enables levels 0 and 1, level_max excludes the last (level_max)
    level, and level_max+1 includes all levels.

    Fixes: 1d3504fcf5 ("sched, cpuset: customize sched domains, core")
    Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-07-15 11:12:25 -04:00
Phil Auld b8e14fc25c sched/topology: Optimize topology_span_sane()
JIRA: https://issues.redhat.com/browse/RHEL-39277

commit 05037e5f0f17935a86861f9610f941ebf346a95e
Author: Kyle Meyer <kyle.meyer@hpe.com>
Date:   Wed Apr 10 16:33:11 2024 -0500

    sched/topology: Optimize topology_span_sane()

    Optimize topology_span_sane() by removing duplicate comparisons.

    Since topology_span_sane() is called inside of for_each_cpu(), each
    previous CPU has already been compared against every other CPU. The
    current CPU only needs to be compared against higher-numbered CPUs.

    The total number of comparisons is reduced from N * (N - 1) to
    N * (N - 1) / 2 on each non-NUMA scheduling domain level.

    Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
    Reviewed-by: Yury Norov <yury.norov@gmail.com>
    Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Yury Norov <yury.norov@gmail.com>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-05-29 10:46:37 -04:00
Phil Auld 219d789d21 sched/fair: Scan cluster before scanning LLC in wake-up path
JIRA: https://issues.redhat.com/browse/RHEL-15622

commit 8881e1639f1f899b64e9bccf6cc14d51c1d3c822
Author: Barry Song <song.bao.hua@hisilicon.com>
Date:   Thu Oct 19 11:33:22 2023 +0800

    sched/fair: Scan cluster before scanning LLC in wake-up path

    For platforms having clusters like Kunpeng920, CPUs within the same cluster
    have lower latency when synchronizing and accessing shared resources like
    cache. Thus, this patch tries to find an idle cpu within the cluster of the
    target CPU before scanning the whole LLC to gain lower latency. This
    will be implemented in 2 steps in select_idle_sibling():
    1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them
       if they're sharing cluster with the target CPU. Otherwise trying to
       scan for an idle CPU in the target's cluster.
    2. Scanning the cluster prior to the LLC of the target CPU for an
       idle CPU to wakeup.

    Testing has been done on Kunpeng920 by pinning tasks to one numa and two
    numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.

    With this patch, We noticed enhancement on tbench and netperf within one
    numa or cross two numa on top of tip-sched-core commit
    9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()")

    tbench results (node 0):
                baseline                     patched
      1:        327.2833        372.4623 (   13.80%)
      4:       1320.5933       1479.8833 (   12.06%)
      8:       2638.4867       2921.5267 (   10.73%)
     16:       5282.7133       5891.5633 (   11.53%)
     32:       9810.6733       9877.3400 (    0.68%)
     64:       7408.9367       7447.9900 (    0.53%)
    128:       6203.2600       6191.6500 (   -0.19%)
    tbench results (node 0-1):
                baseline                     patched
      1:        332.0433        372.7223 (   12.25%)
      4:       1325.4667       1477.6733 (   11.48%)
      8:       2622.9433       2897.9967 (   10.49%)
     16:       5218.6100       5878.2967 (   12.64%)
     32:      10211.7000      11494.4000 (   12.56%)
     64:      13313.7333      16740.0333 (   25.74%)
    128:      13959.1000      14533.9000 (    4.12%)

    netperf results TCP_RR (node 0):
                baseline                     patched
      1:      76546.5033      90649.9867 (   18.42%)
      4:      77292.4450      90932.7175 (   17.65%)
      8:      77367.7254      90882.3467 (   17.47%)
     16:      78519.9048      90938.8344 (   15.82%)
     32:      72169.5035      72851.6730 (    0.95%)
     64:      25911.2457      25882.2315 (   -0.11%)
    128:      10752.6572      10768.6038 (    0.15%)

    netperf results TCP_RR (node 0-1):
                baseline                     patched
      1:      76857.6667      90892.2767 (   18.26%)
      4:      78236.6475      90767.3017 (   16.02%)
      8:      77929.6096      90684.1633 (   16.37%)
     16:      77438.5873      90502.5787 (   16.87%)
     32:      74205.6635      88301.5612 (   19.00%)
     64:      69827.8535      71787.6706 (    2.81%)
    128:      25281.4366      25771.3023 (    1.94%)

    netperf results UDP_RR (node 0):
                baseline                     patched
      1:      96869.8400     110800.8467 (   14.38%)
      4:      97744.9750     109680.5425 (   12.21%)
      8:      98783.9863     110409.9637 (   11.77%)
     16:      99575.0235     110636.2435 (   11.11%)
     32:      95044.7250      97622.8887 (    2.71%)
     64:      32925.2146      32644.4991 (   -0.85%)
    128:      12859.2343      12824.0051 (   -0.27%)

    netperf results UDP_RR (node 0-1):
                baseline                     patched
      1:      97202.4733     110190.1200 (   13.36%)
      4:      95954.0558     106245.7258 (   10.73%)
      8:      96277.1958     105206.5304 (    9.27%)
     16:      97692.7810     107927.2125 (   10.48%)
     32:      79999.6702     103550.2999 (   29.44%)
     64:      80592.7413      87284.0856 (    8.30%)
    128:      27701.5770      29914.5820 (    7.99%)

    Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
    in the code has not been tested but it supposed to work.

    Chen Yu also noticed this will improve the performance of tbench and
    netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one
    cluster sharing L2 Cache.

    [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
    Reviewed-by: Chen Yu <yu.c.chen@intel.com>
    Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
    Tested-by: Yicong Yang <yangyicong@hisilicon.com>
    Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:04 -04:00
Phil Auld 26c251b772 sched: Add cpus_share_resources API
JIRA: https://issues.redhat.com/browse/RHEL-15622

commit b95303e0aeaf446b65169dd4142cacdaeb7d4c8b
Author: Barry Song <song.bao.hua@hisilicon.com>
Date:   Thu Oct 19 11:33:21 2023 +0800

    sched: Add cpus_share_resources API

    Add cpus_share_resources() API. This is the preparation for the
    optimization of select_idle_cpu() on platforms with cluster scheduler
    level.

    On a machine with clusters cpus_share_resources() will test whether
    two cpus are within the same cluster. On a non-cluster machine it
    will behaves the same as cpus_share_cache(). So we use "resources"
    here for cache resources.

    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
    Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:46:43 -04:00
Phil Auld 38176213a7 sched/topology: Move the declaration of 'schedutil_gov' to kernel/sched/sched.h
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit f2273f4e19e29f7d0be6a2393f18369cd1b496c8
Author: Ingo Molnar <mingo@kernel.org>
Date:   Mon Oct 9 17:31:26 2023 +0200

    sched/topology: Move the declaration of 'schedutil_gov' to kernel/sched/sched.h

    Move it out of the .c file into the shared scheduler-internal header file,
    to gain type-checking.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
    Cc: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:57 -04:00
Phil Auld 137e304223 sched/topology: Change behaviour of the 'sched_energy_aware' sysctl, based on the platform
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 8f833c82cdab7b4049bcfe88311d35fa5f24e422
Author: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Date:   Mon Oct 9 11:30:37 2023 +0530

    sched/topology: Change behaviour of the 'sched_energy_aware' sysctl, based on the platform

    The 'sched_energy_aware' sysctl is available for the admin to disable/enable
    energy aware scheduling(EAS). EAS is enabled only if few conditions are
    met by the platform. They are, asymmetric CPU capacity, no SMT,
    schedutil CPUfreq governor, frequency invariant load tracking etc.
    A platform may boot without EAS capability, but could gain such
    capability at runtime. For example, changing/registering the cpufreq
    governor to schedutil.

    At present, though platform doesn't support EAS, this sysctl returns 1
    and it ends up calling build_perf_domains on write to 1 and
    NOP when writing to 0. That is confusing and un-necessary.

    Desired behavior would be to have this sysctl to enable/disable the EAS
    on supported platform. On non-supported platform write to the sysctl
    would return not supported error and read of the sysctl would return
    empty. So sched_energy_aware returns empty - EAS is not possible at this moment
    This will include EAS capable platforms which have at least one EAS
    condition false during startup, e.g. not using the schedutil cpufreq governor
    sched_energy_aware returns 0 - EAS is supported but disabled by admin.
    sched_energy_aware returns 1 - EAS is supported and enabled.

    User can find out the reason why EAS is not possible by checking
    info messages. sched_is_eas_possible returns true if the platform
    can do EAS at this moment.

    Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Pierre Gondois <pierre.gondois@arm.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:57 -04:00
Phil Auld 4d6985f12b sched/topology: Remove the EM_MAX_COMPLEXITY limit
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 5b77261c5510f1e6f4d359e97dd3e39ee7259c3d
Author: Pierre Gondois <Pierre.Gondois@arm.com>
Date:   Mon Oct 9 11:30:36 2023 +0530

    sched/topology: Remove the EM_MAX_COMPLEXITY limit

    The Energy Aware Scheduler (EAS) estimates the energy consumption
    of placing a task on different CPUs. The goal is to minimize this
    energy consumption. Estimating the energy of different task placements
    is increasingly complex with the size of the platform.

    To avoid having a slow wake-up path, EAS is only enabled if this
    complexity is low enough.

    The current complexity limit was set in:

      b68a4c0dba ("sched/topology: Disable EAS on inappropriate platforms")

    ... based on the first implementation of EAS, which was re-computing
    the power of the whole platform for each task placement scenario, see:

      390031e4c3 ("sched/fair: Introduce an energy estimation helper function")

    ... but the complexity of EAS was reduced in:

      eb92692b25 ("sched/fair: Speed-up energy-aware wake-ups")

    ... and find_energy_efficient_cpu() (feec) algorithm was updated in:

      3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")

    find_energy_efficient_cpu() (feec) is now doing:

            feec()
            \_ for_each_pd(pd) [0]
              // get max_spare_cap_cpu and compute_prev_delta
              \_ for_each_cpu(pd) [1]

              \_ eenv_pd_busy_time(pd) [2]
                    \_ for_each_cpu(pd)

              // compute_energy(pd) without the task
              \_ eenv_pd_max_util(pd, -1) [3.0]
                \_ for_each_cpu(pd)
              \_ em_cpu_energy(pd, -1)
                \_ for_each_ps(pd)

              // compute_energy(pd) with the task on prev_cpu
              \_ eenv_pd_max_util(pd, prev_cpu) [3.1]
                \_ for_each_cpu(pd)
              \_ em_cpu_energy(pd, prev_cpu)
                \_ for_each_ps(pd)

              // compute_energy(pd) with the task on max_spare_cap_cpu
              \_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2]
                \_ for_each_cpu(pd)
              \_ em_cpu_energy(pd, max_spare_cap_cpu)
                \_ for_each_ps(pd)

            [3.1] happens only once since prev_cpu is unique. With the same
                  definitions for nr_pd, nr_cpus and nr_ps, the complexity is of:

                    nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd]))
                    + ([nr_cpus in pd] + [nr_ps in pd])

                     [0]  * (     [1] + [2]      +       [3.0] + [3.2]                  )
                    + [3.1]

                    = nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd])
                    + [nr_cpus in prev pd] + nr_ps

    The complexity limit was set to 2048 in:

      b68a4c0dba ("sched/topology: Disable EAS on inappropriate platforms")

    ... to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8
    performance states each". For the same platform, the complexity would
    actually be of:

      16 * (4 + 2 * 7) + 1 + 7 = 296

    Since the EAS complexity was greatly reduced since the limit was
    introduced, bigger platforms can handle EAS.

    For instance, a platform with 112 CPUs with 7 performance states
    each would not reach it:

      112 * (4 + 2 * 7) + 1 + 7 = 2024

    To reflect this improvement in the underlying EAS code, remove
    the EAS complexity check.

    Note that a limit on the number of CPUs still holds against
    EM_MAX_NUM_CPUS to avoid overflows during the energy estimation.

    [ mingo: Updates to the changelog. ]

    Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20231009060037.170765-2-sshegde@linux.vnet.ibm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:56 -04:00
Phil Auld 49d1b3f5c9 sched/topology: Consolidate and clean up access to a CPU's max compute capacity
JIRA: https://issues.redhat.com/browse/RHEL-29020

commit 7bc263840bc3377186cb06b003ac287bb2f18ce2
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Mon Oct 9 12:36:16 2023 +0200

    sched/topology: Consolidate and clean up access to a CPU's max compute capacity

    Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
    instead.

    The scheduler uses 3 methods to get access to a CPU's max compute capacity:

     - arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.

     - cpu_capacity_orig field which is periodically updated with
       arch_scale_cpu_capacity().

     - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.

    There is no real need to save the value returned by arch_scale_cpu_capacity()
    in struct rq. arch_scale_cpu_capacity() returns:

     - either a per_cpu variable.

     - or a const value for systems which have only one capacity.

    Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.

    No functional changes.

    Some performance tests on Arm64:

      - small SMP device (hikey): no noticeable changes
      - HMP device (RB5):         hackbench shows minor improvement (1-2%)
      - large smp (thx2):         hackbench and tbench shows minor improvement (1%)

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:43:50 -04:00
Prarit Bhargava 209997995a sched/topology: Rename 'DIE' domain to 'PKG'
JIRA: https://issues.redhat.com/browse/RHEL-25415

commit f577cd57bfaa889cf0718e30e92c08c7f78c9d85
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Jul 12 16:10:56 2023 +0200

    sched/topology: Rename 'DIE' domain to 'PKG'

    While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain
    for the package mask. :-)

    Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the
    name less ambiguous.

    [ Shrikanth Hegde: rename on s390 as well. ]
    [ Valentin Schneider: also rename it in the comments. ]
    [ mingo: port to recent kernels & find all remaining occurances. ]

    Reported-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Valentin Schneider <vschneid@redhat.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Heiko Carstens <hca@linux.ibm.com>
    Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
    Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:43:30 -04:00
Scott Weaver 1cd6d5c501 Merge: Sched: Handle Asymmetric scheduler domains better
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3466

JIRA: https://issues.redhat.com/browse/RHEL-17497
Tested: Tested topology changes on several x86 systems and scheduler
performance and sanity tests. IBM will test the regression on the z
systems which exhibited the issue.

Fix performance issues in handling asymmetric scheduler
domains. This includes some fixes for load balancing across
SMT domains and some underlying code for x86 topology detection.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Dean Nelson <dnelson@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2024-01-08 12:05:44 -05:00
Phil Auld 559b34b006 sched/topology: Record number of cores in sched group
JIRA: https://issues.redhat.com/browse/RHEL-17497

commit d24cb0d9113f5932b8832533ce82351b5911ed50
Author: Tim C Chen <tim.c.chen@linux.intel.com>
Date:   Fri Jul 7 15:57:01 2023 -0700

    sched/topology: Record number of cores in sched group

    When balancing sibling domains that have different number of cores,
    tasks in respective sibling domain should be proportional to the
    number of cores in each domain. In preparation of implementing such a
    policy, record the number of cores in a scheduling group.

    Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/04641eeb0e95c21224352f5743ecb93dfac44654.1688770494.git.tim.c.chen@linux.intel.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-12-06 09:31:10 -05:00
Phil Auld 18c8a22fb9 sched/topology: Fix sched_numa_find_nth_cpu() comment
JIRA: https://issues.redhat.com/browse/RHEL-17580

commit 6d08ad2166f7770341ea56afad45fa41cd16ae62
Author: Yury Norov <yury.norov@gmail.com>
Date:   Sat Aug 19 07:12:38 2023 -0700

    sched/topology: Fix sched_numa_find_nth_cpu() comment

    Reword sched_numa_find_nth_cpu() comment and make it kernel-doc compatible.

    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20230819141239.287290-7-yury.norov@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-11-29 10:41:41 -05:00
Phil Auld b3d49aa01c sched/topology: Handle NUMA_NO_NODE in sched_numa_find_nth_cpu()
JIRA: https://issues.redhat.com/browse/RHEL-17580

commit 9ecea9ae4d3127a09fb5dfcea87f248937a39ff5
Author: Yury Norov <yury.norov@gmail.com>
Date:   Sat Aug 19 07:12:37 2023 -0700

    sched/topology: Handle NUMA_NO_NODE in sched_numa_find_nth_cpu()

    sched_numa_find_nth_cpu() doesn't handle NUMA_NO_NODE properly, and
    may crash kernel if passed with it. On the other hand, the only user
    of sched_numa_find_nth_cpu() has to check NUMA_NO_NODE case explicitly.

    It would be easier for users if this logic will get moved into
    sched_numa_find_nth_cpu().

    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20230819141239.287290-6-yury.norov@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-11-29 10:41:32 -05:00
Phil Auld 02f212ad1e sched/topology: Fix sched_numa_find_nth_cpu() in CPU-less case
JIRA: https://issues.redhat.com/browse/RHEL-17580

commit 617f2c38cb5ce60226042081c09e2ee3a90d03f8
Author: Yury Norov <yury.norov@gmail.com>
Date:   Sat Aug 19 07:12:35 2023 -0700

    sched/topology: Fix sched_numa_find_nth_cpu() in CPU-less case

    When the node provided by user is CPU-less, corresponding record in
    sched_domains_numa_masks is not set. Trying to dereference it in the
    following code leads to kernel crash.

    To avoid it, start searching from the nearest node with CPUs.

    Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()")
    Reported-by: Yicong Yang <yangyicong@hisilicon.com>
    Reported-by: Guenter Roeck <linux@roeck-us.net>
    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20230819141239.287290-4-yury.norov@gmail.com

    Closes: https://lore.kernel.org/lkml/CAAH8bW8C5humYnfpW3y5ypwx0E-09A3QxFE1JFzR66v+mO4XfA@mail.gmail.com/T/
    Closes: https://lore.kernel.org/lkml/ZMHSNQfv39HN068m@yury-ThinkPad/T/#mf6431cb0b7f6f05193c41adeee444bc95bf2b1c4

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-11-29 10:41:25 -05:00
Phil Auld 3d7687bbd0 sched/topology: Align group flags when removing degenerate domain
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 4efcc8bc7e08c09c58a2f5cbc2096fbda5b7cf5e
Author: Chen Yu <yu.c.chen@intel.com>
Date:   Thu Jul 13 09:31:33 2023 +0800

    sched/topology: Align group flags when removing degenerate domain

    The flags of the child of a given scheduling domain are used to initialize
    the flags of its scheduling groups. When the child of a scheduling domain
    is degenerated, the flags of its local scheduling group need to be updated
    to align with the flags of its new child domain.

    The flag SD_SHARE_CPUCAPACITY was aligned in
    Commit bf2dc42d6beb ("sched/topology: Propagate SMT flags when removing degenerate domain").
    Further generalize this alignment so other flags can be used later, such as
    in cluster-based task wakeup. [1]

    Reported-by: Yicong Yang <yangyicong@huawei.com>
    Suggested-by: Ricardo Neri <ricardo.neri@intel.com>
    Signed-off-by: Chen Yu <yu.c.chen@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
    Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
    Link: https://lore.kernel.org/r/20230713013133.2314153-1-yu.c.chen@intel.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld 57aa0597d5 sched/core: Fixed missing rq clock update before calling set_rq_offline()
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit cab3ecaed5cdcc9c36a96874b4c45056a46ece45
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Tue Jun 13 16:20:09 2023 +0800

    sched/core: Fixed missing rq clock update before calling set_rq_offline()

    When using a cpufreq governor that uses
    cpufreq_add_update_util_hook(), it is possible to trigger a missing
    update_rq_clock() warning for the CPU hotplug path:

      rq_attach_root()
        set_rq_offline()
          rq_offline_rt()
            __disable_runtime()
              sched_rt_rq_enqueue()
                enqueue_top_rt_rq()
                  cpufreq_update_util()
                    data->func(data, rq_clock(rq), flags)

    Move update_rq_clock() from sched_cpu_deactivate() (one of it's
    callers) into set_rq_offline() such that it covers all
    set_rq_offline() usage.

    Additionally change rq_attach_root() to use rq_lock_irqsave() so that
    it will properly manage the runqueue clock flags.

    Suggested-by: Ben Segall <bsegall@google.com>
    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20230613082012.49615-2-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld 97c7797259 sched/topology: Mark set_sched_topology() __init
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 0cce0fde499a92c726cd2e24f7763644f7c9f971
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Sat Jun 3 15:36:45 2023 +0800

    sched/topology: Mark set_sched_topology() __init

    All callers of set_sched_topology() are within __init section. Mark
    it __init too.

    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230603073645.1173332-1-linmiaohe@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld 191e2fbf4b sched/topology: Propagate SMT flags when removing degenerate domain
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit bf2dc42d6beb890c995b8b09f881ef1b37259107
Author: Tim C Chen <tim.c.chen@linux.intel.com>
Date:   Thu May 4 09:09:51 2023 -0700

    sched/topology: Propagate SMT flags when removing degenerate domain

    When a degenerate cluster domain for core with SMT CPUs is removed,
    the SD_SHARE_CPUCAPACITY flag in the local child sched group was not
    propagated to the new parent.  We need this flag to properly determine
    whether the local sched group is SMT.  Set the flag in the local
    child sched group of the new parent sched domain.

    Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
    Link: https://lkml.kernel.org/r/73cf0959eafa53c02e7ef6bf805d751d9190e55d.1683156492.git.tim.c.chen@linux.intel.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 0bf13a5d2a sched/topology: Make sched_energy_mutex,update static
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit d91e15a21d4b3823ce93a42b05f0d171689f4e6a
Author: Tom Rix <trix@redhat.com>
Date:   Tue Mar 14 10:48:18 2023 -0400

    sched/topology: Make sched_energy_mutex,update static

    smatch reports
    kernel/sched/topology.c:212:1: warning:
      symbol 'sched_energy_mutex' was not declared. Should it be static?
    kernel/sched/topology.c:213:6: warning:
      symbol 'sched_energy_update' was not declared. Should it be static?

    These variables are only used in topology.c, so should be static

    Signed-off-by: Tom Rix <trix@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230314144818.1453523-1-trix@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 46f1f2ceaf sched/topology: Add __init for sched_init_domains()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit ef90cf2281a013d359d24d51732af990badf6e03
Author: Bing Huang <huangbing@kylinos.cn>
Date:   Thu Jan 5 09:49:43 2023 +0800

    sched/topology: Add __init for sched_init_domains()

    sched_init_domains() is only used in initialization

    Signed-off-by: Bing Huang <huangbing@kylinos.cn>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230105014943.9857-1-huangbing775@126.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:19 -04:00
Phil Auld af9be13f8c sched/topology: Add __init for init_defrootdomain
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 9a5322db46332a4ce42369e86f031b5e963d841c
Author: Bing Huang <huangbing@kylinos.cn>
Date:   Fri Nov 18 11:42:08 2022 +0800

    sched/topology: Add __init for init_defrootdomain

    init_defrootdomain is only used in initialization

    Signed-off-by: Bing Huang <huangbing@kylinos.cn>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lkml.kernel.org/r/20221118034208.267330-1-huangbing775@126.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:19 -04:00
Phil Auld b89cdaef13 sched/topology: fix KASAN warning in hop_cmp()
JIRA: https://issues.redhat.com/browse/RHEL-318

commit 01bb11ad828b320749764fa93ad078db20d08a9e
Author: Yury Norov <yury.norov@gmail.com>
Date:   Thu Feb 16 17:39:08 2023 -0800

    sched/topology: fix KASAN warning in hop_cmp()

    Despite that prev_hop is used conditionally on cur_hop
    is not the first hop, it's initialized unconditionally.

    Because initialization implies dereferencing, it might happen
    that the code dereferences uninitialized memory, which has been
    spotted by KASAN. Fix it by reorganizing hop_cmp() logic.

    Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
    Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()")
    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Link: https://lore.kernel.org/r/Y+7avK6V9SyAWsXi@yury-laptop/
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-03-30 10:24:35 -04:00
Phil Auld 87253b2abc sched/topology: Introduce sched_numa_hop_mask()
JIRA: https://issues.redhat.com/browse/RHEL-318

commit 9feae65845f7b16376716fe70b7d4b9bf8721848
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Fri Jan 20 20:24:33 2023 -0800

    sched/topology: Introduce sched_numa_hop_mask()

    Tariq has pointed out that drivers allocating IRQ vectors would benefit
    from having smarter NUMA-awareness - cpumask_local_spread() only knows
    about the local node and everything outside is in the same bucket.

    sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
    of CPUs reachable within a given distance budget), introduce
    sched_numa_hop_mask() to export those cpumasks.

    Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-03-30 10:24:35 -04:00
Phil Auld bf668d6ec9 sched: add sched_numa_find_nth_cpu()
JIRA: https://issues.redhat.com/browse/RHEL-318

commit cd7f55359c90a4108e6528e326b8623fce1ad72a
Author: Yury Norov <yury.norov@gmail.com>
Date:   Fri Jan 20 20:24:30 2023 -0800

    sched: add sched_numa_find_nth_cpu()

    The function finds Nth set CPU in a given cpumask starting from a given
    node.

    Leveraging the fact that each hop in sched_domains_numa_masks includes the
    same or greater number of CPUs than the previous one, we can use binary
    search on hops instead of linear walk, which makes the overall complexity
    of O(log n) in terms of number of cpumask_weight() calls.

    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Acked-by: Tariq Toukan <tariqt@nvidia.com>
    Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
    Reviewed-by: Peter Lafreniere <peter@n8pjl.ca>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-03-30 10:24:34 -04:00
Phil Auld 5cc3f6ed92 sched: Move energy_aware sysctls to topology.c
Bugzilla: https://bugzilla.redhat.com/2115520

commit 8a0441415b3f9b9a920a6a5086580ea3daa7b884
Author: Zhen Ni <nizhen@uniontech.com>
Date:   Tue Feb 15 19:46:04 2022 +0800

    sched: Move energy_aware sysctls to topology.c

    move energy_aware sysctls to topology.c and use the new
    register_sysctl_init() to register the sysctl interface.

    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:36 -04:00
Phil Auld cd40c28edf sched/numa: Adjust imb_numa_nr to a better approximation of memory channels
Bugzilla: https://bugzilla.redhat.com/2110021

commit 026b98a93bbdbefb37ab8008df84e38e2fedaf92
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri May 20 11:35:19 2022 +0100

    sched/numa: Adjust imb_numa_nr to a better approximation of memory channels

    For a single LLC per node, a NUMA imbalance is allowed up until 25%
    of CPUs sharing a node could be active. One intent of the cut-off is
    to avoid an imbalance of memory channels but there is no topological
    information based on active memory channels. Furthermore, there can
    be differences between nodes depending on the number of populated
    DIMMs.

    A cut-off of 25% was arbitrary but generally worked. It does have a severe
    corner cases though when an parallel workload is using 25% of all available
    CPUs over-saturates memory channels. This can happen due to the initial
    forking of tasks that get pulled more to one node after early wakeups
    (e.g. a barrier synchronisation) that is not quickly corrected by the
    load balancer. The LB may fail to act quickly as the parallel tasks are
    considered to be poor migrate candidates due to locality or cache hotness.

    On a range of modern Intel CPUs, 12.5% appears to be a better cut-off
    assuming all memory channels are populated and is used as the new cut-off
    point. A minimum of 1 is specified to allow a communicating pair to
    remain local even for CPUs with low numbers of cores. For modern AMDs,
    there are multiple LLCs and are not affected.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Link: https://lore.kernel.org/r/20220520103519.1863-5-mgorman@techsingularity.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-09-01 09:11:17 -04:00
Patrick Talbert 229eaccaf0 Merge: Scheduler header clean up
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/710

Scheduler header clean up

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2069275
Upstream Status: Linux
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2062831
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2065222
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2065226
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2065994

This is the first chunk of the fast headers rework. It covers the
scheduler header files and will both speed up builds and help
with future backports.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-12 09:28:09 +02:00
Patrick Talbert d46e36b09c Merge: sched/isolation: Split housekeeping cpumask per isolation features
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/671

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2065222
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2065994
Tested: Setup isolation and ran scheduler tests, checked that housekeeping
looked right (tasks offloaded from isolated cpus to HK ones etc).

Split the housekeeping flags into finer granularity in preparation
for allowing them to be configured dynamically. There should not be
much functional change.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Paolo Bonzini <bonzini@gnu.org>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-11 08:42:56 +02:00
Patrick Talbert dbecc7b791 Merge: Scheduler updates and fixes
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/627

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2062831
Upstream Status: Linux
Tested: By me with a series of scheduler stress and performance tests

Scheduler updates and fixes from 5.17 and 5.18-rc1. This series
keeps the scheduler up to date and addresses a handful of potential
issues.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-04-28 10:01:36 +02:00
Phil Auld deee3a961c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there
Bugzilla: http://bugzilla.redhat.com/2069275

commit 801c141955108fb7cf1244dda76e6de8b16fd3ae
Author: Ingo Molnar <mingo@kernel.org>
Date:   Tue Feb 22 13:23:24 2022 +0100

    sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there

    Collect all utility functionality source code files into a single kernel/sched/build_utility.c file,
    via #include-ing the .c files:

        kernel/sched/clock.c
        kernel/sched/completion.c
        kernel/sched/loadavg.c
        kernel/sched/swait.c
        kernel/sched/wait_bit.c
        kernel/sched/wait.c

    CONFIG_CPU_FREQ:
        kernel/sched/cpufreq.c

    CONFIG_CPU_FREQ_GOV_SCHEDUTIL:
        kernel/sched/cpufreq_schedutil.c

    CONFIG_CGROUP_CPUACCT:
        kernel/sched/cpuacct.c

    CONFIG_SCHED_DEBUG:
        kernel/sched/debug.c

    CONFIG_SCHEDSTATS:
        kernel/sched/stats.c

    CONFIG_SMP:
       kernel/sched/cpupri.c
       kernel/sched/stop_task.c
       kernel/sched/topology.c

    CONFIG_SCHED_CORE:
       kernel/sched/core_sched.c

    CONFIG_PSI:
       kernel/sched/psi.c

    CONFIG_MEMBARRIER:
       kernel/sched/membarrier.c

    CONFIG_CPU_ISOLATION:
       kernel/sched/isolation.c

    CONFIG_SCHED_AUTOGROUP:
       kernel/sched/autogroup.c

    The goal is to amortize the 60+ KLOC header bloat from over a dozen build units into
    a single build unit.

    The build time of build_utility.c also roughly matches the build time of core.c and
    fair.c - allowing better load-balancing of scheduler-only rebuilds.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-11 17:38:21 -04:00
Phil Auld 233aa69d39 Merge remote-tracking branch 'origin/merge-requests/671' into bz2069275
Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-11 12:37:07 -04:00
Phil Auld 79db208833 sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains
Bugzilla: http://bugzilla.redhat.com/2065198

commit 7f434dff76215af00c26ba6449eaa4738fe9e2ab
Author: K Prateek Nayak <kprateek.nayak@amd.com>
Date:   Fri Feb 18 21:57:43 2022 +0530

    sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains

    While investigating the sparse warning reported by the LKP bot [1],
    observed that we have a redundant variable "top" in the function
    build_sched_domains that was introduced in the recent commit
    e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance when
    SD_NUMA spans multiple LLCs")

    The existing variable "sd" suffices which allows us to remove the
    redundant variable "top" while annotating the other variable "top_p"
    with the "__rcu" annotation to silence the sparse warning.

    [1] https://lore.kernel.org/lkml/202202170853.9vofgC3O-lkp@intel.com/

    Fixes: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lore.kernel.org/r/20220218162743.1134-1-kprateek.nayak@amd.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-05 08:04:24 -04:00
Phil Auld 024ae775f1 sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Bugzilla: http://bugzilla.redhat.com/2065198
Conflicts: Minor fuzz due to still having cpu_util().

commit e496132ebedd870b67f1f6d2428f9bb9d7ae27fd
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Tue Feb 8 09:43:34 2022 +0000

    sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

    Commit 7d2b5dd0bc ("sched/numa: Allow a floating imbalance between NUMA
    nodes") allowed an imbalance between NUMA nodes such that communicating
    tasks would not be pulled apart by the load balancer. This works fine when
    there is a 1:1 relationship between LLC and node but can be suboptimal
    for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

    Zen* has multiple LLCs per node with local memory channels and due to
    the allowed imbalance, it's far harder to tune some workloads to run
    optimally than it is on hardware that has 1 LLC per node. This patch
    allows an imbalance to exist up to the point where LLCs should be balanced
    between nodes.

    On a Zen3 machine running STREAM parallelised with OMP to have on instance
    per LLC the results and without binding, the results are

                                5.17.0-rc0             5.17.0-rc0
                                   vanilla       sched-numaimb-v6
    MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
    MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
    MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
    MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)

    STREAM can use directives to force the spread if the OpenMP is new
    enough but that doesn't help if an application uses threads and
    it's not known in advance how many threads will be created.

    Coremark is a CPU and cache intensive benchmark parallelised with
    threads. When running with 1 thread per core, the vanilla kernel
    allows threads to contend on cache. With the patch;

                                   5.17.0-rc0             5.17.0-rc0
                                      vanilla       sched-numaimb-v5
    Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
    Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
    Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
    Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
    CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)

    It can also make a big difference for semi-realistic workloads
    like specjbb which can execute arbitrary numbers of threads without
    advance knowledge of how they should be placed. Even in cases where
    the average performance is neutral, the results are more stable.

                                   5.17.0-rc0             5.17.0-rc0
                                      vanilla       sched-numaimb-v6
    Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
    Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
    Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
    Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
    Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
    Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
    Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
    Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
    Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
    Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
    Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
    Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
    Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
    Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
    Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
    Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
    Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
    Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
    Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
    Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
    Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
    Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
    Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
    Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
    Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
    Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
    Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
    Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
    Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
    Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
    Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
    Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
    Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
    Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)

    Similarly, for embarassingly parallel problems like NPB-ep, there are
    improvements due to better spreading across LLC when the machine is not
    fully utilised.

                                  vanilla       sched-numaimb-v6
    Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
    Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
    Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
    CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
    Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-05 08:04:24 -04:00
Phil Auld 1cf795c344 sched/isolation: Use single feature type while referring to housekeeping cpumask
Bugzilla: http://bugzilla.redhat.com/2065222

commit 04d4e665a60902cf36e7ad39af1179cb5df542ad
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Mon Feb 7 16:59:06 2022 +0100

    sched/isolation: Use single feature type while referring to housekeeping cpumask

    Refer to housekeeping APIs using single feature types instead of flags.
    This prevents from passing multiple isolation features at once to
    housekeeping interfaces, which soon won't be possible anymore as each
    isolation features will have their own cpumask.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-31 10:40:39 -04:00
Phil Auld a5be4d79e1 sched/numa: Fix NUMA topology for systems with CPU-less nodes
Bugzilla: http://bugzilla.redhat.com/2062831

commit 0fb3978b0aac3a5c08637aed03cc2d65f793508f
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 14 20:15:52 2022 +0800

    sched/numa: Fix NUMA topology for systems with CPU-less nodes

    The NUMA topology parameters (sched_numa_topology_type,
    sched_domains_numa_levels, and sched_max_numa_distance, etc.)
    identified by scheduler may be wrong for systems with CPU-less nodes.

    For example, the ACPI SLIT of a system with CPU-less persistent
    memory (Intel Optane DCPMM) nodes is as follows,

    [000h 0000   4]                    Signature : "SLIT"    [System Locality Information Table]
    [004h 0004   4]                 Table Length : 0000042C
    [008h 0008   1]                     Revision : 01
    [009h 0009   1]                     Checksum : 59
    [00Ah 0010   6]                       Oem ID : "XXXX"
    [010h 0016   8]                 Oem Table ID : "XXXXXXX"
    [018h 0024   4]                 Oem Revision : 00000001
    [01Ch 0028   4]              Asl Compiler ID : "INTL"
    [020h 0032   4]        Asl Compiler Revision : 20091013

    [024h 0036   8]                   Localities : 0000000000000004
    [02Ch 0044   4]                 Locality   0 : 0A 15 11 1C
    [030h 0048   4]                 Locality   1 : 15 0A 1C 11
    [034h 0052   4]                 Locality   2 : 11 1C 0A 1C
    [038h 0056   4]                 Locality   3 : 1C 11 1C 0A

    While the `numactl -H` output is as follows,

    available: 4 nodes (0-3)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
    node 0 size: 64136 MB
    node 0 free: 5981 MB
    node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
    node 1 size: 64466 MB
    node 1 free: 10415 MB
    node 2 cpus:
    node 2 size: 253952 MB
    node 2 free: 253920 MB
    node 3 cpus:
    node 3 size: 253952 MB
    node 3 free: 253951 MB
    node distances:
    node   0   1   2   3
      0:  10  21  17  28
      1:  21  10  28  17
      2:  17  28  10  28
      3:  28  17  28  10

    In this system, there are only 2 sockets.  In each memory controller,
    both DRAM and PMEM DIMMs are installed.  Although the physical NUMA
    topology is simple, the logical NUMA topology becomes a little
    complex.  Because both the distance(0, 1) and distance (1, 3) are less
    than the distance (0, 3), it appears that node 1 sits between node 0
    and node 3.  And the whole system appears to be a glueless mesh NUMA
    topology type.  But it's definitely not, there is even no CPU in node 3.

    This isn't a practical problem now yet.  Because the PMEM nodes (node
    2 and node 3 in example system) are offlined by default during system
    boot.  So init_numa_topology_type() called during system boot will
    ignore them and set sched_numa_topology_type to NUMA_DIRECT.  And
    init_numa_topology_type() is only called at runtime when a CPU of a
    never-onlined-before node gets plugged in.  And there's no CPU in the
    PMEM nodes.  But it appears better to fix this to make the code more
    robust.

    To test the potential problem.  We have used a debug patch to call
    init_numa_topology_type() when the PMEM node is onlined (in
    __set_migration_target_nodes()).  With that, the NUMA parameters
    identified by scheduler is as follows,

    sched_numa_topology_type:       NUMA_GLUELESS_MESH
    sched_domains_numa_levels:      4
    sched_max_numa_distance:        28

    To fix the issue, the CPU-less nodes are ignored when the NUMA topology
    parameters are identified.  Because a node may become CPU-less or not
    at run time because of CPU hotplug, the NUMA topology parameters need
    to be re-initialized at runtime for CPU hotplug too.

    With the patch, the NUMA parameters identified for the example system
    above is as follows,

    sched_numa_topology_type:       NUMA_DIRECT
    sched_domains_numa_levels:      2
    sched_max_numa_distance:        21

    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:37 -04:00
Phil Auld 6c8a46d512 sched: replace cpumask_weight with cpumask_empty where appropriate
Bugzilla: http://bugzilla.redhat.com/2062831

commit 1087ad4e3f88c474b8134a482720782922bf3fdf
Author: Yury Norov <yury.norov@gmail.com>
Date:   Thu Feb 10 14:49:06 2022 -0800

    sched: replace cpumask_weight with cpumask_empty where appropriate

    In some places, kernel/sched code calls cpumask_weight() to check if
    any bit of a given cpumask is set. We can do it more efficiently with
    cpumask_empty() because cpumask_empty() stops traversing the cpumask as
    soon as it finds first set bit, while cpumask_weight() counts all bits
    unconditionally.

    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220210224933.379149-23-yury.norov@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:37 -04:00
Phil Auld 0b20e27c69 mm: move node_reclaim_distance to fix NUMA without SMP
Bugzilla: http://bugzilla.redhat.com/2020279

commit 61bb6cd2f765b90cfc5f0f91696889c366a6a13d
Author: Geert Uytterhoeven <geert+renesas@glider.be>
Date:   Fri Nov 5 13:40:24 2021 -0700

    mm: move node_reclaim_distance to fix NUMA without SMP

    Patch series "Fix NUMA without SMP".

    SuperH is the only architecture which still supports NUMA without SMP,
    for good reasons (various memories scattered around the address space,
    each with varying latencies).

    This series fixes two build errors due to variables and functions used
    by the NUMA code being provided by SMP-only source files or sections.

    This patch (of 2):

    If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):

        sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist':
        page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance'

    Fix this by moving the declaration of node_reclaim_distance from an
    SMP-only to a generic file.

    Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be
    Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be
    Fixes: a55c7454a8 ("sched/topology: Improve load balancing on AMD EPYC systems")
    Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Suggested-by: Matt Fleming <matt@codeblueprint.co.uk>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yoshinori Sato <ysato@users.osdn.me>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Gon Solo <gonsolo@gmail.com>
    Cc: Geert Uytterhoeven <geert+renesas@glider.be>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:51 -05:00
Phil Auld 9c3545b7ce sched/fair: Wait before decaying max_newidle_lb_cost
Bugzilla: http://bugzilla.redhat.com/2020279

commit e60b56e46b384cee1ad34e6adc164d883049c6c3
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Tue Oct 19 14:35:35 2021 +0200

    sched/fair: Wait before decaying max_newidle_lb_cost

    Decay max_newidle_lb_cost only when it has not been updated for a while
    and ensure to not decay a recently changed value.

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:51 -05:00
Phil Auld 55e6a8ed1f sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
Bugzilla: http://bugzilla.redhat.com/2020279

commit da6ff09943491819e077b94c284bf0a6b751c9b8
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Wed Oct 6 13:18:49 2021 +0200

    sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ

    The push-IPI logic for RT tasks expects to be invoked from hardirq
    context. One reason is that a RT task on the remote CPU would block the
    softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT
    tasks as intended.

    Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20211006111852.1514359-2-bigeasy@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:50 -05:00
Phil Auld 4864422a46 sched: Add cluster scheduler level in core and related Kconfig for ARM64
Bugzilla: http://bugzilla.redhat.com/2020279

commit 778c558f49a2cb3dc7b18a80ff515e82aa813627
Author: Barry Song <song.bao.hua@hisilicon.com>
Date:   Fri Sep 24 20:51:03 2021 +1200

    sched: Add cluster scheduler level in core and related Kconfig for ARM64

    This patch adds scheduler level for clusters and automatically enables
    the load balance among clusters. It will directly benefit a lot of
    workload which loves more resources such as memory bandwidth, caches.

    Testing has widely been done in two different hardware configurations of
    Kunpeng920:

     24 cores in one NUMA(6 clusters in each NUMA node);
     32 cores in one NUMA(8 clusters in each NUMA node)

    Workload is running on either one NUMA node or four NUMA nodes, thus,
    this can estimate the effect of cluster spreading w/ and w/o NUMA load
    balance.

    * Stream benchmark:

    4threads stream (on 1NUMA * 24cores = 24cores)
                    stream                 stream
                    w/o patch              w/ patch
    MB/sec copy     29929.64 (   0.00%)    32932.68 (  10.03%)
    MB/sec scale    29861.10 (   0.00%)    32710.58 (   9.54%)
    MB/sec add      27034.42 (   0.00%)    32400.68 (  19.85%)
    MB/sec triad    27225.26 (   0.00%)    31965.36 (  17.41%)

    6threads stream (on 1NUMA * 24cores = 24cores)
                    stream                 stream
                    w/o patch              w/ patch
    MB/sec copy     40330.24 (   0.00%)    42377.68 (   5.08%)
    MB/sec scale    40196.42 (   0.00%)    42197.90 (   4.98%)
    MB/sec add      37427.00 (   0.00%)    41960.78 (  12.11%)
    MB/sec triad    37841.36 (   0.00%)    42513.64 (  12.35%)

    12threads stream (on 1NUMA * 24cores = 24cores)
                    stream                 stream
                    w/o patch              w/ patch
    MB/sec copy     52639.82 (   0.00%)    53818.04 (   2.24%)
    MB/sec scale    52350.30 (   0.00%)    53253.38 (   1.73%)
    MB/sec add      53607.68 (   0.00%)    55198.82 (   2.97%)
    MB/sec triad    54776.66 (   0.00%)    56360.40 (   2.89%)

    Thus, it could help memory-bound workload especially under medium load.
    Similar improvement is also seen in lkp-pbzip2:

    * lkp-pbzip2 benchmark

    2-96 threads (on 4NUMA * 24cores = 96cores)
                      lkp-pbzip2              lkp-pbzip2
                      w/o patch               w/ patch
    Hmean     tput-2   11062841.57 (   0.00%)  11341817.51 *   2.52%*
    Hmean     tput-5   26815503.70 (   0.00%)  27412872.65 *   2.23%*
    Hmean     tput-8   41873782.21 (   0.00%)  43326212.92 *   3.47%*
    Hmean     tput-12  61875980.48 (   0.00%)  64578337.51 *   4.37%*
    Hmean     tput-21 105814963.07 (   0.00%) 111381851.01 *   5.26%*
    Hmean     tput-30 150349470.98 (   0.00%) 156507070.73 *   4.10%*
    Hmean     tput-48 237195937.69 (   0.00%) 242353597.17 *   2.17%*
    Hmean     tput-79 360252509.37 (   0.00%) 362635169.23 *   0.66%*
    Hmean     tput-96 394571737.90 (   0.00%) 400952978.48 *   1.62%*

    2-24 threads (on 1NUMA * 24cores = 24cores)
                     lkp-pbzip2               lkp-pbzip2
                     w/o patch                w/ patch
    Hmean     tput-2   11071705.49 (   0.00%)  11296869.10 *   2.03%*
    Hmean     tput-4   20782165.19 (   0.00%)  21949232.15 *   5.62%*
    Hmean     tput-6   30489565.14 (   0.00%)  33023026.96 *   8.31%*
    Hmean     tput-8   40376495.80 (   0.00%)  42779286.27 *   5.95%*
    Hmean     tput-12  61264033.85 (   0.00%)  62995632.78 *   2.83%*
    Hmean     tput-18  86697139.39 (   0.00%)  86461545.74 (  -0.27%)
    Hmean     tput-24 104854637.04 (   0.00%) 104522649.46 *  -0.32%*

    In the case of 6 threads and 8 threads, we see the greatest performance
    improvement.

    Similar improvement can be seen on lkp-pixz though the improvement is
    smaller:

    * lkp-pixz benchmark

    2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores)
                      lkp-pixz               lkp-pixz
                      w/o patch              w/ patch
    Hmean     tput-2   6486981.16 (   0.00%)  6561515.98 *   1.15%*
    Hmean     tput-4  11645766.38 (   0.00%) 11614628.43 (  -0.27%)
    Hmean     tput-6  15429943.96 (   0.00%) 15957350.76 *   3.42%*
    Hmean     tput-8  19974087.63 (   0.00%) 20413746.98 *   2.20%*
    Hmean     tput-12 28172068.18 (   0.00%) 28751997.06 *   2.06%*
    Hmean     tput-18 39413409.54 (   0.00%) 39896830.55 *   1.23%*
    Hmean     tput-24 49101815.85 (   0.00%) 49418141.47 *   0.64%*

    * SPECrate benchmark

    4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores)
                    Base                    Base
                    Run Time                Rate
                    -------                 ---------
    4 Copies        w/o 580 (w/ 570)        w/o 11.1 (w/ 11.3)
    8 Copies        w/o 647 (w/ 605)        w/o 20.0 (w/ 21.4, +7%)
    16 Copies       w/o 844 (w/ 844)        w/o 30.6 (w/ 30.6)

    32 Copies(on 4NUMA * 32 cores = 128cores)
    [w/o patch]
                     Base     Base        Base
    Benchmarks       Copies  Run Time     Rate
    --------------- -------  ---------  ---------
    500.perlbench_r      32        584       87.2  *
    502.gcc_r            32        503       90.2  *
    505.mcf_r            32        745       69.4  *
    520.omnetpp_r        32       1031       40.7  *
    523.xalancbmk_r      32        597       56.6  *
    525.x264_r            1         --            CE
    531.deepsjeng_r      32        336      109    *
    541.leela_r          32        556       95.4  *
    548.exchange2_r      32        513      163    *
    557.xz_r             32        530       65.2  *
     Est. SPECrate2017_int_base              80.3

    [w/ patch]
                      Base     Base        Base
    Benchmarks       Copies  Run Time     Rate
    --------------- -------  ---------  ---------
    500.perlbench_r      32        580      87.8 (+0.688%)  *
    502.gcc_r            32        477      95.1 (+5.432%)  *
    505.mcf_r            32        644      80.3 (+13.574%) *
    520.omnetpp_r        32        942      44.6 (+9.58%)   *
    523.xalancbmk_r      32        560      60.4 (+6.714%%) *
    525.x264_r            1         --           CE
    531.deepsjeng_r      32        337      109  (+0.000%) *
    541.leela_r          32        554      95.6 (+0.210%) *
    548.exchange2_r      32        515      163  (+0.000%) *
    557.xz_r             32        524      66.0 (+1.227%) *
     Est. SPECrate2017_int_base              83.7 (+4.062%)

    On the other hand, it is slightly helpful to CPU-bound tasks like
    kernbench:

    * 24-96 threads kernbench (on 4NUMA * 24cores = 96cores)
                         kernbench              kernbench
                         w/o cluster            w/ cluster
    Min       user-24    12054.67 (   0.00%)    12024.19 (   0.25%)
    Min       syst-24     1751.51 (   0.00%)     1731.68 (   1.13%)
    Min       elsp-24      600.46 (   0.00%)      598.64 (   0.30%)
    Min       user-48    12361.93 (   0.00%)    12315.32 (   0.38%)
    Min       syst-48     1917.66 (   0.00%)     1892.73 (   1.30%)
    Min       elsp-48      333.96 (   0.00%)      332.57 (   0.42%)
    Min       user-96    12922.40 (   0.00%)    12921.17 (   0.01%)
    Min       syst-96     2143.94 (   0.00%)     2110.39 (   1.56%)
    Min       elsp-96      211.22 (   0.00%)      210.47 (   0.36%)
    Amean     user-24    12063.99 (   0.00%)    12030.78 *   0.28%*
    Amean     syst-24     1755.20 (   0.00%)     1735.53 *   1.12%*
    Amean     elsp-24      601.60 (   0.00%)      600.19 (   0.23%)
    Amean     user-48    12362.62 (   0.00%)    12315.56 *   0.38%*
    Amean     syst-48     1921.59 (   0.00%)     1894.95 *   1.39%*
    Amean     elsp-48      334.10 (   0.00%)      332.82 *   0.38%*
    Amean     user-96    12925.27 (   0.00%)    12922.63 (   0.02%)
    Amean     syst-96     2146.66 (   0.00%)     2122.20 *   1.14%*
    Amean     elsp-96      211.96 (   0.00%)      211.79 (   0.08%)

    Note this patch isn't an universal win, it might hurt those workload
    which can benefit from packing. Though tasks which want to take
    advantages of lower communication latency of one cluster won't
    necessarily been packed in one cluster while kernel is not aware of
    clusters, they have some chance to be randomly packed. But this
    patch will make them more likely spread.

    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Tested-by: Yicong Yang <yangyicong@hisilicon.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:48 -05:00
Phil Auld 0ac59a41d7 sched/topology: Remove unused numa_distance in cpu_attach_domain()
Bugzilla: http://bugzilla.redhat.com/2020279

commit f9ec6fea201429b5a3f76319e943989f1a1e25ef
Author: Yicong Yang <yangyicong@hisilicon.com>
Date:   Wed Sep 15 14:31:58 2021 +0800

    sched/topology: Remove unused numa_distance in cpu_attach_domain()

    numa_distance in cpu_attach_domain() is introduced in
    commit b5b217346d ("sched/topology: Warn when NUMA diameter > 2")
    to warn user when NUMA diameter > 2 as we'll misrepresent
    the scheduler topology structures at that time. This is
    fixed by Barry in commit 585b6d2723 ("sched/topology: fix the issue
    groups don't span domain->span for NUMA diameter > 2") and
    numa_distance is unused now. So remove it.

    Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lore.kernel.org/r/20210915063158.80639-1-yangyicong@hisilicon.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:47 -05:00
Phil Auld d3a1930bef sched/topology: Introduce sched_group::flags
Bugzilla: http://bugzilla.redhat.com/2020279

commit 16d364ba6ef2aa59b409df70682770f3ed23f7c0
Author: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Date:   Fri Sep 10 18:18:15 2021 -0700

    sched/topology: Introduce sched_group::flags

    There exist situations in which the load balance needs to know the
    properties of the CPUs in a scheduling group. When using asymmetric
    packing, for instance, the load balancer needs to know not only the
    state of dst_cpu but also of its SMT siblings, if any.

    Use the flags of the child scheduling domains to initialize scheduling
    group flags. This will reflect the properties of the CPUs in the
    group.

    A subsequent changeset will make use of these new flags. No functional
    changes are introduced.

    Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Reviewed-by: Len Brown <len.brown@intel.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20210911011819.12184-3-ricardo.neri-calderon@linux.intel.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:47 -05:00
Phil Auld 1f65ff196a sched/topology: Skip updating masks for non-online nodes
Bugzilla: http://bugzilla.redhat.com/1992256

commit 0083242c93759dde353a963a90cb351c5c283379
Author: Valentin Schneider <valentin.schneider@arm.com>
Date:   Wed Aug 18 13:13:33 2021 +0530

    sched/topology: Skip updating masks for non-online nodes

    The scheduler currently expects NUMA node distances to be stable from
    init onwards, and as a consequence builds the related data structures
    once-and-for-all at init (see sched_init_numa()).

    Unfortunately, on some architectures node distance is unreliable for
    offline nodes and may very well change upon onlining.

    Skip over offline nodes during sched_init_numa(). Track nodes that have
    been onlined at least once, and trigger a build of a node's NUMA masks
    when it is first onlined post-init.

    Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
    Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210818074333.48645-1-srikar@linux.vnet.ibm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-01 18:14:47 -04:00
Beata Michalska c744dc4ab5 sched/topology: Rework CPU capacity asymmetry detection
Currently the CPU capacity asymmetry detection, performed through
asym_cpu_capacity_level, tries to identify the lowest topology level
at which the highest CPU capacity is being observed, not necessarily
finding the level at which all possible capacity values are visible
to all CPUs, which might be bit problematic for some possible/valid
asymmetric topologies i.e.:

DIE      [                                ]
MC       [                       ][       ]

CPU       [0] [1] [2] [3] [4] [5]  [6] [7]
Capacity  |.....| |.....| |.....|  |.....|
	     L	     M       B        B

Where:
 arch_scale_cpu_capacity(L) = 512
 arch_scale_cpu_capacity(M) = 871
 arch_scale_cpu_capacity(B) = 1024

In this particular case, the asymmetric topology level will point
at MC, as all possible CPU masks for that level do cover the CPU
with the highest capacity. It will work just fine for the first
cluster, not so much for the second one though (consider the
find_energy_efficient_cpu which might end up attempting the energy
aware wake-up for a domain that does not see any asymmetry at all)

Rework the way the capacity asymmetry levels are being detected,
allowing to point to the lowest topology level (for a given CPU), where
full set of available CPU capacities is visible to all CPUs within given
domain. As a result, the per-cpu sd_asym_cpucapacity might differ across
the domains. This will have an impact on EAS wake-up placement in a way
that it might see different range of CPUs to be considered, depending on
the given current and target CPUs.

Additionally, those levels, where any range of asymmetry (not
necessarily full) is being detected will get identified as well.
The selected asymmetric topology level will be denoted by
SD_ASYM_CPUCAPACITY_FULL sched domain flag whereas the 'sub-levels'
would receive the already used SD_ASYM_CPUCAPACITY flag. This allows
maintaining the current behaviour for asymmetric topologies, with
misfit migration operating correctly on lower levels, if applicable,
as any asymmetry is enough to trigger the misfit migration.
The logic there relies on the SD_ASYM_CPUCAPACITY flag and does not
relate to the full asymmetry level denoted by the sd_asym_cpucapacity
pointer.

Detecting the CPU capacity asymmetry is being based on a set of
available CPU capacities for all possible CPUs. This data is being
generated upon init and updated once CPU topology changes are being
detected (through arch_update_cpu_topology). As such, any changes
to identified CPU capacities (like initializing cpufreq) need to be
explicitly advertised by corresponding archs to trigger rebuilding
the data.

Additional -dflags- parameter, used when building sched domains, has
been removed as well, as the asymmetry flags are now being set directly
in sd_init.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Beata Michalska <beata.michalska@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20210603140627.8409-3-beata.michalska@arm.com
2021-06-24 09:07:51 +02:00