Commit Graph

654 Commits

Author SHA1 Message Date
Phil Auld 04db3263fa sched: Don't define sched_clock_irqtime as static key
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit b9f2b29b94943b08157e3dfc970baabc7944dbc3
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Wed Feb 5 11:24:38 2025 +0800

    sched: Don't define sched_clock_irqtime as static key

    The sched_clock_irqtime was defined as a static key in commit 8722903cbb8f
    ('sched: Define sched_clock_irqtime as static key'). However, this change
    introduces a 'sleeping in atomic context' warning, as shown below:

            arch/x86/kernel/tsc.c:1214 mark_tsc_unstable()
            warn: sleeping in atomic context

    As analyzed by Dan, the affected code path is as follows:

    vcpu_load() <- disables preempt
    -> kvm_arch_vcpu_load()
       -> mark_tsc_unstable() <- sleeps

    virt/kvm/kvm_main.c
       166  void vcpu_load(struct kvm_vcpu *vcpu)
       167  {
       168          int cpu = get_cpu();
                              ^^^^^^^^^^
    This get_cpu() disables preemption.

       169
       170          __this_cpu_write(kvm_running_vcpu, vcpu);
       171          preempt_notifier_register(&vcpu->preempt_notifier);
       172          kvm_arch_vcpu_load(vcpu, cpu);
       173          put_cpu();
       174  }

    arch/x86/kvm/x86.c
      4979          if (unlikely(vcpu->cpu != cpu) || kvm_check_tsc_unstable()) {
      4980                  s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
      4981                                  rdtsc() - vcpu->arch.last_host_tsc;
      4982                  if (tsc_delta < 0)
      4983                          mark_tsc_unstable("KVM discovered backwards TSC");

    arch/x86/kernel/tsc.c
        1206 void mark_tsc_unstable(char *reason)
        1207 {
        1208         if (tsc_unstable)
        1209                 return;
        1210
        1211         tsc_unstable = 1;
        1212         if (using_native_sched_clock())
        1213                 clear_sched_clock_stable();
    --> 1214         disable_sched_clock_irqtime();
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    kernel/jump_label.c
       245  void static_key_disable(struct static_key *key)
       246  {
       247          cpus_read_lock();
                    ^^^^^^^^^^^^^^^^
    This lock has a might_sleep() in it which triggers the static checker
    warning.

       248          static_key_disable_cpuslocked(key);
       249          cpus_read_unlock();
       250  }

    Let revert this change for now as {disable,enable}_sched_clock_irqtime
    are used in many places, as pointed out by Sean, including the following:

    The code path in clocksource_watchdog():

      clocksource_watchdog()
      |
      -> spin_lock(&watchdog_lock);
         |
         -> __clocksource_unstable()
            |
            -> clocksource.mark_unstable() == tsc_cs_mark_unstable()
               |
               -> disable_sched_clock_irqtime()

    And the code path in sched_clock_register():

            /* Cannot register a sched_clock with interrupts on */
            local_irq_save(flags);

            ...

            /* Enable IRQ time accounting if we have a fast enough sched_clock() */
            if (irqtime > 0 || (irqtime == -1 && rate >= 1000000))
                    enable_sched_clock_irqtime();

            local_irq_restore(flags);

    [lkp@intel.com: reported a build error in the prev version]

    Closes: https://lore.kernel.org/kvm/37a79ba3-9ce0-479c-a5b0-2bd75d573ed3@stanley.mountain/
    Fixes: 8722903cbb8f ("sched: Define sched_clock_irqtime as static key")
    Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
    Debugged-by: Dan Carpenter <dan.carpenter@linaro.org>
    Debugged-by: Sean Christopherson <seanjc@google.com>
    Debugged-by: Michal Koutný <mkoutny@suse.com>
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20250205032438.14668-1-laoar.shao@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:12 +00:00
Phil Auld 8657a68d8f sched: Define sched_clock_irqtime as static key
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit 8722903cbb8f0d51057fbf9ef1c680756b74119e
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Fri Jan 3 10:24:06 2025 +0800

    sched: Define sched_clock_irqtime as static key

    Since CPU time accounting is a performance-critical path, let's define
    sched_clock_irqtime as a static key to minimize potential overhead.

    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Michal Koutný <mkoutny@suse.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20250103022409.2544-2-laoar.shao@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:11 +00:00
Phil Auld 423e539eb2 sched: add READ_ONCE to task_on_rq_queued
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit 59297e2093ceced86393a059a4bd36802311f7bb
Author: Harshit Agarwal <harshit@nutanix.com>
Date:   Thu Nov 14 14:08:11 2024 -0700

    sched: add READ_ONCE to task_on_rq_queued

    task_on_rq_queued read p->on_rq without READ_ONCE, though p->on_rq is
    set with WRITE_ONCE in {activate|deactivate}_task and smp_store_release
    in __block_task, and also read with READ_ONCE in task_on_rq_migrating.

    Make all of these accesses pair together by adding READ_ONCE in the
    task_on_rq_queued.

    Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Link: https://lkml.kernel.org/r/20241114210812.1836587-1-jon@nutanix.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:09 +00:00
Phil Auld 91707bbfc4 sched: Consolidate pick_*_task to task_is_pushable helper
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit 18adad1dac3334ed34f60ad4de2960df03058142
Author: Connor O'Brien <connoro@google.com>
Date:   Wed Oct 9 16:53:38 2024 -0700

    sched: Consolidate pick_*_task to task_is_pushable helper

    This patch consolidates rt and deadline pick_*_task functions to
    a task_is_pushable() helper

    This patch was broken out from a larger chain migration
    patch originally by Connor O'Brien.

    [jstultz: split out from larger chain migration patch,
     renamed helper function]

    Signed-off-by: Connor O'Brien <connoro@google.com>
    Signed-off-by: John Stultz <jstultz@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Metin Kaya <metin.kaya@arm.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Christian Loehle <christian.loehle@arm.com>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Tested-by: Metin Kaya <metin.kaya@arm.com>
    Link: https://lore.kernel.org/r/20241009235352.1614323-6-jstultz@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:08 +00:00
Phil Auld 8851a9b9ae sched: Add move_queued_task_locked helper
JIRA: https://issues.redhat.com/browse/RHEL-78821
Conflicts: Context diffs in sched.h due to not having eevdf code.

commit 2b05a0b4c08ffd6dedfbd27af8708742cde39b95
Author: Connor O'Brien <connoro@google.com>
Date:   Wed Oct 9 16:53:37 2024 -0700

    sched: Add move_queued_task_locked helper

    Switch logic that deactivates, sets the task cpu,
    and reactivates a task on a different rq to use a
    helper that will be later extended to push entire
    blocked task chains.

    This patch was broken out from a larger chain migration
    patch originally by Connor O'Brien.

    [jstultz: split out from larger chain migration patch]
    Signed-off-by: Connor O'Brien <connoro@google.com>
    Signed-off-by: John Stultz <jstultz@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Metin Kaya <metin.kaya@arm.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Qais Yousef <qyousef@layalina.io>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Tested-by: Metin Kaya <metin.kaya@arm.com>
    Link: https://lore.kernel.org/r/20241009235352.1614323-5-jstultz@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:08 +00:00
Phil Auld b550c6bebd sched/headers: Move struct pre-declarations to the beginning of the header
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 3cd7271987ffd89c2d5eaeea85d3e9a16aec6894
Author: Ingo Molnar <mingo@kernel.org>
Date:   Wed Jun 5 13:44:28 2024 +0200

    sched/headers: Move struct pre-declarations to the beginning of the header

    There's a random number of structure pre-declaration lines in
    kernel/sched/sched.h, some of which are unnecessary duplicates.

    Move them to the head & order them a bit for readability.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: linux-kernel@vger.kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:03 -04:00
Phil Auld 3917c1b34b sched/core: Clean up kernel/sched/sched.h a bit
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Left out hunks in mm_cid code which we don't have
in RHEL9.

commit 127f6bf1618868920c1f77e0a427d1f4570e450b
Author: Ingo Molnar <mingo@kernel.org>
Date:   Wed Jun 5 13:39:31 2024 +0200

    sched/core: Clean up kernel/sched/sched.h a bit

     - Fix whitespace noise
     - Fix col80 linebreak damage where possible
     - Apply CodingStyle consistently
     - Use consistent #else and #endif comments
     - Use consistent vertical alignment
     - Use 'extern' consistently

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: linux-kernel@vger.kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:03 -04:00
Phil Auld e8bf69e6e0 sched: Fix spelling in comments
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Dropped hunks in mm_cid code which we don't have. Minor
context diffs due to still having IA64 in tree and previous Kabi
workarounds.

commit 402de7fc880fef055bc984957454b532987e9ad0
Author: Ingo Molnar <mingo@kernel.org>
Date:   Mon May 27 16:54:52 2024 +0200

    sched: Fix spelling in comments

    Do a spell-checking pass.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:02 -04:00
Phil Auld 10626dfce1 sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Worked around RHEL-only commits 9b35f92491 ("sched/core: Make
sched_setaffinity() always return -EINVAL on empty cpumask"),90f7bb0c1823 ("sched/core:
Don't return -ENODEV from sched_setaffinity()") and 05fddaaaac ("sched/core: Use empty
mask to reset cpumasks in sched_setaffinity()") by removing the changes and re-applying
them to the new syscalls.c file. Reverting and re-applying was not possible since there
have been other changes on top of these as well.

commit 04746ed80bcf3130951ed4d5c1bc5b0bcabdde22
Author: Ingo Molnar <mingo@kernel.org>
Date:   Sun Apr 7 10:43:15 2024 +0200

    sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c

    core.c has become rather large, move most scheduler syscall
    related functionality into a separate file, syscalls.c.

    This is about ~15% of core.c's raw linecount.

    Move the alloc_user_cpus_ptr(), __rt_effective_prio(),
    rt_effective_prio(), uclamp_none(), uclamp_se_set()
    and uclamp_bucket_id() inlines to kernel/sched/sched.h.

    Internally export the __sched_setscheduler(), __sched_setaffinity(),
    __setscheduler_prio(), set_load_weight(), enqueue_task(), dequeue_task(),
    check_class_changed(), splice_balance_callbacks() and balance_callbacks()
    methods to better facilitate this.

    Move the new file's build to sched_policy.c, because it fits there
    semantically, but also because it's the smallest of the 4 build units
    under an allmodconfig build:

      -rw-rw-r-- 1 mingo mingo 7.3M May 27 12:35 kernel/sched/core.i
      -rw-rw-r-- 1 mingo mingo 6.4M May 27 12:36 kernel/sched/build_utility.i
      -rw-rw-r-- 1 mingo mingo 6.3M May 27 12:36 kernel/sched/fair.i
      -rw-rw-r-- 1 mingo mingo 5.8M May 27 12:36 kernel/sched/build_policy.i

    This better balances build time for scheduler subsystem rebuilds.

    I build-tested this new file as a standalone syscalls.o file for a bit,
    to make sure all the encapsulations & abstractions are robust.

    Also update/add my copyright notices to these files.

    Build time measurements:

     # -Before/+After:

     kepler:~/tip> perf stat -e 'cycles,instructions,duration_time' --sync --repeat 5 --pre 'rm -f kernel/sched/*.o' m kernel/sched/built-in.a >/dev/null

     Performance counter stats for 'm kernel/sched/built-in.a' (5 runs):

     -    71,938,508,607      cycles                                                                  ( +-  0.17% )
     +    71,992,916,493      cycles                                                                  ( +-  0.22% )
     -   106,214,780,964      instructions                     #    1.48  insn per cycle              ( +-  0.01% )
     +   105,450,231,154      instructions                     #    1.46  insn per cycle              ( +-  0.01% )
     -     5,878,232,620 ns   duration_time                                                           ( +-  0.38% )
     +     5,290,085,069 ns   duration_time                                                           ( +-  0.21% )

     -            5.8782 +- 0.0221 seconds time elapsed  ( +-  0.38% )
     +            5.2901 +- 0.0111 seconds time elapsed  ( +-  0.21% )

    Build time improvement of -11.1% (duration_time) is expected: the
    parallel build time of the scheduler subsystem is determined by the
    largest, slowest to build object file, which is kernel/sched/core.o.
    By moving ~15% of its complexity into another build unit, we reduced
    build time by -11%.

    Measured cycles spent on building is within its ~0.2% stddev noise envelope.

    The -0.7% reduction in instructions spent on building the scheduler is
    statistically reliable and somewhat surprising - I can only speculate:
    maybe compilers aren't that efficient at building & optimizing 10+ KLOC files
    (core.c), and it's an overall win to balance the linecount a bit.

    Anyway, this might be a data point that suggests that reducing the linecount
    of our largest files will improve not just code readability and maintainability,
    but might also improve build times a bit.

    Code generation got a bit worse, by 0.5kb text on an x86 defconfig build:

      # -Before/+After:

      kepler:~/tip> size vmlinux
         text          data     bss     dec     hex filename
      -26475475     10439178        1740804 38655457        24dd5e1 vmlinux
      +26476003     10439178        1740804 38655985        24dd7f1 vmlinux

      kepler:~/tip> size kernel/sched/built-in.a
         text          data     bss     dec     hex filename
      - 76056         30025     489  106570   1a04a kernel/sched/core.o (ex kernel/sched/built-in.a)
      + 63452         29453     489   93394   16cd2 kernel/sched/core.o (ex kernel/sched/built-in.a)
        44299          2181     104   46584    b5f8 kernel/sched/fair.o (ex kernel/sched/built-in.a)
      - 42764          3424     120   46308    b4e4 kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
      + 55651          4044     120   59815    e9a7 kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
        44866         12655    2192   59713    e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a)
        44866         12655    2192   59713    e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a)

    This is primarily due to the extra functions exported, and the size
    gets exaggerated somewhat by __pfx CFI function padding:

            ffffffff810cc710 <__pfx_enqueue_task>:
            ffffffff810cc710:       90                      nop
            ffffffff810cc711:       90                      nop
            ffffffff810cc712:       90                      nop
            ffffffff810cc713:       90                      nop
            ffffffff810cc714:       90                      nop
            ffffffff810cc715:       90                      nop
            ffffffff810cc716:       90                      nop
            ffffffff810cc717:       90                      nop
            ffffffff810cc718:       90                      nop
            ffffffff810cc719:       90                      nop
            ffffffff810cc71a:       90                      nop
            ffffffff810cc71b:       90                      nop
            ffffffff810cc71c:       90                      nop
            ffffffff810cc71d:       90                      nop
            ffffffff810cc71e:       90                      nop
            ffffffff810cc71f:       90                      nop

    AFAICS the cost is primarily not to core.o and fair.o though (which contain
    most performance sensitive scheduler functions), only to syscalls.o
    that get called with much lower frequency - so I think this is an acceptable
    trade-off for better code separation.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20240407084319.1462211-2-mingo@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:02 -04:00
Phil Auld 14a470e760 sched/pelt: Remove shift of thermal clock
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 97450eb909658573dcacc1063b06d3d08642c0c1
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Tue Mar 26 10:16:16 2024 +0100

    sched/pelt: Remove shift of thermal clock

    The optional shift of the clock used by thermal/hw load avg has been
    introduced to handle case where the signal was not always a high frequency
    hw signal. Now that cpufreq provides a signal for firmware and
    SW pressure, we can remove this exception and always keep this PELT signal
    aligned with other signals.
    Mark sysctl_sched_migration_cost boot parameter as deprecated

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Lukasz Luba <lukasz.luba@arm.com>
    Reviewed-by: Qais Yousef <qyousef@layalina.io>
    Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
    Link: https://lore.kernel.org/r/20240326091616.3696851-6-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:02 -04:00
Phil Auld 51c743b331 sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Minor differences since we already have ddae0ca2a8f
("sched: Move psi_account_irqtime() out of update_rq_clock_task()
hotpath") which changes some nearby code.

commit d4dbc991714eefcbd8d54a3204bd77a0a52bd32d
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Tue Mar 26 10:16:15 2024 +0100

    sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()

    Now that cpufreq provides a pressure value to the scheduler, rename
    arch_update_thermal_pressure into HW pressure to reflect that it returns
    a pressure applied by HW (i.e. with a high frequency change) and not
    always related to thermal mitigation but also generated by max current
    limitation as an example. Such high frequency signal needs filtering to be
    smoothed and provide an value that reflects the average available capacity
    into the scheduler time scale.

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Lukasz Luba <lukasz.luba@arm.com>
    Reviewed-by: Qais Yousef <qyousef@layalina.io>
    Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
    Link: https://lore.kernel.org/r/20240326091616.3696851-5-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:02 -04:00
Phil Auld 3aaab109b4 sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 4475cd8bfd9bcb898953fcadb2f51b3432eb68a1
Author: Ingo Molnar <mingo@kernel.org>
Date:   Thu Mar 28 12:07:48 2024 +0100

    sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags

    SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an
    unnecessary complication that only make the code harder to read and slower.

    We only ever set them separately:

     thule:~/tip> git grep SG_OVER kernel/sched/
     kernel/sched/fair.c:            set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
     kernel/sched/fair.c:                    *sg_status |= SG_OVERLOADED;
     kernel/sched/fair.c:                    *sg_status |= SG_OVERUTILIZED;
     kernel/sched/fair.c:                            *sg_status |= SG_OVERLOADED;
     kernel/sched/fair.c:            set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
     kernel/sched/fair.c:                                       sg_status & SG_OVERUTILIZED);
     kernel/sched/fair.c:    } else if (sg_status & SG_OVERUTILIZED) {
     kernel/sched/fair.c:            set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
     kernel/sched/sched.h:#define SG_OVERLOADED              0x1 /* More than one runnable task on a CPU. */
     kernel/sched/sched.h:#define SG_OVERUTILIZED            0x2 /* One or more CPUs are over-utilized. */
     kernel/sched/sched.h:           set_rd_overloaded(rq->rd, SG_OVERLOADED);

    And use them separately, which results in suboptimal code:

                    /* update overload indicator if we are at root domain */
                    set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);

                    /* Update over-utilization (tipping point, U >= 0) indicator */
                    set_rd_overutilized_status(env->dst_rq->rd,

    Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats()
    and its lower level functions, and change all of them to 'bool'.

    Remove the now unused SG_OVERLOADED and SG_OVERUTILIZED flags.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
    Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
    Cc: Qais Yousef <qyousef@layalina.io>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Link: https://lore.kernel.org/r/ZgVPhODZ8/nbsqbP@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:50 -04:00
Phil Auld 72be6d9b90 sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 7bda10ba7f453729f210264dd07d38989fb858d9
Author: Ingo Molnar <mingo@kernel.org>
Date:   Thu Mar 28 11:44:16 2024 +0100

    sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED

    Follow the rename of the root_domain::overloaded flag.

    Note that this also matches the SG_OVERUTILIZED flag better.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Qais Yousef <qyousef@layalina.io>
    Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:50 -04:00
Phil Auld 11e709c903 sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 76cc4f91488af0a808bec97794bfe434dece7d67
Author: Ingo Molnar <mingo@kernel.org>
Date:   Thu Mar 28 11:41:31 2024 +0100

    sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()

    Follow the rename of the root_domain::overloaded flag.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Qais Yousef <qyousef@layalina.io>
    Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:50 -04:00
Phil Auld 96b4819653 sched/fair: Rename root_domain::overload to ::overloaded
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit dfb83ef7b8b064c15be19cf7fcbde0996712de8f
Author: Ingo Molnar <mingo@kernel.org>
Date:   Thu Mar 28 11:33:20 2024 +0100

    sched/fair: Rename root_domain::overload to ::overloaded

    It is silly to use an ambiguous noun instead of a clear adjective when naming
    such a flag ...

    Note how root_domain::overutilized already used a proper adjective.

    rd->overloaded is now set to 1 when the root domain is overloaded.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Qais Yousef <qyousef@layalina.io>
    Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:50 -04:00
Phil Auld b4c29ee118 sched/fair: Use helper functions to access root_domain::overload
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit caac6291728ed5493d8a53f4b086c270849ce0c4
Author: Shrikanth Hegde <sshegde@linux.ibm.com>
Date:   Mon Mar 25 11:15:05 2024 +0530

    sched/fair: Use helper functions to access root_domain::overload

    Introduce two helper functions to access & set the root_domain::overload flag:

      get_rd_overload()
      set_rd_overload()

    To make sure code is always following READ_ONCE()/WRITE_ONCE() access methods.

    No change in functionality intended.

    [ mingo: Renamed the accessors to get_/set_rd_overload(), tidied up the changelog. ]

    Suggested-by: Qais Yousef <qyousef@layalina.io>
    Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Qais Yousef <qyousef@layalina.io>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20240325054505.201995-3-sshegde@linux.ibm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:50 -04:00
Phil Auld 096e01219f sched/topology: Remove root_domain::max_cpu_capacity
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit fa427e8e53d8db15090af7e952a55870dc2a453f
Author: Qais Yousef <qyousef@layalina.io>
Date:   Sun Mar 24 00:45:51 2024 +0000

    sched/topology: Remove root_domain::max_cpu_capacity

    The value is no longer used as we now keep track of max_allowed_capacity
    for each task instead.

    Signed-off-by: Qais Yousef <qyousef@layalina.io>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20240324004552.999936-4-qyousef@layalina.io

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:49 -04:00
Phil Auld b921e17c63 sched/topology: Export asym_cap_list
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 77222b0d12e8ae6f082261842174cc2e981bf99c
Author: Qais Yousef <qyousef@layalina.io>
Date:   Sun Mar 24 00:45:49 2024 +0000

    sched/topology: Export asym_cap_list

    So that we can use it to iterate through available capacities in the
    system. Sort asym_cap_list in descending order as expected users are
    likely to be interested on the highest capacity first.

    Make the list RCU protected to allow for cheap access in hot paths.

    Signed-off-by: Qais Yousef <qyousef@layalina.io>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20240324004552.999936-2-qyousef@layalina.io

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:49 -04:00
Phil Auld 32938a738c sched/balancing: Rename trigger_load_balance() => sched_balance_trigger()
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts:  Dropped CN documentation since not in RHEL. Minor fuzz in
sched-domains.rst.

commit 983be0628c061989b6cc175d2f5e429b40699fbb
Author: Ingo Molnar <mingo@kernel.org>
Date:   Fri Mar 8 12:18:09 2024 +0100

    sched/balancing: Rename trigger_load_balance() => sched_balance_trigger()

    Standardize scheduler load-balancing function names on the
    sched_balance_() prefix.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
    Link: https://lore.kernel.org/r/20240308111819.1101550-4-mingo@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:48 -04:00
Phil Auld df588c9291 sched/balancing: Rename rebalance_domains() => sched_balance_domains()
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts:  Dropped CN documentation since not in RHEL.

commit 14ff4dbd34f46cc6b6105f549983321241ccbba9
Author: Ingo Molnar <mingo@kernel.org>
Date:   Fri Mar 8 12:18:10 2024 +0100

    sched/balancing: Rename rebalance_domains() => sched_balance_domains()

    Standardize scheduler load-balancing function names on the
    sched_balance_() prefix.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
    Link: https://lore.kernel.org/r/20240308111819.1101550-5-mingo@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:48 -04:00
Phil Auld acd3db6848 sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit a6965b31888501f889261a6783f0de6afff84f8d
Author: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Date:   Mon Jan 1 21:16:24 2024 +0530

    sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq

    Use existing helper function cpu_util_irq() instead of open-coding
    access to ->avg_irq.

    During review it was noted that ->avg_irq could be updated by a
    different CPU than the one which is trying to access it.

    ->avg_irq is updated with WRITE_ONCE(), use READ_ONCE to access it
    in order to avoid any compiler optimizations.

    Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20240101154624.100981-3-sshegde@linux.vnet.ibm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:46 -04:00
Phil Auld d414c1e069 sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
JIRA: https://issues.redhat.com/browse/RHEL-48226
Conflicts: Minor context differences in sched/core.c due to
not having scheduler_tick() renamed sched_tick and d4dbc991714e
("sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()").

commit ddae0ca2a8fe12d0e24ab10ba759c3fbd755ada8
Author: John Stultz <jstultz@google.com>
Date:   Tue Jun 18 14:58:55 2024 -0700

    sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath

    It was reported that in moving to 6.1, a larger then 10%
    regression was seen in the performance of
    clock_gettime(CLOCK_THREAD_CPUTIME_ID,...).

    Using a simple reproducer, I found:
    5.10:
    100000000 calls in 24345994193 ns => 243.460 ns per call
    100000000 calls in 24288172050 ns => 242.882 ns per call
    100000000 calls in 24289135225 ns => 242.891 ns per call

    6.1:
    100000000 calls in 28248646742 ns => 282.486 ns per call
    100000000 calls in 28227055067 ns => 282.271 ns per call
    100000000 calls in 28177471287 ns => 281.775 ns per call

    The cause of this was finally narrowed down to the addition of
    psi_account_irqtime() in update_rq_clock_task(), in commit
    52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ
    pressure").

    In my initial attempt to resolve this, I leaned towards moving
    all accounting work out of the clock_gettime() call path, but it
    wasn't very pretty, so it will have to wait for a later deeper
    rework. Instead, Peter shared this approach:

    Rework psi_account_irqtime() to use its own psi_irq_time base
    for accounting, and move it out of the hotpath, calling it
    instead from sched_tick() and __schedule().

    In testing this, we found the importance of ensuring
    psi_account_irqtime() is run under the rq_lock, which Johannes
    Weiner helpfully explained, so also add some lockdep annotations
    to make that requirement clear.

    With this change the performance is back in-line with 5.10:
    6.1+fix:
    100000000 calls in 24297324597 ns => 242.973 ns per call
    100000000 calls in 24318869234 ns => 243.189 ns per call
    100000000 calls in 24291564588 ns => 242.916 ns per call

    Reported-by: Jimmy Shiu <jimmyshiu@google.com>
    Originally-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: John Stultz <jstultz@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
    Reviewed-by: Qais Yousef <qyousef@layalina.io>
    Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-07-15 11:13:20 -04:00
Lucas Zampieri f67ab7550c Merge: Scheduler: rhel9.5 updates
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3975

JIRA: https://issues.redhat.com/browse/RHEL-25535 

JIRA: https://issues.redhat.com/browse/RHEL-20158  

JIRA: https://issues.redhat.com/browse/RHEL-15622

Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3935

Tested: Scheduler stress tests. Perf Qe will do a  
performance regression test.  
  
A collection of fixes and updates that brings the  
core scheduler code up to v6.8. EEVDF related commits  
are skipped since we are not planning to take the new  
task scheduler in rhel9.
  
  
Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-08 20:13:47 +00:00
Lucas Zampieri d23522d08a Merge: Sched: schedutil/cpufreq updates
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3935

JIRA: https://issues.redhat.com/browse/RHEL-29020  
  
Bring schedutil code up to about v6.8. This includes some fixes for  
code in rhel9 from the 5.14 rebase.  There are few pieces in cpufreq  
driver code and the arm architectures needed to make it complete.  
Tested: Ran stress tests with schedutil governor. Ran general scheduler  
stress and performance tests.  
  
Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-26 12:34:20 +00:00
Lucas Zampieri 79eb65d175 Merge: sched: apply class and guard cleanups
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3865

JIRA: https://issues.redhat.com/browse/RHEL-29017  
  
Apply the changes using the macros in include/linux/cleanup.h providing  
scoped guards. There is no real functional change. We rely on the compiler  
to cleanup rather than having explicit unwiding with gotos.  
  
Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-22 12:41:20 +00:00
Phil Auld 39ff726e7b sched: fair: move unused stub functions to header
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit b1c3efe07987592c16d5f59ce235e6ddbea65a73
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Thu Nov 23 12:05:03 2023 +0100

    sched: fair: move unused stub functions to header

    These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED,
    and empty one that is only referenced when FAIR_GROUP_SCHED is disabled
    but CGROUP_SCHED is still enabled.  If both are turned off, the functions
    are still defined but the misisng prototype causes a W=1 warning:

    kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group'
    kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group'
    kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group'
    kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group'

    Move the alternatives into the header as static inline functions with the
    correct combination of #ifdef checks to avoid the warning without adding
    even more complexity.

    [A different patch with the same description got applied by accident
     and was later reverted, but the original patch is still missing]

    Link: https://lkml.kernel.org/r/20231123110506.707903-4-arnd@kernel.org
    Fixes: 7aa55f2a5902 ("sched/fair: Move unused stub functions to header")
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: David Woodhouse <dwmw2@infradead.org>
    Cc: Dinh Nguyen <dinguyen@kernel.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Masahiro Yamada <masahiroy@kernel.org>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nathan Chancellor <nathan@kernel.org>
    Cc: Nicolas Schier <nicolas@fjasle.eu>
    Cc: Palmer Dabbelt <palmer@rivosinc.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Richard Henderson <richard.henderson@linaro.org>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Tudor Ambarus <tudor.ambarus@linaro.org>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Cc: Zhihao Cheng <chengzhihao1@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:16 -04:00
Phil Auld 8e7f4729fa sched/deadline: Introduce deadline servers
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Context diff in include/linux/sched.h mostly due to not
 having fd593511cdfc ("tracing/user_events: Track fork/exec/exit for
 mm lifetime").

commit 63ba8422f876e32ee564ea95da9a7313b13ff0a1
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sat Nov 4 11:59:21 2023 +0100

    sched/deadline: Introduce deadline servers

    Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
    with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).

    RT Throttling has been introduced a while ago as a (mostly debug)
    countermeasure one can utilize to reserve some CPU time for low priority
    tasks (usually background type of work, e.g. workqueues, timers, etc.).
    It however has its own problems (see documentation) and the undesired
    effect of unconditionally throttling FIFO tasks even when no lower
    priority activity needs to run (there are mechanisms to fix this issue
    as well, but, again, with their own problems).

    Introduce deadline servers to service low priority tasks needs under
    starvation conditions. Deadline servers are built extending SCHED_DEADLINE
    implementation to allow 2-level scheduling (a sched_deadline entity
    becomes a container for lower priority scheduling entities).

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:16 -04:00
Phil Auld 32ae9572f2 sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: One hunk applied by hand in sched.h due to not having
 eevdf commit d07f09a1f99c ("sched/fair: Propagate enqueue flags
 into place_entity()").

commit 2f7a0f58948d8231236e2facecc500f1930fb996
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sat Nov 4 11:59:20 2023 +0100

    sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity

    In preparation of introducing !task sched_dl_entity; move the
    bandwidth accounting into {en.de}queue_dl_entity().

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lkml.kernel.org/r/a86dccbbe44e021b8771627e1dae01a69b73466d.1699095159.git.bristot@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:16 -04:00
Phil Auld f0cdbfa9cb sched/deadline: Collect sched_dl_entity initialization
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Minor fuzz due to unrelated whitespace difference from
 upstream.

commit 9e07d45c5210f5dd6701c00d55791983db7320fa
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sat Nov 4 11:59:19 2023 +0100

    sched/deadline: Collect sched_dl_entity initialization

    Create a single function that initializes a sched_dl_entity.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lkml.kernel.org/r/51acc695eecf0a1a2f78f9a044e11ffd9b316bcf.1699095159.git.bristot@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:16 -04:00
Phil Auld 7fc27e6f01 sched: Unify runtime accounting across classes
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Whitespace context difference in removed code in sched.h.
 Minor context diff in fair.c due to not having the eevdf scheduler
 patches in rhel.

commit 5d69eca542ee17c618f9a55da52191d5e28b435f
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sat Nov 4 11:59:18 2023 +0100

    sched: Unify runtime accounting across classes

    All classes use sched_entity::exec_start to track runtime and have
    copies of the exact same code around to compute runtime.

    Collapse all that.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.1699095159.git.bristot@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:16 -04:00
Phil Auld fea5f42a4c sched/fair: Remove SIS_PROP
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 984ffb6a4366752c949f7b39640aecdce222607f
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Oct 20 12:35:33 2023 +0200

    sched/fair: Remove SIS_PROP

    SIS_UTIL seems to work well, lets remove the old thing.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:15 -04:00
Phil Auld 219d789d21 sched/fair: Scan cluster before scanning LLC in wake-up path
JIRA: https://issues.redhat.com/browse/RHEL-15622

commit 8881e1639f1f899b64e9bccf6cc14d51c1d3c822
Author: Barry Song <song.bao.hua@hisilicon.com>
Date:   Thu Oct 19 11:33:22 2023 +0800

    sched/fair: Scan cluster before scanning LLC in wake-up path

    For platforms having clusters like Kunpeng920, CPUs within the same cluster
    have lower latency when synchronizing and accessing shared resources like
    cache. Thus, this patch tries to find an idle cpu within the cluster of the
    target CPU before scanning the whole LLC to gain lower latency. This
    will be implemented in 2 steps in select_idle_sibling():
    1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them
       if they're sharing cluster with the target CPU. Otherwise trying to
       scan for an idle CPU in the target's cluster.
    2. Scanning the cluster prior to the LLC of the target CPU for an
       idle CPU to wakeup.

    Testing has been done on Kunpeng920 by pinning tasks to one numa and two
    numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.

    With this patch, We noticed enhancement on tbench and netperf within one
    numa or cross two numa on top of tip-sched-core commit
    9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()")

    tbench results (node 0):
                baseline                     patched
      1:        327.2833        372.4623 (   13.80%)
      4:       1320.5933       1479.8833 (   12.06%)
      8:       2638.4867       2921.5267 (   10.73%)
     16:       5282.7133       5891.5633 (   11.53%)
     32:       9810.6733       9877.3400 (    0.68%)
     64:       7408.9367       7447.9900 (    0.53%)
    128:       6203.2600       6191.6500 (   -0.19%)
    tbench results (node 0-1):
                baseline                     patched
      1:        332.0433        372.7223 (   12.25%)
      4:       1325.4667       1477.6733 (   11.48%)
      8:       2622.9433       2897.9967 (   10.49%)
     16:       5218.6100       5878.2967 (   12.64%)
     32:      10211.7000      11494.4000 (   12.56%)
     64:      13313.7333      16740.0333 (   25.74%)
    128:      13959.1000      14533.9000 (    4.12%)

    netperf results TCP_RR (node 0):
                baseline                     patched
      1:      76546.5033      90649.9867 (   18.42%)
      4:      77292.4450      90932.7175 (   17.65%)
      8:      77367.7254      90882.3467 (   17.47%)
     16:      78519.9048      90938.8344 (   15.82%)
     32:      72169.5035      72851.6730 (    0.95%)
     64:      25911.2457      25882.2315 (   -0.11%)
    128:      10752.6572      10768.6038 (    0.15%)

    netperf results TCP_RR (node 0-1):
                baseline                     patched
      1:      76857.6667      90892.2767 (   18.26%)
      4:      78236.6475      90767.3017 (   16.02%)
      8:      77929.6096      90684.1633 (   16.37%)
     16:      77438.5873      90502.5787 (   16.87%)
     32:      74205.6635      88301.5612 (   19.00%)
     64:      69827.8535      71787.6706 (    2.81%)
    128:      25281.4366      25771.3023 (    1.94%)

    netperf results UDP_RR (node 0):
                baseline                     patched
      1:      96869.8400     110800.8467 (   14.38%)
      4:      97744.9750     109680.5425 (   12.21%)
      8:      98783.9863     110409.9637 (   11.77%)
     16:      99575.0235     110636.2435 (   11.11%)
     32:      95044.7250      97622.8887 (    2.71%)
     64:      32925.2146      32644.4991 (   -0.85%)
    128:      12859.2343      12824.0051 (   -0.27%)

    netperf results UDP_RR (node 0-1):
                baseline                     patched
      1:      97202.4733     110190.1200 (   13.36%)
      4:      95954.0558     106245.7258 (   10.73%)
      8:      96277.1958     105206.5304 (    9.27%)
     16:      97692.7810     107927.2125 (   10.48%)
     32:      79999.6702     103550.2999 (   29.44%)
     64:      80592.7413      87284.0856 (    8.30%)
    128:      27701.5770      29914.5820 (    7.99%)

    Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
    in the code has not been tested but it supposed to work.

    Chen Yu also noticed this will improve the performance of tbench and
    netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one
    cluster sharing L2 Cache.

    [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
    Reviewed-by: Chen Yu <yu.c.chen@intel.com>
    Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
    Tested-by: Yicong Yang <yangyicong@hisilicon.com>
    Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:04 -04:00
Phil Auld 26c251b772 sched: Add cpus_share_resources API
JIRA: https://issues.redhat.com/browse/RHEL-15622

commit b95303e0aeaf446b65169dd4142cacdaeb7d4c8b
Author: Barry Song <song.bao.hua@hisilicon.com>
Date:   Thu Oct 19 11:33:21 2023 +0800

    sched: Add cpus_share_resources API

    Add cpus_share_resources() API. This is the preparation for the
    optimization of select_idle_cpu() on platforms with cluster scheduler
    level.

    On a machine with clusters cpus_share_resources() will test whether
    two cpus are within the same cluster. On a non-cluster machine it
    will behaves the same as cpus_share_cache(). So we use "resources"
    here for cache resources.

    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
    Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:46:43 -04:00
Phil Auld de762ac709 sched/headers: Remove comment referring to rq::cpu_load, since this has been removed
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit b19fdb16fb2167c6bc9ee8fbc0c1d2d4fd3e2eb8
Author: Colin Ian King <colin.i.king@gmail.com>
Date:   Tue Oct 10 16:57:44 2023 +0100

    sched/headers: Remove comment referring to rq::cpu_load, since this has been removed

    There is a comment that refers to cpu_load, however, this cpu_load was
    removed with:

      55627e3cd2 ("sched/core: Remove rq->cpu_load[]")

    ... back in 2019. The comment does not make sense with respect to this
    removed array, so remove the comment.

    Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20231010155744.1381065-1-colin.i.king@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:57 -04:00
Phil Auld 38176213a7 sched/topology: Move the declaration of 'schedutil_gov' to kernel/sched/sched.h
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit f2273f4e19e29f7d0be6a2393f18369cd1b496c8
Author: Ingo Molnar <mingo@kernel.org>
Date:   Mon Oct 9 17:31:26 2023 +0200

    sched/topology: Move the declaration of 'schedutil_gov' to kernel/sched/sched.h

    Move it out of the .c file into the shared scheduler-internal header file,
    to gain type-checking.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
    Cc: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:57 -04:00
Phil Auld 660107a034 sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 5fe7765997b139e2d922b58359dea181efe618f9
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu Sep 28 17:02:51 2023 +0200

    sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded

    dl_rq->dl_nr_migratory is increased whenever a DL entity is enqueued and it has
    nr_cpus_allowed > 1. Unlike the pushable_dl_tasks tree, dl_rq->dl_nr_migratory
    includes a dl_rq's current task. This means a dl_rq can have a migratable
    current, N non-migratable queued tasks, and be flagged as overloaded and have
    its CPU set in the dlo_mask, despite having an empty pushable_tasks tree.

    Make an dl_rq's overload logic be driven by {enqueue,dequeue}_pushable_dl_task(),
    in other words make DL RQs only be flagged as overloaded if they have at
    least one runnable-but-not-current migratable task.

     o push_dl_task() is unaffected, as it is a no-op if there are no pushable
       tasks.

     o pull_dl_task() now no longer scans runqueues whose sole migratable task is
       their current one, which it can't do anything about anyway.
       It may also now pull tasks to a DL RQ with dl_nr_running > 1 if only its
       current task is migratable.

    Since dl_rq->dl_nr_migratory becomes unused, remove it.

    RT had the exact same mechanism (rt_rq->rt_nr_migratory) which was dropped
    in favour of relying on rt_rq->pushable_tasks, see:

      612f769edd06 ("sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask")

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Juri Lelli <juri.lelli@redhat.com>
    Link: https://lore.kernel.org/r/20230928150251.463109-1-vschneid@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:56 -04:00
Phil Auld 8883ff7c00 sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 612f769edd06a6e42f7cd72425488e68ddaeef0a
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Fri Aug 11 12:20:44 2023 +0100

    sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask

    Sebastian noted that the rto_push_work IRQ work can be queued for a CPU
    that has an empty pushable_tasks list, which means nothing useful will be
    done in the IPI other than queue the work for the next CPU on the rto_mask.

    rto_push_irq_work_func() only operates on tasks in the pushable_tasks list,
    but the conditions for that irq_work to be queued (and for a CPU to be
    added to the rto_mask) rely on rq_rt->nr_migratory instead.

    nr_migratory is increased whenever an RT task entity is enqueued and it has
    nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a
    rt_rq's current task. This means a rt_rq can have a migratible current, N
    non-migratible queued tasks, and be flagged as overloaded / have its CPU
    set in the rto_mask, despite having an empty pushable_tasks list.

    Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task().
    Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them.

    Note that the case where the current task is pushed away to make way for a
    migration-disabled task remains unchanged: the migration-disabled task has
    to be in the pushable_tasks list in the first place, which means it has
    nr_cpus_allowed > 1.

    Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Link: https://lore.kernel.org/r/20230811112044.3302588-1-vschneid@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:56 -04:00
Phil Auld 8738be5be2 sched/fair: Make cfs_rq->throttled_csd_list available on !SMP
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Minor fuzz in sched.h due to context from kABI additions.

commit 30797bce8ef0c73f0c388148ffac92458533b10e
Author: Josh Don <joshdon@google.com>
Date:   Fri Sep 22 16:05:34 2023 -0700

    sched/fair: Make cfs_rq->throttled_csd_list available on !SMP

    This makes the following patch cleaner by avoiding extra CONFIG_SMP
    conditionals on the availability of rq->throttled_csd_list.

    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230922230535.296350-1-joshdon@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:55 -04:00
Phil Auld 7e2b960e90 sched/fair: Rename check_preempt_curr() to wakeup_preempt()
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Minor fuzz in fair.c due to having RT merged,
  specifically: ea622076b76f ("sched: Add support for lazy preemption")

commit e23edc86b09df655bf8963bbcb16647adc787395
Author: Ingo Molnar <mingo@kernel.org>
Date:   Tue Sep 19 10:38:21 2023 +0200

    sched/fair: Rename check_preempt_curr() to wakeup_preempt()

    The name is a bit opaque - make it clear that this is about wakeup
    preemption.

    Also rename the ->check_preempt_curr() methods similarly.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:55 -04:00
Phil Auld 380b737290 sched/headers: Remove duplicated includes in kernel/sched/sched.h
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 7ad0354d18ae05e9c8885251e234cbcf141f8972
Author: GUO Zihua <guozihua@huawei.com>
Date:   Fri Aug 18 09:56:33 2023 +0800

    sched/headers: Remove duplicated includes in kernel/sched/sched.h

    Remove duplicated includes of linux/cgroup.h and linux/psi.h. Both of
    these includes are included regardless of the config and they are all
    protected by ifndef, so no point including them again.

    Signed-off-by: GUO Zihua <guozihua@huawei.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230818015633.18370-1-guozihua@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:55 -04:00
Phil Auld 1c89ce0800 sched/fair: Ratelimit update to tg->load_avg
JIRA: https://issues.redhat.com/browse/RHEL-25535
JIRA: https://issues.redhat.com/browse/RHEL-20158

commit 1528c661c24b407e92194426b0adbb43de859ce0
Author: Aaron Lu <aaron.lu@intel.com>
Date:   Tue Sep 12 14:58:08 2023 +0800

    sched/fair: Ratelimit update to tg->load_avg

    When using sysbench to benchmark Postgres in a single docker instance
    with sysbench's nr_threads set to nr_cpu, it is observed there are times
    update_cfs_group() and update_load_avg() shows noticeable overhead on
    a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

        13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
        10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

    Annotate shows the cycles are mostly spent on accessing tg->load_avg
    with update_load_avg() being the write side and update_cfs_group() being
    the read side. tg->load_avg is per task group and when different tasks
    of the same taskgroup running on different CPUs frequently access
    tg->load_avg, it can be heavily contended.

    E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
    Sappire Rapids, during a 5s window, the wakeup number is 14millions and
    migration number is 11millions and with each migration, the task's load
    will transfer from src cfs_rq to target cfs_rq and each change involves
    an update to tg->load_avg. Since the workload can trigger as many wakeups
    and migrations, the access(both read and write) to tg->load_avg can be
    unbound. As a result, the two mentioned functions showed noticeable
    overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
    during a 5s window, wakeup number is 21millions and migration number is
    14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

    Reduce the overhead by limiting updates to tg->load_avg to at most once
    per ms. The update frequency is a tradeoff between tracking accuracy and
    overhead. 1ms is chosen because PELT window is roughly 1ms and it
    delivered good results for the tests that I've done. After this change,
    the cost of accessing tg->load_avg is greatly reduced and performance
    improved. Detailed test results below.

      ==============================
      postgres_sysbench on SPR:
      25%
      base:   42382±19.8%
      patch:  50174±9.5%  (noise)

      50%
      base:   67626±1.3%
      patch:  67365±3.1%  (noise)

      75%
      base:   100216±1.2%
      patch:  112470±0.1% +12.2%

      100%
      base:    93671±0.4%
      patch:  113563±0.2% +21.2%

      ==============================
      hackbench on ICL:
      group=1
      base:    114912±5.2%
      patch:   117857±2.5%  (noise)

      group=4
      base:    359902±1.6%
      patch:   361685±2.7%  (noise)

      group=8
      base:    461070±0.8%
      patch:   491713±0.3% +6.6%

      group=16
      base:    309032±5.0%
      patch:   378337±1.3% +22.4%

      =============================
      hackbench on SPR:
      group=1
      base:    100768±2.9%
      patch:   103134±2.9%  (noise)

      group=4
      base:    413830±12.5%
      patch:   378660±16.6% (noise)

      group=8
      base:    436124±0.6%
      patch:   490787±3.2% +12.5%

      group=16
      base:    457730±3.2%
      patch:   680452±1.3% +48.8%

      ============================
      netperf/udp_rr on ICL
      25%
      base:    114413±0.1%
      patch:   115111±0.0% +0.6%

      50%
      base:    86803±0.5%
      patch:   86611±0.0%  (noise)

      75%
      base:    35959±5.3%
      patch:   49801±0.6% +38.5%

      100%
      base:    61951±6.4%
      patch:   70224±0.8% +13.4%

      ===========================
      netperf/udp_rr on SPR
      25%
      base:   104954±1.3%
      patch:  107312±2.8%  (noise)

      50%
      base:    55394±4.6%
      patch:   54940±7.4%  (noise)

      75%
      base:    13779±3.1%
      patch:   36105±1.1% +162%

      100%
      base:     9703±3.7%
      patch:   28011±0.2% +189%

      ==============================================
      netperf/tcp_stream on ICL (all in noise range)
      25%
      base:    43092±0.1%
      patch:   42891±0.5%

      50%
      base:    19278±14.9%
      patch:   22369±7.2%

      75%
      base:    16822±3.0%
      patch:   17086±2.3%

      100%
      base:    18216±0.6%
      patch:   18078±2.9%

      ===============================================
      netperf/tcp_stream on SPR (all in noise range)
      25%
      base:    34491±0.3%
      patch:   34886±0.5%

      50%
      base:    19278±14.9%
      patch:   22369±7.2%

      75%
      base:    16822±3.0%
      patch:   17086±2.3%

      100%
      base:    18216±0.6%
      patch:   18078±2.9%

    Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
    Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Aaron Lu <aaron.lu@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Reviewed-by: David Vernet <void@manifault.com>
    Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
    Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:26 -04:00
Phil Auld b3d7247782 sched/rt: Change the type of 'sysctl_sched_rt_period' from 'unsigned int' to 'int'
JIRA: https://issues.redhat.com/browse/RHEL-29436

commit 089768dfeb3ab294f9ab6a1f2462001f0f879fbb
Author: Yajun Deng <yajun.deng@linux.dev>
Date:   Sun Oct 8 10:15:38 2023 +0800

    sched/rt: Change the type of 'sysctl_sched_rt_period' from 'unsigned int' to 'int'

    Doing this matches the natural type of 'int' based calculus
    in sched_rt_handler(), and also enables the adding in of a
    correct upper bounds check on the sysctl interface.

    [ mingo: Rewrote the changelog. ]

    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20231008021538.3063250-1-yajun.deng@linux.dev

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:55:45 -04:00
Phil Auld 9f16bf1bd9 sched: add WF_CURRENT_CPU and externise ttwu
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit ab83f455f04df5b2f7c6d4de03b6d2eaeaa27b8a
Author: Peter Oskolkov <posk@google.com>
Date:   Tue Mar 7 23:31:57 2023 -0800

    sched: add WF_CURRENT_CPU and externise ttwu

    Add WF_CURRENT_CPU wake flag that advices the scheduler to
    move the wakee to the current CPU. This is useful for fast on-CPU
    context switching use cases.

    In addition, make ttwu external rather than static so that
    the flag could be passed to it from outside of sched/core.c.

    Signed-off-by: Peter Oskolkov <posk@google.com>
    Signed-off-by: Andrei Vagin <avagin@google.com>
    Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230308073201.3102738-3-avagin@google.com
    Signed-off-by: Kees Cook <keescook@chromium.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:49:13 -04:00
Phil Auld dd1a6e3897 sched: add throttled time stat for throttled children
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 677ea015f231aa38b3972aa7be54ecd2637e99fd
Author: Josh Don <joshdon@google.com>
Date:   Tue Jun 20 11:32:47 2023 -0700

    sched: add throttled time stat for throttled children

    We currently export the total throttled time for cgroups that are given
    a bandwidth limit. This patch extends this accounting to also account
    the total time that each children cgroup has been throttled.

    This is useful to understand the degree to which children have been
    affected by the throttling control. Children which are not runnable
    during the entire throttled period, for example, will not show any
    self-throttling time during this period.

    Expose this in a new interface, 'cpu.stat.local', which is similar to
    how non-hierarchical events are accounted in 'memory.events.local'.

    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:49:13 -04:00
Phil Auld 2adae1808f sched/cpufreq: Rework iowait boost
JIRA: https://issues.redhat.com/browse/RHEL-29020

commit f12560779f9d734446508f3df17f5632e9aaa2c8
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Wed Nov 22 14:39:04 2023 +0100

    sched/cpufreq: Rework iowait boost

    Use the max value that has already been computed inside sugov_get_util()
    to cap the iowait boost and remove dependency with uclamp_rq_util_with()
    which is not used anymore.

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Rafael J. Wysocki <rafael@kernel.org>
    Link: https://lore.kernel.org/r/20231122133904.446032-3-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:43:59 -04:00
Phil Auld 74c4d90dda sched/cpufreq: Rework schedutil governor performance estimation
JIRA: https://issues.redhat.com/browse/RHEL-29020

commit 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Wed Nov 22 14:39:03 2023 +0100

    sched/cpufreq: Rework schedutil governor performance estimation

    The current method to take into account uclamp hints when estimating the
    target frequency can end in a situation where the selected target
    frequency is finally higher than uclamp hints, whereas there are no real
    needs. Such cases mainly happen because we are currently mixing the
    traditional scheduler utilization signal with the uclamp performance
    hints. By adding these 2 metrics, we loose an important information when
    it comes to select the target frequency, and we have to make some
    assumptions which can't fit all cases.

    Rework the interface between the scheduler and schedutil governor in order
    to propagate all information down to the cpufreq governor.

    effective_cpu_util() interface changes and now returns the actual
    utilization of the CPU with 2 optional inputs:

    - The minimum performance for this CPU; typically the capacity to handle
      the deadline task and the interrupt pressure. But also uclamp_min
      request when available.

    - The maximum targeting performance for this CPU which reflects the
      maximum level that we would like to not exceed. By default it will be
      the CPU capacity but can be reduced because of some performance hints
      set with uclamp. The value can be lower than actual utilization and/or
      min performance level.

    A new sugov_effective_cpu_perf() interface is also available to compute
    the final performance level that is targeted for the CPU, after applying
    some cpufreq headroom and taking into account all inputs.

    With these 2 functions, schedutil is now able to decide when it must go
    above uclamp hints. It now also has a generic way to get the min
    performance level.

    The dependency between energy model and cpufreq governor and its headroom
    policy doesn't exist anymore.

    eenv_pd_max_util() asks schedutil for the targeted performance after
    applying the impact of the waking task.

    [ mingo: Refined the changelog & C comments. ]

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Rafael J. Wysocki <rafael@kernel.org>
    Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:43:59 -04:00
Phil Auld 49d1b3f5c9 sched/topology: Consolidate and clean up access to a CPU's max compute capacity
JIRA: https://issues.redhat.com/browse/RHEL-29020

commit 7bc263840bc3377186cb06b003ac287bb2f18ce2
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Mon Oct 9 12:36:16 2023 +0200

    sched/topology: Consolidate and clean up access to a CPU's max compute capacity

    Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
    instead.

    The scheduler uses 3 methods to get access to a CPU's max compute capacity:

     - arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.

     - cpu_capacity_orig field which is periodically updated with
       arch_scale_cpu_capacity().

     - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.

    There is no real need to save the value returned by arch_scale_cpu_capacity()
    in struct rq. arch_scale_cpu_capacity() returns:

     - either a per_cpu variable.

     - or a const value for systems which have only one capacity.

    Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.

    No functional changes.

    Some performance tests on Arm64:

      - small SMP device (hikey): no noticeable changes
      - HMP device (RB5):         hackbench shows minor improvement (1-2%)
      - large smp (thx2):         hackbench and tbench shows minor improvement (1%)

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:43:50 -04:00
Phil Auld 7d8b86de57 sched: Simplify set_user_nice()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 94b548a15e8ec47dfbf6925bdfb64bb5657dce0c
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 9 20:52:55 2023 +0200

    sched: Simplify set_user_nice()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:36:13 -04:00
Phil Auld 2d0ed06667 sched: Simplify wake_up_if_idle()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 4eb054f92b066ec0a0cba6896ee8eff4c91dfc9e
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:25 2023 +0200

    sched: Simplify wake_up_if_idle()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211812.032678917@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:35:47 -04:00
Phil Auld f4b0880d3d sched: Simplify: migrate_swap_stop()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 5bb76f1ddf2a7dd98f5a89d7755600ed1b4a7fcd
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:24 2023 +0200

    sched: Simplify: migrate_swap_stop()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211811.964370836@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:35:43 -04:00