Commit Graph

216 Commits

Author SHA1 Message Date
Luis Claudio R. Goncalves 0857fd208a sched: Fix stop_one_cpu_nowait() vs hotplug
JIRA: https://issues.redhat.com/browse/RHEL-84526

commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Oct 10 20:57:39 2023 +0200

    sched: Fix stop_one_cpu_nowait() vs hotplug

    Kuyo reported sporadic failures on a sched_setaffinity() vs CPU
    hotplug stress-test -- notably affine_move_task() remains stuck in
    wait_for_completion(), leading to a hung-task detector warning.

    Specifically, it was reported that stop_one_cpu_nowait(.fn =
    migration_cpu_stop) returns false -- this stopper is responsible for
    the matching complete().

    The race scenario is:

            CPU0                                    CPU1

                                            // doing _cpu_down()

      __set_cpus_allowed_ptr()
        task_rq_lock();
                                            takedown_cpu()
                                              stop_machine_cpuslocked(take_cpu_down..)

                                            <PREEMPT: cpu_stopper_thread()
                                              MULTI_STOP_PREPARE
                                              ...
        __set_cpus_allowed_ptr_locked()
          affine_move_task()
            task_rq_unlock();

      <PREEMPT: cpu_stopper_thread()\>
        ack_state()
                                              MULTI_STOP_RUN
                                                take_cpu_down()
                                                  __cpu_disable();
                                                  stop_machine_park();
                                                    stopper->enabled = false;
                                             />
       />
            stop_one_cpu_nowait(.fn = migration_cpu_stop);
              if (stopper->enabled) // false!!!

    That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the
    stopper thread gets a chance to preempt and allows the cpu-down for
    the target CPU to complete.

    OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to
    issue a wakeup, it must not be ran under the scheduler locks.

    Solve this apparent contradiction by keeping preemption disabled over
    the unlock + queue_stopper combination:

            preempt_disable();
            task_rq_unlock(...);
            if (!stop_pending)
              stop_one_cpu_nowait(...)
            preempt_enable();

    This respects the lock ordering contraints while still avoiding the
    above race. That is, if we find the CPU is online under rq-lock, the
    targeted stop_one_cpu_nowait() must succeed.

    Apply this pattern to all similar stop_one_cpu_nowait() invocations.

    Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
    Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
    Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net

Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
2025-03-21 18:50:20 -03:00
Phil Auld 91707bbfc4 sched: Consolidate pick_*_task to task_is_pushable helper
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit 18adad1dac3334ed34f60ad4de2960df03058142
Author: Connor O'Brien <connoro@google.com>
Date:   Wed Oct 9 16:53:38 2024 -0700

    sched: Consolidate pick_*_task to task_is_pushable helper

    This patch consolidates rt and deadline pick_*_task functions to
    a task_is_pushable() helper

    This patch was broken out from a larger chain migration
    patch originally by Connor O'Brien.

    [jstultz: split out from larger chain migration patch,
     renamed helper function]

    Signed-off-by: Connor O'Brien <connoro@google.com>
    Signed-off-by: John Stultz <jstultz@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Metin Kaya <metin.kaya@arm.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Christian Loehle <christian.loehle@arm.com>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Tested-by: Metin Kaya <metin.kaya@arm.com>
    Link: https://lore.kernel.org/r/20241009235352.1614323-6-jstultz@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:08 +00:00
Phil Auld 8851a9b9ae sched: Add move_queued_task_locked helper
JIRA: https://issues.redhat.com/browse/RHEL-78821
Conflicts: Context diffs in sched.h due to not having eevdf code.

commit 2b05a0b4c08ffd6dedfbd27af8708742cde39b95
Author: Connor O'Brien <connoro@google.com>
Date:   Wed Oct 9 16:53:37 2024 -0700

    sched: Add move_queued_task_locked helper

    Switch logic that deactivates, sets the task cpu,
    and reactivates a task on a different rq to use a
    helper that will be later extended to push entire
    blocked task chains.

    This patch was broken out from a larger chain migration
    patch originally by Connor O'Brien.

    [jstultz: split out from larger chain migration patch]
    Signed-off-by: Connor O'Brien <connoro@google.com>
    Signed-off-by: John Stultz <jstultz@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Metin Kaya <metin.kaya@arm.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Qais Yousef <qyousef@layalina.io>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Tested-by: Metin Kaya <metin.kaya@arm.com>
    Link: https://lore.kernel.org/r/20241009235352.1614323-5-jstultz@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:08 +00:00
Phil Auld e8bf69e6e0 sched: Fix spelling in comments
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Dropped hunks in mm_cid code which we don't have. Minor
context diffs due to still having IA64 in tree and previous Kabi
workarounds.

commit 402de7fc880fef055bc984957454b532987e9ad0
Author: Ingo Molnar <mingo@kernel.org>
Date:   Mon May 27 16:54:52 2024 +0200

    sched: Fix spelling in comments

    Do a spell-checking pass.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:02 -04:00
Lucas Zampieri f67ab7550c Merge: Scheduler: rhel9.5 updates
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3975

JIRA: https://issues.redhat.com/browse/RHEL-25535 

JIRA: https://issues.redhat.com/browse/RHEL-20158  

JIRA: https://issues.redhat.com/browse/RHEL-15622

Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3935

Tested: Scheduler stress tests. Perf Qe will do a  
performance regression test.  
  
A collection of fixes and updates that brings the  
core scheduler code up to v6.8. EEVDF related commits  
are skipped since we are not planning to take the new  
task scheduler in rhel9.
  
  
Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-08 20:13:47 +00:00
Lucas Zampieri d23522d08a Merge: Sched: schedutil/cpufreq updates
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3935

JIRA: https://issues.redhat.com/browse/RHEL-29020  
  
Bring schedutil code up to about v6.8. This includes some fixes for  
code in rhel9 from the 5.14 rebase.  There are few pieces in cpufreq  
driver code and the arm architectures needed to make it complete.  
Tested: Ran stress tests with schedutil governor. Ran general scheduler  
stress and performance tests.  
  
Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-26 12:34:20 +00:00
Phil Auld 7fc27e6f01 sched: Unify runtime accounting across classes
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Whitespace context difference in removed code in sched.h.
 Minor context diff in fair.c due to not having the eevdf scheduler
 patches in rhel.

commit 5d69eca542ee17c618f9a55da52191d5e28b435f
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sat Nov 4 11:59:18 2023 +0100

    sched: Unify runtime accounting across classes

    All classes use sched_entity::exec_start to track runtime and have
    copies of the exact same code around to compute runtime.

    Collapse all that.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.1699095159.git.bristot@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:16 -04:00
Phil Auld 8883ff7c00 sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 612f769edd06a6e42f7cd72425488e68ddaeef0a
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Fri Aug 11 12:20:44 2023 +0100

    sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask

    Sebastian noted that the rto_push_work IRQ work can be queued for a CPU
    that has an empty pushable_tasks list, which means nothing useful will be
    done in the IPI other than queue the work for the next CPU on the rto_mask.

    rto_push_irq_work_func() only operates on tasks in the pushable_tasks list,
    but the conditions for that irq_work to be queued (and for a CPU to be
    added to the rto_mask) rely on rq_rt->nr_migratory instead.

    nr_migratory is increased whenever an RT task entity is enqueued and it has
    nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a
    rt_rq's current task. This means a rt_rq can have a migratible current, N
    non-migratible queued tasks, and be flagged as overloaded / have its CPU
    set in the rto_mask, despite having an empty pushable_tasks list.

    Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task().
    Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them.

    Note that the case where the current task is pushed away to make way for a
    migration-disabled task remains unchanged: the migration-disabled task has
    to be in the pushable_tasks list in the first place, which means it has
    nr_cpus_allowed > 1.

    Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Link: https://lore.kernel.org/r/20230811112044.3302588-1-vschneid@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:56 -04:00
Phil Auld 7e2b960e90 sched/fair: Rename check_preempt_curr() to wakeup_preempt()
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Minor fuzz in fair.c due to having RT merged,
  specifically: ea622076b76f ("sched: Add support for lazy preemption")

commit e23edc86b09df655bf8963bbcb16647adc787395
Author: Ingo Molnar <mingo@kernel.org>
Date:   Tue Sep 19 10:38:21 2023 +0200

    sched/fair: Rename check_preempt_curr() to wakeup_preempt()

    The name is a bit opaque - make it clear that this is about wakeup
    preemption.

    Also rename the ->check_preempt_curr() methods similarly.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:55 -04:00
Phil Auld b3d7247782 sched/rt: Change the type of 'sysctl_sched_rt_period' from 'unsigned int' to 'int'
JIRA: https://issues.redhat.com/browse/RHEL-29436

commit 089768dfeb3ab294f9ab6a1f2462001f0f879fbb
Author: Yajun Deng <yajun.deng@linux.dev>
Date:   Sun Oct 8 10:15:38 2023 +0800

    sched/rt: Change the type of 'sysctl_sched_rt_period' from 'unsigned int' to 'int'

    Doing this matches the natural type of 'int' based calculus
    in sched_rt_handler(), and also enables the adding in of a
    correct upper bounds check on the sysctl interface.

    [ mingo: Rewrote the changelog. ]

    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20231008021538.3063250-1-yajun.deng@linux.dev

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:55:45 -04:00
Phil Auld 49d1b3f5c9 sched/topology: Consolidate and clean up access to a CPU's max compute capacity
JIRA: https://issues.redhat.com/browse/RHEL-29020

commit 7bc263840bc3377186cb06b003ac287bb2f18ce2
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Mon Oct 9 12:36:16 2023 +0200

    sched/topology: Consolidate and clean up access to a CPU's max compute capacity

    Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
    instead.

    The scheduler uses 3 methods to get access to a CPU's max compute capacity:

     - arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.

     - cpu_capacity_orig field which is periodically updated with
       arch_scale_cpu_capacity().

     - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.

    There is no real need to save the value returned by arch_scale_cpu_capacity()
    in struct rq. arch_scale_cpu_capacity() returns:

     - either a per_cpu variable.

     - or a const value for systems which have only one capacity.

    Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.

    No functional changes.

    Some performance tests on Arm64:

      - small SMP device (hikey): no noticeable changes
      - HMP device (RB5):         hackbench shows minor improvement (1-2%)
      - large smp (thx2):         hackbench and tbench shows minor improvement (1%)

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:43:50 -04:00
Phil Auld 499c230970 sched/rt: Disallow writing invalid values to sched_rt_period_us
JIRA: https://issues.redhat.com/browse/RHEL-29436

commit 079be8fc630943d9fc70a97807feb73d169ee3fc
Author: Cyril Hrubis <chrubis@suse.cz>
Date:   Mon Oct 2 13:55:51 2023 +0200

    sched/rt: Disallow writing invalid values to sched_rt_period_us

    The validation of the value written to sched_rt_period_us was broken
    because:

      - the sysclt_sched_rt_period is declared as unsigned int
      - parsed by proc_do_intvec()
      - the range is asserted after the value parsed by proc_do_intvec()

    Because of this negative values written to the file were written into a
    unsigned integer that were later on interpreted as large positive
    integers which did passed the check:

      if (sysclt_sched_rt_period <= 0)
            return EINVAL;

    This commit fixes the parsing by setting explicit range for both
    perid_us and runtime_us into the sched_rt_sysctls table and processes
    the values with proc_dointvec_minmax() instead.

    Alternatively if we wanted to use full range of unsigned int for the
    period value we would have to split the proc_handler and use
    proc_douintvec() for it however even the
    Documentation/scheduller/sched-rt-group.rst describes the range as 1 to
    INT_MAX.

    As far as I can tell the only problem this causes is that the sysctl
    file allows writing negative values which when read back may confuse
    userspace.

    There is also a LTP test being submitted for these sysctl files at:

      http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/

    Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 08:46:01 -04:00
Phil Auld e76f2e0d03 sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset
JIRA: https://issues.redhat.com/browse/RHEL-29436

commit c1fc6484e1fb7cc2481d169bfef129a1b0676abe
Author: Cyril Hrubis <chrubis@suse.cz>
Date:   Wed Aug 2 17:19:06 2023 +0200

    sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset

    The sched_rr_timeslice can be reset to default by writing value that is
    <= 0. However after reading from this file we always got the last value
    written, which is not useful at all.

    $ echo -1 > /proc/sys/kernel/sched_rr_timeslice_ms
    $ cat /proc/sys/kernel/sched_rr_timeslice_ms
    -1

    Fix this by setting the variable that holds the sysctl file value to the
    jiffies_to_msecs(RR_TIMESLICE) in case that <= 0 value was written.

    Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Petr Vorel <pvorel@suse.cz>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Tested-by: Petr Vorel <pvorel@suse.cz>
    Link: https://lore.kernel.org/r/20230802151906.25258-3-chrubis@suse.cz

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 08:46:01 -04:00
Phil Auld 7b01f17632 sched/rt: Fix sysctl_sched_rr_timeslice intial value
JIRA: https://issues.redhat.com/browse/RHEL-29436

commit c7fcb99877f9f542c918509b2801065adcaf46fa
Author: Cyril Hrubis <chrubis@suse.cz>
Date:   Wed Aug 2 17:19:05 2023 +0200

    sched/rt: Fix sysctl_sched_rr_timeslice intial value

    There is a 10% rounding error in the intial value of the
    sysctl_sched_rr_timeslice with CONFIG_HZ_300=y.

    This was found with LTP test sched_rr_get_interval01:

    sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
    sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
    sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90
    sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
    sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
    sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90

    What this test does is to compare the return value from the
    sched_rr_get_interval() and the sched_rr_timeslice_ms sysctl file and
    fails if they do not match.

    The problem it found is the intial sysctl file value which was computed as:

    static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;

    which works fine as long as MSEC_PER_SEC is multiple of HZ, however it
    introduces 10% rounding error for CONFIG_HZ_300:

    (MSEC_PER_SEC / HZ) * (100 * HZ / 1000)

    (1000 / 300) * (100 * 300 / 1000)

    3 * 30 = 90

    This can be easily fixed by reversing the order of the multiplication
    and division. After this fix we get:

    (MSEC_PER_SEC * (100 * HZ / 1000)) / HZ

    (1000 * (100 * 300 / 1000)) / 300

    (1000 * 30) / 300 = 100

    Fixes: 975e155ed8 ("sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds")
    Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Petr Vorel <pvorel@suse.cz>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Tested-by: Petr Vorel <pvorel@suse.cz>
    Link: https://lore.kernel.org/r/20230802151906.25258-2-chrubis@suse.cz

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 08:46:01 -04:00
Eder Zulian 2a63005194 sched/rt: Don't try push tasks if there are none.
JIRA: https://issues.redhat.com/browse/RHEL-3988
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git

commit baf33250e0b248eb68a0fa5572861416625d8121
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Tue Aug 1 17:26:48 2023 +0200

    sched/rt: Don't try push tasks if there are none.

    I have a RT task X at a high priority and cyclictest on each CPU with
    lower priority than X's. If X is active and each CPU wakes their own
    cylictest thread then it ends in a longer rto_push storm.
    A random CPU determines via balance_rt() that the CPU on which X is
    running needs to push tasks. X has the highest priority, cyclictest is
    next in line so there is nothing that can be done since the task with
    the higher priority is not touched.

    tell_cpu_to_push() increments rto_loop_next and schedules
    rto_push_irq_work_func() on X's CPU. The other CPUs also increment the
    loop counter and do the same. Once rto_push_irq_work_func() is active it
    does nothing because it has _no_ pushable tasks on its runqueue. Then
    checks rto_next_cpu() and decides to queue irq_work on the local CPU
    because another CPU requested a push by incrementing the counter.

    I have traces where ~30 CPUs request this ~3 times each before it
    finally ends. This greatly increases X's runtime while X isn't making
    much progress.

    Teach rto_next_cpu() to only return CPUs which also have tasks on their
    runqueue which can be pushed away. This does not reduce the
    tell_cpu_to_push() invocations (rto_loop_next counter increments) but
    reduces the amount of issued rto_push_irq_work_func() if nothing can be
    done. As the result the overloaded CPU is blocked less often.

    There are still cases where the "same job" is repeated several times
    (for instance the current CPU needs to resched but didn't yet because
    the irq-work is repeated a few times and so the old task remains on the
    CPU) but the majority of request end in tell_cpu_to_push() before an IPI
    is issued.

    Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
    Link: https://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Signed-off-by: Eder Zulian <ezulian@redhat.com>
2023-11-06 12:29:40 +01:00
Phil Auld 112765493a sched/core: Avoid selecting the task that is throttled to run when core-sched enable
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 530bfad1d53d103f98cec66a3e491a36d397884d
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Thu Mar 16 16:18:06 2023 +0800

    sched/core: Avoid selecting the task that is throttled to run when core-sched enable

    When {rt, cfs}_rq or dl task is throttled, since cookied tasks
    are not dequeued from the core tree, So sched_core_find() and
    sched_core_next() may return throttled task, which may
    cause throttled task to run on the CPU.

    So we add checks in sched_core_find() and sched_core_next()
    to make sure that the return is a runnable task that is
    not throttled.

    Co-developed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
    Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230316081806.69544-1-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Jan Stancek 3b12a1f1fc Merge: Scheduler updates for 9.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2392

JIRA: https://issues.redhat.com/browse/RHEL-282
Tested: With scheduler stress tests. Perf QE is running performance regression tests.

Update the kernel's core scheduler and related code with fixes and minor changes from
the upstream kernel. This will sync up to roughly linux v6.3-rc6.  Added a couple of
cpumask things which fit better here.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-16 11:49:47 +02:00
Jan Stancek 7d7534d569 Merge: sched/rt: Fix bad task migration for rt tasks
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2372

sched/rt: Fix bad task migration for rt tasks

Bugzilla: https://bugzilla.redhat.com/2182900
Upstream-status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

commit feffe5bb274dd3442080ef0e4053746091878799
Author: Schspa Shi <schspa@gmail.com>
Date:   Mon Aug 29 01:03:02 2022 +0800

    sched/rt: Fix bad task migration for rt tasks

    Commit 95158a89dd ("sched,rt: Use the full cpumask for balancing")
    allows find_lock_lowest_rq() to pick a task with migration disabled.
    The purpose of the commit is to push the current running task on the
    CPU that has the migrate_disable() task away.

    However, there is a race which allows a migrate_disable() task to be
    migrated. Consider:

      CPU0                                    CPU1
      push_rt_task
	check is_migration_disabled(next_task)

					      task not running and
					      migration_disabled == 0

	find_lock_lowest_rq(next_task, rq);
	  _double_lock_balance(this_rq, busiest);
	    raw_spin_rq_unlock(this_rq);
	    double_rq_lock(this_rq, busiest);
	      <<wait for busiest rq>>
						  <wakeup>
					      task become running
					      migrate_disable();
						<context out>
	deactivate_task(rq, next_task, 0);
	set_task_cpu(next_task, lowest_rq->cpu);
	  WARN_ON_ONCE(is_migration_disabled(p));

    Fixes: 95158a89dd ("sched,rt: Use the full cpumask for balancing")
    Signed-off-by: Schspa Shi <schspa@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Tested-by: Dwaine Gonyier <dgonyier@redhat.com>

Signed-off-by: Valentin Schneider <vschneid@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Juri Lelli <juri.lelli@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-15 09:35:56 +02:00
Valentin Schneider 625cb69602 sched/rt: Fix bad task migration for rt tasks
Bugzilla: https://bugzilla.redhat.com/2182900
Upstream-status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

commit feffe5bb274dd3442080ef0e4053746091878799
Author: Schspa Shi <schspa@gmail.com>
Date:   Mon Aug 29 01:03:02 2022 +0800

    sched/rt: Fix bad task migration for rt tasks

    Commit 95158a89dd ("sched,rt: Use the full cpumask for balancing")
    allows find_lock_lowest_rq() to pick a task with migration disabled.
    The purpose of the commit is to push the current running task on the
    CPU that has the migrate_disable() task away.

    However, there is a race which allows a migrate_disable() task to be
    migrated. Consider:

      CPU0                                    CPU1
      push_rt_task
	check is_migration_disabled(next_task)

					      task not running and
					      migration_disabled == 0

	find_lock_lowest_rq(next_task, rq);
	  _double_lock_balance(this_rq, busiest);
	    raw_spin_rq_unlock(this_rq);
	    double_rq_lock(this_rq, busiest);
	      <<wait for busiest rq>>
						  <wakeup>
					      task become running
					      migrate_disable();
						<context out>
	deactivate_task(rq, next_task, 0);
	set_task_cpu(next_task, lowest_rq->cpu);
	  WARN_ON_ONCE(is_migration_disabled(p));

    Fixes: 95158a89dd ("sched,rt: Use the full cpumask for balancing")
    Signed-off-by: Schspa Shi <schspa@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Tested-by: Dwaine Gonyier <dgonyier@redhat.com>

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
2023-04-24 09:53:54 +01:00
Phil Auld 7ab9d04d74 sched: Rename task_running() to task_on_cpu()
JIRA: https://issues.redhat.com/browse/RHEL-282
Conflicts:  Context differences caused by having PREEMPT_RT
merged, specifically a015745ca41f ("sched: Consider
task_struct::saved_state in wait_task_inactive()").

commit 0b9d46fc5ef7a457cc635b30b010081228cb81ac
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Sep 6 12:33:04 2022 +0200

    sched: Rename task_running() to task_on_cpu()

    There is some ambiguity about task_running() in that it is unrelated
    to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing
    task_on_cpu().

    Suggested-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:18 -04:00
Phil Auld 085256b8a7 sched: Add update_current_exec_runtime helper
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 5531ecffa4b923bc7739e9ea73c552d80af602dc
Author: Shang XiaoJing <shangxiaojing@huawei.com>
Date:   Wed Aug 24 16:28:56 2022 +0800

    sched: Add update_current_exec_runtime helper

    Wrap repeated code in helper function update_current_exec_runtime for
    update the exec time of the current.

    Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220824082856.15674-1-shangxiaojing@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:18 -04:00
Phil Auld 9b10d97986 sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 09348d75a6ce60eec85c86dd0ab7babc4db3caf6
Author: Ingo Molnar <mingo@kernel.org>
Date:   Thu Aug 11 08:54:52 2022 +0200

    sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

    There's no good reason to crash a user's system with a BUG_ON(),
    chances are high that they'll never even see the crash message on
    Xorg, and it won't make it into the syslog either.

    By using a WARN_ON_ONCE() we at least give the user a chance to report
    any bugs triggered here - instead of getting silent hangs.

    None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore
    cases where a NULL check is done via a BUG_ON() and we let a NULL
    pointer through after a WARN_ON_ONCE().

    There's one exception: WARN_ON_ONCE() arguments with side-effects,
    such as locking - in this case we use the return value of the
    WARN_ON_ONCE(), such as in:

     -       BUG_ON(!lock_task_sighand(p, &flags));
     +       if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
     +               return;

    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:17 -04:00
Phil Auld 11d3f0cf26 sched: Introduce struct balance_callback to avoid CFI mismatches
JIRA: https://issues.redhat.com/browse/RHEL-310

commit 8e5bad7dccec2014f24497b57d8a8ee0b752c290
Author: Kees Cook <keescook@chromium.org>
Date:   Fri Oct 7 17:07:58 2022 -0700

    sched: Introduce struct balance_callback to avoid CFI mismatches

    Introduce distinct struct balance_callback instead of performing function
    pointer casting which will trip CFI. Avoids warnings as found by Clang's
    future -Wcast-function-type-strict option:

    In file included from kernel/sched/core.c:84:
    kernel/sched/sched.h:1755:15: warning: cast from 'void (*)(struct rq *)' to 'void (*)(struct callback_head *)' converts to incompatible function type [-Wcast-function-type-strict]
            head->func = (void (*)(struct callback_head *))func;
                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    No binary differences result from this change.

    This patch is a cleanup based on Brad Spengler/PaX Team's modifications
    to sched code in their last public patch of grsecurity/PaX based on my
    understanding of the code. Changes or omissions from the original code
    are mine and don't reflect the original grsecurity/PaX code.

    Reported-by: Sami Tolvanen <samitolvanen@google.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Nathan Chancellor <nathan@kernel.org>
    Link: https://github.com/ClangBuiltLinux/linux/issues/1724
    Link: https://lkml.kernel.org/r/20221008000758.2957718-1-keescook@chromium.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-10 11:35:02 -04:00
Phil Auld 194fcdc10c sched/rt: pick_next_rt_entity(): check list_entry
JIRA: https://issues.redhat.com/browse/RHEL-303

commit 7c4a5b89a0b5a57a64b601775b296abf77a9fe97
Author: Pietro Borrello <borrello@diag.uniroma1.it>
Date:   Mon Feb 6 22:33:54 2023 +0000

    sched/rt: pick_next_rt_entity(): check list_entry

    Commit 326587b840 ("sched: fix goto retry in pick_next_task_rt()")
    removed any path which could make pick_next_rt_entity() return NULL.
    However, BUG_ON(!rt_se) in _pick_next_task_rt() (the only caller of
    pick_next_rt_entity()) still checks the error condition, which can
    never happen, since list_entry() never returns NULL.
    Remove the BUG_ON check, and instead emit a warning in the only
    possible error condition here: the queue being empty which should
    never happen.

    Fixes: 326587b840 ("sched: fix goto retry in pick_next_task_rt()")
    Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Link: https://lore.kernel.org/r/20230128-list-entry-null-check-sched-v3-1-b1a71bd1ac6b@diag.uniroma1.it

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-03-20 09:05:29 -04:00
Phil Auld 68b2fea8cd sched/core: Introduce sched_asym_cpucap_active()
JIRA: https://issues.redhat.com/browse/RHEL-303

commit 740cf8a760b73e8375bfb4bedcbe9746183350f9
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Fri Jul 29 13:13:03 2022 +0200

    sched/core: Introduce sched_asym_cpucap_active()

    Create an inline helper for conditional code to be only executed on
    asymmetric CPU capacity systems. This makes these (currently ~10 and
    future) conditions a lot more readable.

    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20220729111305.1275158-2-dietmar.eggemann@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-03-20 09:02:24 -04:00
Phil Auld 216aaa830b sched/core: Avoid obvious double update_rq_clock warning
Bugzilla: https://bugzilla.redhat.com/2115520

commit 2679a83731d51a744657f718fc02c3b077e47562
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Sat Apr 30 16:58:42 2022 +0800

    sched/core: Avoid obvious double update_rq_clock warning

    When we use raw_spin_rq_lock() to acquire the rq lock and have to
    update the rq clock while holding the lock, the kernel may issue
    a WARN_DOUBLE_CLOCK warning.

    Since we directly use raw_spin_rq_lock() to acquire rq lock instead of
    rq_lock(), there is no corresponding change to rq->clock_update_flags.
    In particular, we have obtained the rq lock of other CPUs, the
    rq->clock_update_flags of this CPU may be RQCF_UPDATED at this time, and
    then calling update_rq_clock() will trigger the WARN_DOUBLE_CLOCK warning.

    So we need to clear RQCF_UPDATED of rq->clock_update_flags to avoid
    the WARN_DOUBLE_CLOCK warning.

    For the sched_rt_period_timer() and migrate_task_rq_dl() cases
    we simply replace raw_spin_rq_lock()/raw_spin_rq_unlock() with
    rq_lock()/rq_unlock().

    For the {pull,push}_{rt,dl}_task() cases, we add the
    double_rq_clock_clear_update() function to clear RQCF_UPDATED of
    rq->clock_update_flags, and call double_rq_clock_clear_update()
    before double_lock_balance()/double_rq_lock() returns to avoid the
    WARN_DOUBLE_CLOCK warning.

    Some call trace reports:
    Call Trace 1:
     <IRQ>
     sched_rt_period_timer+0x10f/0x3a0
     ? enqueue_top_rt_rq+0x110/0x110
     __hrtimer_run_queues+0x1a9/0x490
     hrtimer_interrupt+0x10b/0x240
     __sysvec_apic_timer_interrupt+0x8a/0x250
     sysvec_apic_timer_interrupt+0x9a/0xd0
     </IRQ>
     <TASK>
     asm_sysvec_apic_timer_interrupt+0x12/0x20

    Call Trace 2:
     <TASK>
     activate_task+0x8b/0x110
     push_rt_task.part.108+0x241/0x2c0
     push_rt_tasks+0x15/0x30
     finish_task_switch+0xaa/0x2e0
     ? __switch_to+0x134/0x420
     __schedule+0x343/0x8e0
     ? hrtimer_start_range_ns+0x101/0x340
     schedule+0x4e/0xb0
     do_nanosleep+0x8e/0x160
     hrtimer_nanosleep+0x89/0x120
     ? hrtimer_init_sleeper+0x90/0x90
     __x64_sys_nanosleep+0x96/0xd0
     do_syscall_64+0x34/0x90
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Call Trace 3:
     <TASK>
     deactivate_task+0x93/0xe0
     pull_rt_task+0x33e/0x400
     balance_rt+0x7e/0x90
     __schedule+0x62f/0x8e0
     do_task_dead+0x3f/0x50
     do_exit+0x7b8/0xbb0
     do_group_exit+0x2d/0x90
     get_signal+0x9df/0x9e0
     ? preempt_count_add+0x56/0xa0
     ? __remove_hrtimer+0x35/0x70
     arch_do_signal_or_restart+0x36/0x720
     ? nanosleep_copyout+0x39/0x50
     ? do_nanosleep+0x131/0x160
     ? audit_filter_inodes+0xf5/0x120
     exit_to_user_mode_prepare+0x10f/0x1e0
     syscall_exit_to_user_mode+0x17/0x30
     do_syscall_64+0x40/0x90
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Call Trace 4:
     update_rq_clock+0x128/0x1a0
     migrate_task_rq_dl+0xec/0x310
     set_task_cpu+0x84/0x1e4
     try_to_wake_up+0x1d8/0x5c0
     wake_up_process+0x1c/0x30
     hrtimer_wakeup+0x24/0x3c
     __hrtimer_run_queues+0x114/0x270
     hrtimer_interrupt+0xe8/0x244
     arch_timer_handler_phys+0x30/0x50
     handle_percpu_devid_irq+0x88/0x140
     generic_handle_domain_irq+0x40/0x60
     gic_handle_irq+0x48/0xe0
     call_on_irq_stack+0x2c/0x60
     do_interrupt_handler+0x80/0x84

    Steps to reproduce:
    1. Enable CONFIG_SCHED_DEBUG when compiling the kernel
    2. echo 1 > /sys/kernel/debug/clear_warn_once
       echo "WARN_DOUBLE_CLOCK" > /sys/kernel/debug/sched/features
       echo "NO_RT_PUSH_IPI" > /sys/kernel/debug/sched/features
    3. Run some rt/dl tasks that periodically work and sleep, e.g.
    Create 2*n rt or dl (90% running) tasks via rt-app (on a system
    with n CPUs), and Dietmar Eggemann reports Call Trace 4 when running
    on PREEMPT_RT kernel.

    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20220430085843.62939-2-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:38 -04:00
Phil Auld b16117e4db sched/rt: fix build error when CONFIG_SYSCTL is disable
Bugzilla: https://bugzilla.redhat.com/2115520

commit 28f152cd0926596e69d412467b11b6fe6fe4e864
Author: Baisong Zhong <zhongbaisong@huawei.com>
Date:   Fri Mar 18 10:54:17 2022 +0800

    sched/rt: fix build error when CONFIG_SYSCTL is disable

    Avoid random build errors which do not select
    CONFIG_SYSCTL by depending on it in Kconfig.

    This fixes the following warning:

    In file included from kernel/sched/build_policy.c:43:
    At top level:
    kernel/sched/rt.c:3017:12: error: ‘sched_rr_handler’ defined but not used [-Werror=unused-function]
     3017 | static int sched_rr_handler(struct ctl_table *table, int write, void *buffer,
          |            ^~~~~~~~~~~~~~~~
    kernel/sched/rt.c:2978:12: error: ‘sched_rt_handler’ defined but not used [-Werror=unused-function]
     2978 | static int sched_rt_handler(struct ctl_table *table, int write, void *buffer,
          |            ^~~~~~~~~~~~~~~~
    cc1: all warnings being treated as errors
    make[2]: *** [scripts/Makefile.build:310: kernel/sched/build_policy.o] Error 1
    make[1]: *** [scripts/Makefile.build:638: kernel/sched] Error 2
    make[1]: *** Waiting for unfinished jobs....

    Reported-by: Hulk Robot <hulkci@huawei.com>
    Signed-off-by: Baisong Zhong <zhongbaisong@huawei.com>
    [mcgrof: small build fix, we need sched_rt_can_attach() even
     when CONFIG_SYSCTL is disabled]
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:36 -04:00
Phil Auld 32b391b9c5 sched: Move rr_timeslice sysctls to rt.c
Bugzilla: https://bugzilla.redhat.com/2115520

commit dafd7a9dad22fadcb290b24dff54e2eae3b89776
Author: Zhen Ni <nizhen@uniontech.com>
Date:   Tue Feb 15 19:46:01 2022 +0800

    sched: Move rr_timeslice sysctls to rt.c

    move rr_timeslice sysctls to rt.c and use the new
    register_sysctl_init() to register the sysctl interface.

    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:36 -04:00
Phil Auld a3912174ad sched: Move rt_period/runtime sysctls to rt.c
Bugzilla: https://bugzilla.redhat.com/2115520

commit d9ab0e63fa7f8405fbb19e28c5191e0880a7f2db
Author: Zhen Ni <nizhen@uniontech.com>
Date:   Tue Feb 15 19:45:59 2022 +0800

    sched: Move rt_period/runtime sysctls to rt.c

    move rt_period/runtime sysctls to rt.c and use the new
    register_sysctl_init() to register the sysctl interface.

    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:36 -04:00
Phil Auld 72aa38a322 nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
Bugzilla: https://bugzilla.redhat.com/2107236

commit 5c66d1b9b30f737fcef85a0b75bfe0590e16b62a
Author: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Date:   Tue Jun 28 11:22:59 2022 +0200

    nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()

    dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having
    called sched_update_tick_dependency() preventing it from re-enabling the
    tick on systems that no longer have pending SCHED_RT tasks but have
    multiple runnable SCHED_OTHER tasks:

      dequeue_task_rt()
        dequeue_rt_entity()
          dequeue_rt_stack()
            dequeue_top_rt_rq()
              sub_nr_running()      // decrements rq->nr_running
                sched_update_tick_dependency()
                  sched_can_stop_tick()     // checks rq->rt.rt_nr_running,
                  ...
            __dequeue_rt_entity()
              dec_rt_tasks()        // decrements rq->rt.rt_nr_running
              ...

    Every other scheduler class performs the operation in the opposite
    order, and sched_update_tick_dependency() expects the values to be
    updated as such. So avoid the misbehaviour by inverting the order in
    which the above operations are performed in the RT scheduler.

    Fixes: 76d92ac305 ("sched: Migrate sched to use new tick dependency mask model")
    Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-08-26 10:40:33 -04:00
Patrick Talbert d92575ea9d Merge: sched/deadline: code cleanup
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/729

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2065219
Upstream Status: Linux
Tested: by me with scheduler stress tests using deadline class, admission
control failures and general stress tests.

A series of fixes and cleanup for the deadline scheduler
class.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Fernando Pacheco <fpacheco@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-12 09:28:24 +02:00
Phil Auld 38a1a140ed sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()
Bugzilla: http://bugzilla.redhat.com/2065219

commit 821aecd09e5ad2f8d4c3d8195333d272b392f7d3
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Wed Mar 2 19:34:33 2022 +0100

    sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()

    The `struct rq *rq` parameter isn't used. Remove it.

    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Juri Lelli <juri.lelli@redhat.com>
    Link: https://lore.kernel.org/r/20220302183433.333029-7-dietmar.eggemann@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-13 12:56:37 -04:00
Phil Auld 03eadd7365 sched/deadline,rt: Remove unused functions for !CONFIG_SMP
Bugzilla: http://bugzilla.redhat.com/2065219

commit 71d29747b0e26f36a50e6a65dc0191ca742b9222
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Wed Mar 2 19:34:32 2022 +0100

    sched/deadline,rt: Remove unused functions for !CONFIG_SMP

    The need_pull_[rt|dl]_task() and pull_[rt|dl]_task() functions are not
    used on a !CONFIG_SMP system. Remove them.

    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Juri Lelli <juri.lelli@redhat.com>
    Link: https://lore.kernel.org/r/20220302183433.333029-6-dietmar.eggemann@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-13 12:56:37 -04:00
Phil Auld cdb6eb9a38 sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there
Bugzilla: http://bugzilla.redhat.com/2069275

commit f96eca432015ddc1b621632488ebc345bca06791
Author: Ingo Molnar <mingo@kernel.org>
Date:   Tue Feb 22 13:46:03 2022 +0100

    sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there

    Similarly to kernel/sched/build_utility.c, collect all 'scheduling policy' related
    source code files into kernel/sched/build_policy.c:

        kernel/sched/idle.c

        kernel/sched/rt.c

        kernel/sched/cpudeadline.c
        kernel/sched/pelt.c

        kernel/sched/cputime.c
        kernel/sched/deadline.c

    With the exception of fair.c, which we continue to build as a separate file
    for build efficiency and parallelism reasons.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-11 17:38:21 -04:00
Phil Auld 320a94d95e sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
Bugzilla: http://bugzilla.redhat.com/2062831

commit 49bef33e4b87b743495627a529029156c6e09530
Author: Valentin Schneider <valentin.schneider@arm.com>
Date:   Thu Jan 27 15:40:59 2022 +0000

    sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race

    John reported that push_rt_task() can end up invoking
    find_lowest_rq(rq->curr) when curr is not an RT task (in this case a CFS
    one), which causes mayhem down convert_prio().

    This can happen when current gets demoted to e.g. CFS when releasing an
    rt_mutex, and the local CPU gets hit with an rto_push_work irqwork before
    getting the chance to reschedule. Exactly who triggers this work isn't
    entirely clear to me - switched_from_rt() only invokes rt_queue_pull_task()
    if there are no RT tasks on the local RQ, which means the local CPU can't
    be in the rto_mask.

    My current suspected sequence is something along the lines of the below,
    with the demoted task being current.

      mark_wakeup_next_waiter()
        rt_mutex_adjust_prio()
          rt_mutex_setprio() // deboost originally-CFS task
            check_class_changed()
              switched_from_rt() // Only rt_queue_pull_task() if !rq->rt.rt_nr_running
              switched_to_fair() // Sets need_resched
          __balance_callbacks() // if pull_rt_task(), tell_cpu_to_push() can't select local CPU per the above
          raw_spin_rq_unlock(rq)

           // need_resched is set, so task_woken_rt() can't
           // invoke push_rt_tasks(). Best I can come up with is
           // local CPU has rt_nr_migratory >= 2 after the demotion, so stays
           // in the rto_mask, and then:

           <some other CPU running rto_push_irq_work_func() queues rto_push_work on this CPU>
             push_rt_task()
               // breakage follows here as rq->curr is CFS

    Move an existing check to check rq->curr vs the next pushable task's
    priority before getting anywhere near find_lowest_rq(). While at it, add an
    explicit sched_class of rq->curr check prior to invoking
    find_lowest_rq(rq->curr). Align the DL logic to also reschedule regardless
    of next_task's migratability.

    Fixes: a7c81556ec ("sched: Fix migrate_disable() vs rt/dl balancing")
    Reported-by: John Keeping <john@metanate.com>
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Tested-by: John Keeping <john@metanate.com>
    Link: https://lore.kernel.org/r/20220127154059.974729-1-valentin.schneider@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:37 -04:00
Phil Auld cfa6672bb4 sched/rt: Try to restart rt period timer when rt runtime exceeded
Bugzilla: http://bugzilla.redhat.com/2062831

commit 9b58e976b3b391c0cf02e038d53dd0478ed3013c
Author: Li Hua <hucool.lihua@huawei.com>
Date:   Fri Dec 3 03:36:18 2021 +0000

    sched/rt: Try to restart rt period timer when rt runtime exceeded

    When rt_runtime is modified from -1 to a valid control value, it may
    cause the task to be throttled all the time. Operations like the following
    will trigger the bug. E.g:

      1. echo -1 > /proc/sys/kernel/sched_rt_runtime_us
      2. Run a FIFO task named A that executes while(1)
      3. echo 950000 > /proc/sys/kernel/sched_rt_runtime_us

    When rt_runtime is -1, The rt period timer will not be activated when task
    A enqueued. And then the task will be throttled after setting rt_runtime to
    950,000. The task will always be throttled because the rt period timer is
    not activated.

    Fixes: d0b27fa778 ("sched: rt-group: synchonised bandwidth period")
    Reported-by: Hulk Robot <hulkci@huawei.com>
    Signed-off-by: Li Hua <hucool.lihua@huawei.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20211203033618.11895-1-hucool.lihua@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:36 -04:00
Phil Auld 70d81a5df7 sched/fair: Prevent dead task groups from regaining cfs_rq's
Bugzilla: http://bugzilla.redhat.com/2020279

commit b027789e5e50494c2325cc70c8642e7fd6059479
Author: Mathias Krause <minipli@grsecurity.net>
Date:   Wed Nov 3 20:06:13 2021 +0100

    sched/fair: Prevent dead task groups from regaining cfs_rq's

    Kevin is reporting crashes which point to a use-after-free of a cfs_rq
    in update_blocked_averages(). Initial debugging revealed that we've
    live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
    free_fair_sched_group(). However, it was unclear how that can happen.

    His kernel config happened to lead to a layout of struct sched_entity
    that put the 'my_q' member directly into the middle of the object
    which makes it incidentally overlap with SLUB's freelist pointer.
    That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
    mangling, leads to a reliable access violation in form of a #GP which
    made the UAF fail fast.

    Michal seems to have run into the same issue[1]. He already correctly
    diagnosed that commit a7b359fc6a ("sched/fair: Correctly insert
    cfs_rq's to list on unthrottle") is causing the preconditions for the
    UAF to happen by re-adding cfs_rq's also to task groups that have no
    more running tasks, i.e. also to dead ones. His analysis, however,
    misses the real root cause and it cannot be seen from the crash
    backtrace only, as the real offender is tg_unthrottle_up() getting
    called via sched_cfs_period_timer() via the timer interrupt at an
    inconvenient time.

    When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
    task group, it doesn't protect itself from getting interrupted. If the
    timer interrupt triggers while we iterate over all CPUs or after
    unregister_fair_sched_group() has finished but prior to unlinking the
    task group, sched_cfs_period_timer() will execute and walk the list of
    task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
    dying task group. These will later -- in free_fair_sched_group() -- be
    kfree()'ed while still being linked, leading to the fireworks Kevin
    and Michal are seeing.

    To fix this race, ensure the dying task group gets unlinked first.
    However, simply switching the order of unregistering and unlinking the
    task group isn't sufficient, as concurrent RCU walkers might still see
    it, as can be seen below:

        CPU1:                                      CPU2:
          :                                        timer IRQ:
          :                                          do_sched_cfs_period_timer():
          :                                            :
          :                                            distribute_cfs_runtime():
          :                                              rcu_read_lock();
          :                                              :
          :                                              unthrottle_cfs_rq():
        sched_offline_group():                             :
          :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
          list_del_rcu(&tg->list);                           :
     (1)  :                                                  list_for_each_entry_rcu(child, &parent->children, siblings)
          :                                                    :
     (2)  list_del_rcu(&tg->siblings);                         :
          :                                                    tg_unthrottle_up():
          unregister_fair_sched_group():                         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
            :                                                    :
            list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);               :
            :                                                    :
            :                                                    if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
     (3)    :                                                        list_add_leaf_cfs_rq(cfs_rq);
          :                                                      :
          :                                                    :
          :                                                  :
          :                                                :
          :                                              :
     (4)  :                                              rcu_read_unlock();

    CPU 2 walks the task group list in parallel to sched_offline_group(),
    specifically, it'll read the soon to be unlinked task group entry at
    (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
    still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
    all cfs_rq's via list_del_leaf_cfs_rq() in
    unregister_fair_sched_group().  Meanwhile CPU 2 will re-add some of
    these at (3), which is the cause of the UAF later on.

    To prevent this additional race from happening, we need to wait until
    walk_tg_tree_from() has finished traversing the task groups, i.e.
    after the RCU read critical section ends in (4). Afterwards we're safe
    to call unregister_fair_sched_group(), as each new walk won't see the
    dying task group any more.

    On top of that, we need to wait yet another RCU grace period after
    unregister_fair_sched_group() to ensure print_cfs_stats(), which might
    run concurrently, always sees valid objects, i.e. not already free'd
    ones.

    This patch survives Michal's reproducer[2] for 8h+ now, which used to
    trigger within minutes before.

      [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
      [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/

    Fixes: a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
    [peterz: shuffle code around a bit]
    Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com>
    Signed-off-by: Mathias Krause <minipli@grsecurity.net>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:51 -05:00
Phil Auld 07198c5567 sched/rt: Support schedstats for RT sched class
Bugzilla: http://bugzilla.redhat.com/2020279

commit 57a5c2dafca8e3ce4f70e975a9c7727b66b5071f
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Sep 5 14:35:45 2021 +0000

    sched/rt: Support schedstats for RT sched class

    We want to measure the latency of RT tasks in our production
    environment with schedstats facility, but currently schedstats is only
    supported for fair sched class. This patch enable it for RT sched class
    as well.

    After we make the struct sched_statistics and the helpers of it
    independent of fair sched class, we can easily use the schedstats
    facility for RT sched class.

    The schedstat usage in RT sched class is similar with fair sched class,
    for example,
                    fair                        RT
    enqueue         update_stats_enqueue_fair   update_stats_enqueue_rt
    dequeue         update_stats_dequeue_fair   update_stats_dequeue_rt
    put_prev_task   update_stats_wait_start     update_stats_wait_start_rt
    set_next_task   update_stats_wait_end       update_stats_wait_end_rt

    The user can get the schedstats information in the same way in fair sched
    class. For example,
           fair                            RT
           /proc/[pid]/sched               /proc/[pid]/sched

    schedstats is not supported for RT group.

    The output of a RT task's schedstats as follows,
    $ cat /proc/10349/sched
    ...
    sum_sleep_runtime                            :           972.434535
    sum_block_runtime                            :           960.433522
    wait_start                                   :        188510.871584
    sleep_start                                  :             0.000000
    block_start                                  :             0.000000
    sleep_max                                    :            12.001013
    block_max                                    :           952.660622
    exec_max                                     :             0.049629
    slice_max                                    :             0.000000
    wait_max                                     :             0.018538
    wait_sum                                     :             0.424340
    wait_count                                   :                   49
    iowait_sum                                   :           956.495640
    iowait_count                                 :                   24
    nr_migrations_cold                           :                    0
    nr_failed_migrations_affine                  :                    0
    nr_failed_migrations_running                 :                    0
    nr_failed_migrations_hot                     :                    0
    nr_forced_migrations                         :                    0
    nr_wakeups                                   :                   49
    nr_wakeups_sync                              :                    0
    nr_wakeups_migrate                           :                    0
    nr_wakeups_local                             :                   49
    nr_wakeups_remote                            :                    0
    nr_wakeups_affine                            :                    0
    nr_wakeups_affine_attempts                   :                    0
    nr_wakeups_passive                           :                    0
    nr_wakeups_idle                              :                    0
    ...

    The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
    be used to trace RT tasks as well. The output of these tracepoints for a
    RT tasks as follows,

    - runtime
              stress-10352   [004] d.h.  1035.382286: sched_stat_runtime: comm=stress pid=10352 runtime=995769 [ns] vruntime=0 [ns]
              [vruntime=0 means it is a RT task]

    - wait
              <idle>-0       [004] dN..  1227.688544: sched_stat_wait: comm=stress pid=10352 delay=46849882 [ns]

    - blocked
         kworker/4:1-465     [004] dN..  1585.676371: sched_stat_blocked: comm=stress pid=17194 delay=189963 [ns]

    - iowait
         kworker/4:1-465     [004] dN..  1585.675330: sched_stat_iowait: comm=stress pid=17189 delay=182848 [ns]

    - sleep
               sleep-18194   [023] dN..  1780.891840: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001160770 [ns]
               sleep-18196   [023] dN..  1781.893208: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001161970 [ns]
               sleep-18197   [023] dN..  1782.894544: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001128840 [ns]
               [ In sleep.sh, it sleeps 1 sec each time. ]

    [lkp@intel.com: reported build failure in earlier version]

    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20210905143547.4668-7-laoar.shao@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:46 -05:00
Phil Auld 124ef2d25a sched/rt: Support sched_stat_runtime tracepoint for RT sched class
Bugzilla: http://bugzilla.redhat.com/2020279

commit ed7b564cfdd0668efbd739d0b4e2d67797293f32
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Sep 5 14:35:44 2021 +0000

    sched/rt: Support sched_stat_runtime tracepoint for RT sched class

    The runtime of a RT task has already been there, so we only need to
    add a tracepoint.

    One difference between fair task and RT task is that there is no vruntime
    in RT task. To reuse the sched_stat_runtime tracepoint, '0' is passed as
    vruntime for RT task.

    The output of this tracepoint for RT task as follows,
              stress-9748    [039] d.h.   113.519352: sched_stat_runtime: comm=stress pid=9748 runtime=997573 [ns] vruntime=0 [ns]
              stress-9748    [039] d.h.   113.520352: sched_stat_runtime: comm=stress pid=9748 runtime=997627 [ns] vruntime=0 [ns]
              stress-9748    [039] d.h.   113.521352: sched_stat_runtime: comm=stress pid=9748 runtime=998203 [ns] vruntime=0 [ns]

    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20210905143547.4668-6-laoar.shao@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:46 -05:00
Phil Auld fbc84644bc sched: Make struct sched_statistics independent of fair sched class
Bugzilla: http://bugzilla.redhat.com/2020279

commit ceeadb83aea28372e54857bf88ab7e17af48ab7b
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Sep 5 14:35:41 2021 +0000

    sched: Make struct sched_statistics independent of fair sched class

    If we want to use the schedstats facility to trace other sched classes, we
    should make it independent of fair sched class. The struct sched_statistics
    is the schedular statistics of a task_struct or a task_group. So we can
    move it into struct task_struct and struct task_group to achieve the goal.

    After the patch, schestats are orgnized as follows,

        struct task_struct {
           ...
           struct sched_entity se;
           struct sched_rt_entity rt;
           struct sched_dl_entity dl;
           ...
           struct sched_statistics stats;
           ...
       };

    Regarding the task group, schedstats is only supported for fair group
    sched, and a new struct sched_entity_stats is introduced, suggested by
    Peter -

        struct sched_entity_stats {
            struct sched_entity     se;
            struct sched_statistics stats;
        } __no_randomize_layout;

    Then with the se in a task_group, we can easily get the stats.

    The sched_statistics members may be frequently modified when schedstats is
    enabled, in order to avoid impacting on random data which may in the same
    cacheline with them, the struct sched_statistics is defined as cacheline
    aligned.

    As this patch changes the core struct of scheduler, so I verified the
    performance it may impact on the scheduler with 'perf bench sched
    pipe', suggested by Mel. Below is the result, in which all the values
    are in usecs/op.
                                      Before               After
          kernel.sched_schedstats=0  5.2~5.4               5.2~5.4
          kernel.sched_schedstats=1  5.3~5.5               5.3~5.5
    [These data is a little difference with the earlier version, that is
     because my old test machine is destroyed so I have to use a new
     different test machine.]

    Almost no impact on the sched performance.

    No functional change.

    [lkp@intel.com: reported build failure in earlier version]

    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:46 -05:00
Vincent Donnefort fecfcbc288 sched/rt: Fix RT utilization tracking during policy change
RT keeps track of the utilization on a per-rq basis with the structure
avg_rt. This utilization is updated during task_tick_rt(),
put_prev_task_rt() and set_next_task_rt(). However, when the current
running task changes its policy, set_next_task_rt() which would usually
take care of updating the utilization when the rq starts running RT tasks,
will not see a such change, leaving the avg_rt structure outdated. When
that very same task will be dequeued later, put_prev_task_rt() will then
update the utilization, based on a wrong last_update_time, leading to a
huge spike in the RT utilization signal.

The signal would eventually recover from this issue after few ms. Even if
no RT tasks are run, avg_rt is also updated in __update_blocked_others().
But as the CPU capacity depends partly on the avg_rt, this issue has
nonetheless a significant impact on the scheduler.

Fix this issue by ensuring a load update when a running task changes
its policy to RT.

Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking")
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
2021-06-22 16:41:59 +02:00
Peter Zijlstra 21f56ffe44 sched: Introduce sched_class::pick_task()
Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Vineeth: folded fixes]
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.437092775@infradead.org
2021-05-12 11:43:28 +02:00
Peter Zijlstra 5cb9eaa3d2 sched: Wrap rq::lock access
In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.136465446@infradead.org
2021-05-12 11:43:26 +02:00
Ingo Molnar 3b03706fa6 sched: Fix various typos
Fix ~42 single-word typos in scheduler code comments.

We have accumulated a few fun ones over the years. :-)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-kernel@vger.kernel.org
2021-03-22 00:11:52 +01:00
Hui Su 65bcf072e2 sched: Use task_current() instead of 'rq->curr == p'
Use the task_current() function where appropriate.

No functional change.

Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201030173223.GA52339@rlk
2021-01-14 11:20:11 +01:00
Valentin Schneider 3aef1551e9 sched: Remove select_task_rq()'s sd_flag parameter
Only select_task_rq_fair() uses that parameter to do an actual domain
search, other classes only care about what kind of wakeup is happening
(fork, exec, or "regular") and thus just translate the flag into a wakeup
type.

WF_TTWU and WF_EXEC have just been added, use these along with WF_FORK to
encode the wakeup types we care about. For select_task_rq_fair(), we can
simply use the shiny new WF_flag : SD_flag mapping.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201102184514.2733-3-valentin.schneider@arm.com
2020-11-10 18:39:06 +01:00
Peter Zijlstra 12fa97c64d Merge branch 'sched/migrate-disable' 2020-11-10 18:39:04 +01:00
Peter Zijlstra a7c81556ec sched: Fix migrate_disable() vs rt/dl balancing
In order to minimize the interference of migrate_disable() on lower
priority tasks, which can be deprived of runtime due to being stuck
below a higher priority task. Teach the RT/DL balancers to push away
these higher priority tasks when a lower priority task gets selected
to run on a freshly demoted CPU (pull).

This adds migration interference to the higher priority task, but
restores bandwidth to system that would otherwise be irrevocably lost.
Without this it would be possible to have all tasks on the system
stuck on a single CPU, each task preempted in a migrate_disable()
section with a single high priority task running.

This way we can still approximate running the M highest priority tasks
on the system.

Migrating the top task away is (ofcourse) still subject to
migrate_disable() too, which means the lower task is subject to an
interference equivalent to the worst case migrate_disable() section.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.499155098@infradead.org
2020-11-10 18:39:01 +01:00
Peter Zijlstra 95158a89dd sched,rt: Use the full cpumask for balancing
We want migrate_disable() tasks to get PULLs in order for them to PUSH
away the higher priority task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.310519774@infradead.org
2020-11-10 18:39:00 +01:00
Peter Zijlstra 14e292f8d4 sched,rt: Use cpumask_any*_distribute()
Replace a bunch of cpumask_any*() instances with
cpumask_any*_distribute(), by injecting this little bit of random in
cpu selection, we reduce the chance two competing balance operations
working off the same lowest_mask pick the same CPU.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.190759694@infradead.org
2020-11-10 18:39:00 +01:00