Commit Graph

1596 Commits

Author SHA1 Message Date
Luis Claudio R. Goncalves 0857fd208a sched: Fix stop_one_cpu_nowait() vs hotplug
JIRA: https://issues.redhat.com/browse/RHEL-84526

commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Oct 10 20:57:39 2023 +0200

    sched: Fix stop_one_cpu_nowait() vs hotplug

    Kuyo reported sporadic failures on a sched_setaffinity() vs CPU
    hotplug stress-test -- notably affine_move_task() remains stuck in
    wait_for_completion(), leading to a hung-task detector warning.

    Specifically, it was reported that stop_one_cpu_nowait(.fn =
    migration_cpu_stop) returns false -- this stopper is responsible for
    the matching complete().

    The race scenario is:

            CPU0                                    CPU1

                                            // doing _cpu_down()

      __set_cpus_allowed_ptr()
        task_rq_lock();
                                            takedown_cpu()
                                              stop_machine_cpuslocked(take_cpu_down..)

                                            <PREEMPT: cpu_stopper_thread()
                                              MULTI_STOP_PREPARE
                                              ...
        __set_cpus_allowed_ptr_locked()
          affine_move_task()
            task_rq_unlock();

      <PREEMPT: cpu_stopper_thread()\>
        ack_state()
                                              MULTI_STOP_RUN
                                                take_cpu_down()
                                                  __cpu_disable();
                                                  stop_machine_park();
                                                    stopper->enabled = false;
                                             />
       />
            stop_one_cpu_nowait(.fn = migration_cpu_stop);
              if (stopper->enabled) // false!!!

    That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the
    stopper thread gets a chance to preempt and allows the cpu-down for
    the target CPU to complete.

    OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to
    issue a wakeup, it must not be ran under the scheduler locks.

    Solve this apparent contradiction by keeping preemption disabled over
    the unlock + queue_stopper combination:

            preempt_disable();
            task_rq_unlock(...);
            if (!stop_pending)
              stop_one_cpu_nowait(...)
            preempt_enable();

    This respects the lock ordering contraints while still avoiding the
    above race. That is, if we find the CPU is online under rq-lock, the
    targeted stop_one_cpu_nowait() must succeed.

    Apply this pattern to all similar stop_one_cpu_nowait() invocations.

    Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
    Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
    Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net

Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
2025-03-21 18:50:20 -03:00
Phil Auld cb633355af sched: Don't account irq time if sched_clock_irqtime is disabled
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit 763a744e24a8cfbcc13f699dcdae13a627b8588e
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Fri Jan 3 10:24:07 2025 +0800

    sched: Don't account irq time if sched_clock_irqtime is disabled

    sched_clock_irqtime may be disabled due to the clock source, in which case
    IRQ time should not be accounted. Let's add a conditional check to avoid
    unnecessary logic.

    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Michal Koutný <mkoutny@suse.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20250103022409.2544-3-laoar.shao@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:11 +00:00
Phil Auld 9228c0fc48 sched: Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit c907cd44a108eff7005a2b5689bb91f50637df8b
Author: Waiman Long <longman@redhat.com>
Date:   Wed Oct 30 13:52:53 2024 -0400

    sched: Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE

    As all the non-domain and non-managed_irq housekeeping types have been
    unified to HK_TYPE_KERNEL_NOISE, replace all these references in the
    scheduler to use HK_TYPE_KERNEL_NOISE.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Frederic Weisbecker <frederic@kernel.org>
    Link: https://lore.kernel.org/r/20241030175253.125248-5-longman@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:10 +00:00
Phil Auld 699009fa9c sched: Don't try to catch up excess steal time.
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit 108ad0999085df2366dd9ef437573955cb3f5586
Author: Suleiman Souhlal <suleiman@google.com>
Date:   Mon Nov 18 13:37:45 2024 +0900

    sched: Don't try to catch up excess steal time.

    When steal time exceeds the measured delta when updating clock_task, we
    currently try to catch up the excess in future updates.
    However, this results in inaccurate run times for the future things using
    clock_task, in some situations, as they end up getting additional steal
    time that did not actually happen.
    This is because there is a window between reading the elapsed time in
    update_rq_clock() and sampling the steal time in update_rq_clock_task().
    If the VCPU gets preempted between those two points, any additional
    steal time is accounted to the outgoing task even though the calculated
    delta did not actually contain any of that "stolen" time.
    When this race happens, we can end up with steal time that exceeds the
    calculated delta, and the previous code would try to catch up that excess
    steal time in future clock updates, which is given to the next,
    incoming task, even though it did not actually have any time stolen.

    This behavior is particularly bad when steal time can be very long,
    which we've seen when trying to extend steal time to contain the duration
    that the host was suspended [0]. When this happens, clock_task stays
    frozen, during which the running task stays running for the whole
    duration, since its run time doesn't increase.
    However the race can happen even under normal operation.

    Ideally we would read the elapsed cpu time and the steal time atomically,
    to prevent this race from happening in the first place, but doing so
    is non-trivial.

    Since the time between those two points isn't otherwise accounted anywhere,
    neither to the outgoing task nor the incoming task (because the "end of
    outgoing task" and "start of incoming task" timestamps are the same),
    I would argue that the right thing to do is to simply drop any excess steal
    time, in order to prevent these issues.

    [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/

    Signed-off-by: Suleiman Souhlal <suleiman@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:09 +00:00
Phil Auld bcbc94b5b8 sched: Initialize idle tasks only once
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit b23decf8ac9102fc52c4de5196f4dc0a5f3eb80b
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Mon Oct 28 11:43:42 2024 +0100

    sched: Initialize idle tasks only once

    Idle tasks are initialized via __sched_fork() twice:

         fork_idle()
            copy_process()
              sched_fork()
                 __sched_fork()
            init_idle()
              __sched_fork()

    Instead of cleaning this up, sched_ext hacked around it. Even when analyis
    and solution were provided in a discussion, nobody cared to clean this up.

    init_idle() is also invoked from sched_init() to initialize the boot CPU's
    idle task, which requires the __sched_fork() invocation. But this can be
    trivially solved by invoking __sched_fork() before init_idle() in
    sched_init() and removing the __sched_fork() invocation from init_idle().

    Do so and clean up the comments explaining this historical leftover.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20241028103142.359584747@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:09 +00:00
Phil Auld aa347a1976 sched/uclamp: Fix unnused variable warning
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit 23f1178ad706a1aa69ac3dfaa6559f1fb876c14e
Author: Christian Loehle <christian.loehle@arm.com>
Date:   Fri Oct 25 11:53:17 2024 +0100

    sched/uclamp: Fix unnused variable warning

    uclamp_mutex is only used for CONFIG_SYSCTL or
    CONFIG_UCLAMP_TASK_GROUP so declare it __maybe_unused.

    Closes: https://lore.kernel.org/oe-kbuild-all/202410060258.bPl2ZoUo-lkp@intel.com/
    Closes: https://lore.kernel.org/oe-kbuild-all/202410250459.EJe6PJI5-lkp@intel.com/
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Christian Loehle <christian.loehle@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/a1e9c342-01c9-44f0-a789-2c908e57942b@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:08 +00:00
Phil Auld 8851a9b9ae sched: Add move_queued_task_locked helper
JIRA: https://issues.redhat.com/browse/RHEL-78821
Conflicts: Context diffs in sched.h due to not having eevdf code.

commit 2b05a0b4c08ffd6dedfbd27af8708742cde39b95
Author: Connor O'Brien <connoro@google.com>
Date:   Wed Oct 9 16:53:37 2024 -0700

    sched: Add move_queued_task_locked helper

    Switch logic that deactivates, sets the task cpu,
    and reactivates a task on a different rq to use a
    helper that will be later extended to push entire
    blocked task chains.

    This patch was broken out from a larger chain migration
    patch originally by Connor O'Brien.

    [jstultz: split out from larger chain migration patch]
    Signed-off-by: Connor O'Brien <connoro@google.com>
    Signed-off-by: John Stultz <jstultz@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Metin Kaya <metin.kaya@arm.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Qais Yousef <qyousef@layalina.io>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Tested-by: Metin Kaya <metin.kaya@arm.com>
    Link: https://lore.kernel.org/r/20241009235352.1614323-5-jstultz@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:08 +00:00
Phil Auld 9d954525ab sched/core: Add clearing of ->dl_server in put_prev_task_balance()
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit c245910049d04fbfa85bb2f5acd591c24e9907c7
Author: Joel Fernandes (Google) <joel@joelfernandes.org>
Date:   Mon May 27 14:06:48 2024 +0200

    sched/core: Add clearing of ->dl_server in put_prev_task_balance()

    Paths using put_prev_task_balance() need to do a pick shortly
    after. Make sure they also clear the ->dl_server on prev as a
    part of that.

    Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
    Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
    Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Juri Lelli <juri.lelli@redhat.com>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/d184d554434bedbad0581cb34656582d78655150.1716811044.git.bristot@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-12 15:42:08 +00:00
Phil Auld b26aeaeafa sched/core: Clear prev->dl_server in CFS pick fast path
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit a741b82423f41501e301eb6f9820b45ca202e877
Author: Youssef Esmat <youssefesmat@google.com>
Date:   Mon May 27 14:06:49 2024 +0200

    sched/core: Clear prev->dl_server in CFS pick fast path

    In case the previous pick was a DL server pick, ->dl_server might be
    set. Clear it in the fast path as well.

    Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
    Signed-off-by: Youssef Esmat <youssefesmat@google.com>
    Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Juri Lelli <juri.lelli@redhat.com>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/7f7381ccba09efcb4a1c1ff808ed58385eccc222.1716811044.git.bristot@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-12 15:41:45 +00:00
Phil Auld 5b8b5283a9 sched: remove HZ_BW feature hedge
JIRA: https://issues.redhat.com/browse/RHEL-78821
Conflicts: Minor context difference in features.h.

commit a58501fb8320d6232507f722b4c9dcd4e03362ee
Author: Phil Auld <pauld@redhat.com>
Date:   Wed May 15 09:37:05 2024 -0400

    sched: remove HZ_BW feature hedge

    As a hedge against unexpected user issues commit 88c56cfeaec4
    ("sched/fair: Block nohz tick_stop when cfs bandwidth in use")
    included a scheduler feature to disable the new functionality.
    It's been a few releases (v6.6) and no screams, so remove it.

    Signed-off-by: Phil Auld <pauld@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20240515133705.3632915-1-pauld@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-12 15:24:41 +00:00
Phil Auld 5dc87bb1b4 sched/core: Prevent wakeup of ksoftirqd during idle load balance
JIRA: https://issues.redhat.com/browse/RHEL-70904

commit e932c4ab38f072ce5894b2851fea8bc5754bb8e5
Author: K Prateek Nayak <kprateek.nayak@amd.com>
Date:   Tue Nov 19 05:44:32 2024 +0000

    sched/core: Prevent wakeup of ksoftirqd during idle load balance

    Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on
    from the IPI handler on the idle CPU. If the SMP function is invoked
    from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ
    flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd
    because soft interrupts are handled before ksoftirqd get on the CPU.

    Adding a trace_printk() in nohz_csd_func() at the spot of raising
    SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup,
    and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the
    current behavior:

           <idle>-0   [000] dN.1.:  nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func
           <idle>-0   [000] dN.4.:  sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000
           <idle>-0   [000] .Ns1.:  softirq_entry: vec=7 [action=SCHED]
           <idle>-0   [000] .Ns1.:  softirq_exit: vec=7  [action=SCHED]
           <idle>-0   [000] d..2.:  sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120
      ksoftirqd/0-16  [000] d..2.:  sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
           ...

    Use __raise_softirq_irqoff() to raise the softirq. The SMP function call
    is always invoked on the requested CPU in an interrupt handler. It is
    guaranteed that soft interrupts are handled at the end.

    Following are the observations with the changes when enabling the same
    set of events:

           <idle>-0       [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance
           <idle>-0       [000] dN.1.: softirq_raise: vec=7 [action=SCHED]
           <idle>-0       [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]

    No unnecessary ksoftirqd wakeups are seen from idle task's context to
    service the softirq.

    Fixes: b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
    Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1]
    Reported-by: Julia Lawall <julia.lawall@inria.fr>
    Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Link: https://lore.kernel.org/r/20241119054432.6405-5-kprateek.nayak@amd.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-12-11 19:38:11 +00:00
Phil Auld 04d352ce17 sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
JIRA: https://issues.redhat.com/browse/RHEL-70904

commit ea9cffc0a154124821531991d5afdd7e8b20d7aa
Author: K Prateek Nayak <kprateek.nayak@amd.com>
Date:   Tue Nov 19 05:44:30 2024 +0000

    sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()

    The need_resched() check currently in nohz_csd_func() can be tracked
    to have been added in scheduler_ipi() back in 2011 via commit
    ca38062e57 ("sched: Use resched IPI to kick off the nohz idle balance")

    Since then, it has travelled quite a bit but it seems like an idle_cpu()
    check currently is sufficient to detect the need to bail out from an
    idle load balancing. To justify this removal, consider all the following
    case where an idle load balancing could race with a task wakeup:

    o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
      on wakelist if wakee cpu is idle") a target perceived to be idle
      (target_rq->nr_running == 0) will return true for
      ttwu_queue_cond(target) which will offload the task wakeup to the idle
      target via an IPI.

      In all such cases target_rq->ttwu_pending will be set to 1 before
      queuing the wake function.

      If an idle load balance races here, following scenarios are possible:

      - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
        IPI is sent to the CPU to wake it out of idle. If the
        nohz_csd_func() queues before sched_ttwu_pending(), the idle load
        balance will bail out since idle_cpu(target) returns 0 since
        target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
        sched_ttwu_pending() it should see rq->nr_running to be non-zero and
        bail out of idle load balancing.

      - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
        the sender will simply set TIF_NEED_RESCHED for the target to put it
        out of idle and flush_smp_call_function_queue() in do_idle() will
        execute the call function. Depending on the ordering of the queuing
        of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
        nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
        target_rq->nr_running to be non-zero if there is a genuine task
        wakeup racing with the idle load balance kick.

    o The waker CPU perceives the target CPU to be busy
      (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
      to a series of unfortunate events, the system reaches a case where the
      waker CPU decides to perform the wakeup by itself in ttwu_queue() on
      the target CPU but target is concurrently selected for idle load
      balance (XXX: Can this happen? I'm not sure, but we'll consider the
      mother of all coincidences to estimate the worst case scenario).

      ttwu_do_activate() calls enqueue_task() which would increment
      "rq->nr_running" post which it calls wakeup_preempt() which is
      responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
      setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
      thing to note in this case is that rq->nr_running is already non-zero
      in case of a wakeup before TIF_NEED_RESCHED is set which would
      lead to idle_cpu() check returning false.

    In all cases, it seems that need_resched() check is unnecessary when
    checking for idle_cpu() first since an impending wakeup racing with idle
    load balancer will either set the "rq->ttwu_pending" or indicate a newly
    woken task via "rq->nr_running".

    Chasing the reason why this check might have existed in the first place,
    I came across  Peter's suggestion on the fist iteration of Suresh's
    patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:

            sched_ttwu_do_pending(list);

            if (unlikely((rq->idle == current) &&
                rq->nohz_balance_kick &&
                !need_resched()))
                    raise_softirq_irqoff(SCHED_SOFTIRQ);

    Since the condition to raise the SCHED_SOFIRQ was preceded by
    sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in
    the current upstream kernel, the need_resched() check was necessary to
    catch a newly queued task. Peter suggested modifying it to:

            if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
                    raise_softirq_irqoff(SCHED_SOFTIRQ);

    where idle_cpu() seems to have replaced "rq->idle == current" check.

    Even back then, the idle_cpu() check would have been sufficient to catch
    a new task being enqueued. Since commit b2a02fc43a ("smp: Optimize
    send_call_function_single_ipi()") overloads the interpretation of
    TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the
    need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
    on Peter's suggestion.

    Fixes: b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@amd.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-12-11 19:37:47 +00:00
Audra Mitchell 8947f5b14c lazy tlb: introduce lazy tlb mm refcount helper functions
JIRA: https://issues.redhat.com/browse/RHEL-55462

This patch is a backport of the following upstream commit:
commit aa464ba9a1e444d5ef95bb63ee3b2ef26fc96ed7
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Fri Feb 3 17:18:34 2023 +1000

    lazy tlb: introduce lazy tlb mm refcount helper functions

    Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting.
    This makes the lazy tlb mm references more obvious, and allows the
    refcounting scheme to be modified in later changes.  There is no
    functional change with this patch.

    Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-11-04 09:14:17 -05:00
Rado Vrbovsky d2bd7080ef Merge: Sched: Updates and fixes for 9.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5250

JIRA: https://issues.redhat.com/browse/RHEL-56494
  
JIRA: https://issues.redhat.com/browse/RHEL-57142

CVE: CVE-2024-44958

Tested: Ran scheduler tests and general stress testing. Have asked  
perf QE for sanity tests.   

Omitted-fix: c049acee3c71 ("selftests/ftrace: Fix test to handle both old and new kernels"): Somewhat out of scope for this MR and should not need to run test against old kernels in RHEL. 

Series of scheduler related fixes and updates, up to v6.11. A large  
number of these are refactoring (making naming consistent, breaking out  
code into new files etc) with no functional changes. Otherwise, primarily  
bug fixes and cleanups, no real feature additions.   
  
  
  
  
  
Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Eric Chanudet <echanude@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-25 16:52:35 +00:00
Rado Vrbovsky d30d477e21 Merge: rcu: Backport upstream RCU commits up to v6.10
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5074

JIRA: https://issues.redhat.com/browse/RHEL-55557    
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5074

This MR backports upstream RCU commits up to v6.10 with relevant bug
fixes, if applicable.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-25 16:11:27 +00:00
Ming Lei 23a842d0ff sched: Add a new function to compare if two cpus have the same capacity
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit b361c9027b4e4159e7bcca4eb64fd26507c19994
Author: Qais Yousef <qyousef@layalina.io>
Date:   Fri Feb 23 15:57:48 2024 +0000

    sched: Add a new function to compare if two cpus have the same capacity

    The new helper function is needed to help blk-mq check if it needs to
    dispatch the softirq on another CPU to match the performance level the
    IO requester is running at. This is important on HMP systems where not
    all CPUs have the same compute capacity.

    Signed-off-by: Qais Yousef <qyousef@layalina.io>
    Reviewed-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20240223155749.2958009-2-qyousef@layalina.io
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:18:38 +08:00
Ming Lei 3a1c398968 block: update cached timestamp post schedule/preemption
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit 06b23f92af87a84d70881b2ecaa72e00f7838264
Author: Jens Axboe <axboe@kernel.dk>
Date:   Tue Jan 16 09:18:39 2024 -0700

    block: update cached timestamp post schedule/preemption

    Mark the task as having a cached timestamp when set assign it, so we
    can efficiently check if it needs updating post being scheduled back in.
    This covers both the actual schedule out case, which would've flushed
    the plug, and the preemption case which doesn't touch the plugged
    requests (for many reasons, one of them being then we'd need to have
    preemption disabled around plug state manipulation).

    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:18:33 +08:00
Phil Auld f2e299d329 sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit fe7a11c78d2a9bdb8b50afc278a31ac177000948
Author: Yang Yingliang <yangyingliang@huawei.com>
Date:   Wed Jul 3 11:16:10 2024 +0800

    sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()

    If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback.

    Fixes: 120455c514 ("sched: Fix hotplug vs CPU bandwidth control")
    Cc: stable@kernel.org
    Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-26 08:11:31 -04:00
Phil Auld bcf83b5d1b sched/smt: Fix unbalance sched_smt_present dec/inc
JIRA: https://issues.redhat.com/browse/RHEL-57142
CVE: CVE-2024-44958

commit e22f910a26cc2a3ac9c66b8e935ef2a7dd881117
Author: Yang Yingliang <yangyingliang@huawei.com>
Date:   Wed Jul 3 11:16:08 2024 +0800

    sched/smt: Fix unbalance sched_smt_present dec/inc

    I got the following warn report while doing stress test:

    jump label: negative count!
    WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0
    Call Trace:
     <TASK>
     __static_key_slow_dec_cpuslocked+0x16/0x70
     sched_cpu_deactivate+0x26e/0x2a0
     cpuhp_invoke_callback+0x3ad/0x10d0
     cpuhp_thread_fun+0x3f5/0x680
     smpboot_thread_fn+0x56d/0x8d0
     kthread+0x309/0x400
     ret_from_fork+0x41/0x70
     ret_from_fork_asm+0x1b/0x30
     </TASK>

    Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(),
    the cpu offline failed, but sched_smt_present is decremented before
    calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so
    fix it by incrementing sched_smt_present in the error path.

    Fixes: c5511d03ec ("sched/smt: Make sched_smt_present track topology")
    Cc: stable@kernel.org
    Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Chen Yu <yu.c.chen@intel.com>
    Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
    Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-26 08:11:14 -04:00
Phil Auld 9d3f3b5053 sched/core: Introduce sched_set_rq_on/offline() helper
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 2f027354122f58ee846468a6f6b48672fff92e9b
Author: Yang Yingliang <yangyingliang@huawei.com>
Date:   Wed Jul 3 11:16:09 2024 +0800

    sched/core: Introduce sched_set_rq_on/offline() helper

    Introduce sched_set_rq_on/offline() helper, so it can be called
    in normal or error path simply. No functional changed.

    Cc: stable@kernel.org
    Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-26 08:11:14 -04:00
Phil Auld 0fc32c5847 sched/smt: Introduce sched_smt_present_inc/dec() helper
JIRA: https://issues.redhat.com/browse/RHEL-57142
CVE: CVE-2024-44958

commit 31b164e2e4af84d08d2498083676e7eeaa102493
Author: Yang Yingliang <yangyingliang@huawei.com>
Date:   Wed Jul 3 11:16:07 2024 +0800

    sched/smt: Introduce sched_smt_present_inc/dec() helper

    Introduce sched_smt_present_inc/dec() helper, so it can be called
    in normal or error path simply. No functional changed.

    Cc: stable@kernel.org
    Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-26 08:10:44 -04:00
Phil Auld a1d464759c rcu/tasks: Fix stale task snaphot for Tasks Trace
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 399ced9594dfab51b782798efe60a2376cd5b724
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri May 17 17:23:02 2024 +0200

    rcu/tasks: Fix stale task snaphot for Tasks Trace

    When RCU-TASKS-TRACE pre-gp takes a snapshot of the current task running
    on all online CPUs, no explicit ordering synchronizes properly with a
    context switch.  This lack of ordering can permit the new task to miss
    pre-grace-period update-side accesses.  The following diagram, courtesy
    of Paul, shows the possible bad scenario:

            CPU 0                                           CPU 1
            -----                                           -----

            // Pre-GP update side access
            WRITE_ONCE(*X, 1);
            smp_mb();
            r0 = rq->curr;
                                                            RCU_INIT_POINTER(rq->curr, TASK_B)
                                                            spin_unlock(rq)
                                                            rcu_read_lock_trace()
                                                            r1 = X;
            /* ignore TASK_B */

    Either r0==TASK_B or r1==1 is needed but neither is guaranteed.

    One possible solution to solve this is to wait for an RCU grace period
    at the beginning of the RCU-tasks-trace grace period before taking the
    current tasks snaphot. However this would introduce large additional
    latencies to RCU-tasks-trace grace periods.

    Another solution is to lock the target runqueue while taking the current
    task snapshot. This ensures that the update side sees the latest context
    switch and subsequent context switches will see the pre-grace-period
    update side accesses.

    This commit therefore adds runqueue locking to cpu_curr_snapshot().

    Fixes: e386b6725798 ("rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs")
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:03 -04:00
Phil Auld fb5d3f735e sched/core: Simplify prefetch_curr_exec_start()
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 85c9a8f4531c6c0862ecda50cac662b0b78d1974
Author: Ingo Molnar <mingo@kernel.org>
Date:   Wed Jun 5 13:01:44 2024 +0200

    sched/core: Simplify prefetch_curr_exec_start()

    Remove unnecessary use of the address operator.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: linux-kernel@vger.kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:03 -04:00
Phil Auld e8bf69e6e0 sched: Fix spelling in comments
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Dropped hunks in mm_cid code which we don't have. Minor
context diffs due to still having IA64 in tree and previous Kabi
workarounds.

commit 402de7fc880fef055bc984957454b532987e9ad0
Author: Ingo Molnar <mingo@kernel.org>
Date:   Mon May 27 16:54:52 2024 +0200

    sched: Fix spelling in comments

    Do a spell-checking pass.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:02 -04:00
Phil Auld 10626dfce1 sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Worked around RHEL-only commits 9b35f92491 ("sched/core: Make
sched_setaffinity() always return -EINVAL on empty cpumask"),90f7bb0c1823 ("sched/core:
Don't return -ENODEV from sched_setaffinity()") and 05fddaaaac ("sched/core: Use empty
mask to reset cpumasks in sched_setaffinity()") by removing the changes and re-applying
them to the new syscalls.c file. Reverting and re-applying was not possible since there
have been other changes on top of these as well.

commit 04746ed80bcf3130951ed4d5c1bc5b0bcabdde22
Author: Ingo Molnar <mingo@kernel.org>
Date:   Sun Apr 7 10:43:15 2024 +0200

    sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c

    core.c has become rather large, move most scheduler syscall
    related functionality into a separate file, syscalls.c.

    This is about ~15% of core.c's raw linecount.

    Move the alloc_user_cpus_ptr(), __rt_effective_prio(),
    rt_effective_prio(), uclamp_none(), uclamp_se_set()
    and uclamp_bucket_id() inlines to kernel/sched/sched.h.

    Internally export the __sched_setscheduler(), __sched_setaffinity(),
    __setscheduler_prio(), set_load_weight(), enqueue_task(), dequeue_task(),
    check_class_changed(), splice_balance_callbacks() and balance_callbacks()
    methods to better facilitate this.

    Move the new file's build to sched_policy.c, because it fits there
    semantically, but also because it's the smallest of the 4 build units
    under an allmodconfig build:

      -rw-rw-r-- 1 mingo mingo 7.3M May 27 12:35 kernel/sched/core.i
      -rw-rw-r-- 1 mingo mingo 6.4M May 27 12:36 kernel/sched/build_utility.i
      -rw-rw-r-- 1 mingo mingo 6.3M May 27 12:36 kernel/sched/fair.i
      -rw-rw-r-- 1 mingo mingo 5.8M May 27 12:36 kernel/sched/build_policy.i

    This better balances build time for scheduler subsystem rebuilds.

    I build-tested this new file as a standalone syscalls.o file for a bit,
    to make sure all the encapsulations & abstractions are robust.

    Also update/add my copyright notices to these files.

    Build time measurements:

     # -Before/+After:

     kepler:~/tip> perf stat -e 'cycles,instructions,duration_time' --sync --repeat 5 --pre 'rm -f kernel/sched/*.o' m kernel/sched/built-in.a >/dev/null

     Performance counter stats for 'm kernel/sched/built-in.a' (5 runs):

     -    71,938,508,607      cycles                                                                  ( +-  0.17% )
     +    71,992,916,493      cycles                                                                  ( +-  0.22% )
     -   106,214,780,964      instructions                     #    1.48  insn per cycle              ( +-  0.01% )
     +   105,450,231,154      instructions                     #    1.46  insn per cycle              ( +-  0.01% )
     -     5,878,232,620 ns   duration_time                                                           ( +-  0.38% )
     +     5,290,085,069 ns   duration_time                                                           ( +-  0.21% )

     -            5.8782 +- 0.0221 seconds time elapsed  ( +-  0.38% )
     +            5.2901 +- 0.0111 seconds time elapsed  ( +-  0.21% )

    Build time improvement of -11.1% (duration_time) is expected: the
    parallel build time of the scheduler subsystem is determined by the
    largest, slowest to build object file, which is kernel/sched/core.o.
    By moving ~15% of its complexity into another build unit, we reduced
    build time by -11%.

    Measured cycles spent on building is within its ~0.2% stddev noise envelope.

    The -0.7% reduction in instructions spent on building the scheduler is
    statistically reliable and somewhat surprising - I can only speculate:
    maybe compilers aren't that efficient at building & optimizing 10+ KLOC files
    (core.c), and it's an overall win to balance the linecount a bit.

    Anyway, this might be a data point that suggests that reducing the linecount
    of our largest files will improve not just code readability and maintainability,
    but might also improve build times a bit.

    Code generation got a bit worse, by 0.5kb text on an x86 defconfig build:

      # -Before/+After:

      kepler:~/tip> size vmlinux
         text          data     bss     dec     hex filename
      -26475475     10439178        1740804 38655457        24dd5e1 vmlinux
      +26476003     10439178        1740804 38655985        24dd7f1 vmlinux

      kepler:~/tip> size kernel/sched/built-in.a
         text          data     bss     dec     hex filename
      - 76056         30025     489  106570   1a04a kernel/sched/core.o (ex kernel/sched/built-in.a)
      + 63452         29453     489   93394   16cd2 kernel/sched/core.o (ex kernel/sched/built-in.a)
        44299          2181     104   46584    b5f8 kernel/sched/fair.o (ex kernel/sched/built-in.a)
      - 42764          3424     120   46308    b4e4 kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
      + 55651          4044     120   59815    e9a7 kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
        44866         12655    2192   59713    e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a)
        44866         12655    2192   59713    e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a)

    This is primarily due to the extra functions exported, and the size
    gets exaggerated somewhat by __pfx CFI function padding:

            ffffffff810cc710 <__pfx_enqueue_task>:
            ffffffff810cc710:       90                      nop
            ffffffff810cc711:       90                      nop
            ffffffff810cc712:       90                      nop
            ffffffff810cc713:       90                      nop
            ffffffff810cc714:       90                      nop
            ffffffff810cc715:       90                      nop
            ffffffff810cc716:       90                      nop
            ffffffff810cc717:       90                      nop
            ffffffff810cc718:       90                      nop
            ffffffff810cc719:       90                      nop
            ffffffff810cc71a:       90                      nop
            ffffffff810cc71b:       90                      nop
            ffffffff810cc71c:       90                      nop
            ffffffff810cc71d:       90                      nop
            ffffffff810cc71e:       90                      nop
            ffffffff810cc71f:       90                      nop

    AFAICS the cost is primarily not to core.o and fair.o though (which contain
    most performance sensitive scheduler functions), only to syscalls.o
    that get called with much lower frequency - so I think this is an acceptable
    trade-off for better code separation.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20240407084319.1462211-2-mingo@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:02 -04:00
Phil Auld 14a470e760 sched/pelt: Remove shift of thermal clock
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 97450eb909658573dcacc1063b06d3d08642c0c1
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Tue Mar 26 10:16:16 2024 +0100

    sched/pelt: Remove shift of thermal clock

    The optional shift of the clock used by thermal/hw load avg has been
    introduced to handle case where the signal was not always a high frequency
    hw signal. Now that cpufreq provides a signal for firmware and
    SW pressure, we can remove this exception and always keep this PELT signal
    aligned with other signals.
    Mark sysctl_sched_migration_cost boot parameter as deprecated

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Lukasz Luba <lukasz.luba@arm.com>
    Reviewed-by: Qais Yousef <qyousef@layalina.io>
    Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
    Link: https://lore.kernel.org/r/20240326091616.3696851-6-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:02 -04:00
Phil Auld 51c743b331 sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Minor differences since we already have ddae0ca2a8f
("sched: Move psi_account_irqtime() out of update_rq_clock_task()
hotpath") which changes some nearby code.

commit d4dbc991714eefcbd8d54a3204bd77a0a52bd32d
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Tue Mar 26 10:16:15 2024 +0100

    sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()

    Now that cpufreq provides a pressure value to the scheduler, rename
    arch_update_thermal_pressure into HW pressure to reflect that it returns
    a pressure applied by HW (i.e. with a high frequency change) and not
    always related to thermal mitigation but also generated by max current
    limitation as an example. Such high frequency signal needs filtering to be
    smoothed and provide an value that reflects the average available capacity
    into the scheduler time scale.

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Lukasz Luba <lukasz.luba@arm.com>
    Reviewed-by: Qais Yousef <qyousef@layalina.io>
    Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
    Link: https://lore.kernel.org/r/20240326091616.3696851-5-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-23 13:33:02 -04:00
Phil Auld 32938a738c sched/balancing: Rename trigger_load_balance() => sched_balance_trigger()
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts:  Dropped CN documentation since not in RHEL. Minor fuzz in
sched-domains.rst.

commit 983be0628c061989b6cc175d2f5e429b40699fbb
Author: Ingo Molnar <mingo@kernel.org>
Date:   Fri Mar 8 12:18:09 2024 +0100

    sched/balancing: Rename trigger_load_balance() => sched_balance_trigger()

    Standardize scheduler load-balancing function names on the
    sched_balance_() prefix.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
    Link: https://lore.kernel.org/r/20240308111819.1101550-4-mingo@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:48 -04:00
Phil Auld 88d1c5d2ed sched/balancing: Rename scheduler_tick() => sched_tick()
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts:  Dropped CN documentation since not in RHEL, context
diffs in sched-domains.rst. Skipped hunk in func_set_ftrace_file.tc
due to not having 6fec1ab67f8 ("selftests/ftrace: Do not
trace do_softirq because of PREEMPT_RT") in tree.

commit 86dd6c04ef9f213e14d60c9f64bce1cc019f816e
Author: Ingo Molnar <mingo@kernel.org>
Date:   Fri Mar 8 12:18:08 2024 +0100

    sched/balancing: Rename scheduler_tick() => sched_tick()

    - Standardize on prefixing scheduler-internal functions defined
      in <linux/sched.h> with sched_*() prefix. scheduler_tick() was
      the only function using the scheduler_ prefix. Harmonize it.

    - The other reason to rename it is the NOHZ scheduler tick
      handling functions are already named sched_tick_*().
      Make the 'git grep sched_tick' more meaningful.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
    Link: https://lore.kernel.org/r/20240308111819.1101550-3-mingo@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:48 -04:00
Phil Auld 79fb2c348b sched/core: Simplify code by removing duplicate #ifdefs
JIRA: https://issues.redhat.com/browse/RHEL-56494

commit 8cec3dd9e5930c82c6bd0af3fdb3a36bcd428310
Author: Shrikanth Hegde <sshegde@linux.ibm.com>
Date:   Fri Feb 16 11:44:33 2024 +0530

    sched/core: Simplify code by removing duplicate #ifdefs

    There's a few cases of nested #ifdefs in the scheduler code
    that can be simplified:

      #ifdef DEFINE_A
      ...code block...
        #ifdef DEFINE_A       <-- This is a duplicate.
        ...code block...
        #endif
      #else
        #ifndef DEFINE_A     <-- This is also duplicate.
        ...code block...
        #endif
      #endif

    More details about the script and methods used to find these code
    patterns can be found at:

      https://lore.kernel.org/all/20240118080326.13137-1-sshegde@linux.ibm.com/

    No change in functionality intended.

    [ mingo: Clarified the changelog. ]

    Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20240216061433.535522-1-sshegde@linux.ibm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:46 -04:00
Waiman Long dcf25e4ab2 rcu/tasks: Fix stale task snaphot for Tasks Trace
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 399ced9594dfab51b782798efe60a2376cd5b724
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 17 May 2024 17:23:02 +0200

    rcu/tasks: Fix stale task snaphot for Tasks Trace

    When RCU-TASKS-TRACE pre-gp takes a snapshot of the current task running
    on all online CPUs, no explicit ordering synchronizes properly with a
    context switch.  This lack of ordering can permit the new task to miss
    pre-grace-period update-side accesses.  The following diagram, courtesy
    of Paul, shows the possible bad scenario:

            CPU 0                                           CPU 1
            -----                                           -----

            // Pre-GP update side access
            WRITE_ONCE(*X, 1);
            smp_mb();
            r0 = rq->curr;
                                                            RCU_INIT_POINTER(rq->curr, TASK_B)
                                                            spin_unlock(rq)
                                                            rcu_read_lock_trace()
                                                            r1 = X;
            /* ignore TASK_B */

    Either r0==TASK_B or r1==1 is needed but neither is guaranteed.

    One possible solution to solve this is to wait for an RCU grace period
    at the beginning of the RCU-tasks-trace grace period before taking the
    current tasks snaphot. However this would introduce large additional
    latencies to RCU-tasks-trace grace periods.

    Another solution is to lock the target runqueue while taking the current
    task snapshot. This ensures that the update side sees the latest context
    switch and subsequent context switches will see the pre-grace-period
    update side accesses.

    This commit therefore adds runqueue locking to cpu_curr_snapshot().

    Fixes: e386b6725798 ("rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs")
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:51 -04:00
Phil Auld d414c1e069 sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
JIRA: https://issues.redhat.com/browse/RHEL-48226
Conflicts: Minor context differences in sched/core.c due to
not having scheduler_tick() renamed sched_tick and d4dbc991714e
("sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()").

commit ddae0ca2a8fe12d0e24ab10ba759c3fbd755ada8
Author: John Stultz <jstultz@google.com>
Date:   Tue Jun 18 14:58:55 2024 -0700

    sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath

    It was reported that in moving to 6.1, a larger then 10%
    regression was seen in the performance of
    clock_gettime(CLOCK_THREAD_CPUTIME_ID,...).

    Using a simple reproducer, I found:
    5.10:
    100000000 calls in 24345994193 ns => 243.460 ns per call
    100000000 calls in 24288172050 ns => 242.882 ns per call
    100000000 calls in 24289135225 ns => 242.891 ns per call

    6.1:
    100000000 calls in 28248646742 ns => 282.486 ns per call
    100000000 calls in 28227055067 ns => 282.271 ns per call
    100000000 calls in 28177471287 ns => 281.775 ns per call

    The cause of this was finally narrowed down to the addition of
    psi_account_irqtime() in update_rq_clock_task(), in commit
    52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ
    pressure").

    In my initial attempt to resolve this, I leaned towards moving
    all accounting work out of the clock_gettime() call path, but it
    wasn't very pretty, so it will have to wait for a later deeper
    rework. Instead, Peter shared this approach:

    Rework psi_account_irqtime() to use its own psi_irq_time base
    for accounting, and move it out of the hotpath, calling it
    instead from sched_tick() and __schedule().

    In testing this, we found the importance of ensuring
    psi_account_irqtime() is run under the rq_lock, which Johannes
    Weiner helpfully explained, so also add some lockdep annotations
    to make that requirement clear.

    With this change the performance is back in-line with 5.10:
    6.1+fix:
    100000000 calls in 24297324597 ns => 242.973 ns per call
    100000000 calls in 24318869234 ns => 243.189 ns per call
    100000000 calls in 24291564588 ns => 242.916 ns per call

    Reported-by: Jimmy Shiu <jimmyshiu@google.com>
    Originally-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: John Stultz <jstultz@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
    Reviewed-by: Qais Yousef <qyousef@layalina.io>
    Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-07-15 11:13:20 -04:00
Phil Auld f2ab2fc5c2 sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write()
JIRA: https://issues.redhat.com/browse/RHEL-48226

commit 49217ea147df7647cb89161b805c797487783fc0
Author: Cheng Yu <serein.chengyu@huawei.com>
Date:   Wed Apr 24 21:24:38 2024 +0800

    sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write()

    In the cgroup v2 CPU subsystem, assuming we have a
    cgroup named 'test', and we set cpu.max and cpu.max.burst:

        # echo 1000000 > /sys/fs/cgroup/test/cpu.max
        # echo 1000000 > /sys/fs/cgroup/test/cpu.max.burst

    then we check cpu.max and cpu.max.burst:

        # cat /sys/fs/cgroup/test/cpu.max
        1000000 100000
        # cat /sys/fs/cgroup/test/cpu.max.burst
        1000000

    Next we set cpu.max again and check cpu.max and
    cpu.max.burst:

        # echo 2000000 > /sys/fs/cgroup/test/cpu.max
        # cat /sys/fs/cgroup/test/cpu.max
        2000000 100000

        # cat /sys/fs/cgroup/test/cpu.max.burst
        1000

    ... we find that the cpu.max.burst value changed unexpectedly.

    In cpu_max_write(), the unit of the burst value returned
    by tg_get_cfs_burst() is microseconds, while in cpu_max_write(),
    the burst unit used for calculation should be nanoseconds,
    which leads to the bug.

    To fix it, get the burst value directly from tg->cfs_bandwidth.burst.

    Fixes: f4183717b3 ("sched/fair: Introduce the burstable CFS controller")
    Reported-by: Qixin Liao <liaoqixin@huawei.com>
    Signed-off-by: Cheng Yu <serein.chengyu@huawei.com>
    Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20240424132438.514720-1-serein.chengyu@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-07-15 11:13:08 -04:00
Phil Auld e997ca8943 delayacct: track delays from IRQ/SOFTIRQ
JIRA: https://issues.redhat.com/browse/RHEL-48226
Conflicts: Context difference in delayacct.h due to different location of
delayaact_tsk_init in rhel codebase.

commit a3b2aeac9d154e5e15ddbf19de934c0c606b6acd
Author: Yang Yang <yang.yang19@zte.com.cn>
Date:   Sat Apr 8 17:28:35 2023 +0800

    delayacct: track delays from IRQ/SOFTIRQ

    Delay accounting does not track the delay of IRQ/SOFTIRQ.  While
    IRQ/SOFTIRQ could have obvious impact on some workloads productivity, such
    as when workloads are running on system which is busy handling network
    IRQ/SOFTIRQ.

    Get the delay of IRQ/SOFTIRQ could help users to reduce such delay.  Such
    as setting interrupt affinity or task affinity, using kernel thread for
    NAPI etc.  This is inspired by "sched/psi: Add PSI_IRQ to track
    IRQ/SOFTIRQ pressure"[1].  Also fix some code indent problems of older
    code.

    And update tools/accounting/getdelays.c:
        / # ./getdelays -p 156 -di
        print delayacct stats ON
        printing IO accounting
        PID     156

        CPU             count     real total  virtual total    delay total  delay average
                           15       15836008       16218149      275700790         18.380ms
        IO              count    delay total  delay average
                            0              0          0.000ms
        SWAP            count    delay total  delay average
                            0              0          0.000ms
        RECLAIM         count    delay total  delay average
                            0              0          0.000ms
        THRASHING       count    delay total  delay average
                            0              0          0.000ms
        COMPACT         count    delay total  delay average
                            0              0          0.000ms
        WPCOPY          count    delay total  delay average
                           36        7586118          0.211ms
        IRQ             count    delay total  delay average
                           42         929161          0.022ms

    [1] commit 52b1364ba0b1("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure")

    Link: https://lkml.kernel.org/r/202304081728353557233@zte.com.cn
    Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
    Cc: Jiang Xuexin <jiang.xuexin@zte.com.cn>
    Cc: wangyong <wang.yong12@zte.com.cn>
    Cc: junhua huang <huang.junhua@zte.com.cn>
    Cc: Balbir Singh <bsingharora@gmail.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-07-15 11:12:08 -04:00
Lucas Zampieri f6029bf351 Merge: workqueue: Backport workqueue commits to v6.9
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3910

JIRA: https://issues.redhat.com/browse/RHEL-25103    
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3910    
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3847    

The primary purpose of this MR is to backport those upstream workqueue
commits which enables ordered workqueues and rescuers to follow
changes in workqueue unbound cpumask which is necessary to make sure
that isolated CPUs won't be disturbed due to unbound work items being
handled by those CPUs.

These upstream commits were merged into the v6.9 kernel which also
contains some major changes in workqueue code. This makes the required
commits dependent on some of the v6.9 workqueue commits. It is less risky
to sync the workqueue code up to v6.9 instead of selective backports
of some dependent commits. This MR also includes some miscellaneous
commits in other subsystems due to changes in the underlying workqueue
implementations.

A follow-up proactive workqueue fixes MR will be created later on,
if necessary.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: Vladis Dronov <vdronov@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Radu Rendec <rrendec@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-13 13:07:43 +00:00
Waiman Long 6d0328a7cf Revert "Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8""
JIRA: https://issues.redhat.com/browse/RHEL-36683
Upstream Status: RHEL only

This reverts commit 08637d76a2 which is a
revert of "Merge: cgroup: Backport upstream cgroup commits up to v6.8"

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-18 21:38:20 -04:00
Lucas Zampieri 08637d76a2 Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8"
This reverts merge request !4128
2024-05-16 15:26:41 +00:00
Lucas Zampieri 1ce55b7cbb Merge: cgroup: Backport upstream cgroup commits up to v6.8
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4128

JIRA: https://issues.redhat.com/browse/RHEL-34600    
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4128

This MR backports upstream cgroup commits up to v6.8 with related fixes,
if applicable. It also pulls in a number of scheduler and PSI related
commits due to their interaction with cgroup.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-16 13:28:22 +00:00
Lucas Zampieri f67ab7550c Merge: Scheduler: rhel9.5 updates
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3975

JIRA: https://issues.redhat.com/browse/RHEL-25535 

JIRA: https://issues.redhat.com/browse/RHEL-20158  

JIRA: https://issues.redhat.com/browse/RHEL-15622

Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3935

Tested: Scheduler stress tests. Perf Qe will do a  
performance regression test.  
  
A collection of fixes and updates that brings the  
core scheduler code up to v6.8. EEVDF related commits  
are skipped since we are not planning to take the new  
task scheduler in rhel9.
  
  
Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-08 20:13:47 +00:00
Waiman Long 1665f6ac9c workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 616db8779b1e3f93075df691432cccc5ef3c3ba0
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

    If a per-cpu work item hogs the CPU, it can prevent other work items from
    starting through concurrency management. A per-cpu workqueue which intends
    to host such CPU-hogging work items can choose to not participate in
    concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
    error-prone and difficult to debug when missed.

    This patch adds an automatic CPU usage based detection. If a
    concurrency-managed work item consumes more CPU time than the threshold
    (10ms by default) continuously without intervening sleeps, wq_worker_tick()
    which is called from scheduler_tick() will detect the condition and
    automatically mark it CPU_INTENSIVE.

    The mechanism isn't foolproof:

    * Detection depends on tick hitting the work item. Getting preempted at the
      right timings may allow a violating work item to evade detection at least
      temporarily.

    * nohz_full CPUs may not be running ticks and thus can fail detection.

    * Even when detection is working, the 10ms detection delays can add up if
      many CPU-hogging work items are queued at the same time.

    However, in vast majority of cases, this should be able to detect violations
    reliably and provide reasonable protection with a small increase in code
    complexity.

    If some work items trigger this condition repeatedly, the bigger problem
    likely is the CPU being saturated with such per-cpu work items and the
    solution would be making them UNBOUND. The next patch will add a debug
    mechanism to help spot such cases.

    v4: Documentation for workqueue.cpu_intensive_thresh_us added to
        kernel-parameters.txt.

    v3: Switch to use wq_worker_tick() instead of hooking into preemptions as
        suggested by Peter.

    v2: Lai pointed out that wq_worker_stopping() also needs to be called from
        preemption and rtlock paths and an earlier patch was updated
        accordingly. This patch adds a comment describing the risk of infinte
        recursions and how they're avoided.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Lucas Zampieri d83249ff08 Merge: futex: Rebase futex code to v6.8
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3827

JIRA: https://issues.redhat.com/browse/RHEL-28616    
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3827

This MR rebases the RHEL9 futex code base to align to the v6.8 upstream
kernel to gain access to new futex syscalls and functionality that are
likely needed by userspace applications and other kernel subsystems.

It also includes the reverting of some linux-rt-devel specific rt-mutex
and scheduler patches and replacing them with upstream linux equivalents.
It also includes some unrelated syscall patches. These are all done to
ease the current and future backporting effort.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Crystal Wood <crwood@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-29 14:06:58 +00:00
Waiman Long fe6fb41529 sched/core: Update stale comment in try_to_wake_up()
JIRA: https://issues.redhat.com/browse/RHEL-34600

commit ea41bb514fe286bf50498b3c6d7f7a5dc2b6c5e0
Author: Ingo Molnar <mingo@kernel.org>
Date:   Wed, 4 Oct 2023 11:33:36 +0200

    sched/core: Update stale comment in try_to_wake_up()

    The following commit:

      9b3c4ab3045e ("sched,rcu: Rework try_invoke_on_locked_down_task()")

    ... renamed try_invoke_on_locked_down_task() to task_call_func(),
    but forgot to update the comment in try_to_wake_up().

    But it turns out that the smp_rmb() doesn't live in task_call_func()
    either, it was moved to __task_needs_rq_lock() in:

      91dabf33ae5d ("sched: Fix race in task_call_func()")

    Fix that now.

    Also fix the s/smb/smp typo while at it.

    Reported-by: Zhang Qiao <zhangqiao22@huawei.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230731085759.11443-1-zhangqiao22@huawei.com

Signed-off-by: Waiman Long <longman@redhat.com>
2024-04-26 22:49:23 -04:00
Waiman Long 5246c3da80 sched: add throttled time stat for throttled children
JIRA: https://issues.redhat.com/browse/RHEL-34600

commit 677ea015f231aa38b3972aa7be54ecd2637e99fd
Author: Josh Don <joshdon@google.com>
Date:   Tue, 20 Jun 2023 11:32:47 -0700

    sched: add throttled time stat for throttled children

    We currently export the total throttled time for cgroups that are given
    a bandwidth limit. This patch extends this accounting to also account
    the total time that each children cgroup has been throttled.

    This is useful to understand the degree to which children have been
    affected by the throttling control. Children which are not runnable
    during the entire throttled period, for example, will not show any
    self-throttling time during this period.

    Expose this in a new interface, 'cpu.stat.local', which is similar to
    how non-hierarchical events are accounted in 'memory.events.local'.

    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com

Signed-off-by: Waiman Long <longman@redhat.com>
2024-04-26 22:49:11 -04:00
Waiman Long 9dedff9054 sched: Fix race in task_call_func()
JIRA: https://issues.redhat.com/browse/RHEL-34600

commit 91dabf33ae5df271da63e87ad7833e5fdb4a44b9
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed, 26 Oct 2022 13:43:00 +0200

    sched: Fix race in task_call_func()

    There is a very narrow race between schedule() and task_call_func().

      CPU0                                          CPU1

      __schedule()
        rq_lock();
        prev_state = READ_ONCE(prev->__state);
        if (... && prev_state) {
          deactivate_tasl(rq, prev, ...)
            prev->on_rq = 0;

                                                    task_call_func()
                                                      raw_spin_lock_irqsave(p->pi_lock);
                                                      state = READ_ONCE(p->__state);
                                                      smp_rmb();
                                                      if (... || p->on_rq) // false!!!
                                                        rq = __task_rq_lock()

                                                      ret = func();

        next = pick_next_task();
        rq = context_switch(prev, next)
          prepare_lock_switch()
            spin_release(&__rq_lockp(rq)->dep_map...)

    So while the task is on it's way out, it still holds rq->lock for a
    little while, and right then task_call_func() comes in and figures it
    doesn't need rq->lock anymore (because the task is already dequeued --
    but still running there) and then the __set_task_frozen() thing observes
    it's holding rq->lock and yells murder.

    Avoid this by waiting for p->on_cpu to get cleared, which guarantees
    the task is fully finished on the old CPU.

    ( While arguably the fixes tag is 'wrong' -- none of the previous
      task_call_func() users appears to care for this case. )

    Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic")
    Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: https://lkml.kernel.org/r/Y1kdRNNfUeAU+FNl@hirez.programming.kicks-ass.net

Signed-off-by: Waiman Long <longman@redhat.com>
2024-04-26 22:49:08 -04:00
Waiman Long 812de711d8 sched: Fix TASK_state comparisons
JIRA: https://issues.redhat.com/browse/RHEL-34600

commit 5aec788aeb8eb74282b75ac1b317beb0fbb69a42
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue, 27 Sep 2022 21:02:34 +0200

    sched: Fix TASK_state comparisons

    Task state is fundamentally a bitmask; direct comparisons are probably
    not working as intended. Specifically the normal wait-state have
    a number of possible modifiers:

      TASK_UNINTERRUPTIBLE: TASK_WAKEKILL, TASK_NOLOAD, TASK_FREEZABLE
      TASK_INTERRUPTIBLE:   TASK_FREEZABLE

    Specifically, the addition of TASK_FREEZABLE wrecked
    __wait_is_interruptible(). This however led to an audit of direct
    comparisons yielding the rest of the changes.

    Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic")
    Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
    Debugged-by: Christian Borntraeger <borntraeger@linux.ibm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-04-26 22:49:07 -04:00
Waiman Long 724656e7cf freezer,sched: Rewrite core freezer logic
JIRA: https://issues.redhat.com/browse/RHEL-34600
Conflicts:
 1) A merge conflict in the kernel/signal.c hunk due to the presence
    of RHEL-only commit 975d318867 ("signal: Don't disable preemption
    in ptrace_stop() on PREEMPT_RT.").
 2) A merge conflict in the kernel/time/hrtimer.c hunk due to the
    presence of RHEL-only commit 5f76194136 ("time/hrtimer: Embed
    hrtimer mode into hrtimer_sleeper").
 3) The fs/cifs/inode.c hunk was applied to fs/smb/client/inode.c due
    to the presence of upstream commit 38c8a9a52082 ("smb: move client
    and server files to common directory fs/smb").
 4) Similarly, the fs/cifs/transport.c hunk was applied to
    fs/smb/client/transport.c manually due to the presence of
    a later upstream commit d527f51331ca ("cifs: Fix UAF in
    cifs_demultiplex_thread()").

Note that all the prerequiste patches in the same patch series
(https://lore.kernel.org/lkml/20220822111816.760285417@infradead.org/)
had already been merged into RHEL9.

commit f5d39b020809146cc28e6e73369bf8065e0310aa
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon, 22 Aug 2022 13:18:22 +0200

    freezer,sched: Rewrite core freezer logic

    Rewrite the core freezer to behave better wrt thawing and be simpler
    in general.

    By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
    ensured frozen tasks stay frozen until thawed and don't randomly wake
    up early, as is currently possible.

    As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
    two PF_flags (yay!).

    Specifically; the current scheme works a little like:

            freezer_do_not_count();
            schedule();
            freezer_count();

    And either the task is blocked, or it lands in try_to_freezer()
    through freezer_count(). Now, when it is blocked, the freezer
    considers it frozen and continues.

    However, on thawing, once pm_freezing is cleared, freezer_count()
    stops working, and any random/spurious wakeup will let a task run
    before its time.

    That is, thawing tries to thaw things in explicit order; kernel
    threads and workqueues before doing bringing SMP back before userspace
    etc.. However due to the above mentioned races it is entirely possible
    for userspace tasks to thaw (by accident) before SMP is back.

    This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
    where the userspace task requires a special CPU to run.

    As said; replace this with a special task state TASK_FROZEN and add
    the following state transitions:

            TASK_FREEZABLE  -> TASK_FROZEN
            __TASK_STOPPED  -> TASK_FROZEN
            __TASK_TRACED   -> TASK_FROZEN

    The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
    (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
    is already required to deal with spurious wakeups and the freezer
    causes one such when thawing the task (since the original state is
    lost).

    The special __TASK_{STOPPED,TRACED} states *can* be restored since
    their canonical state is in ->jobctl.

    With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
    free of undue (early / spurious) wakeups.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org

Signed-off-by: Waiman Long <longman@redhat.com>
2024-04-26 22:49:06 -04:00
Lucas Zampieri d23522d08a Merge: Sched: schedutil/cpufreq updates
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3935

JIRA: https://issues.redhat.com/browse/RHEL-29020  
  
Bring schedutil code up to about v6.8. This includes some fixes for  
code in rhel9 from the 5.14 rebase.  There are few pieces in cpufreq  
driver code and the arm architectures needed to make it complete.  
Tested: Ran stress tests with schedutil governor. Ran general scheduler  
stress and performance tests.  
  
Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-26 12:34:20 +00:00
Lucas Zampieri 79eb65d175 Merge: sched: apply class and guard cleanups
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3865

JIRA: https://issues.redhat.com/browse/RHEL-29017  
  
Apply the changes using the macros in include/linux/cleanup.h providing  
scoped guards. There is no real functional change. We rely on the compiler  
to cleanup rather than having explicit unwiding with gotos.  
  
Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-22 12:41:20 +00:00
Audra Mitchell 0f85367c44 panic: Consolidate open-coded panic_on_warn checks
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 79cc1ba7badf9e7a12af99695a557e9ce27ee967
Author: Kees Cook <keescook@chromium.org>
Date:   Thu Nov 17 15:43:24 2022 -0800

    panic: Consolidate open-coded panic_on_warn checks

    Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll
    their own warnings, and each check "panic_on_warn". Consolidate this
    into a single function so that future instrumentation can be added in
    a single location.

    Cc: Marco Elver <elver@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ben Segall <bsegall@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
    Cc: Valentin Schneider <vschneid@redhat.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: David Gow <davidgow@google.com>
    Cc: tangmeng <tangmeng@uniontech.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Shuah Khan <skhan@linuxfoundation.org>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: "Paul E. McKenney" <paulmck@kernel.org>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
    Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
    Cc: kasan-dev@googlegroups.com
    Cc: linux-mm@kvack.org
    Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Marco Elver <elver@google.com>
    Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
    Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:00 -04:00
Phil Auld 8e7f4729fa sched/deadline: Introduce deadline servers
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Context diff in include/linux/sched.h mostly due to not
 having fd593511cdfc ("tracing/user_events: Track fork/exec/exit for
 mm lifetime").

commit 63ba8422f876e32ee564ea95da9a7313b13ff0a1
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sat Nov 4 11:59:21 2023 +0100

    sched/deadline: Introduce deadline servers

    Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
    with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).

    RT Throttling has been introduced a while ago as a (mostly debug)
    countermeasure one can utilize to reserve some CPU time for low priority
    tasks (usually background type of work, e.g. workqueues, timers, etc.).
    It however has its own problems (see documentation) and the undesired
    effect of unconditionally throttling FIFO tasks even when no lower
    priority activity needs to run (there are mechanisms to fix this issue
    as well, but, again, with their own problems).

    Introduce deadline servers to service low priority tasks needs under
    starvation conditions. Deadline servers are built extending SCHED_DEADLINE
    implementation to allow 2-level scheduling (a sched_deadline entity
    becomes a container for lower priority scheduling entities).

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:16 -04:00