Commit Graph

1596 Commits

Author SHA1 Message Date
Jerome Marchand 04a26afde2 trace: Add trace_ipi_send_cpu()
Bugzilla: https://bugzilla.redhat.com/2192613

Conflicts: context change due to missing commit ed29b0b4fd83
("io_uring: move to separate directory")

commit 68e2d17c9eb311ab59aeb6d0c38aad8985fa2596
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Mar 22 11:28:36 2023 +0100

    trace: Add trace_ipi_send_cpu()

    Because copying cpumasks around when targeting a single CPU is a bit
    daft...

    Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230322103004.GA571242%40hirez.programming.kicks-ass.net

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-09-14 15:36:30 +02:00
Jerome Marchand aa5786b04d sched, smp: Trace smp callback causing an IPI
Bugzilla: https://bugzilla.redhat.com/2192613

Conflicts: Need to modify __smp_call_single_queue_debug() too. It was
removed upstream by commit 1771257cb447 ("locking/csd_lock: Remove
added data from CSD lock debugging")

commit 68f4ff04dbada18dad79659c266a8e5e29e458cd
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Tue Mar 7 14:35:58 2023 +0000

    sched, smp: Trace smp callback causing an IPI

    Context
    =======

    The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter
    which so far has only been fed with NULL.

    While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing
    struct layout (meaning their callback func can be accessed without caring
    about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function
    attached to its struct. This means we need to check the type of a CSD
    before eventually dereferencing its associated callback.

    This isn't as trivial as it sounds: the CSD type is stored in
    __call_single_node.u_flags, which get cleared right before the callback is
    executed via csd_unlock(). This implies checking the CSD type before it is
    enqueued on the call_single_queue, as the target CPU's queue can be flushed
    before we get to sending an IPI.

    Furthermore, send_call_function_single_ipi() only has a CPU parameter, and
    would need to have an additional argument to trickle down the invoked
    function. This is somewhat silly, as the extra argument will always be
    pushed down to the function even when nothing is being traced, which is
    unnecessary overhead.

    Changes
    =======

    send_call_function_single_ipi() is only used by smp.c, and is defined in
    sched/core.c as it contains scheduler-specific ops (set_nr_if_polling() of
    a CPU's idle task).

    Split it into two parts: the scheduler bits remain in sched/core.c, and the
    actual IPI emission is moved into smp.c. This lets us define an
    __always_inline helper function that can take the related callback as
    parameter without creating useless register pressure in the non-traced path
    which only gains a (disabled) static branch.

    Do the same thing for the multi IPI case.

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230307143558.294354-8-vschneid@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-09-14 15:36:30 +02:00
Jerome Marchand 160dc2ad5b sched, smp: Trace IPIs sent via send_call_function_single_ipi()
Bugzilla: https://bugzilla.redhat.com/2192613

Conflicts: context change due to missing commit ed29b0b4fd83
("io_uring: move to separate directory")

commit cc9cb0a71725aa8dd8d8f534a9b562bbf7981f75
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Tue Mar 7 14:35:53 2023 +0000

    sched, smp: Trace IPIs sent via send_call_function_single_ipi()

    send_call_function_single_ipi() is the thing that sends IPIs at the bottom
    of smp_call_function*() via either generic_exec_single() or
    smp_call_function_many_cond(). Give it an IPI-related tracepoint.

    Note that this ends up tracing any IPI sent via __smp_call_single_queue(),
    which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free".

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Acked-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230307143558.294354-3-vschneid@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-09-14 15:36:30 +02:00
Phil Auld c11309550b sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 96500560f0c73c71bca1b27536c6254fa0e8ce37
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Tue Jun 13 16:20:10 2023 +0800

    sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()

    There is a double update_rq_clock() invocation:

      __balance_push_cpu_stop()
        update_rq_clock()
        __migrate_task()
          update_rq_clock()

    Sadly select_fallback_rq() also needs update_rq_clock() for
    __do_set_cpus_allowed(), it is not possible to remove the update from
    __balance_push_cpu_stop(). So remove it from __migrate_task() and
    ensure all callers of this function call update_rq_clock() prior to
    calling it.

    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20230613082012.49615-3-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld 57aa0597d5 sched/core: Fixed missing rq clock update before calling set_rq_offline()
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit cab3ecaed5cdcc9c36a96874b4c45056a46ece45
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Tue Jun 13 16:20:09 2023 +0800

    sched/core: Fixed missing rq clock update before calling set_rq_offline()

    When using a cpufreq governor that uses
    cpufreq_add_update_util_hook(), it is possible to trigger a missing
    update_rq_clock() warning for the CPU hotplug path:

      rq_attach_root()
        set_rq_offline()
          rq_offline_rt()
            __disable_runtime()
              sched_rt_rq_enqueue()
                enqueue_top_rt_rq()
                  cpufreq_update_util()
                    data->func(data, rq_clock(rq), flags)

    Move update_rq_clock() from sched_cpu_deactivate() (one of it's
    callers) into set_rq_offline() such that it covers all
    set_rq_offline() usage.

    Additionally change rq_attach_root() to use rq_lock_irqsave() so that
    it will properly manage the runqueue clock flags.

    Suggested-by: Ben Segall <bsegall@google.com>
    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20230613082012.49615-2-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld 51c8946826 sched: Consider task_struct::saved_state in wait_task_inactive()
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 1c06918788e8ae6e69e4381a2806617312922524
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed May 31 16:39:07 2023 +0200

    sched: Consider task_struct::saved_state in wait_task_inactive()

    With the introduction of task_struct::saved_state in commit
    5f220be21418 ("sched/wakeup: Prepare for RT sleeping spin/rwlocks")
    matching the task state has gotten more complicated. That same commit
    changed try_to_wake_up() to consider both states, but
    wait_task_inactive() has been neglected.

    Sebastian noted that the wait_task_inactive() usage in
    ptrace_check_attach() can misbehave when ptrace_stop() is blocked on
    the tasklist_lock after it sets TASK_TRACED.

    Therefore extract a common helper from ttwu_state_match() and use that
    to teach wait_task_inactive() about the PREEMPT_RT locks.

    Originally-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Link: https://lkml.kernel.org/r/20230601091234.GW83892@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld 9e3d063131 sched: Unconditionally use full-fat wait_task_inactive()
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit d5e1586617be7093ea3419e3fa9387ed833cdbb1
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 2 10:42:53 2023 +0200

    sched: Unconditionally use full-fat wait_task_inactive()

    While modifying wait_task_inactive() for PREEMPT_RT; the build robot
    noted that UP got broken. This led to audit and consideration of the
    UP implementation of wait_task_inactive().

    It looks like the UP implementation is also broken for PREEMPT;
    consider task_current_syscall() getting preempted between the two
    calls to wait_task_inactive().

    Therefore move the wait_task_inactive() implementation out of
    CONFIG_SMP and unconditionally use it.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230602103731.GA630648%40hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld ec56b1904a sched: Change wait_task_inactive()s match_state
JIRA: https://issues.redhat.com/browse/RHEL-1536
Conflicts: This was applied out of order with f9fc8cad9728 ("sched:
    Add TASK_ANY for wait_task_inactive()") so adjusted code to match
    what the results should have been.

commit 9204a97f7ae862fc8a3330ec8335917534c3fb63
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon Aug 22 13:18:19 2022 +0200

    sched: Change wait_task_inactive()s match_state

    Make wait_task_inactive()'s @match_state work like ttwu()'s @state.

    That is, instead of an equal comparison, use it as a mask. This allows
    matching multiple block conditions.

    (removes the unlikely; it doesn't make sense how it's only part of the
    condition)

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:58 -04:00
Phil Auld 418216578b Revert "sched: Consider task_struct::saved_state in wait_task_inactive()."
JIRA: https://issues.redhat.com/browse/RHEL-1536
Upstream status: RHEL only
Conflicts: A later patch renamed task_running() to task_on_cpu() so this
did not revert cleanly. In addition match_state does not need to be checked
for 0 due to f9fc8cad9728 sched: Add TASK_ANY for wait_task_inactive().

This reverts commit 3673cc2e61.

This is commit a015745ca41f from the RT tree merge. It will be re-applied in
the form it was in when merged to Linus' tree as 1c06918788.

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:58 -04:00
Phil Auld c59c893622 sched/core: Make sched_dynamic_mutex static
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 9b8e17813aeccc29c2f9f2e6e68997a6eac2d26d
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Tue Apr 11 22:26:41 2023 -0700

    sched/core: Make sched_dynamic_mutex static

    The sched_dynamic_mutex is only used within the file.  Make it static.

    Fixes: e3ff7c609f39 ("livepatch,sched: Add livepatch task switching to cond_resched()")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/oe-kbuild-all/202304062335.tNuUjgsl-lkp@intel.com/

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 63797cb734 sched/core: Reduce cost of sched_move_task when config autogroup
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit eff6c8ce8d4d7faef75f66614dd20bb50595d261
Author: wuchi <wuchi.zero@gmail.com>
Date:   Tue Mar 21 14:44:59 2023 +0800

    sched/core: Reduce cost of sched_move_task when config autogroup

    Some sched_move_task calls are useless because that
    task_struct->sched_task_group maybe not changed (equals task_group
    of cpu_cgroup) when system enable autogroup. So do some checks in
    sched_move_task.

    sched_move_task eg:
    task A belongs to cpu_cgroup0 and autogroup0, it will always belong
    to cpu_cgroup0 when do_exit. So there is no need to do {de|en}queue.
    The call graph is as follow.

      do_exit
        sched_autogroup_exit_task
          sched_move_task
            dequeue_task
              sched_change_group
                A.sched_task_group = sched_get_task_group (=cpu_cgroup0)
            enqueue_task

    Performance results:
    ===========================
    1. env
            cpu: bogomips=4600.00
         kernel: 6.3.0-rc3
     cpu_cgroup: 6:cpu,cpuacct:/user.slice

    2. cmds
    do_exit script:

      for i in {0..10000}; do
          sleep 0 &
          done
      wait

    Run the above script, then use the following bpftrace cmd to get
    the cost of sched_move_task:

      bpftrace -e 'k:sched_move_task { @ts[tid] = nsecs; }
                   kr:sched_move_task /@ts[tid]/
                      { @ns += nsecs - @ts[tid]; delete(@ts[tid]); }'

    3. cost time(ns):
      without patch: 43528033
      with    patch: 18541416
               diff:-24986617  -57.4%

    As the result show, the patch will save 57.4% in the scenario.

    Signed-off-by: wuchi <wuchi.zero@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230321064459.39421-1-wuchi.zero@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 112765493a sched/core: Avoid selecting the task that is throttled to run when core-sched enable
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 530bfad1d53d103f98cec66a3e491a36d397884d
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Thu Mar 16 16:18:06 2023 +0800

    sched/core: Avoid selecting the task that is throttled to run when core-sched enable

    When {rt, cfs}_rq or dl task is throttled, since cookied tasks
    are not dequeued from the core tree, So sched_core_find() and
    sched_core_next() may return throttled task, which may
    cause throttled task to run on the CPU.

    So we add checks in sched_core_find() and sched_core_next()
    to make sure that the return is a runnable task that is
    not throttled.

    Co-developed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
    Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230316081806.69544-1-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 2e4b079146 sched_getaffinity: don't assume 'cpumask_size()' is fully initialized
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 6015b1aca1a233379625385feb01dd014aca60b5
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue Mar 14 19:32:38 2023 -0700

    sched_getaffinity: don't assume 'cpumask_size()' is fully initialized

    The getaffinity() system call uses 'cpumask_size()' to decide how big
    the CPU mask is - so far so good.  It is indeed the allocation size of a
    cpumask.

    But the code also assumes that the whole allocation is initialized
    without actually doing so itself.  That's wrong, because we might have
    fixed-size allocations (making copying and clearing more efficient), but
    not all of it is then necessarily used if 'nr_cpu_ids' is smaller.

    Having checked other users of 'cpumask_size()', they all seem to be ok,
    either using it purely for the allocation size, or explicitly zeroing
    the cpumask before using the size in bytes to copy it.

    See for example the ublk_ctrl_get_queue_affinity() function that uses
    the proper 'zalloc_cpumask_var()' to make sure that the whole mask is
    cleared, whether the storage is on the stack or if it was an external
    allocation.

    Fix this by just zeroing the allocation before using it.  Do the same
    for the compat version of sched_getaffinity(), which had the same logic.

    Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to
    access the bits.  For a cpumask_var_t, it ends up being a pointer to the
    same data either way, but it's just a good idea to treat it like you
    would a 'cpumask_t'.  The compat case already did that.

    Reported-by: Ryan Roberts <ryan.roberts@arm.com>
    Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/
    Cc: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 8142c03a19 livepatch,sched: Add livepatch task switching to cond_resched()
JIRA: https://issues.redhat.com/browse/RHEL-1536
Conflicts: Minor fixup due to already having 8df1947c71 ("livepatch:
    Replace the fake signal sending with TIF_NOTIFY_SIGNAL infrastructure")

commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Fri Feb 24 08:50:00 2023 -0800

    livepatch,sched: Add livepatch task switching to cond_resched()

    There have been reports [1][2] of live patches failing to complete
    within a reasonable amount of time due to CPU-bound kthreads.

    Fix it by patching tasks in cond_resched().

    There are four different flavors of cond_resched(), depending on the
    kernel configuration.  Hook into all of them.

    A more elegant solution might be to use a preempt notifier.  However,
    non-ORC unwinders can't unwind a preempted task reliably.

    [1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
    [2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org

    Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Petr Mladek <pmladek@suse.com>
    Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
    Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 4e3b05f4b0 sched/fair: Block nohz tick_stop when cfs bandwidth in use
Bugzilla: https://bugzilla.redhat.com/2208016
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
Conflicts: minor fuzz due to context.

commit 88c56cfeaec4642aee8aac58b38d5708c6aae0d3
Author: Phil Auld <pauld@redhat.com>
Date:   Wed Jul 12 09:33:57 2023 -0400

    sched/fair: Block nohz tick_stop when cfs bandwidth in use

    CFS bandwidth limits and NOHZ full don't play well together.  Tasks
    can easily run well past their quotas before a remote tick does
    accounting.  This leads to long, multi-period stalls before such
    tasks can run again. Currently, when presented with these conflicting
    requirements the scheduler is favoring nohz_full and letting the tick
    be stopped. However, nohz tick stopping is already best-effort, there
    are a number of conditions that can prevent it, whereas cfs runtime
    bandwidth is expected to be enforced.

    Make the scheduler favor bandwidth over stopping the tick by setting
    TICK_DEP_BIT_SCHED when the only running task is a cfs task with
    runtime limit enabled. We use cfs_b->hierarchical_quota to
    determine if the task requires the tick.

    Add check in pick_next_task_fair() as well since that is where
    we have a handle on the task that is actually going to be running.

    Add check in sched_can_stop_tick() to cover some edge cases such
    as nr_running going from 2->1 and the 1 remains the running task.

    Reviewed-By: Ben Segall <bsegall@google.com>
    Signed-off-by: Phil Auld <pauld@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:25:42 -04:00
Phil Auld 3f3cb409d3 sched, cgroup: Restore meaning to hierarchical_quota
Bugzilla: https://bugzilla.redhat.com/2208016
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

commit c98c18270be115678f4295b10a5af5dcc9c4efa0
Author: Phil Auld <pauld@redhat.com>
Date:   Fri Jul 14 08:57:46 2023 -0400

    sched, cgroup: Restore meaning to hierarchical_quota

    In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
    groups due to the previous fix simply taking the min.  It should
    reflect a limit imposed at that level or by an ancestor. Even
    though cgroupv2 does not require child quota to be less than or
    equal to that of its ancestors the task group will still be
    constrained by such a quota so this should be shown here. Cgroupv1
    continues to set this correctly.

    In both cases, add initialization when a new task group is created
    based on the current parent's value (or RUNTIME_INF in the case of
    root_task_group). Otherwise, the field is wrong until a quota is
    changed after creation and __cfs_schedulable() is called.

    Fixes: c53593e5cb ("sched, cgroup: Don't reject lower cpu.max on ancestors")
    Signed-off-by: Phil Auld <pauld@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Ben Segall <bsegall@google.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:25:41 -04:00
Jan Stancek 8d19d78fab Merge: sched/core: Use empty mask to reset cpumasks in sched_setaffinity()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2962

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219681
Upstream Status: RHEL only

Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask"), user provided CPU affinity via sched_setaffinity(2) is
perserved even if the task is being moved to a different cpuset. However,
that affinity is also being inherited by any subsequently created child
processes which may not want or be aware of that affinity.

One way to solve this problem is to provide a way to back off from
that user provided CPU affinity.  This patch implements such a scheme
by using an empty cpumask to signal a reset of the cpumasks to the
default as allowed by the current cpuset.

Before this patch, passing in an empty cpumask to sched_setaffinity(2)
will always return an -EINVAL error. With this patch, an alternative
error of -ENODEV will be returned returned if sched_setaffinity(2)
has been called before to set up user_cpus_ptr. In this case, the
user_cpus_ptr that stores the user provided affinity will be cleared and
the task's CPU affinity will be reset to that of the current cpuset. This
alternative error code of -ENODEV signals that the no CPU is specified
and, at the same time, a side effect of resetting cpu affinity to the
cpuset default.

If sched_setaffinity(2) has not been called previously, an EINVAL error
will be returned with an empty cpumask just like before.  Tests or
tools that rely on the behavior that an empty cpumask will return an
error code will not be affected.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: John B. Wyatt IV <jwyatt@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-09-01 21:26:13 +02:00
Jan Stancek f2a2d5da21 Merge: cgroup/cpuset: Provide better cpuset API to enable creation of isolated partition
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2957

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2957

OpenShfit requires support to disable CPU load balancing for the Telco
use cases and this is a gating factor in determining if it can switch
to use cgroup v2 as the default.

The current RHEL9 kernel is able to create an isolated cpuset partition
of exclusive CPUs with load balancing disabled for cgroup v2. However,
it currently has the limitation that isolated cpuset partitions can
only be formed clustered around the cgroup root. That doesn't fit the
current OpenShift use case where systemd is primarily responsible for
managing the cgroup filesystem and OpenShift can only manage child
cgroups further away from the cgroup root.

To address the need of OpenShift, a patch series [1] has been proposed
upstream to extend the v2 cpuset partition semantics to allow the
creation of isolated partitions further away from cgroup root by adding a
new cpuset control file "cpuset.cpus.exclusive" to distribute potential
exclusive CPUs down the cgroup hierarchy for the creation of isolated
cpuset partition.

This MR incorporates the proposed upstream patches with its dependency
patches to provide a way for OpenShift to move forward with switching
the default cgroup from v1 to v2 for the 4.14 release.

The last 6 patches are the proposed upstream patches and the rests have
been merged upstream either in the mainline or the cgroup maintainer's
tree.

[1] https://lore.kernel.org/lkml/20230817132454.755459-1-longman@redhat.com/

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Aristeu Rozanski <arozansk@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-09-01 21:26:08 +02:00
Waiman Long 132876f2ff cgroup/cpuset: Free DL BW in case can_attach() fails
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568

commit 2ef269ef1ac006acf974793d975539244d77b28f
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Mon, 8 May 2023 09:58:54 +0200

    cgroup/cpuset: Free DL BW in case can_attach() fails

    cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks
    have been checked. DL BW is not allocated per-task but as a sum over
    all DL tasks migrating.

    If multiple controllers are attached to the cgroup next to the cpuset
    controller a non-cpuset can_attach() can fail. In this case free DL BW
    in cpuset_cancel_attach().

    Finally, update cpuset DL task count (nr_deadline_tasks) only in
    cpuset_attach().

    Suggested-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-08-28 11:07:05 -04:00
Waiman Long 5503327426 sched/deadline: Create DL BW alloc, free & check overflow interface
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568

commit 85989106feb734437e2d598b639991b9185a43a6
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Mon, 8 May 2023 09:58:53 +0200

    sched/deadline: Create DL BW alloc, free & check overflow interface

    While moving a set of tasks between exclusive cpusets,
    cpuset_can_attach() -> task_can_attach() calls dl_cpu_busy(..., p) for
    DL BW overflow checking and per-task DL BW allocation on the destination
    root_domain for the DL tasks in this set.

    This approach has the issue of not freeing already allocated DL BW in
    the following error cases:

    (1) The set of tasks includes multiple DL tasks and DL BW overflow
        checking fails for one of the subsequent DL tasks.

    (2) Another controller next to the cpuset controller which is attached
        to the same cgroup fails in its can_attach().

    To address this problem rework dl_cpu_busy():

    (1) Split it into dl_bw_check_overflow() & dl_bw_alloc() and add a
        dedicated dl_bw_free().

    (2) dl_bw_alloc() & dl_bw_free() take a `u64 dl_bw` parameter instead of
        a `struct task_struct *p` used in dl_cpu_busy(). This allows to
        allocate DL BW for a set of tasks too rather than only for a single
        task.

    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-08-28 11:07:05 -04:00
Waiman Long 3493ed9e35 sched/cpuset: Bring back cpuset_mutex
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568

commit 111cd11bbc54850f24191c52ff217da88a5e639b
Author: Juri Lelli <juri.lelli@redhat.com>
Date:   Mon, 8 May 2023 09:58:50 +0200

    sched/cpuset: Bring back cpuset_mutex

    Turns out percpu_cpuset_rwsem - commit 1243dc518c ("cgroup/cpuset:
    Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
    as it has been reported to cause slowdowns in workloads that need to
    change cpuset configuration frequently and it is also not implementing
    priority inheritance (which causes troubles with realtime workloads).

    Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
    only for SCHED_DEADLINE tasks (other policies don't care about stable
    cpusets anyway).

    Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-08-28 11:07:04 -04:00
Crystal Wood ec180d083a sched/core: Add __always_inline to schedule_loop()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232098

Upstream Status: RHEL only

Without __always_inline, this function breaks wchan.

schedule_loop() was added by patches from the upstream RT tree; a respin
of the patches for upstream has __always_inline.

Signed-off-by: Crystal Wood <swood@redhat.com>
2023-08-21 09:57:26 -05:00
Waiman Long 05fddaaaac sched/core: Use empty mask to reset cpumasks in sched_setaffinity()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219681
Upstream Status: RHEL only

Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask"), user provided CPU affinity via sched_setaffinity(2) is
perserved even if the task is being moved to a different cpuset. However,
that affinity is also being inherited by any subsequently created child
processes which may not want or be aware of that affinity.

One way to solve this problem is to provide a way to back off from
that user provided CPU affinity.  This patch implements such a scheme
by using an empty cpumask to signal a reset of the cpumasks to the
default as allowed by the current cpuset.

Before this patch, passing in an empty cpumask to sched_setaffinity(2)
will always return an -EINVAL error. With this patch, an alternative
error of -ENODEV will be returned returned if sched_setaffinity(2)
has been called before to set up user_cpus_ptr. In this case, the
user_cpus_ptr that stores the user provided affinity will be cleared and
the task's CPU affinity will be reset to that of the current cpuset. This
alternative error code of -ENODEV signals that the no CPU is specified
and, at the same time, a side effect of resetting cpu affinity to the
cpuset default.

If sched_setaffinity(2) has not been called previously, an EINVAL error
will be returned with an empty cpumask just like before.  Tests or
tools that rely on the behavior that an empty cpumask will return an
error code will not be affected.

We will have to update the sched_setaffinity(2) manpage to document
this possible side effect of passing in an empty cpumask.

Signed-off-by: Waiman Long <longman@redhat.com>
2023-08-19 14:53:37 -04:00
Jan Stancek b7217f6931 Merge: sched/core: Provide sched_rtmutex() and expose sched work helpers
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2829

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218724

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git

Avoid corrupting lock state due to blocking on a lock in sched_submit_work() while in the process of blocking on another lock.

Signed-off-by: Crystal Wood <swood@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-07-31 16:05:41 +02:00
Crystal Wood 09e4f82619 sched/core: Provide sched_rtmutex() and expose sched work helpers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218724

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git

commit ca66ec3b9994e5f82b433697e37512f7d28b6d22
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Apr 27 13:19:34 2023 +0200

    sched/core: Provide sched_rtmutex() and expose sched work helpers

    schedule() invokes sched_submit_work() before scheduling and
    sched_update_worker() afterwards to ensure that queued block requests are
    flushed and the (IO)worker machineries can instantiate new workers if
    required. This avoids deadlocks and starvation.

    With rt_mutexes this can lead to subtle problem:

      When rtmutex blocks current::pi_blocked_on points to the rtmutex it
      blocks on. When one of the functions in sched_submit/resume_work()
      contends on a rtmutex based lock then that would corrupt
      current::pi_blocked_on.

    Make it possible to let rtmutex issue the calls outside of the slowpath,
    i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:

      - Exposing sched_submit_work() and moving the task_running() condition
        into schedule()

      - Renamimg sched_update_worker() to sched_resume_work() and exposing it
        too.

      - Providing sched_rtmutex() which just does the inner loop of scheduling
        until need_resched() is not longer set. Split out the loop so this does
        not create yet another copy.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Link: https://lore.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Signed-off-by: Crystal Wood <swood@redhat.com>
2023-07-18 17:22:36 -05:00
Oleg Nesterov b85b393abb ptrace: Don't change __state
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

commit 2500ad1c7fa42ad734677853961a3a8bec0772c5
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Apr 29 08:43:34 2022 -0500

    ptrace: Don't change __state

    Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
    command is executing.

    Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and
    implement a new jobctl flag TASK_PTRACE_FROZEN.  This new flag is set
    in jobctl_freeze_task and cleared when ptrace_stop is awoken or in
    jobctl_unfreeze_task (when ptrace_stop remains asleep).

    In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
    when the wake up is for a fatal signal.  Skip adding __TASK_TRACED
    when TASK_PTRACE_FROZEN is not set.  This has the same effect as
    changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use
    TASK_KILLABLE go through signal_wake_up.

    Handle a ptrace_stop being called with a pending fatal signal.
    Previously it would have been handled by schedule simply failing to
    sleep.  As TASK_WAKEKILL is no longer part of TASK_TRACED schedule
    will sleep with a fatal_signal_pending.   The code in signal_wake_up
    guarantees that the code will be awaked by any fatal signal that
    codes after TASK_TRACED is set.

    Previously the __state value of __TASK_TRACED was changed to
    TASK_RUNNING when woken up or back to TASK_TRACED when the code was
    left in ptrace_stop.  Now when woken up ptrace_stop now clears
    JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
    clears JOBCTL_PTRACE_FROZEN.

    Tested-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-07-06 15:55:31 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Jan Stancek 3b12a1f1fc Merge: Scheduler updates for 9.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2392

JIRA: https://issues.redhat.com/browse/RHEL-282
Tested: With scheduler stress tests. Perf QE is running performance regression tests.

Update the kernel's core scheduler and related code with fixes and minor changes from
the upstream kernel. This will sync up to roughly linux v6.3-rc6.  Added a couple of
cpumask things which fit better here.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-16 11:49:47 +02:00
Jan Stancek eeab15fa15 Merge: Scheduler uclamp and asym updates to v6.3-rc1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2337

JIRA: https://issues.redhat.com/browse/RHEL-310
Tested: scheduler stress tests.

This is a collection of commits that update (mostly)
the uclamp code in the scheduler. We don't have
CONFIG_UCLAMP_TASK enabled right now but we might in
the future. We do though have EAS enabled and this helps
keep the code in sync to reduce issues withother patches.

It's broken out of the main scheduler update for 9.3 to
keep it contained and make the other MR smaller.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-15 09:35:54 +02:00
Jan Stancek f58fc750ef Merge: Sched/psi: updates to v6.3-rc1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2325

JIRA: https://issues.redhat.com/browse/RHEL-311
Tested: Enabled PSI and ran various stress tests.

Updates and bug fixes for the PSI subsystem. This brings
the code up to about v6.3-rc1. It does not include the
runtime enablement interface (34f26a15611 "sched/psi: Per-cgroup
PSI accounting disable/re-enable interfaceas") that required a larger
set of cgroup and kernfs patches. That may be take later if the
prerequisites are provided.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-11 12:12:13 +02:00
Jeff Moyer 2b8780eae3 io_uring: move to separate directory
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit ed29b0b4fd835b058ddd151c49d021e28d631ee6
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon May 23 17:05:03 2022 -0600

    io_uring: move to separate directory
    
    In preparation for splitting io_uring up a bit, move it into its own
    top level directory. It didn't really belong in fs/ anyway, as it's
    not a file system only API.
    
    This adds io_uring/ and moves the core files in there, and updates the
    MAINTAINERS file for the new location.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 04:49:02 -04:00
Jan Stancek 567f50bcff Merge: sched/core: Fix arch_scale_freq_tick() on tickless systems
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2276

Bugzilla: https://bugzilla.redhat.com/1996625

commit 7fb3ff22ad8772bbf0e3ce1ef3eb7b09f431807f
Author: Yair Podemsky <ypodemsk@redhat.com>
Date:   Wed Nov 30 14:51:21 2022 +0200

    sched/core: Fix arch_scale_freq_tick() on tickless systems

    In order for the scheduler to be frequency invariant we measure the
    ratio between the maximum CPU frequency and the actual CPU frequency.

    During long tickless periods of time the calculations that keep track
    of that might overflow, in the function scale_freq_tick():

      if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
              goto error;

    eventually forcing the kernel to disable the feature for all CPUs,
    and show the warning message:

       "Scheduler frequency invariance went wobbly, disabling!".

    Let's avoid that by limiting the frequency invariant calculations
    to CPUs with regular tick.

    Fixes: e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")
    Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
    Signed-off-by: Yair Podemsky <ypodemsk@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>
    Link: https://lore.kernel.org/r/20221130125121.34407-1-ypodemsk@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-04-25 06:58:25 +02:00
Phil Auld 02c4ba58b7 sched/fair: Sanitize vruntime of entity being migrated
JIRA: https://issues.redhat.com/browse/RHEL-282

commit a53ce18cacb477dd0513c607f187d16f0fa96f71
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Fri Mar 17 17:08:10 2023 +0100

    sched/fair: Sanitize vruntime of entity being migrated

    Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
    fixes an overflowing bug, but ignore a case that se->exec_start is reset
    after a migration.

    For fixing this case, we delay the reset of se->exec_start after
    placing the entity which se->exec_start to detect long sleeping task.

    In order to take into account a possible divergence between the clock_task
    of 2 rqs, we increase the threshold to around 104 days.

    Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
    Originally-by: Zhang Qiao <zhangqiao22@huawei.com>
    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
    Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 14:16:00 -04:00
Phil Auld d3f2df660a sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 585463f0d58aa4d29b744c7c53b222b8028de87f
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Mon Oct 3 16:34:20 2022 +0100

    sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()

    This removes the second use of the sched_core_mask temporary mask.

    Suggested-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 10:04:09 -04:00
Phil Auld f089b6b716 sched/core: Fix a missed update of user_cpus_ptr
JIRA: https://issues.redhat.com/browse/RHEL-282

commit df14b7f9efcda35e59bb6f50351aac25c50f6e24
Author: Waiman Long <longman@redhat.com>
Date:   Fri Feb 3 13:18:49 2023 -0500

    sched/core: Fix a missed update of user_cpus_ptr

    Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
    cpumask"), a successful call to sched_setaffinity() should always save
    the user requested cpu affinity mask in a task's user_cpus_ptr. However,
    when the given cpu mask is the same as the current one, user_cpus_ptr
    is not updated. Fix this by saving the user mask in this case too.

    Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230203181849.221943-1-longman@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:20 -04:00
Phil Auld bf73c54d24 sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 5657c116783545fb49cd7004994c187128552b12
Author: Waiman Long <longman@redhat.com>
Date:   Sun Jan 15 14:31:22 2023 -0500

    sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs

    The kernel commit 9a5418bc48ba ("sched/core: Use kfree_rcu() in
    do_set_cpus_allowed()") introduces a bug for kernels built with non-SMP
    configs. Calling sched_setaffinity() on such a uniprocessor kernel will
    cause cpumask_copy() to be called with a NULL pointer leading to general
    protection fault. This is not really a problem in real use cases as
    there aren't that many uniprocessor kernel configs in use and calling
    sched_setaffinity() on such a uniprocessor system doesn't make sense.

    Fix this problem by making sure cpumask_copy() will not be called in
    such a case.

    Fixes: 9a5418bc48ba ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()")
    Reported-by: kernel test robot <yujie.liu@intel.com>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230115193122.563036-1-longman@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:20 -04:00
Phil Auld ffd9ddbf5a sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 160fb0d83f206b3429fc495864a022110f9e4978
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date:   Fri Dec 23 18:32:57 2022 +0800

    sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate()

    ttwu_do_activate() is used for a complete wakeup, in which we will
    activate_task() and use ttwu_do_wakeup() to mark the task runnable
    and perform wakeup-preemption, also call class->task_woken() callback
    and update the rq->idle_stamp.

    Since ttwu_runnable() is not a complete wakeup, don't need all those
    done in ttwu_do_wakeup(), so we can move those to ttwu_do_activate()
    to simplify ttwu_do_wakeup(), making it only mark the task runnable
    to be reused in ttwu_runnable() and try_to_wake_up().

    This patch should not have any functional changes.

    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20221223103257.4962-2-zhouchengming@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:20 -04:00
Phil Auld a09a99cf93 sched/core: Micro-optimize ttwu_runnable()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit efe09385864f3441c71711f91e621992f9423c01
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date:   Fri Dec 23 18:32:56 2022 +0800

    sched/core: Micro-optimize ttwu_runnable()

    ttwu_runnable() is used as a fast wakeup path when the wakee task
    is running on CPU or runnable on RQ, in both cases we can just
    set its state to TASK_RUNNING to prevent a sleep.

    If the wakee task is on_cpu running, we don't need to update_rq_clock()
    or check_preempt_curr().

    But if the wakee task is on_rq && !on_cpu (e.g. an IRQ hit before
    the task got to schedule() and the task been preempted), we should
    check_preempt_curr() to see if it can preempt the current running.

    This also removes the class->task_woken() callback from ttwu_runnable(),
    which wasn't required per the RT/DL implementations: any required push
    operation would have been queued during class->set_next_task() when p
    got preempted.

    ttwu_runnable() also loses the update to rq->idle_stamp, as by definition
    the rq cannot be idle in this scenario.

    Suggested-by: Valentin Schneider <vschneid@redhat.com>
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Link: https://lore.kernel.org/r/20221223103257.4962-1-zhouchengming@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:19 -04:00
Phil Auld 4881a62e1d sched: Make const-safe
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 904cbab71dda1689d41a240541179f21ff433c40
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Dec 12 14:49:46 2022 +0000

    sched: Make const-safe

    With a modified container_of() that preserves constness, the compiler
    finds some pointers which should have been marked as const.  task_of()
    also needs to become const-preserving for the !FAIR_GROUP_SCHED case so
    that cfs_rq_of() can take a const argument.  No change to generated code.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20221212144946.2657785-1-willy@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:19 -04:00
Phil Auld 6a7d52383a sched: Clear ttwu_pending after enqueue_task()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit d6962c4fe8f96f7d384d6489b6b5ab5bf3e35991
Author: Tianchen Ding <dtcccc@linux.alibaba.com>
Date:   Fri Nov 4 10:36:01 2022 +0800

    sched: Clear ttwu_pending after enqueue_task()

    We found a long tail latency in schbench whem m*t is close to nr_cpus.
    (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)

    This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
    too early, and idle_cpu() will return true until the wakee task enqueued.
    This will mislead the waker when selecting idle cpu, and wake multiple
    worker threads on the same wakee cpu. This situation is enlarged by
    commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
    wakelist if wakee cpu is idle") because it tends to use wakelist.

    Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
    (Intel(R) Xeon(R) Platinum 8369B).

    Latency percentiles (usec):
                    base      base+revert_f3dd3f674555   base+this_patch
    50.0000th:         9                            13                 9
    75.0000th:        12                            19                12
    90.0000th:        15                            22                15
    95.0000th:        18                            24                17
    *99.0000th:       27                            31                24
    99.5000th:      3364                            33                27
    99.9000th:     12560                            36                30

    We also tested on unixbench and hackbench, and saw no performance
    change.

    Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Link: https://lkml.kernel.org/r/20221104023601.12844-1-dtcccc@linux.alibaba.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:19 -04:00
Phil Auld a774282315 sched/fair: Cleanup loop_max and loop_break
JIRA: https://issues.redhat.com/browse/RHEL-282

commit c59862f8265f8060b6650ee1dc12159fe5c89779
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Thu Aug 25 14:27:24 2022 +0200

    sched/fair: Cleanup loop_max and loop_break

    sched_nr_migrate_break is set to a fix value and never changes so we can
    replace it by a define SCHED_NR_MIGRATE_BREAK.

    Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value
    of sysctl_sched_nr_migrate which can be init to different values.

    Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate.

    The behavior stays unchanged unless you modify sysctl_sched_nr_migrate
    trough debugfs.

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220825122726.20819-3-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:18 -04:00
Phil Auld a9fcc51032 sched: Add TASK_ANY for wait_task_inactive()
JIRA: https://issues.redhat.com/browse/RHEL-282
Conflicts:  Context differences caused by having PREEMPT_RT
merged, specifically a015745ca41f ("sched: Consider
task_struct::saved_state in wait_task_inactive()").

commit f9fc8cad9728124cefe8844fb53d1814c92c6bfc
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Sep 6 12:39:55 2022 +0200

    sched: Add TASK_ANY for wait_task_inactive()

    Now that wait_task_inactive()'s @match_state argument is a mask (like
    ttwu()) it is possible to replace the special !match_state case with
    an 'all-states' value such that any blocked state will match.

    Suggested-by: Ingo Molnar (mingo@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:18 -04:00
Phil Auld 7ab9d04d74 sched: Rename task_running() to task_on_cpu()
JIRA: https://issues.redhat.com/browse/RHEL-282
Conflicts:  Context differences caused by having PREEMPT_RT
merged, specifically a015745ca41f ("sched: Consider
task_struct::saved_state in wait_task_inactive()").

commit 0b9d46fc5ef7a457cc635b30b010081228cb81ac
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Sep 6 12:33:04 2022 +0200

    sched: Rename task_running() to task_on_cpu()

    There is some ambiguity about task_running() in that it is unrelated
    to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing
    task_on_cpu().

    Suggested-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:18 -04:00
Phil Auld 3f3a0eeee3 sched/fair: Allow changing cgroup of new forked task
JIRA: https://issues.redhat.com/browse/RHEL-282

commit df16b71c686cb096774e30153c9ce6756450796c
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date:   Thu Aug 18 20:48:03 2022 +0800

    sched/fair: Allow changing cgroup of new forked task

    commit 7dc603c902 ("sched/fair: Fix PELT integrity for new tasks")
    introduce a TASK_NEW state and an unnessary limitation that would fail
    when changing cgroup of new forked task.

    Because at that time, we can't handle task_change_group_fair() for new
    forked fair task which hasn't been woken up by wake_up_new_task(),
    which will cause detach on an unattached task sched_avg problem.

    This patch delete this unnessary limitation by adding check before do
    detach or attach in task_change_group_fair().

    So cpu_cgrp_subsys.can_attach() has nothing to do for fair tasks,
    only define it in #ifdef CONFIG_RT_GROUP_SCHED.

    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20220818124805.601-8-zhouchengming@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:17 -04:00
Phil Auld fb17b0f886 sched/fair: Remove redundant cpu_cgrp_subsys->fork()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 39c4261191bf05e7eb310f852980a6d0afe5582a
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date:   Thu Aug 18 20:47:58 2022 +0800

    sched/fair: Remove redundant cpu_cgrp_subsys->fork()

    We use cpu_cgrp_subsys->fork() to set task group for the new fair task
    in cgroup_post_fork().

    Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
    has already set_task_rq() for the new fair task in sched_cgroup_fork(),
    so cpu_cgrp_subsys->fork() can be removed.

      cgroup_can_fork()     --> pin parent's sched_task_group
      sched_cgroup_fork()
        __set_task_cpu()
          set_task_rq()
      cgroup_post_fork()
        ss->fork() := cpu_cgroup_fork()
          sched_change_group(..., TASK_SET_GROUP)
            task_set_group_fair()
              set_task_rq()  --> can be removed

    After this patch's change, task_change_group_fair() only need to
    care about task cgroup migration, make the code much simplier.

    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20220818124805.601-3-zhouchengming@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:17 -04:00
Phil Auld 9b10d97986 sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 09348d75a6ce60eec85c86dd0ab7babc4db3caf6
Author: Ingo Molnar <mingo@kernel.org>
Date:   Thu Aug 11 08:54:52 2022 +0200

    sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

    There's no good reason to crash a user's system with a BUG_ON(),
    chances are high that they'll never even see the crash message on
    Xorg, and it won't make it into the syslog either.

    By using a WARN_ON_ONCE() we at least give the user a chance to report
    any bugs triggered here - instead of getting silent hangs.

    None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore
    cases where a NULL check is done via a BUG_ON() and we let a NULL
    pointer through after a WARN_ON_ONCE().

    There's one exception: WARN_ON_ONCE() arguments with side-effects,
    such as locking - in this case we use the return value of the
    WARN_ON_ONCE(), such as in:

     -       BUG_ON(!lock_task_sighand(p, &flags));
     +       if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
     +               return;

    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:17 -04:00
Phil Auld 30180b878d sched/fair: Make per-cpu cpumasks static
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 18c31c9711a90b48a77b78afb65012d9feec444c
Author: Bing Huang <huangbing@kylinos.cn>
Date:   Sat Jul 23 05:36:09 2022 +0800

    sched/fair: Make per-cpu cpumasks static

    The load_balance_mask and select_rq_mask percpu variables are only used in
    kernel/sched/fair.c.

    Make them static and move their allocation into init_sched_fair_class().

    Replace kzalloc_node() with zalloc_cpumask_var_node() to get rid of the
    CONFIG_CPUMASK_OFFSTACK #ifdef and to align with per-cpu cpumask
    allocation for RT (local_cpu_mask in init_sched_rt_class()) and DL
    class (local_cpu_mask_dl in init_sched_dl_class()).

    [ mingo: Tidied up changelog & touched up the code. ]

    Signed-off-by: Bing Huang <huangbing@kylinos.cn>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20220722213609.3901-1-huangbing775@126.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:16 -04:00
Phil Auld 680e019203 sched/debug: Print each field value left-aligned in sched_show_task()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 0f03d6805bfc454279169a1460abb3f6b3db317f
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date:   Wed Jul 27 14:08:19 2022 +0800

    sched/debug: Print each field value left-aligned in sched_show_task()

    Currently, the values of some fields are printed right-aligned, causing
    the field value to be next to the next field name rather than next to its
    own field name. So print each field value left-aligned, to make it more
    readable.

     Before:
            stack:    0 pid:  307 ppid:     2 flags:0x00000008
     After:
            stack:0     pid:308   ppid:2      flags:0x0000000a

    This also makes them print in the same style as the other two fields:

            task:demo0           state:R  running task

    Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20220727060819.1085-1-thunder.leizhen@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:16 -04:00
Phil Auld 78210daf7b sched: Snapshot thread flags
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 0569b245132c40015281610353935a50e282eb94
Author: Mark Rutland <mark.rutland@arm.com>
Date:   Mon Nov 29 13:06:45 2021 +0000

    sched: Snapshot thread flags

    Some thread flags can be set remotely, and so even when IRQs are disabled,
    the flags can change under our feet. Generally this is unlikely to cause a
    problem in practice, but it is somewhat unsound, and KCSAN will
    legitimately warn that there is a data race.

    To avoid such issues, a snapshot of the flags has to be taken prior to
    using them. Some places already use READ_ONCE() for that, others do not.

    Convert them all to the new flag accessor helpers.

    The READ_ONCE(ti->flags) .. cmpxchg(ti->flags) loop in
    set_nr_if_polling() is left as-is for clarity.

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20211129130653.2037928-4-mark.rutland@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:16 -04:00
Phil Auld cb73223615 sched/core: Adjusting the order of scanning CPU
JIRA: https://issues.redhat.com/browse/RHEL-310

commit 8589018acc65e5ddfd111f0a7ee85f9afde3a830
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Fri Dec 16 14:24:06 2022 +0800

    sched/core: Adjusting the order of scanning CPU

    When select_idle_capacity() starts scanning for an idle CPU, it starts
    with target CPU that has already been checked in select_idle_sibling().
    So we start checking from the next CPU and try the target CPU at the end.
    Similarly for task_numa_assign(), we have just checked numa_migrate_on
    of dst_cpu, so start from the next CPU. This also works for
    steal_cookie_task(), the first scan must fail and start directly
    from the next one.

    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Link: https://lore.kernel.org/r/20221216062406.7812-3-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-17 16:14:35 -04:00
Phil Auld 11d3f0cf26 sched: Introduce struct balance_callback to avoid CFI mismatches
JIRA: https://issues.redhat.com/browse/RHEL-310

commit 8e5bad7dccec2014f24497b57d8a8ee0b752c290
Author: Kees Cook <keescook@chromium.org>
Date:   Fri Oct 7 17:07:58 2022 -0700

    sched: Introduce struct balance_callback to avoid CFI mismatches

    Introduce distinct struct balance_callback instead of performing function
    pointer casting which will trip CFI. Avoids warnings as found by Clang's
    future -Wcast-function-type-strict option:

    In file included from kernel/sched/core.c:84:
    kernel/sched/sched.h:1755:15: warning: cast from 'void (*)(struct rq *)' to 'void (*)(struct callback_head *)' converts to incompatible function type [-Wcast-function-type-strict]
            head->func = (void (*)(struct callback_head *))func;
                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    No binary differences result from this change.

    This patch is a cleanup based on Brad Spengler/PaX Team's modifications
    to sched code in their last public patch of grsecurity/PaX based on my
    understanding of the code. Changes or omissions from the original code
    are mine and don't reflect the original grsecurity/PaX code.

    Reported-by: Sami Tolvanen <samitolvanen@google.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Nathan Chancellor <nathan@kernel.org>
    Link: https://github.com/ClangBuiltLinux/linux/issues/1724
    Link: https://lkml.kernel.org/r/20221008000758.2957718-1-keescook@chromium.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-10 11:35:02 -04:00
Phil Auld 830c2b71ea sched/uclamp: Fix fits_capacity() check in feec()
JIRA: https://issues.redhat.com/browse/RHEL-310

commit 244226035a1f9b2b6c326e55ae5188fab4f428cb
Author: Qais Yousef <qyousef@layalina.io>
Date:   Thu Aug 4 15:36:03 2022 +0100

    sched/uclamp: Fix fits_capacity() check in feec()

    As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024,
    it'll always pick the previous CPU because fits_capacity() will always
    return false in this case.

    The new util_fits_cpu() logic should handle this correctly for us beside
    more corner cases where similar failures could occur, like when using
    UCLAMP_MAX.

    We open code uclamp_rq_util_with() except for the clamp() part,
    util_fits_cpu() needs the 'raw' values to be passed to it.

    Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp
    value for the rq. Makes the code more readable and ensures the right
    rules (use READ_ONCE/WRITE_ONCE) are respected transparently.

    [1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html

    Fixes: 1d42509e47 ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
    Reported-by: Yun Hsiang <hsiang023167@gmail.com>
    Signed-off-by: Qais Yousef <qais.yousef@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220804143609.515789-4-qais.yousef@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-10 11:35:02 -04:00
Phil Auld 20844af32a sched/psi: Use task->psi_flags to clear in CPU migration
JIRA: https://issues.redhat.com/browse/RHEL-311

commit 52b33d87b9197c51e8ffdc61873739d90dd0a16f
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date:   Mon Sep 26 16:19:31 2022 +0800

    sched/psi: Use task->psi_flags to clear in CPU migration

    The commit d583d360a6 ("psi: Fix psi state corruption when schedule()
    races with cgroup move") fixed a race problem by making cgroup_move_task()
    use task->psi_flags instead of looking at the scheduler state.

    We can extend task->psi_flags usage to CPU migration, which should be
    a minor optimization for performance and code simplicity.

    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Link: https://lore.kernel.org/r/20220926081931.45420-1-zhouchengming@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-07 09:17:26 -04:00
Phil Auld c345ba1c0c sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
JIRA: https://issues.redhat.com/browse/RHEL-311

commit 52b1364ba0b105122d6de0e719b36db705011ac1
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date:   Fri Aug 26 00:41:08 2022 +0800

    sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure

    Now PSI already tracked workload pressure stall information for
    CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
    obvious impact on some workload productivity, such as web service
    workload.

    When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
    from update_rq_clock_task(), in which we can record that delta
    to CPU curr task's cgroups as PSI_IRQ_FULL status.

    Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
    the current task on the CPU, make nothing productive could run
    even if it were runnable, so we only use PSI_IRQ_FULL.

    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-07 09:17:25 -04:00
Phil Auld e59312ac89 sched/core: Fix arch_scale_freq_tick() on tickless systems
Bugzilla: https://bugzilla.redhat.com/1996625

commit 7fb3ff22ad8772bbf0e3ce1ef3eb7b09f431807f
Author: Yair Podemsky <ypodemsk@redhat.com>
Date:   Wed Nov 30 14:51:21 2022 +0200

    sched/core: Fix arch_scale_freq_tick() on tickless systems

    In order for the scheduler to be frequency invariant we measure the
    ratio between the maximum CPU frequency and the actual CPU frequency.

    During long tickless periods of time the calculations that keep track
    of that might overflow, in the function scale_freq_tick():

      if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
              goto error;

    eventually forcing the kernel to disable the feature for all CPUs,
    and show the warning message:

       "Scheduler frequency invariance went wobbly, disabling!".

    Let's avoid that by limiting the frequency invariant calculations
    to CPUs with regular tick.

    Fixes: e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")
    Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
    Signed-off-by: Yair Podemsky <ypodemsk@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>
    Link: https://lore.kernel.org/r/20221130125121.34407-1-ypodemsk@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-03-30 11:52:21 -04:00
Waiman Long 415317267b sched/debug: Show the registers of 'current' in dump_cpu_task()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit bc1cca97e6da6c7c34db7c5b864bb354ca5305ac
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date:   Thu, 4 Aug 2022 10:34:20 +0800

    sched/debug: Show the registers of 'current' in dump_cpu_task()

    The dump_cpu_task() function does not print registers on architectures
    that do not support NMIs.  However, registers can be useful for
    debugging.  Fortunately, in the case where dump_cpu_task() is invoked
    from an interrupt handler and is dumping the current CPU's stack, the
    get_irq_regs() function can be used to get the registers.

    Therefore, this commit makes dump_cpu_task() check to see if it is being
    asked to dump the current CPU's stack from within an interrupt handler,
    and, if so, it uses the get_irq_regs() function to obtain the registers.
    On systems that do support NMIs, this commit has the further advantage
    of avoiding a self-NMI in this case.

    This is an example of rcu self-detected stall on arm64, which does not
    support NMIs:
    [   27.501721] rcu: INFO: rcu_preempt self-detected stall on CPU
    [   27.502238] rcu:     0-....: (1250 ticks this GP) idle=4f7/1/0x4000000000000000 softirq=2594/2594 fqs=619
    [   27.502632]  (t=1251 jiffies g=2989 q=29 ncpus=4)
    [   27.503845] CPU: 0 PID: 306 Comm: test0 Not tainted 5.19.0-rc7-00009-g1c1a6c29ff99-dirty #46
    [   27.504732] Hardware name: linux,dummy-virt (DT)
    [   27.504947] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [   27.504998] pc : arch_counter_read+0x18/0x24
    [   27.505301] lr : arch_counter_read+0x18/0x24
    [   27.505328] sp : ffff80000b29bdf0
    [   27.505345] x29: ffff80000b29bdf0 x28: 0000000000000000 x27: 0000000000000000
    [   27.505475] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
    [   27.505553] x23: 0000000000001f40 x22: ffff800009849c48 x21: 000000065f871ae0
    [   27.505627] x20: 00000000000025ec x19: ffff80000a6eb300 x18: ffffffffffffffff
    [   27.505654] x17: 0000000000000001 x16: 0000000000000000 x15: ffff80000a6d0296
    [   27.505681] x14: ffffffffffffffff x13: ffff80000a29bc18 x12: 0000000000000426
    [   27.505709] x11: 0000000000000162 x10: ffff80000a2f3c18 x9 : ffff80000a29bc18
    [   27.505736] x8 : 00000000ffffefff x7 : ffff80000a2f3c18 x6 : 00000000759bd013
    [   27.505761] x5 : 01ffffffffffffff x4 : 0002dc6c00000000 x3 : 0000000000000017
    [   27.505787] x2 : 00000000000025ec x1 : ffff80000b29bdf0 x0 : 0000000075a30653
    [   27.505937] Call trace:
    [   27.506002]  arch_counter_read+0x18/0x24
    [   27.506171]  ktime_get+0x48/0xa0
    [   27.506207]  test_task+0x70/0xf0
    [   27.506227]  kthread+0x10c/0x110
    [   27.506243]  ret_from_fork+0x10/0x20

    This is a marked improvement over the old output:
    [   27.944550] rcu: INFO: rcu_preempt self-detected stall on CPU
    [   27.944980] rcu:     0-....: (1249 ticks this GP) idle=cbb/1/0x4000000000000000 softirq=2610/2610 fqs=614
    [   27.945407]  (t=1251 jiffies g=2681 q=28 ncpus=4)
    [   27.945731] Task dump for CPU 0:
    [   27.945844] task:test0           state:R  running task     stack:    0 pid:  306 ppid:     2 flags:0x0000000a
    [   27.946073] Call trace:
    [   27.946151]  dump_backtrace.part.0+0xc8/0xd4
    [   27.946378]  show_stack+0x18/0x70
    [   27.946405]  sched_show_task+0x150/0x180
    [   27.946427]  dump_cpu_task+0x44/0x54
    [   27.947193]  rcu_dump_cpu_stacks+0xec/0x130
    [   27.947212]  rcu_sched_clock_irq+0xb18/0xef0
    [   27.947231]  update_process_times+0x68/0xac
    [   27.947248]  tick_sched_handle+0x34/0x60
    [   27.947266]  tick_sched_timer+0x4c/0xa4
    [   27.947281]  __hrtimer_run_queues+0x178/0x360
    [   27.947295]  hrtimer_interrupt+0xe8/0x244
    [   27.947309]  arch_timer_handler_virt+0x38/0x4c
    [   27.947326]  handle_percpu_devid_irq+0x88/0x230
    [   27.947342]  generic_handle_domain_irq+0x2c/0x44
    [   27.947357]  gic_handle_irq+0x44/0xc4
    [   27.947376]  call_on_irq_stack+0x2c/0x54
    [   27.947415]  do_interrupt_handler+0x80/0x94
    [   27.947431]  el1_interrupt+0x34/0x70
    [   27.947447]  el1h_64_irq_handler+0x18/0x24
    [   27.947462]  el1h_64_irq+0x64/0x68                       <--- the above backtrace is worthless
    [   27.947474]  arch_counter_read+0x18/0x24
    [   27.947487]  ktime_get+0x48/0xa0
    [   27.947501]  test_task+0x70/0xf0
    [   27.947520]  kthread+0x10c/0x110
    [   27.947538]  ret_from_fork+0x10/0x20

    Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Cc: Ben Segall <bsegall@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
    Cc: Valentin Schneider <vschneid@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:47:58 -04:00
Waiman Long c6babad818 sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit e73dfe30930b75c98746152e7a2f6a8ab6067b51
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date:   Thu, 4 Aug 2022 10:34:19 +0800

    sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task()

    The trigger_all_cpu_backtrace() function attempts to send an NMI to the
    target CPU, which usually provides much better stack traces than the
    dump_cpu_task() function's approach of dumping that stack from some other
    CPU.  So much so that most calls to dump_cpu_task() only happen after
    a call to trigger_all_cpu_backtrace() has failed.  And the exception to
    this rule really should attempt to use trigger_all_cpu_backtrace() first.

    Therefore, move the trigger_all_cpu_backtrace() invocation into
    dump_cpu_task().

    Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Cc: Ben Segall <bsegall@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
    Cc: Valentin Schneider <vschneid@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:47:57 -04:00
Waiman Long 6ddf329bf5 context_tracking: Split user tracking Kconfig
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516
Conflicts:
 1) All the hunks from arch/mips/Kconfig, arch/loongarch/Kconfig and
    arch/xtensa/* as they are unsupported arches and cannot be applied
    directly.
 2) Context diffs in arch/arm/Kconfig, arch/riscv/Kconfig,
    arch/riscv/kernel/entry.S and arch/x86/Kconfig.

commit 24a9c54182b3758801b8ca6c8c237cc2ff654732
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed, 8 Jun 2022 16:40:24 +0200

    context_tracking: Split user tracking Kconfig

    Context tracking is going to be used not only to track user transitions
    but also idle/IRQs/NMIs. The user tracking part will then become a
    separate feature. Prepare Kconfig for that.

    [ frederic: Apply Max Filippov feedback. ]

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
    Cc: Yu Liao <liaoyu15@huawei.com>
    Cc: Phil Auld <pauld@redhat.com>
    Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
    Cc: Alex Belits <abelits@marvell.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:36:16 -04:00
Waiman Long 0ceb37b5ca rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit e386b6725798eec07facedf4d4bb710c079fd25c
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu, 2 Jun 2022 17:30:01 -0700

    rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs

    Currently, the RCU Tasks Trace grace-period kthread IPIs each online CPU
    using smp_call_function_single() in order to track any tasks currently in
    RCU Tasks Trace read-side critical sections during which the corresponding
    task has neither blocked nor been preempted.  These IPIs are annoying
    and are also not strictly necessary because any task that blocks or is
    preempted within its current RCU Tasks Trace read-side critical section
    will be tracked on one of the per-CPU rcu_tasks_percpu structure's
    ->rtp_blkd_tasks list.  So the only time that this is a problem is if
    one of the CPUs runs through a long-duration RCU Tasks Trace read-side
    critical section without a context switch.

    Note that the task_call_func() function cannot help here because there is
    no safe way to identify the target task.  Of course, the task_call_func()
    function will be very useful later, when processing the list of tasks,
    but it needs to know the task.

    This commit therefore creates a cpu_curr_snapshot() function that returns
    a pointer the task_struct structure of some task that happened to be
    running on the specified CPU more or less during the time that the
    cpu_curr_snapshot() function was executing.  If there was no context
    switch during this time, this function will return a pointer to the
    task_struct structure of the task that was running throughout.  If there
    was a context switch, then the outgoing task will be taken care of by
    RCU's context-switch hook, and the incoming task was either already taken
    care during some previous context switch, or it is not currently within an
    RCU Tasks Trace read-side critical section.  And in this latter case, the
    grace period already started, so there is no need to wait on this task.

    This new cpu_curr_snapshot() function is invoked on each CPU early in
    the RCU Tasks Trace grace-period processing, and the resulting tasks
    are queued for later quiescent-state inspection.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Andrii Nakryiko <andrii@kernel.org>
    Cc: Martin KaFai Lau <kafai@fb.com>
    Cc: KP Singh <kpsingh@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:36:12 -04:00
Juri Lelli 6bc27040eb sched: Add support for lazy preemption
Bugzilla: https://bugzilla.redhat.com/2171995
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git

commit ea622076b76f25526278b448dc8326db01758c0a
Author:    Thomas Gleixner <tglx@linutronix.de>
Date:      Fri Oct 26 18:50:54 2012 +0100

    sched: Add support for lazy preemption

    It has become an obsession to mitigate the determinism vs. throughput
    loss of RT. Looking at the mainline semantics of preemption points
    gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER
    tasks. One major issue is the wakeup of tasks which are right away
    preempting the waking task while the waking task holds a lock on which
    the woken task will block right after having preempted the wakee. In
    mainline this is prevented due to the implicit preemption disable of
    spin/rw_lock held regions. On RT this is not possible due to the fully
    preemptible nature of sleeping spinlocks.

    Though for a SCHED_OTHER task preempting another SCHED_OTHER task this
    is really not a correctness issue. RT folks are concerned about
    SCHED_FIFO/RR tasks preemption and not about the purely fairness
    driven SCHED_OTHER preemption latencies.

    So I introduced a lazy preemption mechanism which only applies to
    SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the
    existing preempt_count each tasks sports now a preempt_lazy_count
    which is manipulated on lock acquiry and release. This is slightly
    incorrect as for lazyness reasons I coupled this on
    migrate_disable/enable so some other mechanisms get the same treatment
    (e.g. get_cpu_light).

    Now on the scheduler side instead of setting NEED_RESCHED this sets
    NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and
    therefor allows to exit the waking task the lock held region before
    the woken task preempts. That also works better for cross CPU wakeups
    as the other side can stay in the adaptive spinning loop.

    For RT class preemption there is no change. This simply sets
    NEED_RESCHED and forgoes the lazy preemption counter.

     Initial test do not expose any observable latency increasement, but
    history shows that I've been proven wrong before :)

    The lazy preemption mode is per default on, but with
    CONFIG_SCHED_DEBUG enabled it can be disabled via:

     # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features

    and reenabled via

     # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features

    The test results so far are very machine and workload dependent, but
    there is a clear trend that it enhances the non RT workload
    performance.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
2023-02-27 13:46:09 +01:00
Juri Lelli 3673cc2e61 sched: Consider task_struct::saved_state in wait_task_inactive().
Bugzilla: https://bugzilla.redhat.com/2171995
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git

commit a015745ca41f057beb9650166271fc6188f33d9b
Author:    Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:      Wed Jun 22 12:27:05 2022 +0200

    sched: Consider task_struct::saved_state in wait_task_inactive().

    Ptrace is using wait_task_inactive() to wait for the tracee to reach a
    certain task state. On PREEMPT_RT that state may be stored in
    task_struct::saved_state while the tracee blocks on a sleeping lock and
    task_struct::__state is set to TASK_RTLOCK_WAIT.
    It is not possible to check only for TASK_RTLOCK_WAIT to be sure that the task
    is blocked on a sleeping lock because during wake up (after the sleeping lock
    has been acquired) the task state is set TASK_RUNNING. After the task in on CPU
    and acquired the pi_lock it will reset the state accordingly but until then
    TASK_RUNNING will be observed (with the desired state saved in saved_state).

    Check also for task_struct::saved_state if the desired match was not found in
    task_struct::__state on PREEMPT_RT. If the state was found in saved_state, wait
    until the task is idle and state is visible in task_struct::__state.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lkml.kernel.org/r/Yt%2FpQAFQ1xKNK0RY@linutronix.de

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
2023-02-27 13:46:08 +01:00
Waiman Long 22c20d7c8a sched/core: Use kfree_rcu() in do_set_cpus_allowed()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143847
Upstream Status: tip commit 9a5418bc48babb313d2a62df29ebe21ce8c06c59

commit 9a5418bc48babb313d2a62df29ebe21ce8c06c59
Author: Waiman Long <longman@redhat.com>
Date:   Fri, 30 Dec 2022 23:11:20 -0500

    sched/core: Use kfree_rcu() in do_set_cpus_allowed()

    Commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in
    do_set_cpus_allowed()") may call kfree() if user_cpus_ptr was previously
    set. Unfortunately, some of the callers of do_set_cpus_allowed()
    may have pi_lock held when calling it. So the following splats may be
    printed especially when running with a PREEMPT_RT kernel:

       WARNING: possible circular locking dependency detected
       BUG: sleeping function called from invalid context

    To avoid these problems, kfree_rcu() is used instead. An internal
    cpumask_rcuhead union is created for the sole purpose of facilitating
    the use of kfree_rcu() to free the cpumask.

    Since user_cpus_ptr is not being used in non-SMP configs, the newly
    introduced alloc_user_cpus_ptr() helper will return NULL in this case
    and sched_setaffinity() is modified to handle this special case.

    Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Peter Zijlstra <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20221231041120.440785-3-longman@redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
2023-01-09 14:59:01 -05:00
Waiman Long f3e0ad343d sched/core: Fix use-after-free bug in dup_user_cpus_ptr()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143847
Upstream Status: tip commit 87ca4f9efbd7cc649ff43b87970888f2812945b8

commit 87ca4f9efbd7cc649ff43b87970888f2812945b8
Author: Waiman Long <longman@redhat.com>
Date:   Fri, 30 Dec 2022 23:11:19 -0500

    sched/core: Fix use-after-free bug in dup_user_cpus_ptr()

    Since commit 07ec77a1d4e8 ("sched: Allow task CPU affinity to be
    restricted on asymmetric systems"), the setting and clearing of
    user_cpus_ptr are done under pi_lock for arm64 architecture. However,
    dup_user_cpus_ptr() accesses user_cpus_ptr without any lock
    protection. Since sched_setaffinity() can be invoked from another
    process, the process being modified may be undergoing fork() at
    the same time.  When racing with the clearing of user_cpus_ptr in
    __set_cpus_allowed_ptr_locked(), it can lead to user-after-free and
    possibly double-free in arm64 kernel.

    Commit 8f9ea86fdf99 ("sched: Always preserve the user requested
    cpumask") fixes this problem as user_cpus_ptr, once set, will never
    be cleared in a task's lifetime. However, this bug was re-introduced
    in commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in
    do_set_cpus_allowed()") which allows the clearing of user_cpus_ptr in
    do_set_cpus_allowed(). This time, it will affect all arches.

    Fix this bug by always clearing the user_cpus_ptr of the newly
    cloned/forked task before the copying process starts and check the
    user_cpus_ptr state of the source task under pi_lock.

    Note to stable, this patch won't be applicable to stable releases.
    Just copy the new dup_user_cpus_ptr() function over.

    Fixes: 07ec77a1d4e8 ("sched: Allow task CPU affinity to be restricted on asymmetric systems")
    Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
    Reported-by: David Wang 王标 <wangbiao3@xiaomi.com>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Peter Zijlstra <peterz@infradead.org>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20221231041120.440785-2-longman@redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
2023-01-09 14:58:58 -05:00
Frantisek Hrbata e73e910ed6 Merge: Scheduler updates for 9.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1582

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2115520
Depends: !1372
Tested: Scheduler stress and regression tests. Ran nohz full and numa balancing
tests. Perf QE ran perfromance regression tests.

Omitted-fix: 62ebaf2f9261 ("ath6kl: avoid flush_scheduled_work() usage") Not enabled in rhel config.
Omitted-fix: 0538fa09bb10 ("gpu/drm/bridge/cadence: avoid flush_scheduled_work() usage")
Omitted-fix: a4345557527f ("scsi: qla2xxx: Always wait for qlt_sess_work_fn() from qlt_stop_phase1()")

These 3 just reference c4f135d64382 ("workqueue: Wrap flush_workqueue() using a macro") in
commit log since they depend on it.

Series to keep core scheduler code close to upstream linux and to apply potentially needed
fixes.  This brings the scheduler and some related code up to roughly 6.0.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-17 03:54:24 -05:00
Frantisek Hrbata 10ec0ed632 Merge: Update drivers/powercap to enable support for Arm SystemReady IR platforms
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1372

```
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2126952

This is one of a series of patch sets to enable Arm SystemReady IR
support in the kernel for compliant platforms.  This set cleans up
powercap and enables DTPM for edge systems to use in thermal and power
management; this is all in drivers/powercap.  This set has been tested
via simple boot tests, and of course the CI loop.  This may be difficult
to test on Arm due to DTPM being a very new feature.  However, this is
exactly the same powercap framework used by intel_rapl, which should
continue to function properly regardless.

Signed-off-by: Al Stone <ahs3@redhat.com>
```

Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-15 07:30:57 -05:00
Phil Auld a6fd92ef44 sched/core: Do not requeue task on CPU excluded from cpus_mask
Bugzilla: https://bugzilla.redhat.com/2115520

commit 751d4cbc43879229dbc124afefe240b70fd29a85
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Thu Aug 4 10:21:19 2022 +0100

    sched/core: Do not requeue task on CPU excluded from cpus_mask

    The following warning was triggered on a large machine early in boot on
    a distribution kernel but the same problem should also affect mainline.

       WARNING: CPU: 439 PID: 10 at ../kernel/workqueue.c:2231 process_one_work+0x4d/0x440
       Call Trace:
        <TASK>
        rescuer_thread+0x1f6/0x360
        kthread+0x156/0x180
        ret_from_fork+0x22/0x30
        </TASK>

    Commit c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
    optimises ttwu by queueing a task that is descheduling on the wakelist,
    but does not check if the task descheduling is still allowed to run on that CPU.

    In this warning, the problematic task is a workqueue rescue thread which
    checks if the rescue is for a per-cpu workqueue and running on the wrong CPU.
    While this is early in boot and it should be possible to create workers,
    the rescue thread may still used if the MAYDAY_INITIAL_TIMEOUT is reached
    or MAYDAY_INTERVAL and on a sufficiently large machine, the rescue
    thread is being used frequently.

    Tracing confirmed that the task should have migrated properly using the
    stopper thread to handle the migration. However, a parallel wakeup from udev
    running on another CPU that does not share CPU cache observes p->on_cpu and
    uses task_cpu(p), queues the task on the old CPU and triggers the warning.

    Check that the wakee task that is descheduling is still allowed to run
    on its current CPU and if not, wait for the descheduling to complete
    and select an allowed CPU.

    Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20220804092119.20137-1-mgorman@techsingularity.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:41 -04:00
Phil Auld 303a0ad632 sched/core: Always flush pending blk_plug
Bugzilla: https://bugzilla.redhat.com/2115520

commit 401e4963bf45c800e3e9ea0d3a0289d738005fd4
Author: John Keeping <john@metanate.com>
Date:   Fri Jul 8 17:27:02 2022 +0100

    sched/core: Always flush pending blk_plug

    With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
    normal priority tasks (SCHED_OTHER, nice level zero):

            INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
                  Not tainted 5.15.49-rt46 #1
            "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
            Workqueue: writeback wb_workfn (flush-7:0)
            [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
            [<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
            [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
            +(rt_mutex_slowlock.constprop.0+0xac/0x174)
            [<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
            [<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
            [<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
            [<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
            [<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
            [<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
            [<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
            [<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
            [<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
            [<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
            Exception stack(0xc10e3fb0 to 0xc10e3ff8)
            3fa0:                                     00000000 00000000 00000000 00000000
            3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
            3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

            INFO: task tar:2083 blocked for more than 491 seconds.
                  Not tainted 5.15.49-rt46 #1
            "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            task:tar             state:D stack:    0 pid: 2083 ppid:  2082 flags:0x00000000
            [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
            [<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
            [<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
            [<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
            [<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
            [<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
            [<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
            [<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
            [<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
            [<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
            [<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
            [<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
            Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
            bfa0:                   01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
            bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
            bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc

    Here the kworker is waiting on msdos_sb_info::s_lock which is held by
    tar which is in turn waiting for a buffer which is locked waiting to be
    flushed, but this operation is plugged in the kworker.

    The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
    return false on !RT and thus the behaviour changes for RT.

    It seems that the intent here is to skip blk_flush_plug() in the case
    where a non-preemptible lock (such as a spinlock) has been converted to
    a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
    schedule flag.  But sched_submit_work() is only called from schedule()
    which is never called in this scenario, so the check can simply be
    deleted.

    Looking at the history of the -rt patchset, in fact this change was
    present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
    of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
    ("sched/core: Rework the __schedule() preempt argument").

    As described in [1]:

       The schedule process must distinguish between blocking on a regular
       sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
       and rwlock):
       - rwsem and mutex must flush block requests (blk_schedule_flush_plug())
         even if blocked on a lock. This can not deadlock because this also
         happens for non-RT.
         There should be a warning if the scheduling point is within a RCU read
         section.

       - spinlock and rwlock must not flush block requests. This will deadlock
         if the callback attempts to acquire a lock which is already acquired.
         Similarly to being preempted, there should be no warning if the
         scheduling point is within a RCU read section.

    and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
    issue.

    [1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches

    Signed-off-by: John Keeping <john@metanate.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:40 -04:00
Phil Auld a1b1e51378 sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
Bugzilla: https://bugzilla.redhat.com/2115520

commit c02d5546ea34d589c83eda5055dbd727a396642b
Author: Uros Bizjak <ubizjak@gmail.com>
Date:   Wed Jun 29 17:15:52 2022 +0200

    sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling

    Use try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in
    set_nr_{and_not,if}_polling. x86 cmpxchg returns success in ZF flag,
    so this change saves a compare after cmpxchg.

    The definition of cmpxchg based fetch_or was changed in the
    same way as atomic_fetch_##op definitions were changed
    in e6790e4b5d.

    Also declare these two functions as inline to ensure inlining. In the
    case of set_nr_and_not_polling, the compiler (gcc) tries to outsmart
    itself by constructing the boolean return value with logic operations
    on the fetched value, and these extra operations enlarge the function
    over the inlining threshold value.

    Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220629151552.6015-1-ubizjak@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:40 -04:00
Phil Auld 9798738b93 sched/fair: Rename select_idle_mask to select_rq_mask
Bugzilla: https://bugzilla.redhat.com/2115520

commit ec4fc801a02d96180c597238fe87141471b70971
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Thu Jun 23 11:11:02 2022 +0200

    sched/fair: Rename select_idle_mask to select_rq_mask

    On 21/06/2022 11:04, Vincent Donnefort wrote:
    > From: Dietmar Eggemann <dietmar.eggemann@arm.com>

    https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered
    that this patch doesn't build anymore (on tip sched/core or linux-next)
    because of commit f5b2eeb499910 ("sched/fair: Consider CPU affinity when
    allowing NUMA imbalance in find_idlest_group()").

    New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to
    select_rq_mask below.

    -- >8 --

    Decouple the name of the per-cpu cpumask select_idle_mask from its usage
    in select_idle_[cpu/capacity]() of the CFS run-queue selection
    (select_task_rq_fair()).

    This is to support the reuse of this cpumask in the Energy Aware
    Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue
    selection.

    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-by: Lukasz Luba <lukasz.luba@arm.com>
    Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:39 -04:00
Phil Auld 41e6e3a50f sched: only perform capability check on privileged operation
Bugzilla: https://bugzilla.redhat.com/2115520

commit 700a78335fc28a59c307f420857fd2d4521549f8
Author: Christian Göttsche <cgzones@googlemail.com>
Date:   Wed Jun 15 17:25:04 2022 +0200

    sched: only perform capability check on privileged operation

    sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler()
    a CAP_SYS_NICE audit event unconditionally, even when the requested
    operation does not require that capability / is unprivileged, i.e. for
    reducing niceness.
    This is relevant in connection with SELinux, where a capability check
    results in a policy decision and by default a denial message on
    insufficient permission is issued.
    It can lead to three undesired cases:
      1. A denial message is generated, even in case the operation was an
         unprivileged one and thus the syscall succeeded, creating noise.
      2. To avoid the noise from 1. the policy writer adds a rule to ignore
         those denial messages, hiding future syscalls, where the task
         performs an actual privileged operation, leading to hidden limited
         functionality of that task.
      3. To avoid the noise from 1. the policy writer adds a rule to allow
         the task the capability CAP_SYS_NICE, while it does not need it,
         violating the principle of least privilege.

    Conduct privilged/unprivileged categorization first and perform a
    capable test (and at most once) only if needed.

    Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:39 -04:00
Phil Auld f7aa98b454 sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle
Bugzilla: https://bugzilla.redhat.com/2115520

commit f3dd3f674555bd9455c5ae7fafce0696bd9931b3
Author: Tianchen Ding <dtcccc@linux.alibaba.com>
Date:   Thu Jun 9 07:34:12 2022 +0800

    sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle

    Wakelist can help avoid cache bouncing and offload the overhead of waker
    cpu. So far, using wakelist within the same llc only happens on
    WF_ON_CPU, and this limitation could be removed to further improve
    wakeup performance.

    The commit 518cd62341 ("sched: Only queue remote wakeups when
    crossing cache boundaries") disabled queuing tasks on wakelist when
    the cpus share llc. This is because, at that time, the scheduler must
    send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
    supports TIF_POLLING, so this is not a problem now when the wakee cpu is
    in idle polling.

    Benefits:
      Queuing the task on idle cpu can help improving performance on waker cpu
      and utilization on wakee cpu, and further improve locality because
      the wakee cpu can handle its own rq. This patch helps improving rt on
      our real java workloads where wakeup happens frequently.

      Consider the normal condition (CPU0 and CPU1 share same llc)
      Before this patch:

             CPU0                                       CPU1

        select_task_rq()                                idle
        rq_lock(CPU1->rq)
        enqueue_task(CPU1->rq)
        notify CPU1 (by sending IPI or CPU1 polling)

                                                        resched()

      After this patch:

             CPU0                                       CPU1

        select_task_rq()                                idle
        add to wakelist of CPU1
        notify CPU1 (by sending IPI or CPU1 polling)

                                                        rq_lock(CPU1->rq)
                                                        enqueue_task(CPU1->rq)
                                                        resched()

      We see CPU0 can finish its work earlier. It only needs to put task to
      wakelist and return.
      While CPU1 is idle, so let itself handle its own runqueue data.

    This patch brings no difference about IPI.
      This patch only takes effect when the wakee cpu is:
      1) idle polling
      2) idle not polling

      For 1), there will be no IPI with or without this patch.

      For 2), there will always be an IPI before or after this patch.
      Before this patch: waker cpu will enqueue task and check preempt. Since
      "idle" will be sure to be preempted, waker cpu must send a resched IPI.
      After this patch: waker cpu will put the task to the wakelist of wakee
      cpu, and send an IPI.

    Benchmark:
    We've tested schbench, unixbench, and hachbench on both x86 and arm64.

    On x86 (Intel Xeon Platinum 8269CY):
      schbench -m 2 -t 8

        Latency percentiles (usec)              before        after
            50.0000th:                             8            6
            75.0000th:                            10            7
            90.0000th:                            11            8
            95.0000th:                            12            8
            *99.0000th:                           13           10
            99.5000th:                            15           11
            99.9000th:                            18           14

      Unixbench with full threads (104)
                                                before        after
        Dhrystone 2 using register variables  3011862938    3009935994  -0.06%
        Double-Precision Whetstone              617119.3      617298.5   0.03%
        Execl Throughput                         27667.3       27627.3  -0.14%
        File Copy 1024 bufsize 2000 maxblocks   785871.4      784906.2  -0.12%
        File Copy 256 bufsize 500 maxblocks     210113.6      212635.4   1.20%
        File Copy 4096 bufsize 8000 maxblocks  2328862.2     2320529.1  -0.36%
        Pipe Throughput                      145535622.8   145323033.2  -0.15%
        Pipe-based Context Switching           3221686.4     3583975.4  11.25%
        Process Creation                        101347.1      103345.4   1.97%
        Shell Scripts (1 concurrent)            120193.5      123977.8   3.15%
        Shell Scripts (8 concurrent)             17233.4       17138.4  -0.55%
        System Call Overhead                   5300604.8     5312213.6   0.22%

      hackbench -g 1 -l 100000
                                                before        after
        Time                                     3.246        2.251

    On arm64 (Ampere Altra):
      schbench -m 2 -t 8

        Latency percentiles (usec)              before        after
            50.0000th:                            14           10
            75.0000th:                            19           14
            90.0000th:                            22           16
            95.0000th:                            23           16
            *99.0000th:                           24           17
            99.5000th:                            24           17
            99.9000th:                            28           25

      Unixbench with full threads (80)
                                                before        after
        Dhrystone 2 using register variables  3536194249    3537019613   0.02%
        Double-Precision Whetstone              629383.6      629431.6   0.01%
        Execl Throughput                         65920.5       65846.2  -0.11%
        File Copy 1024 bufsize 2000 maxblocks  1063722.8     1064026.8   0.03%
        File Copy 256 bufsize 500 maxblocks     322684.5      318724.5  -1.23%
        File Copy 4096 bufsize 8000 maxblocks  2348285.3     2328804.8  -0.83%
        Pipe Throughput                      133542875.3   131619389.8  -1.44%
        Pipe-based Context Switching           3215356.1     3576945.1  11.25%
        Process Creation                        108520.5      120184.6  10.75%
        Shell Scripts (1 concurrent)            122636.3        121888  -0.61%
        Shell Scripts (8 concurrent)             17462.1       17381.4  -0.46%
        System Call Overhead                   4429998.9     4435006.7   0.11%

      hackbench -g 1 -l 100000
                                                before        after
        Time                                     4.217        2.916

    Our patch has improvement on schbench, hackbench
    and Pipe-based Context Switching of unixbench
    when there exists idle cpus,
    and no obvious regression on other tests of unixbench.
    This can help improve rt in scenes where wakeup happens frequently.

    Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:39 -04:00
Phil Auld eb73b8ff33 sched: Fix the check of nr_running at queue wakelist
Bugzilla: https://bugzilla.redhat.com/2115520

commit 28156108fecb1f808b21d216e8ea8f0d205a530c
Author: Tianchen Ding <dtcccc@linux.alibaba.com>
Date:   Thu Jun 9 07:34:11 2022 +0800

    sched: Fix the check of nr_running at queue wakelist

    The commit 2ebb177175 ("sched/core: Offload wakee task activation if it
    the wakee is descheduling") checked rq->nr_running <= 1 to avoid task
    stacking when WF_ON_CPU.

    Per the ordering of writes to p->on_rq and p->on_cpu, observing p->on_cpu
    (WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through
    the deactivate_task() in __schedule(), thus p has been accounted out of
    rq->nr_running. As such, the task being the only runnable task on the rq
    implies reading rq->nr_running == 0 at that point.

    The benchmark result is in [1].

    [1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/

    Suggested-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:39 -04:00
Phil Auld e208332528 sched: Reverse sched_class layout
Bugzilla: https://bugzilla.redhat.com/2115520

commit 546a3fee174969ff323d70ff27b1ef181f0d7ceb
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue May 17 13:46:54 2022 +0200

    sched: Reverse sched_class layout

    Because GCC-12 is fully stupid about array bounds and it's just really
    hard to get a solid array definition from a linker script, flip the
    array order to avoid needing negative offsets :-/

    This makes the whole relational pointer magic a little less obvious, but
    alas.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Link: https://lkml.kernel.org/r/YoOLLmLG7HRTXeEm@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:38 -04:00
Phil Auld 216aaa830b sched/core: Avoid obvious double update_rq_clock warning
Bugzilla: https://bugzilla.redhat.com/2115520

commit 2679a83731d51a744657f718fc02c3b077e47562
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Sat Apr 30 16:58:42 2022 +0800

    sched/core: Avoid obvious double update_rq_clock warning

    When we use raw_spin_rq_lock() to acquire the rq lock and have to
    update the rq clock while holding the lock, the kernel may issue
    a WARN_DOUBLE_CLOCK warning.

    Since we directly use raw_spin_rq_lock() to acquire rq lock instead of
    rq_lock(), there is no corresponding change to rq->clock_update_flags.
    In particular, we have obtained the rq lock of other CPUs, the
    rq->clock_update_flags of this CPU may be RQCF_UPDATED at this time, and
    then calling update_rq_clock() will trigger the WARN_DOUBLE_CLOCK warning.

    So we need to clear RQCF_UPDATED of rq->clock_update_flags to avoid
    the WARN_DOUBLE_CLOCK warning.

    For the sched_rt_period_timer() and migrate_task_rq_dl() cases
    we simply replace raw_spin_rq_lock()/raw_spin_rq_unlock() with
    rq_lock()/rq_unlock().

    For the {pull,push}_{rt,dl}_task() cases, we add the
    double_rq_clock_clear_update() function to clear RQCF_UPDATED of
    rq->clock_update_flags, and call double_rq_clock_clear_update()
    before double_lock_balance()/double_rq_lock() returns to avoid the
    WARN_DOUBLE_CLOCK warning.

    Some call trace reports:
    Call Trace 1:
     <IRQ>
     sched_rt_period_timer+0x10f/0x3a0
     ? enqueue_top_rt_rq+0x110/0x110
     __hrtimer_run_queues+0x1a9/0x490
     hrtimer_interrupt+0x10b/0x240
     __sysvec_apic_timer_interrupt+0x8a/0x250
     sysvec_apic_timer_interrupt+0x9a/0xd0
     </IRQ>
     <TASK>
     asm_sysvec_apic_timer_interrupt+0x12/0x20

    Call Trace 2:
     <TASK>
     activate_task+0x8b/0x110
     push_rt_task.part.108+0x241/0x2c0
     push_rt_tasks+0x15/0x30
     finish_task_switch+0xaa/0x2e0
     ? __switch_to+0x134/0x420
     __schedule+0x343/0x8e0
     ? hrtimer_start_range_ns+0x101/0x340
     schedule+0x4e/0xb0
     do_nanosleep+0x8e/0x160
     hrtimer_nanosleep+0x89/0x120
     ? hrtimer_init_sleeper+0x90/0x90
     __x64_sys_nanosleep+0x96/0xd0
     do_syscall_64+0x34/0x90
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Call Trace 3:
     <TASK>
     deactivate_task+0x93/0xe0
     pull_rt_task+0x33e/0x400
     balance_rt+0x7e/0x90
     __schedule+0x62f/0x8e0
     do_task_dead+0x3f/0x50
     do_exit+0x7b8/0xbb0
     do_group_exit+0x2d/0x90
     get_signal+0x9df/0x9e0
     ? preempt_count_add+0x56/0xa0
     ? __remove_hrtimer+0x35/0x70
     arch_do_signal_or_restart+0x36/0x720
     ? nanosleep_copyout+0x39/0x50
     ? do_nanosleep+0x131/0x160
     ? audit_filter_inodes+0xf5/0x120
     exit_to_user_mode_prepare+0x10f/0x1e0
     syscall_exit_to_user_mode+0x17/0x30
     do_syscall_64+0x40/0x90
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Call Trace 4:
     update_rq_clock+0x128/0x1a0
     migrate_task_rq_dl+0xec/0x310
     set_task_cpu+0x84/0x1e4
     try_to_wake_up+0x1d8/0x5c0
     wake_up_process+0x1c/0x30
     hrtimer_wakeup+0x24/0x3c
     __hrtimer_run_queues+0x114/0x270
     hrtimer_interrupt+0xe8/0x244
     arch_timer_handler_phys+0x30/0x50
     handle_percpu_devid_irq+0x88/0x140
     generic_handle_domain_irq+0x40/0x60
     gic_handle_irq+0x48/0xe0
     call_on_irq_stack+0x2c/0x60
     do_interrupt_handler+0x80/0x84

    Steps to reproduce:
    1. Enable CONFIG_SCHED_DEBUG when compiling the kernel
    2. echo 1 > /sys/kernel/debug/clear_warn_once
       echo "WARN_DOUBLE_CLOCK" > /sys/kernel/debug/sched/features
       echo "NO_RT_PUSH_IPI" > /sys/kernel/debug/sched/features
    3. Run some rt/dl tasks that periodically work and sleep, e.g.
    Create 2*n rt or dl (90% running) tasks via rt-app (on a system
    with n CPUs), and Dietmar Eggemann reports Call Trace 4 when running
    on PREEMPT_RT kernel.

    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20220430085843.62939-2-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:38 -04:00
Phil Auld 26b81fc091 sched: Fix build warning without CONFIG_SYSCTL
Bugzilla: https://bugzilla.redhat.com/2115520

commit 494dcdf46e5cdee926c9f441d37e3ea1db57d1da
Author: YueHaibing <yuehaibing@huawei.com>
Date:   Wed Apr 27 21:10:02 2022 +0800

    sched: Fix build warning without CONFIG_SYSCTL

    IF CONFIG_SYSCTL is n, build warn:

    kernel/sched/core.c:1782:12: warning: ‘sysctl_sched_uclamp_handler’ defined but not used [-Wunused-function]
     static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~

    sysctl_sched_uclamp_handler() is used while CONFIG_SYSCTL enabled,
    wrap all related code with CONFIG_SYSCTL to fix this.

    Fixes: 3267e0156c33 ("sched: Move uclamp_util sysctls to core.c")
    Signed-off-by: YueHaibing <yuehaibing@huawei.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:37 -04:00
Phil Auld ad01f5836e sched: Move uclamp_util sysctls to core.c
Bugzilla: https://bugzilla.redhat.com/2115520

commit 3267e0156c3341ac25b37a0f60551cdae1634b60
Author: Zhen Ni <nizhen@uniontech.com>
Date:   Tue Feb 15 19:46:02 2022 +0800

    sched: Move uclamp_util sysctls to core.c

    move uclamp_util sysctls to core.c and use the new
    register_sysctl_init() to register the sysctl interface.

    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:36 -04:00
Phil Auld a3912174ad sched: Move rt_period/runtime sysctls to rt.c
Bugzilla: https://bugzilla.redhat.com/2115520

commit d9ab0e63fa7f8405fbb19e28c5191e0880a7f2db
Author: Zhen Ni <nizhen@uniontech.com>
Date:   Tue Feb 15 19:45:59 2022 +0800

    sched: Move rt_period/runtime sysctls to rt.c

    move rt_period/runtime sysctls to rt.c and use the new
    register_sysctl_init() to register the sysctl interface.

    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:36 -04:00
Phil Auld da09e6d005 sched: Move schedstats sysctls to core.c
Bugzilla: https://bugzilla.redhat.com/2115520

commit f5ef06d58be8311a9425e6a54a053ecb350952f3
Author: Zhen Ni <nizhen@uniontech.com>
Date:   Tue Feb 15 19:45:58 2022 +0800

    sched: Move schedstats sysctls to core.c

    move schedstats sysctls to core.c and use the new
    register_sysctl_init() to register the sysctl interface.

    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:35 -04:00
Phil Auld 4241f8765f Merge remote-tracking branch 'origin/merge-requests/1372' into bz2115520
Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:13:30 -04:00
Waiman Long a8b188fafd sched: Always clear user_cpus_ptr in do_set_cpus_allowed()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107354
Upstream Status: tip commit 851a723e45d1c4c8f6f7b0d2cfbc5f53690bb4e9

commit 851a723e45d1c4c8f6f7b0d2cfbc5f53690bb4e9
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 22 Sep 2022 14:00:41 -0400

    sched: Always clear user_cpus_ptr in do_set_cpus_allowed()

    The do_set_cpus_allowed() function is used by either kthread_bind() or
    select_fallback_rq(). In both cases the user affinity (if any) should be
    destroyed too.

    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220922180041.1768141-6-longman@redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
2022-10-27 19:48:17 -04:00
Waiman Long a2add19e1a sched: Enforce user requested affinity
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107354
Upstream Status: tip commit da019032819a1f09943d3af676892ec8c627668e
Conflicts: A merge conflict in kernel/sched/sched.h due to the presence
           of RH_KABI code requiring manual merge.

commit da019032819a1f09943d3af676892ec8c627668e
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 22 Sep 2022 14:00:39 -0400

    sched: Enforce user requested affinity

    It was found that the user requested affinity via sched_setaffinity()
    can be easily overwritten by other kernel subsystems without an easy way
    to reset it back to what the user requested. For example, any change
    to the current cpuset hierarchy may reset the cpumask of the tasks in
    the affected cpusets to the default cpuset value even if those tasks
    have pre-existing user requested affinity. That is especially easy to
    trigger under a cgroup v2 environment where writing "+cpuset" to the
    root cgroup's cgroup.subtree_control file will reset the cpus affinity
    of all the processes in the system.

    That is problematic in a nohz_full environment where the tasks running
    in the nohz_full CPUs usually have their cpus affinity explicitly set
    and will behave incorrectly if cpus affinity changes.

    Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr()
    and use it to restrcit the given cpumask unless there is no overlap. In
    that case, it will fallback to the given one. The SCA_USER flag is
    reused to indicate intent to set user_cpus_ptr and so user_cpus_ptr
    masking should be skipped. In addition, masking should also be skipped
    if any of the SCA_MIGRATE_* flag is set.

    All callers of set_cpus_allowed_ptr() will be affected by this change.
    A scratch cpumask is added to percpu runqueues structure for doing
    additional masking when user_cpus_ptr is set.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220922180041.1768141-4-longman@redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
2022-10-27 19:48:06 -04:00
Waiman Long b5b3deb05e sched: Always preserve the user requested cpumask
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107354
Upstream Status: tip commit 8f9ea86fdf99b81458cc21fc1c591fcd4a0fa1f4

commit 8f9ea86fdf99b81458cc21fc1c591fcd4a0fa1f4
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 22 Sep 2022 14:00:38 -0400

    sched: Always preserve the user requested cpumask

    Unconditionally preserve the user requested cpumask on
    sched_setaffinity() calls. This allows using it outside of the fairly
    narrow restrict_cpus_allowed_ptr() use-case and fix some cpuset issues
    that currently suffer destruction of cpumasks.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220922180041.1768141-3-longman@redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
2022-10-27 19:44:44 -04:00
Waiman Long 8a370625e9 sched: Introduce affinity_context
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107354
Upstream Status: tip commit 713a2e21a5137e96d2594f53d19784ffde3ddbd0

commit 713a2e21a5137e96d2594f53d19784ffde3ddbd0
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 22 Sep 2022 14:00:40 -0400

    sched: Introduce affinity_context

    In order to prepare for passing through additional data through the
    affinity call-chains, convert the mask and flags argument into a
    structure.

    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220922180041.1768141-5-longman@redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
2022-10-27 19:44:41 -04:00
Waiman Long 31d9c33c0e sched: Add __releases annotations to affine_move_task()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107354
Upstream Status: tip commit 5584e8ac2c68280e5ac31d231c23cdb7dfa225db

commit 5584e8ac2c68280e5ac31d231c23cdb7dfa225db
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 22 Sep 2022 14:00:37 -0400

    sched: Add __releases annotations to affine_move_task()

    affine_move_task() assumes task_rq_lock() has been called and it does
    an implicit task_rq_unlock() before returning. Add the appropriate
    __releases annotations to make this clear.

    A typo error in comment is also fixed.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220922180041.1768141-2-longman@redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
2022-10-27 19:44:37 -04:00
Al Stone 0d2d511544 sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2126952
Tested: This is one of a series of patch sets to enable Arm SystemReady IR
 support in the kernel for compliant platforms.  This set cleans up
 powercap and enables DTPM for edge systems to use in thermal and power
 management; this is all in drivers/powercap.  This set has been tested
 via simple boot tests, and of course the CI loop.  This may be difficult
 to test on Arm due to DTPM being a very new feature.  However, this is
 exactly the same powercap framework used by intel_rapl, which should
 continue to function properly regardless.

commit bb4479994945e9170534389a7762eb56149320ac
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Tue Jun 21 10:04:10 2022 +0100

    sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()

    effective_cpu_util() already has a `int cpu' parameter which allows to
    retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
    this function via an arch_scale_cpu_capacity(cpu).

    A lot of code calling effective_cpu_util() (or the shim
    sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
    arch_scale_cpu_capacity() already.
    But not having to pass it into effective_cpu_util() will make the EAS
    wake-up code easier, especially when the maximum CPU capacity reduced
    by the thermal pressure is passed through the EAS wake-up functions.

    Due to the asymmetric CPU capacity support of arm/arm64 architectures,
    arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
    per_cpu(cpu_scale, cpu) on such a system.
    On all other architectures it is a a compile-time constant
    (SCHED_CAPACITY_SCALE).

    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-by: Lukasz Luba <lukasz.luba@arm.com>
    Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com
    (cherry picked from commit bb4479994945e9170534389a7762eb56149320ac)

Signed-off-by: Al Stone <ahs3@redhat.com>
2022-10-24 09:08:12 -06:00
Chris von Recklinghausen 63534db797 NUMA balancing: optimize page placement for memory tiering system
Bugzilla: https://bugzilla.redhat.com/2120352

commit c574bbe917036c8968b984c82c7b13194fe5ce98
Author: Huang Ying <ying.huang@intel.com>
Date:   Tue Mar 22 14:46:23 2022 -0700

    NUMA balancing: optimize page placement for memory tiering system

    With the advent of various new memory types, some machines will have
    multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
    memory subsystem of these machines can be called memory tiering system,
    because the performance of the different types of memory are usually
    different.

    In such system, because of the memory accessing pattern changing etc,
    some pages in the slow memory may become hot globally.  So in this
    patch, the NUMA balancing mechanism is enhanced to optimize the page
    placement among the different memory types according to hot/cold
    dynamically.

    In a typical memory tiering system, there are CPUs, fast memory and slow
    memory in each physical NUMA node.  The CPUs and the fast memory will be
    put in one logical node (called fast memory node), while the slow memory
    will be put in another (faked) logical node (called slow memory node).
    That is, the fast memory is regarded as local while the slow memory is
    regarded as remote.  So it's possible for the recently accessed pages in
    the slow memory node to be promoted to the fast memory node via the
    existing NUMA balancing mechanism.

    The original NUMA balancing mechanism will stop to migrate pages if the
    free memory of the target node becomes below the high watermark.  This
    is a reasonable policy if there's only one memory type.  But this makes
    the original NUMA balancing mechanism almost do not work to optimize
    page placement among different memory types.  Details are as follows.

    It's the common cases that the working-set size of the workload is
    larger than the size of the fast memory nodes.  Otherwise, it's
    unnecessary to use the slow memory at all.  So, there are almost always
    no enough free pages in the fast memory nodes, so that the globally hot
    pages in the slow memory node cannot be promoted to the fast memory
    node.  To solve the issue, we have 2 choices as follows,

    a. Ignore the free pages watermark checking when promoting hot pages
       from the slow memory node to the fast memory node.  This will
       create some memory pressure in the fast memory node, thus trigger
       the memory reclaiming.  So that, the cold pages in the fast memory
       node will be demoted to the slow memory node.

    b. Define a new watermark called wmark_promo which is higher than
       wmark_high, and have kswapd reclaiming pages until free pages reach
       such watermark.  The scenario is as follows: when we want to promote
       hot-pages from a slow memory to a fast memory, but fast memory's free
       pages would go lower than high watermark with such promotion, we wake
       up kswapd with wmark_promo watermark in order to demote cold pages and
       free us up some space.  So, next time we want to promote hot-pages we
       might have a chance of doing so.

    The choice "a" may create high memory pressure in the fast memory node.
    If the memory pressure of the workload is high, the memory pressure
    may become so high that the memory allocation latency of the workload
    is influenced, e.g.  the direct reclaiming may be triggered.

    The choice "b" works much better at this aspect.  If the memory
    pressure of the workload is high, the hot pages promotion will stop
    earlier because its allocation watermark is higher than that of the
    normal memory allocation.  So in this patch, choice "b" is implemented.
    A new zone watermark (WMARK_PROMO) is added.  Which is larger than the
    high watermark and can be controlled via watermark_scale_factor.

    In addition to the original page placement optimization among sockets,
    the NUMA balancing mechanism is extended to be used to optimize page
    placement according to hot/cold among different memory types.  So the
    sysctl user space interface (numa_balancing) is extended in a backward
    compatible way as follow, so that the users can enable/disable these
    functionality individually.

    The sysctl is converted from a Boolean value to a bits field.  The
    definition of the flags is,

    - 0: NUMA_BALANCING_DISABLED
    - 1: NUMA_BALANCING_NORMAL
    - 2: NUMA_BALANCING_MEMORY_TIERING

    We have tested the patch with the pmbench memory accessing benchmark
    with the 80:20 read/write ratio and the Gauss access address
    distribution on a 2 socket Intel server with Optane DC Persistent
    Memory Model.  The test results shows that the pmbench score can
    improve up to 95.9%.

    Thanks Andrew Morton to help fix the document format error.

    Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: Feng Tang <feng.tang@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:54 -04:00
Chris von Recklinghausen 2d6179b3cd kthread: Generalize pf_io_worker so it can point to struct kthread
Bugzilla: https://bugzilla.redhat.com/2120352

commit e32cf5dfbe227b355776948b2c9b5691b84d1cbd
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Dec 22 22:10:09 2021 -0600

    kthread: Generalize pf_io_worker so it can point to struct kthread

    The point of using set_child_tid to hold the kthread pointer was that
    it already did what is necessary.  There are now restrictions on when
    set_child_tid can be initialized and when set_child_tid can be used in
    schedule_tail.  Which indicates that continuing to use set_child_tid
    to hold the kthread pointer is a bad idea.

    Instead of continuing to use the set_child_tid field of task_struct
    generalize the pf_io_worker field of task_struct and use it to hold
    the kthread pointer.

    Rename pf_io_worker (which is a void * pointer) to worker_private so
    it can be used to store kthreads struct kthread pointer.  Update the
    kthread code to store the kthread pointer in the worker_private field.
    Remove the places where set_child_tid had to be dealt with carefully
    because kthreads also used it.

    Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com
    Link: https://lkml.kernel.org/r/87a6grvqy8.fsf_-_@email.froward.int.ebiederm.org
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:35 -04:00
Chris von Recklinghausen 34dce2be8d kthread: Never put_user the set_child_tid address
Bugzilla: https://bugzilla.redhat.com/2120352

commit 00580f03af5eb2a527875b4a80a5effd95bda2fa
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Dec 22 16:57:50 2021 -0600

    kthread: Never put_user the set_child_tid address

    Kernel threads abuse set_child_tid.  Historically that has been fine
    as set_child_tid was initialized after the kernel thread had been
    forked.  Unfortunately storing struct kthread in set_child_tid after
    the thread is running makes struct kthread being unusable for storing
    result codes of the thread.

    When set_child_tid is set to struct kthread during fork that results
    in schedule_tail writing the thread id to the beggining of struct
    kthread (if put_user does not realize it is a kernel address).

    Solve this by skipping the put_user for all kthreads.

    Reported-by: Nathan Chancellor <nathan@kernel.org>
    Link: https://lkml.kernel.org/r/YcNsG0Lp94V13whH@archlinux-ax161
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:34 -04:00
Chris von Recklinghausen 5b51e7ace6 kthread: Warn about failed allocations for the init kthread
Bugzilla: https://bugzilla.redhat.com/2120352

commit dd621ee0cf8eb32445c8f5f26d3b7555953071d8
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue Dec 21 11:41:14 2021 -0600

    kthread: Warn about failed allocations for the init kthread

    Failed allocates are not expected when setting up the initial task and
    it is not really possible to handle them either.  So I added a warning
    to report if such an allocation failure ever happens.

    Correct the sense of the warning so it warns when an allocation failure
    happens not when the allocation succeeded.  Oops.

    Reported-by: kernel test robot <oliver.sang@intel.com>
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
    Link: https://lkml.kernel.org/r/20211221231611.785b74cf@canb.auug.org.au
    Link: https://lkml.kernel.org/r/CA+G9fYvLaR5CF777CKeWTO+qJFTN6vAvm95gtzN+7fw3Wi5hkA@mail.gmail.com
    Link: https://lkml.kernel.org/r/20211216102956.GC10708@xsang-OptiPlex-9020
    Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads")
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:34 -04:00
Chris von Recklinghausen e1e51160dc kthread: Ensure struct kthread is present for all kthreads
Bugzilla: https://bugzilla.redhat.com/2120352

commit 40966e316f86b8cfd83abd31ccb4df729309d3e7
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Dec 2 09:56:14 2021 -0600

    kthread: Ensure struct kthread is present for all kthreads

    Today the rules are a bit iffy and arbitrary about which kernel
    threads have struct kthread present.  Both idle threads and thread
    started with create_kthread want struct kthread present so that is
    effectively all kernel threads.  Make the rule that if PF_KTHREAD
    and the task is running then struct kthread is present.

    This will allow the kernel thread code to using tsk->exit_code
    with different semantics from ordinary processes.

    To make ensure that struct kthread is present for all
    kernel threads move it's allocation into copy_process.

    Add a deallocation of struct kthread in exec for processes
    that were kernel threads.

    Move the allocation of struct kthread for the initial thread
    earlier so that it is not repeated for each additional idle
    thread.

    Move the initialization of struct kthread into set_kthread_struct
    so that the structure is always and reliably initailized.

    Clear set_child_tid in free_kthread_struct to ensure the kthread
    struct is reliably freed during exec.  The function
    free_kthread_struct does not need to clear vfork_done during exec as
    exec_mm_release called from exec_mmap has already cleared vfork_done.

    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:33 -04:00
Frantisek Hrbata 37715a7ab5 Merge: Backport scheduler related v5.19 and earlier commits for kernel-rt
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1319

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2120671
Tested: By me with scheduler stress tests.

Series of prerequisites for the RT patch set that touches scheduler code.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-09-27 08:47:30 -04:00
Phil Auld 035866b87a smp: Rename flush_smp_call_function_from_idle()
Bugzilla: https://bugzilla.redhat.com/2120671

commit 16bf5a5e1ec56474ed2a19d72f272ed09a5d3ea1
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Wed Apr 13 15:31:03 2022 +0200

    smp: Rename flush_smp_call_function_from_idle()

    This is invoked from the stopper thread too, which is definitely not idle.
    Rename it to flush_smp_call_function_queue() and fixup the callers.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220413133024.305001096@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-09-08 11:25:07 -04:00
Phil Auld bb91a8ff6d sched: Fix missing prototype warnings
Bugzilla: https://bugzilla.redhat.com/2120671

commit d664e399128bd78b905ff480917e2c2d4949e101
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Wed Apr 13 15:31:02 2022 +0200

    sched: Fix missing prototype warnings

    A W=1 build emits more than a dozen missing prototype warnings related to
    scheduler and scheduler specific includes.

    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220413133024.249118058@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-09-08 11:25:07 -04:00
Waiman Long d42238049b preempt/dynamic: Introduce preemption model accessors
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2117491

commit cfe43f478b79ba45573ca22d52d0d8823be068fa
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Fri, 12 Nov 2021 18:52:01 +0000

    preempt/dynamic: Introduce preemption model accessors

    CONFIG_PREEMPT{_NONE, _VOLUNTARY} designate either:
    o The build-time preemption model when !PREEMPT_DYNAMIC
    o The default boot-time preemption model when PREEMPT_DYNAMIC

    IOW, using those on PREEMPT_DYNAMIC kernels is meaningless - the actual
    model could have been set to something else by the "preempt=foo" cmdline
    parameter. Same problem applies to CONFIG_PREEMPTION.

    Introduce a set of helpers to determine the actual preemption model used by
    the live kernel.

    Suggested-by: Marco Elver <elver@google.com>
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Marco Elver <elver@google.com>
    Acked-by: Frederic Weisbecker <frederic@kernel.org>
    Link: https://lore.kernel.org/r/20211112185203.280040-3-valentin.schneider@arm.com

Signed-off-by: Waiman Long <longman@redhat.com>
2022-08-30 17:21:52 -04:00
Waiman Long 1a0eb66558 sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2104946
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=sched/urgent&id=b6e8d40d43ae4dec00c8fea2593eeea3114b8f44

commit b6e8d40d43ae4dec00c8fea2593eeea3114b8f44
Author: Waiman Long <longman@redhat.com>
Date:   Tue, 2 Aug 2022 21:54:51 -0400

    sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed

    With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
    that the cpuset will just use the effective CPUs of its parent. So
    cpuset_can_attach() can call task_can_attach() with an empty mask.
    This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
    to dl_bw_of() to crash due to percpu value access of an out of bound
    CPU value. For example:

            [80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
              :
            [80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
              :
            [80468.207946] Call Trace:
            [80468.208947]  cpuset_can_attach+0xa0/0x140
            [80468.209953]  cgroup_migrate_execute+0x8c/0x490
            [80468.210931]  cgroup_update_dfl_csses+0x254/0x270
            [80468.211898]  cgroup_subtree_control_write+0x322/0x400
            [80468.212854]  kernfs_fop_write_iter+0x11c/0x1b0
            [80468.213777]  new_sync_write+0x11f/0x1b0
            [80468.214689]  vfs_write+0x1eb/0x280
            [80468.215592]  ksys_write+0x5f/0xe0
            [80468.216463]  do_syscall_64+0x5c/0x80
            [80468.224287]  entry_SYSCALL_64_after_hwframe+0x44/0xae

    Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
    is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
    to be used by tasks within the cpuset anyway.

    Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
    reflect the change. In addition, a check is added to task_can_attach()
    to guard against the possibility that cpumask_any_and() may return a
    value >= nr_cpu_ids.

    Fixes: 7f51412a41 ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Juri Lelli <juri.lelli@redhat.com>
    Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
2022-08-03 10:41:05 -04:00
Patrick Talbert 5cbac754a7 Merge: sched: Fix balance_push() vs __sched_setscheduler()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1059

Bugzilla: https://bugzilla.redhat.com/2100215
Tested: Ran cpu hot[un]plug for 24+ hours while stress tests were running.

commit 04193d590b390ec7a0592630f46d559ec6564ba1
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Jun 7 22:41:55 2022 +0200

    sched: Fix balance_push() vs __sched_setscheduler()

    The purpose of balance_push() is to act as a filter on task selection
    in the case of CPU hotplug, specifically when taking the CPU out.

    It does this by (ab)using the balance callback infrastructure, with
    the express purpose of keeping all the unlikely/odd cases in a single
    place.

    In order to serve its purpose, the balance_push_callback needs to be
    (exclusively) on the callback list at all times (noting that the
    callback always places itself back on the list the moment it runs,
    also noting that when the CPU goes down, regular balancing concerns
    are moot, so ignoring them is fine).

    And here-in lies the problem, __sched_setscheduler()'s use of
    splice_balance_callbacks() takes the callbacks off the list across a
    lock-break, making it possible for, an interleaving, __schedule() to
    see an empty list and not get filtered.

    Fixes: ae79270232 ("sched: Optimize finish_lock_switch()")
    Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
    Link: https://lkml.kernel.org/r/20220519134706.GH2578@worktop.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Valentin Schneider <vschneid@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-12 10:32:54 +02:00
Phil Auld d560b649e2 sched: Fix balance_push() vs __sched_setscheduler()
Bugzilla: https://bugzilla.redhat.com/2100215

commit 04193d590b390ec7a0592630f46d559ec6564ba1
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Jun 7 22:41:55 2022 +0200

    sched: Fix balance_push() vs __sched_setscheduler()

    The purpose of balance_push() is to act as a filter on task selection
    in the case of CPU hotplug, specifically when taking the CPU out.

    It does this by (ab)using the balance callback infrastructure, with
    the express purpose of keeping all the unlikely/odd cases in a single
    place.

    In order to serve its purpose, the balance_push_callback needs to be
    (exclusively) on the callback list at all times (noting that the
    callback always places itself back on the list the moment it runs,
    also noting that when the CPU goes down, regular balancing concerns
    are moot, so ignoring them is fine).

    And here-in lies the problem, __sched_setscheduler()'s use of
    splice_balance_callbacks() takes the callbacks off the list across a
    lock-break, making it possible for, an interleaving, __schedule() to
    see an empty list and not get filtered.

    Fixes: ae79270232 ("sched: Optimize finish_lock_switch()")
    Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
    Link: https://lkml.kernel.org/r/20220519134706.GH2578@worktop.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-06-22 16:22:26 -04:00
Ming Lei 4415be8560 block: check that there is a plug in blk_flush_plug
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917

commit aa8dcccaf32bfdc09f2aff089d5d60c37da5b7b5
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Jan 27 08:05:49 2022 +0100

    block: check that there is a plug in blk_flush_plug

    Rename blk_flush_plug to __blk_flush_plug and add a wrapper that includes
    the NULL check instead of open coding that check everywhere.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Link: https://lore.kernel.org/r/20220127070549.1377856-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-06-22 08:56:20 +08:00
Ming Lei d5d4963cf5 block: remove blk_needs_flush_plug
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917

commit b1f866b013e6e5583f2f0bf4a61d13eddb9a1799
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Jan 27 08:05:48 2022 +0100

    block: remove blk_needs_flush_plug

    blk_needs_flush_plug fails to account for the cb_list, which needs
    flushing as well.  Remove it and just check if there is a plug instead
    of poking into the internals of the plug structure.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20220127070549.1377856-1-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-06-22 08:56:19 +08:00
Phil Auld 5087f87023 sched/tracing: Append prev_state to tp args instead
Bugzilla: https://bugzilla.redhat.com/2078906
Conflicts: Skipped one hunk, in samples, due to not having 3a73333fb370
("tracing: Add TRACE_CUSTOM_EVENT() macro").

commit 9c2136be0878c88c53dea26943ce40bb03ad8d8d
Author: Delyan Kratunov <delyank@fb.com>
Date:   Wed May 11 18:28:36 2022 +0000

    sched/tracing: Append prev_state to tp args instead

    Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting
    sched_switch event, 2022-01-20) added a new prev_state argument to the
    sched_switch tracepoint, before the prev task_struct pointer.

    This reordering of arguments broke BPF programs that use the raw
    tracepoint (e.g. tp_btf programs). The type of the second argument has
    changed and existing programs that assume a task_struct* argument
    (e.g. for bpf_task_storage access) will now fail to verify.

    If we instead append the new argument to the end, all existing programs
    would continue to work and can conditionally extract the prev_state
    argument on supported kernel versions.

    Fixes: fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20)
    Signed-off-by: Delyan Kratunov <delyank@fb.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Link: https://lkml.kernel.org/r/c8a6930dfdd58a4a5755fc01732675472979732b.camel@fb.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-06-02 09:20:55 -04:00
Phil Auld 2dfe14261a sched: Teach the forced-newidle balancer about CPU affinity limitation.
Bugzilla: https://bugzilla.redhat.com/2078906

commit 386ef214c3c6ab111d05e1790e79475363abaa05
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Mar 17 15:51:32 2022 +0100

    sched: Teach the forced-newidle balancer about CPU affinity limitation.

    try_steal_cookie() looks at task_struct::cpus_mask to decide if the
    task could be moved to `this' CPU. It ignores that the task might be in
    a migration disabled section while not on the CPU. In this case the task
    must not be moved otherwise per-CPU assumption are broken.

    Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a
    task can be moved.

    Fixes: d2dfa17bc7 ("sched: Trivial forced-newidle balancer")
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-06-01 13:54:12 -04:00
Phil Auld 35c29596b9 sched/core: Fix forceidle balancing
Bugzilla: https://bugzilla.redhat.com/2078906

commit 5b6547ed97f4f5dfc23f8e3970af6d11d7b7ed7e
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Mar 16 22:03:41 2022 +0100

    sched/core: Fix forceidle balancing

    Steve reported that ChromeOS encounters the forceidle balancer being
    ran from rt_mutex_setprio()'s balance_callback() invocation and
    explodes.

    Now, the forceidle balancer gets queued every time the idle task gets
    selected, set_next_task(), which is strictly too often.
    rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:

            queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
            running = task_current(rq, p); /* rq->curr == p */

            if (queued)
                    dequeue_task(...);
            if (running)
                    put_prev_task(...);

            /* change task properties */

            if (queued)
                    enqueue_task(...);
            if (running)
                    set_next_task(...);

    However, rt_mutex_setprio() will explicitly not run this pattern on
    the idle task (since priority boosting the idle task is quite insane).
    Most other 'change' pattern users are pidhash based and would also not
    apply to idle.

    Also, the change pattern doesn't contain a __balance_callback()
    invocation and hence we could have an out-of-band balance-callback,
    which *should* trigger the WARN in rq_pin_lock() (which guards against
    this exact anti-pattern).

    So while none of that explains how this happens, it does indicate that
    having it in set_next_task() might not be the most robust option.

    Instead, explicitly queue the forceidle balancer from pick_next_task()
    when it does indeed result in forceidle selection. Having it here,
    ensures it can only be triggered under the __schedule() rq->lock
    instance, and hence must be ran from that context.

    This also happens to clean up the code a little, so win-win.

    Fixes: d2dfa17bc7 ("sched: Trivial forced-newidle balancer")
    Reported-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: T.J. Alumbaugh <talumbau@chromium.org>
    Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-06-01 13:54:12 -04:00
Patrick Talbert f9a5b7f4d0 Merge: Scheduler RT prerequisites
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/754

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076594
Tested:  Sanity tested with scheduler stress tests.

This is a handful of commits to help the RT merge. Keeping the differences
as small as possible reduces the maintenance.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Fernando Pacheco <fpacheco@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-12 09:28:27 +02:00
Patrick Talbert d92575ea9d Merge: sched/deadline: code cleanup
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/729

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2065219
Upstream Status: Linux
Tested: by me with scheduler stress tests using deadline class, admission
control failures and general stress tests.

A series of fixes and cleanup for the deadline scheduler
class.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Fernando Pacheco <fpacheco@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-12 09:28:24 +02:00
Phil Auld 83eb03a64e sched: Make RCU nest depth distinct in __might_resched()
Bugzilla: https://bugzilla.redhat.com/2076594

commit 50e081b96e35e43b65591f40f7376204decd1cb5
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Sep 23 18:54:43 2021 +0200

    sched: Make RCU nest depth distinct in __might_resched()

    For !RT kernels RCU nest depth in __might_resched() is always expected to
    be 0, but on RT kernels it can be non zero while the preempt count is
    expected to be always 0.

    Instead of playing magic games in interpreting the 'preempt_offset'
    argument, rename it to 'offsets' and use the lower 8 bits for the expected
    preempt count, allow to hand in the expected RCU nest depth in the upper
    bits and adopt the __might_resched() code and related checks and printks.

    The affected call sites are updated in subsequent steps.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210923165358.243232823@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:33 -04:00
Phil Auld 627bfaffba sched: Make might_sleep() output less confusing
Bugzilla: https://bugzilla.redhat.com/2076594

commit 8d713b699e84aade6b64e241a35f22e166fc8174
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Sep 23 18:54:41 2021 +0200

    sched: Make might_sleep() output less confusing

    might_sleep() output is pretty informative, but can be confusing at times
    especially with PREEMPT_RCU when the check triggers due to a voluntary
    sleep inside a RCU read side critical section:

     BUG: sleeping function called from invalid context at kernel/test.c:110
     in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52
     Preemption disabled at: migrate_disable+0x33/0xa0

    in_atomic() is 0, but it still tells that preemption was disabled at
    migrate_disable(), which is completely useless because preemption is not
    disabled. But the interesting information to decode the above, i.e. the RCU
    nesting depth, is not printed.

    That becomes even more confusing when might_sleep() is invoked from
    cond_resched_lock() within a RCU read side critical section. Here the
    expected preemption count is 1 and not 0.

     BUG: sleeping function called from invalid context at kernel/test.c:131
     in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52
     Preemption disabled at: test_cond_lock+0xf3/0x1c0

    So in_atomic() is set, which is expected as the caller holds a spinlock,
    but it's unclear why this is broken and the preempt disable IP is just
    pointing at the correct place, i.e. spin_lock(), which is obviously not
    helpful either.

    Make that more useful in general:

     - Print preempt_count() and the expected value

    and for the CONFIG_PREEMPT_RCU case:

     - Print the RCU read side critical section nesting depth

     - Print the preempt disable IP only when preempt count
       does not have the expected value.

    So the might_sleep() dump from a within a preemptible RCU read side
    critical section becomes:

     BUG: sleeping function called from invalid context at kernel/test.c:110
     in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52
     preempt_count: 0, expected: 0
     RCU nest depth: 1, expected: 0

    and the cond_resched_lock() case becomes:

     BUG: sleeping function called from invalid context at kernel/test.c:141
     in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52
     preempt_count: 1, expected: 1
     RCU nest depth: 1, expected: 0

    which makes is pretty obvious what's going on. For all other cases the
    preempt disable IP is still printed as before:

     BUG: sleeping function called from invalid context at kernel/test.c: 156
     in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
     preempt_count: 1, expected: 0
     RCU nest depth: 0, expected: 0
     Preemption disabled at:
     [<ffffffff82b48326>] test_might_sleep+0xbe/0xf8

     BUG: sleeping function called from invalid context at kernel/test.c: 163
     in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
     preempt_count: 1, expected: 0
     RCU nest depth: 1, expected: 0
     Preemption disabled at:
     [<ffffffff82b48326>] test_might_sleep+0x1e4/0x280

    This also prepares to provide a better debugging output for RT enabled
    kernels and their spinlock substitutions.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210923165358.181022656@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:33 -04:00
Phil Auld ace9ad8221 sched: Cleanup might_sleep() printks
Bugzilla: https://bugzilla.redhat.com/2076594

commit a45ed302b6e6fe5b03166321c08b4f2ad4a92a35
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Sep 23 18:54:40 2021 +0200

    sched: Cleanup might_sleep() printks

    Convert them to pr_*(). No functional change.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210923165358.117496067@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:33 -04:00
Phil Auld 5d55e0afeb sched: Remove preempt_offset argument from __might_sleep()
Bugzilla: https://bugzilla.redhat.com/2076594

commit 42a387566c567603bafa1ec0c5b71c35cba83e86
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Sep 23 18:54:38 2021 +0200

    sched: Remove preempt_offset argument from __might_sleep()

    All callers hand in 0 and never will hand in anything else.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210923165358.054321586@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:33 -04:00
Phil Auld acc9612b04 sched: Clean up the might_sleep() underscore zoo
Bugzilla: https://bugzilla.redhat.com/2076594

commit 874f670e6088d3bff3972ecd44c1cb00610f9183
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Sep 23 18:54:35 2021 +0200

    sched: Clean up the might_sleep() underscore zoo

    __might_sleep() vs. ___might_sleep() is hard to distinguish. Aside of that
    the three underscore variant is exposed to provide a checkpoint for
    rescheduling points which are distinct from blocking points.

    They are semantically a preemption point which means that scheduling is
    state preserving. A real blocking operation, e.g. mutex_lock(), wait*(),
    which cannot preserve a task state which is not equal to RUNNING.

    While technically blocking on a "sleeping" spinlock in RT enabled kernels
    falls into the voluntary scheduling category because it has to wait until
    the contended spin/rw lock becomes available, the RT lock substitution code
    can semantically be mapped to a voluntary preemption because the RT lock
    substitution code and the scheduler are providing mechanisms to preserve
    the task state and to take regular non-lock related wakeups into account.

    Rename ___might_sleep() to __might_resched() to make the distinction of
    these functions clear.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210923165357.928693482@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:33 -04:00
Phil Auld 293846bc7d sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()
Bugzilla: http://bugzilla.redhat.com/2065219

commit 772b6539fdda31462cc08368e78df60b31a58bab
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Wed Mar 2 19:34:30 2022 +0100

    sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()

    Both functions are doing almost the same, that is checking if admission
    control is still respected.

    With exclusive cpusets, dl_task_can_attach() checks if the destination
    cpuset (i.e. its root domain) has enough CPU capacity to accommodate the
    task.
    dl_cpu_busy() checks if there is enough CPU capacity in the cpuset in
    case the CPU is hot-plugged out.

    dl_task_can_attach() is used to check if a task can be admitted while
    dl_cpu_busy() is used to check if a CPU can be hotplugged out.

    Make dl_cpu_busy() able to deal with a task and use it instead of
    dl_task_can_attach() in task_can_attach().

    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Juri Lelli <juri.lelli@redhat.com>
    Link: https://lore.kernel.org/r/20220302183433.333029-4-dietmar.eggemann@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-13 12:56:36 -04:00
Phil Auld c812902eec sched/deadline: Remove unused def_dl_bandwidth
Bugzilla: http://bugzilla.redhat.com/2065219

commit eb77cf1c151c4a1c2147cbf24d84bcf0ba504e7c
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Wed Mar 2 19:34:28 2022 +0100

    sched/deadline: Remove unused def_dl_bandwidth

    Since commit 1724813d9f ("sched/deadline: Remove the sysctl_sched_dl
    knobs") the default deadline bandwidth control structure has no purpose.
    Remove it.

    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Juri Lelli <juri.lelli@redhat.com>
    Link: https://lore.kernel.org/r/20220302183433.333029-2-dietmar.eggemann@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-13 10:46:56 -04:00
Phil Auld 20c10cc17b sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y
Bugzilla: http://bugzilla.redhat.com/2069275

commit a7b2553b5ece1aba4b5994eef150d0a1269b5805
Author: Ingo Molnar <mingo@kernel.org>
Date:   Tue Mar 15 10:33:53 2022 +0100

    sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y

    This header is not (yet) standalone.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-11 17:38:23 -04:00
Phil Auld 6f4ee2303a sched/headers: Reorganize, clean up and optimize kernel/sched/core.c dependencies
Bugzilla: http://bugzilla.redhat.com/2069275

commit e66f6481a8c748ce2d4b37a3d5e10c4dd0d65e80
Author: Ingo Molnar <mingo@kernel.org>
Date:   Wed Feb 23 08:17:15 2022 +0100

    sched/headers: Reorganize, clean up and optimize kernel/sched/core.c dependencies

    Use all generic headers from kernel/sched/sched.h that are required
    for it to build.

    Sort the sections alphabetically.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-11 17:38:22 -04:00
Phil Auld b0b1db90ca sched/headers: Standardize kernel/sched/sched.h header dependencies
Bugzilla: http://bugzilla.redhat.com/2069275

commit b9e9c6ca6e54b5d58a57663f76c5cb33c12ea98f
Author: Ingo Molnar <mingo@kernel.org>
Date:   Sun Feb 13 08:19:43 2022 +0100

    sched/headers: Standardize kernel/sched/sched.h header dependencies

    kernel/sched/sched.h is a weird mix of ad-hoc headers included
    in the middle of the header.

    Two of them rely on being included in the middle of kernel/sched/sched.h,
    due to definitions they require:

     - "stat.h" needs the rq definitions.
     - "autogroup.h" needs the task_group definition.

    Move the inclusion of these two files out of kernel/sched/sched.h, and
    include them in all files that require them.

    Move of the rest of the header dependencies to the top of the
    kernel/sched/sched.h file.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-11 17:38:22 -04:00
Phil Auld c08b78797d Merge remote-tracking branch 'origin/merge-requests/673' into bz2069275
Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-11 12:37:32 -04:00
Phil Auld 233aa69d39 Merge remote-tracking branch 'origin/merge-requests/671' into bz2069275
Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-11 12:37:07 -04:00
Phil Auld 0e372dcf73 sched/preempt: Add PREEMPT_DYNAMIC using static keys
Bugzilla: http://bugzilla.redhat.com/2065226

commit 99cf983cc8bca4adb461b519664c939a565cfd4d
Author: Mark Rutland <mark.rutland@arm.com>
Date:   Mon Feb 14 16:52:14 2022 +0000

    sched/preempt: Add PREEMPT_DYNAMIC using static keys

    Where an architecture selects HAVE_STATIC_CALL but not
    HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline
    which will either branch to a callee or return to the caller.

    On such architectures, a number of constraints can conspire to make
    those trampolines more complicated and potentially less useful than we'd
    like. For example:

    * Hardware and software control flow integrity schemes can require the
      addition of "landing pad" instructions (e.g. `BTI` for arm64), which
      will also be present at the "real" callee.

    * Limited branch ranges can require that trampolines generate or load an
      address into a register and perform an indirect branch (or at least
      have a slow path that does so). This loses some of the benefits of
      having a direct branch.

    * Interaction with SW CFI schemes can be complicated and fragile, e.g.
      requiring that we can recognise idiomatic codegen and remove
      indirections understand, at least until clang proves more helpful
      mechanisms for dealing with this.

    For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we
    really only need to enable/disable specific preemption functions. We can
    achieve the same effect without a number of the pain points above by
    using static keys to fold early returns into the preemption functions
    themselves rather than in an out-of-line trampoline, effectively
    inlining the trampoline into the start of the function.

    For arm64, this results in good code generation. For example, the
    dynamic_cond_resched() wrapper looks as follows when enabled. When
    disabled, the first `B` is replaced with a `NOP`, resulting in an early
    return.

    | <dynamic_cond_resched>:
    |        bti     c
    |        b       <dynamic_cond_resched+0x10>     // or `nop`
    |        mov     w0, #0x0
    |        ret
    |        mrs     x0, sp_el0
    |        ldr     x0, [x0, #8]
    |        cbnz    x0, <dynamic_cond_resched+0x8>
    |        paciasp
    |        stp     x29, x30, [sp, #-16]!
    |        mov     x29, sp
    |        bl      <preempt_schedule_common>
    |        mov     w0, #0x1
    |        ldp     x29, x30, [sp], #16
    |        autiasp
    |        ret

    ... compared to the regular form of the function:

    | <__cond_resched>:
    |        bti     c
    |        mrs     x0, sp_el0
    |        ldr     x1, [x0, #8]
    |        cbz     x1, <__cond_resched+0x18>
    |        mov     w0, #0x0
    |        ret
    |        paciasp
    |        stp     x29, x30, [sp, #-16]!
    |        mov     x29, sp
    |        bl      <preempt_schedule_common>
    |        mov     w0, #0x1
    |        ldp     x29, x30, [sp], #16
    |        autiasp
    |        ret

    Any architecture which implements static keys should be able to use this
    to implement PREEMPT_DYNAMIC with similar cost to non-inlined static
    calls. Since this is likely to have greater overhead than (inlined)
    static calls, PREEMPT_DYNAMIC is only defaulted to enabled when
    HAVE_PREEMPT_DYNAMIC_CALL is selected.

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Acked-by: Frederic Weisbecker <frederic@kernel.org>
    Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-07 09:35:08 -04:00
Phil Auld c14a9a1c67 sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY
Bugzilla: http://bugzilla.redhat.com/2065226

commit 33c64734be3461222a8aa27d3dadc477ebca62de
Author: Mark Rutland <mark.rutland@arm.com>
Date:   Mon Feb 14 16:52:13 2022 +0000

    sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY

    Now that the enabled/disabled states for the preemption functions are
    declared alongside their definitions, the core PREEMPT_DYNAMIC logic is
    no longer tied to GENERIC_ENTRY, and can safely be selected so long as
    an architecture provides enabled/disabled states for
    irqentry_exit_cond_resched().

    Make it possible to select HAVE_PREEMPT_DYNAMIC without GENERIC_ENTRY.

    For existing users of HAVE_PREEMPT_DYNAMIC there should be no functional
    change as a result of this patch.

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Acked-by: Frederic Weisbecker <frederic@kernel.org>
    Link: https://lore.kernel.org/r/20220214165216.2231574-5-mark.rutland@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-07 09:35:08 -04:00
Phil Auld 4d8e4a6697 sched/preempt: Refactor sched_dynamic_update()
Bugzilla: http://bugzilla.redhat.com/2065226

commit 8a69fe0be143b0a1af829f85f0e9a1ae7d6a04db
Author: Mark Rutland <mark.rutland@arm.com>
Date:   Mon Feb 14 16:52:11 2022 +0000

    sched/preempt: Refactor sched_dynamic_update()

    Currently sched_dynamic_update needs to open-code the enabled/disabled
    function names for each preemption model it supports, when in practice
    this is a boolean enabled/disabled state for each function.

    Make this clearer and avoid repetition by defining the enabled/disabled
    states at the function definition, and using helper macros to perform the
    static_call_update(). Where x86 currently overrides the enabled
    function, it is made to provide both the enabled and disabled states for
    consistency, with defaults provided by the core code otherwise.

    In subsequent patches this will allow us to support PREEMPT_DYNAMIC
    without static calls.

    There should be no functional change as a result of this patch.

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Acked-by: Frederic Weisbecker <frederic@kernel.org>
    Link: https://lore.kernel.org/r/20220214165216.2231574-3-mark.rutland@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-07 09:35:07 -04:00
Phil Auld cd905af16b sched/preempt: Move PREEMPT_DYNAMIC logic later
Bugzilla: http://bugzilla.redhat.com/2065226

commit 4c7485584d48f60b1e742c7c6a3a1fa503d48d97
Author: Mark Rutland <mark.rutland@arm.com>
Date:   Mon Feb 14 16:52:10 2022 +0000

    sched/preempt: Move PREEMPT_DYNAMIC logic later

    The PREEMPT_DYNAMIC logic in kernel/sched/core.c patches static calls
    for a bunch of preemption functions. While most are defined prior to
    this, the definition of cond_resched() is later in the file, and so we
    only have its declarations from include/linux/sched.h.

    In subsequent patches we'd like to define some macros alongside the
    definition of each of the preemption functions, which we can use within
    sched_dynamic_update(). For this to be possible, the PREEMPT_DYNAMIC
    logic needs to be placed after the various preemption functions.

    As a preparatory step, this patch moves the PREEMPT_DYNAMIC logic after
    the various preemption functions, with no other changes -- this is
    purely a move.

    There should be no functional change as a result of this patch.

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Acked-by: Frederic Weisbecker <frederic@kernel.org>
    Link: https://lore.kernel.org/r/20220214165216.2231574-2-mark.rutland@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-07 09:35:07 -04:00
Phil Auld 1cf795c344 sched/isolation: Use single feature type while referring to housekeeping cpumask
Bugzilla: http://bugzilla.redhat.com/2065222

commit 04d4e665a60902cf36e7ad39af1179cb5df542ad
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Mon Feb 7 16:59:06 2022 +0100

    sched/isolation: Use single feature type while referring to housekeeping cpumask

    Refer to housekeeping APIs using single feature types instead of flags.
    This prevents from passing multiple isolation features at once to
    housekeeping interfaces, which soon won't be possible anymore as each
    isolation features will have their own cpumask.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-31 10:40:39 -04:00
Phil Auld a5be4d79e1 sched/numa: Fix NUMA topology for systems with CPU-less nodes
Bugzilla: http://bugzilla.redhat.com/2062831

commit 0fb3978b0aac3a5c08637aed03cc2d65f793508f
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 14 20:15:52 2022 +0800

    sched/numa: Fix NUMA topology for systems with CPU-less nodes

    The NUMA topology parameters (sched_numa_topology_type,
    sched_domains_numa_levels, and sched_max_numa_distance, etc.)
    identified by scheduler may be wrong for systems with CPU-less nodes.

    For example, the ACPI SLIT of a system with CPU-less persistent
    memory (Intel Optane DCPMM) nodes is as follows,

    [000h 0000   4]                    Signature : "SLIT"    [System Locality Information Table]
    [004h 0004   4]                 Table Length : 0000042C
    [008h 0008   1]                     Revision : 01
    [009h 0009   1]                     Checksum : 59
    [00Ah 0010   6]                       Oem ID : "XXXX"
    [010h 0016   8]                 Oem Table ID : "XXXXXXX"
    [018h 0024   4]                 Oem Revision : 00000001
    [01Ch 0028   4]              Asl Compiler ID : "INTL"
    [020h 0032   4]        Asl Compiler Revision : 20091013

    [024h 0036   8]                   Localities : 0000000000000004
    [02Ch 0044   4]                 Locality   0 : 0A 15 11 1C
    [030h 0048   4]                 Locality   1 : 15 0A 1C 11
    [034h 0052   4]                 Locality   2 : 11 1C 0A 1C
    [038h 0056   4]                 Locality   3 : 1C 11 1C 0A

    While the `numactl -H` output is as follows,

    available: 4 nodes (0-3)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
    node 0 size: 64136 MB
    node 0 free: 5981 MB
    node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
    node 1 size: 64466 MB
    node 1 free: 10415 MB
    node 2 cpus:
    node 2 size: 253952 MB
    node 2 free: 253920 MB
    node 3 cpus:
    node 3 size: 253952 MB
    node 3 free: 253951 MB
    node distances:
    node   0   1   2   3
      0:  10  21  17  28
      1:  21  10  28  17
      2:  17  28  10  28
      3:  28  17  28  10

    In this system, there are only 2 sockets.  In each memory controller,
    both DRAM and PMEM DIMMs are installed.  Although the physical NUMA
    topology is simple, the logical NUMA topology becomes a little
    complex.  Because both the distance(0, 1) and distance (1, 3) are less
    than the distance (0, 3), it appears that node 1 sits between node 0
    and node 3.  And the whole system appears to be a glueless mesh NUMA
    topology type.  But it's definitely not, there is even no CPU in node 3.

    This isn't a practical problem now yet.  Because the PMEM nodes (node
    2 and node 3 in example system) are offlined by default during system
    boot.  So init_numa_topology_type() called during system boot will
    ignore them and set sched_numa_topology_type to NUMA_DIRECT.  And
    init_numa_topology_type() is only called at runtime when a CPU of a
    never-onlined-before node gets plugged in.  And there's no CPU in the
    PMEM nodes.  But it appears better to fix this to make the code more
    robust.

    To test the potential problem.  We have used a debug patch to call
    init_numa_topology_type() when the PMEM node is onlined (in
    __set_migration_target_nodes()).  With that, the NUMA parameters
    identified by scheduler is as follows,

    sched_numa_topology_type:       NUMA_GLUELESS_MESH
    sched_domains_numa_levels:      4
    sched_max_numa_distance:        28

    To fix the issue, the CPU-less nodes are ignored when the NUMA topology
    parameters are identified.  Because a node may become CPU-less or not
    at run time because of CPU hotplug, the NUMA topology parameters need
    to be re-initialized at runtime for CPU hotplug too.

    With the patch, the NUMA parameters identified for the example system
    above is as follows,

    sched_numa_topology_type:       NUMA_DIRECT
    sched_domains_numa_levels:      2
    sched_max_numa_distance:        21

    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:37 -04:00
Phil Auld 6c8a46d512 sched: replace cpumask_weight with cpumask_empty where appropriate
Bugzilla: http://bugzilla.redhat.com/2062831

commit 1087ad4e3f88c474b8134a482720782922bf3fdf
Author: Yury Norov <yury.norov@gmail.com>
Date:   Thu Feb 10 14:49:06 2022 -0800

    sched: replace cpumask_weight with cpumask_empty where appropriate

    In some places, kernel/sched code calls cpumask_weight() to check if
    any bit of a given cpumask is set. We can do it more efficiently with
    cpumask_empty() because cpumask_empty() stops traversing the cpumask as
    soon as it finds first set bit, while cpumask_weight() counts all bits
    unconditionally.

    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220210224933.379149-23-yury.norov@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:37 -04:00
Phil Auld 090df5874d sched/tracing: Don't re-read p->state when emitting sched_switch event
Bugzilla: http://bugzilla.redhat.com/2062831

commit fa2c3254d7cfff5f7a916ab928a562d1165f17bb
Author: Valentin Schneider <valentin.schneider@arm.com>
Date:   Thu Jan 20 16:25:19 2022 +0000

    sched/tracing: Don't re-read p->state when emitting sched_switch event

    As of commit

      c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")

    the following sequence becomes possible:

                          p->__state = TASK_INTERRUPTIBLE;
                          __schedule()
                            deactivate_task(p);
      ttwu()
        READ !p->on_rq
        p->__state=TASK_WAKING
                            trace_sched_switch()
                              __trace_sched_switch_state()
                                task_state_index()
                                  return 0;

    TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
    the trace event.

    Prevent this by pushing the value read from __schedule() down the trace
    event.

    Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:37 -04:00
Phil Auld eb2eff33b3 sched/core: Export pelt_thermal_tp
Bugzilla: http://bugzilla.redhat.com/2062831

commit 77cf151b7bbdfa3577b3c3f3a5e267a6c60a263b
Author: Qais Yousef <qais.yousef@arm.com>
Date:   Thu Oct 28 12:50:05 2021 +0100

    sched/core: Export pelt_thermal_tp

    We can't use this tracepoint in modules without having the symbol
    exported first, fix that.

    Fixes: 765047932f ("sched/pelt: Add support to track thermal pressure")
    Signed-off-by: Qais Yousef <qais.yousef@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:37 -04:00
Phil Auld e99d064416 sched/core: Accounting forceidle time for all tasks except idle task
Bugzilla: http://bugzilla.redhat.com/2062831

commit b171501f258063f5c56dd2c5fdf310802d8d7dc1
Author: Cruz Zhao <CruzZhao@linux.alibaba.com>
Date:   Tue Jan 11 17:55:59 2022 +0800

    sched/core: Accounting forceidle time for all tasks except idle task

    There are two types of forced idle time: forced idle time from cookie'd
    task and forced idle time form uncookie'd task. The forced idle time from
    uncookie'd task is actually caused by the cookie'd task in runqueue
    indirectly, and it's more accurate to measure the capacity loss with the
    sum of both.

    Assuming cpu x and cpu y are a pair of SMT siblings, consider the
    following scenarios:
      1.There's a cookie'd task running on cpu x, and there're 4 uncookie'd
        tasks running on cpu y. For cpu x, there will be 80% forced idle time
        (from uncookie'd task); for cpu y, there will be 20% forced idle time
        (from cookie'd task).
      2.There's a uncookie'd task running on cpu x, and there're 4 cookie'd
        tasks running on cpu y. For cpu x, there will be 80% forced idle time
        (from cookie'd task); for cpu y, there will be 20% forced idle time
        (from uncookie'd task).

    The scenario1 can recurrent by stress-ng(scenario2 can recurrent similary):
        (cookie'd)taskset -c x stress-ng -c 1 -l 100
        (uncookie'd)taskset -c y stress-ng -c 4 -l 100

    In the above two scenarios, the total capacity loss is 1 cpu, but in
    scenario1, the cookie'd forced idle time tells us 20% cpu capacity loss, in
    scenario2, the cookie'd forced idle time tells us 80% cpu capacity loss,
    which are not accurate. It'll be more accurate to measure with cookie'd
    forced idle time and uncookie'd forced idle time.

    Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Josh Don <joshdon@google.com>
    Link: https://lore.kernel.org/r/1641894961-9241-2-git-send-email-CruzZhao@linux.alibaba.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:37 -04:00
Phil Auld ad2f1aab97 sched: Avoid double preemption in __cond_resched_*lock*()
Bugzilla: http://bugzilla.redhat.com/2062831

commit 7e406d1ff39b8ee574036418a5043c86723170cf
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sat Dec 25 01:04:57 2021 +0100

    sched: Avoid double preemption in __cond_resched_*lock*()

    For PREEMPT/DYNAMIC_PREEMPT the *_unlock() will already trigger a
    preemption, no point in then calling preempt_schedule_common()
    *again*.

    Use _cond_resched() instead, since this is a NOP for the preemptible
    configs while it provide a preemption point for the others.

    Reported-by: xuhaifeng <xuhaifeng@oppo.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/YcGnvDEYBwOiV0cR@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:36 -04:00
Phil Auld 2d595edebf sched: Trigger warning if ->migration_disabled counter underflows.
Bugzilla: http://bugzilla.redhat.com/2062831

commit 9d0df37797453f168afdb2e6fd0353c73718ae9a
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Mon Nov 29 18:46:44 2021 +0100

    sched: Trigger warning if ->migration_disabled counter underflows.

    If migrate_enable() is used more often than its counter part then it
    remains undetected and rq::nr_pinned will underflow, too.

    Add a warning if migrate_enable() is attempted if without a matching a
    migrate_disable().

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20211129174654.668506-2-bigeasy@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:36 -04:00
Phil Auld fb7c2476b1 sched/fair: Replace CFS internal cpu_util() with cpu_util_cfs()
Bugzilla: http://bugzilla.redhat.com/2062831

commit 82762d2af31a60081162890983a83499c9c7dd74
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Thu Nov 18 17:42:40 2021 +0100

    sched/fair: Replace CFS internal cpu_util() with cpu_util_cfs()

    cpu_util_cfs() was created by commit d4edd662ac ("sched/cpufreq: Use
    the DEADLINE utilization signal") to enable the access to CPU
    utilization from the Schedutil CPUfreq governor.

    Commit a07630b8b2 ("sched/cpufreq/schedutil: Use util_est for OPP
    selection") added util_est support later.

    The only thing cpu_util() is doing on top of what cpu_util_cfs() already
    does is to clamp the return value to the [0..capacity_orig] capacity
    range of the CPU. Integrating this into cpu_util_cfs() is not harming
    the existing users (Schedutil and CPUfreq cooling (latter via
    sched_cpu_util() wrapper)).

    For straightforwardness, prefer to keep using `int cpu` as the function
    parameter over using `struct rq *rq` which might avoid some calls to
    cpu_rq(cpu) -> per_cpu(runqueues, cpu) -> RELOC_HIDE().
    Update cfs_util()'s documentation and reuse it for cpu_util_cfs().
    Remove cpu_util().

    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20211118164240.623551-1-dietmar.eggemann@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:36 -04:00
Phil Auld b5be4938b7 sched/core: Forced idle accounting
Bugzilla: http://bugzilla.redhat.com/2062831
Conflicts: fuzz due to kabi padding in struct rq

commit 4feee7d12603deca8775f9f9ae5e121093837444
Author: Josh Don <joshdon@google.com>
Date:   Mon Oct 18 13:34:28 2021 -0700

    sched/core: Forced idle accounting

    Adds accounting for "forced idle" time, which is time where a cookie'd
    task forces its SMT sibling to idle, despite the presence of runnable
    tasks.

    Forced idle time is one means to measure the cost of enabling core
    scheduling (ie. the capacity lost due to the need to force idle).

    Forced idle time is attributed to the thread responsible for causing
    the forced idle.

    A few details:
     - Forced idle time is displayed via /proc/PID/sched. It also requires
       that schedstats is enabled.
     - Forced idle is only accounted when a sibling hyperthread is held
       idle despite the presence of runnable tasks. No time is charged if
       a sibling is idle but has no runnable tasks.
     - Tasks with 0 cookie are never charged forced idle.
     - For SMT > 2, we scale the amount of forced idle charged based on the
       number of forced idle siblings. Additionally, we split the time up and
       evenly charge it to all running tasks, as each is equally responsible
       for the forced idle.

    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20211018203428.2025792-1-joshdon@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:35 -04:00
Phil Auld fc23011d32 sched: Fix yet more sched_fork() races
Bugzilla: http://bugzilla.redhat.com/2062836

commit b1e8206582f9d680cff7d04828708c8b6ab32957
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon Feb 14 10:16:57 2022 +0100

    sched: Fix yet more sched_fork() races

    Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an
    invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
    race vs syscalls by not placing the task on the runqueue before it
    gets exposed through the pidhash.

    Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is
    trying to fix a single instance of this, instead fix the whole class
    of issues, effectively reverting this commit.

    Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
    Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
    Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
    Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-14 09:25:27 -04:00
Phil Auld 4f57951af2 sched/fair: Fix fault in reweight_entity
Bugzilla: http://bugzilla.redhat.com/2062836

commit 13765de8148f71fa795e0a6607de37c49ea5915a
Author: Tadeusz Struk <tadeusz.struk@linaro.org>
Date:   Thu Feb 3 08:18:46 2022 -0800

    sched/fair: Fix fault in reweight_entity

    Syzbot found a GPF in reweight_entity. This has been bisected to
    commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid
    sched_task_group")

    There is a race between sched_post_fork() and setpriority(PRIO_PGRP)
    within a thread group that causes a null-ptr-deref in
    reweight_entity() in CFS. The scenario is that the main process spawns
    number of new threads, which then call setpriority(PRIO_PGRP, 0, -20),
    wait, and exit.  For each of the new threads the copy_process() gets
    invoked, which adds the new task_struct and calls sched_post_fork()
    for it.

    In the above scenario there is a possibility that
    setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread
    in the group that is just being created by copy_process(), and for
    which the sched_post_fork() has not been executed yet. This will
    trigger a null pointer dereference in reweight_entity(), as it will
    try to access the run queue pointer, which hasn't been set.

    Before the mentioned change the cfs_rq pointer for the task  has been
    set in sched_fork(), which is called much earlier in copy_process(),
    before the new task is added to the thread_group.  Now it is done in
    the sched_post_fork(), which is called after that.  To fix the issue
    the remove the update_load param from the update_load param() function
    and call reweight_task() only if the task flag doesn't have the
    TASK_NEW flag set.

    Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
    Reported-by: syzbot+af7a719bc92395ee41b3@syzkaller.appspotmail.com
    Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20220203161846.1160750-1-tadeusz.struk@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-14 09:25:18 -04:00
Herton R. Krzesinski f13f32b81b Merge: sched: backports from 5.16 merge window
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/217
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2020279

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2029640

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1921343

Upstream Status: Linux
Tested: By me, with scheduler stress and sanity tests. Boot tested
    on Alderlake for topology changes.

5.16+ scheduler fixes. This includes some commits requested by
the Livepatch team and some AlderLake topology changes. A few
additional patches were pulled in to make the rest apply. With
those and the dependency all patches apply cleanly.

v2: added 3 more commits from sched/urgent.

Added one last (hopefully) fix from sched/urgent.

Signed-off-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: Wander Lairson Costa <wander@redhat.com>
RH-Acked-by: Waiman Long <longman@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-12-22 10:22:13 -03:00
Herton R. Krzesinski bc4cd05211 Merge: block: update to v5.16
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/148

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2023396
Upstream Status: merged to linus tree already

Update block layer and related drivers(drivers/block) with v5.16.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
RH-Acked-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: Lyude Paul <lyude@redhat.com>
RH-Acked-by: Gopal Tiwari <gtiwari@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-12-17 09:07:53 -03:00
Phil Auld 8ad4a5d307 sched/uclamp: Fix rq->uclamp_max not set on first enqueue
Bugzilla: http://bugzilla.redhat.com/2020279

commit 315c4f884800c45cb6bd8c90422fad554a8b9588
Author: Qais Yousef <qais.yousef@arm.com>
Date:   Thu Dec 2 11:20:33 2021 +0000

    sched/uclamp: Fix rq->uclamp_max not set on first enqueue

    Commit d81ae8aac8 ("sched/uclamp: Fix initialization of struct
    uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to
    match the woken up task's uclamp_max when the rq is idle.

    The code was relying on rq->uclamp_max initialized to zero, so on first
    enqueue

            static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
                                                enum uclamp_id clamp_id)
            {
                    ...

                    if (uc_se->value > READ_ONCE(uc_rq->value))
                            WRITE_ONCE(uc_rq->value, uc_se->value);
            }

    was actually resetting it. But since commit d81ae8aac8 changed the
    default to 1024, this no longer works. And since rq->uclamp_flags is
    also initialized to 0, neither above code path nor uclamp_idle_reset()
    update the rq->uclamp_max on first wake up from idle.

    This is only visible from first wake up(s) until the first dequeue to
    idle after enabling the static key. And it only matters if the
    uclamp_max of this task is < 1024 since only then its uclamp_max will be
    effectively ignored.

    Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to
    ensure uclamp_idle_reset() is called which then will update the rq
    uclamp_max value as expected.

    Fixes: d81ae8aac8 ("sched/uclamp: Fix initialization of struct uclamp_rq")
    Signed-off-by: Qais Yousef <qais.yousef@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
    Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:52 -05:00
Phil Auld 29674bf319 preempt/dynamic: Fix setup_preempt_mode() return value
Bugzilla: http://bugzilla.redhat.com/2020279

commit 9ed20bafc85806ca6c97c9128cec46c3ef80ae86
Author: Andrew Halaney <ahalaney@redhat.com>
Date:   Fri Dec 3 17:32:03 2021 -0600

    preempt/dynamic: Fix setup_preempt_mode() return value

    __setup() callbacks expect 1 for success and 0 for failure. Correct the
    usage here to reflect that.

    Fixes: 826bfeb37b ("preempt/dynamic: Support dynamic preempt with preempt= boot option")
    Reported-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:52 -05:00
Phil Auld 9ce1981ceb sched/scs: Reset task stack state in bringup_cpu()
Bugzilla: http://bugzilla.redhat.com/2020279

commit dce1ca0525bfdc8a69a9343bc714fbc19a2f04b3
Author: Mark Rutland <mark.rutland@arm.com>
Date:   Tue Nov 23 11:40:47 2021 +0000

    sched/scs: Reset task stack state in bringup_cpu()

    To hot unplug a CPU, the idle task on that CPU calls a few layers of C
    code before finally leaving the kernel. When KASAN is in use, poisoned
    shadow is left around for each of the active stack frames, and when
    shadow call stacks are in use. When shadow call stacks (SCS) are in use
    the task's saved SCS SP is left pointing at an arbitrary point within
    the task's shadow call stack.

    When a CPU is offlined than onlined back into the kernel, this stale
    state can adversely affect execution. Stale KASAN shadow can alias new
    stackframes and result in bogus KASAN warnings. A stale SCS SP is
    effectively a memory leak, and prevents a portion of the shadow call
    stack being used. Across a number of hotplug cycles the idle task's
    entire shadow call stack can become unusable.

    We previously fixed the KASAN issue in commit:

      e1b77c9298 ("sched/kasan: remove stale KASAN poison after hotplug")

    ... by removing any stale KASAN stack poison immediately prior to
    onlining a CPU.

    Subsequently in commit:

      f1a0a376ca ("sched/core: Initialize the idle task with preemption disabled")

    ... the refactoring left the KASAN and SCS cleanup in one-time idle
    thread initialization code rather than something invoked prior to each
    CPU being onlined, breaking both as above.

    We fixed SCS (but not KASAN) in commit:

      63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit")

    ... but as this runs in the context of the idle task being offlined it's
    potentially fragile.

    To fix these consistently and more robustly, reset the SCS SP and KASAN
    shadow of a CPU's idle task immediately before we online that CPU in
    bringup_cpu(). This ensures the idle task always has a consistent state
    when it is running, and removes the need to so so when exiting an idle
    task.

    Whenever any thread is created, dup_task_struct() will give the task a
    stack which is free of KASAN shadow, and initialize the task's SCS SP,
    so there's no need to specially initialize either for idle thread within
    init_idle(), as this was only necessary to handle hotplug cycles.

    I've tested this on arm64 with:

    * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK
    * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK

    ... offlining and onlining CPUS with:

    | while true; do
    |   for C in /sys/devices/system/cpu/cpu*/online; do
    |     echo 0 > $C;
    |     echo 1 > $C;
    |   done
    | done

    Fixes: f1a0a376ca ("sched/core: Initialize the idle task with preemption disabled")
    Reported-by: Qian Cai <quic_qiancai@quicinc.com>
    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Tested-by: Qian Cai <quic_qiancai@quicinc.com>
    Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:52 -05:00
Phil Auld 4402fa0cd3 sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()
Bugzilla: http://bugzilla.redhat.com/2020279

commit 42dc938a590c96eeb429e1830123fef2366d9c80
Author: Vincent Donnefort <vincent.donnefort@arm.com>
Date:   Thu Nov 4 17:51:20 2021 +0000

    sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()

    Nothing protects the access to the per_cpu variable sd_llc_id. When testing
    the same CPU (i.e. this_cpu == that_cpu), a race condition exists with
    update_top_cache_domain(). One scenario being:

                  CPU1                            CPU2
      ==================================================================

      per_cpu(sd_llc_id, CPUX) => 0
                                        partition_sched_domains_locked()
                                          detach_destroy_domains()
      cpus_share_cache(CPUX, CPUX)          update_top_cache_domain(CPUX)
        per_cpu(sd_llc_id, CPUX) => 0
                                              per_cpu(sd_llc_id, CPUX) = CPUX
        per_cpu(sd_llc_id, CPUX) => CPUX
        return false

    ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result
    is a warning triggered from ttwu_queue_wakelist().

    Avoid a such race in cpus_share_cache() by always returning true when
    this_cpu == that_cpu.

    Fixes: 518cd62341 ("sched: Only queue remote wakeups when crossing cache boundaries")
    Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
    Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20211104175120.857087-1-vincent.donnefort@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:51 -05:00
Phil Auld 70d81a5df7 sched/fair: Prevent dead task groups from regaining cfs_rq's
Bugzilla: http://bugzilla.redhat.com/2020279

commit b027789e5e50494c2325cc70c8642e7fd6059479
Author: Mathias Krause <minipli@grsecurity.net>
Date:   Wed Nov 3 20:06:13 2021 +0100

    sched/fair: Prevent dead task groups from regaining cfs_rq's

    Kevin is reporting crashes which point to a use-after-free of a cfs_rq
    in update_blocked_averages(). Initial debugging revealed that we've
    live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
    free_fair_sched_group(). However, it was unclear how that can happen.

    His kernel config happened to lead to a layout of struct sched_entity
    that put the 'my_q' member directly into the middle of the object
    which makes it incidentally overlap with SLUB's freelist pointer.
    That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
    mangling, leads to a reliable access violation in form of a #GP which
    made the UAF fail fast.

    Michal seems to have run into the same issue[1]. He already correctly
    diagnosed that commit a7b359fc6a ("sched/fair: Correctly insert
    cfs_rq's to list on unthrottle") is causing the preconditions for the
    UAF to happen by re-adding cfs_rq's also to task groups that have no
    more running tasks, i.e. also to dead ones. His analysis, however,
    misses the real root cause and it cannot be seen from the crash
    backtrace only, as the real offender is tg_unthrottle_up() getting
    called via sched_cfs_period_timer() via the timer interrupt at an
    inconvenient time.

    When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
    task group, it doesn't protect itself from getting interrupted. If the
    timer interrupt triggers while we iterate over all CPUs or after
    unregister_fair_sched_group() has finished but prior to unlinking the
    task group, sched_cfs_period_timer() will execute and walk the list of
    task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
    dying task group. These will later -- in free_fair_sched_group() -- be
    kfree()'ed while still being linked, leading to the fireworks Kevin
    and Michal are seeing.

    To fix this race, ensure the dying task group gets unlinked first.
    However, simply switching the order of unregistering and unlinking the
    task group isn't sufficient, as concurrent RCU walkers might still see
    it, as can be seen below:

        CPU1:                                      CPU2:
          :                                        timer IRQ:
          :                                          do_sched_cfs_period_timer():
          :                                            :
          :                                            distribute_cfs_runtime():
          :                                              rcu_read_lock();
          :                                              :
          :                                              unthrottle_cfs_rq():
        sched_offline_group():                             :
          :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
          list_del_rcu(&tg->list);                           :
     (1)  :                                                  list_for_each_entry_rcu(child, &parent->children, siblings)
          :                                                    :
     (2)  list_del_rcu(&tg->siblings);                         :
          :                                                    tg_unthrottle_up():
          unregister_fair_sched_group():                         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
            :                                                    :
            list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);               :
            :                                                    :
            :                                                    if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
     (3)    :                                                        list_add_leaf_cfs_rq(cfs_rq);
          :                                                      :
          :                                                    :
          :                                                  :
          :                                                :
          :                                              :
     (4)  :                                              rcu_read_unlock();

    CPU 2 walks the task group list in parallel to sched_offline_group(),
    specifically, it'll read the soon to be unlinked task group entry at
    (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
    still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
    all cfs_rq's via list_del_leaf_cfs_rq() in
    unregister_fair_sched_group().  Meanwhile CPU 2 will re-add some of
    these at (3), which is the cause of the UAF later on.

    To prevent this additional race from happening, we need to wait until
    walk_tg_tree_from() has finished traversing the task groups, i.e.
    after the RCU read critical section ends in (4). Afterwards we're safe
    to call unregister_fair_sched_group(), as each new walk won't see the
    dying task group any more.

    On top of that, we need to wait yet another RCU grace period after
    unregister_fair_sched_group() to ensure print_cfs_stats(), which might
    run concurrently, always sees valid objects, i.e. not already free'd
    ones.

    This patch survives Michal's reproducer[2] for 8h+ now, which used to
    trigger within minutes before.

      [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
      [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/

    Fixes: a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
    [peterz: shuffle code around a bit]
    Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com>
    Signed-off-by: Mathias Krause <minipli@grsecurity.net>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:51 -05:00
Phil Auld 167140e68e sched: Remove pointless preemption disable in sched_submit_work()
Bugzilla: http://bugzilla.redhat.com/2020279

commit b945efcdd07d86cece1cce68503aae91f107eacb
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Wed Sep 29 11:37:32 2021 +0200

    sched: Remove pointless preemption disable in sched_submit_work()

    Neither wq_worker_sleeping() nor io_wq_worker_sleeping() require to be invoked
    with preemption disabled:

      - The worker flag checks operations only need to be serialized against
        the worker thread itself.

      - The accounting and worker pool operations are serialized with locks.

    which means that disabling preemption has neither a reason nor a
    value. Remove it and update the stale comment.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Reviewed-by: Jens Axboe <axboe@kernel.dk>
    Link: https://lkml.kernel.org/r/8735pnafj7.ffs@tglx

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:49 -05:00
Phil Auld e9a43d5999 sched: Move mmdrop to RCU on RT
Bugzilla: http://bugzilla.redhat.com/2020279

commit 8d491de6edc27138806cae6e8eca455beb325b62
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Tue Sep 28 14:24:32 2021 +0200

    sched: Move mmdrop to RCU on RT

    mmdrop() is invoked from finish_task_switch() by the incoming task to drop
    the mm which was handed over by the previous task. mmdrop() can be quite
    expensive which prevents an incoming real-time task from getting useful
    work done.

    Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels
    it delagates the eventually required invocation of __mmdrop() to RCU.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210928122411.648582026@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:49 -05:00
Phil Auld 395c062ef5 sched: Move kprobes cleanup out of finish_task_switch()
Bugzilla: http://bugzilla.redhat.com/2020279

commit 670721c7bd2a6e16e40db29b2707a27bdecd6928
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Tue Sep 28 14:24:28 2021 +0200

    sched: Move kprobes cleanup out of finish_task_switch()

    Doing cleanups in the tail of schedule() is a latency punishment for the
    incoming task. The point of invoking kprobes_task_flush() for a dead task
    is that the instances are returned and cannot leak when __schedule() is
    kprobed.

    Move it into the delayed cleanup.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210928122411.537994026@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:49 -05:00
Phil Auld 85d252d9f1 sched: Limit the number of task migrations per batch on RT
Bugzilla: http://bugzilla.redhat.com/2020279

commit 691925f3ddccea832cf2d162dc277d2623a816e3
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Tue Sep 28 14:24:25 2021 +0200

    sched: Limit the number of task migrations per batch on RT

    Batched task migrations are a source for large latencies as they keep the
    scheduler from running while processing the migrations.

    Limit the batch size to 8 instead of 32 when running on a RT enabled
    kernel.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210928122411.425097596@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:49 -05:00
Phil Auld 1028c3ee10 sched: Simplify wake_up_*idle*()
Bugzilla: http://bugzilla.redhat.com/2020279

commit 8850cb663b5cda04d33f9cfbc38889d73d3c8e24
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Sep 21 22:16:02 2021 +0200

    sched: Simplify wake_up_*idle*()

    Simplify and make wake_up_if_idle() more robust, also don't iterate
    the whole machine with preempt_disable() in it's caller:
    wake_up_all_idle_cpus().

    This prepares for another wake_up_if_idle() user that needs a full
    do_idle() cycle.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Vasily Gorbik <gor@linux.ibm.com>
    Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
    Link: https://lkml.kernel.org/r/20210929152428.769328779@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:48 -05:00
Phil Auld c693803b2b sched,rcu: Rework try_invoke_on_locked_down_task()
Bugzilla: http://bugzilla.redhat.com/2020279

commit 9b3c4ab3045e953670c7de9c1165fae5358a7237
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Sep 21 21:54:32 2021 +0200

    sched,rcu: Rework try_invoke_on_locked_down_task()

    Give try_invoke_on_locked_down_task() a saner name and have it return
    an int so that the caller might distinguish between different reasons
    of failure.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Paul E. McKenney <paulmck@kernel.org>
    Acked-by: Vasily Gorbik <gor@linux.ibm.com>
    Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
    Link: https://lkml.kernel.org/r/20210929152428.649944917@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:48 -05:00
Phil Auld a2ee57d025 sched: Improve try_invoke_on_locked_down_task()
Bugzilla: http://bugzilla.redhat.com/2020279

commit f6ac18fafcf6cc5e41c26766d12ad335ed81012e
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Sep 22 10:14:15 2021 +0200

    sched: Improve try_invoke_on_locked_down_task()

    Clarify and tighten try_invoke_on_locked_down_task().

    Basically the function calls @func under task_rq_lock(), except it
    avoids taking rq->lock when possible.

    This makes calling @func unconditional (the function will get renamed
    in a later patch to remove the try).

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Vasily Gorbik <gor@linux.ibm.com>
    Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
    Link: https://lkml.kernel.org/r/20210929152428.589323576@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:47 -05:00
Phil Auld eec2be9bfb kernel/sched: Fix sched_fork() access an invalid sched_task_group
Bugzilla: http://bugzilla.redhat.com/2020279

commit 4ef0c5c6b5ba1f38f0ea1cedad0cad722f00c14a
Author: Zhang Qiao <zhangqiao22@huawei.com>
Date:   Wed Sep 15 14:40:30 2021 +0800

    kernel/sched: Fix sched_fork() access an invalid sched_task_group

    There is a small race between copy_process() and sched_fork()
    where child->sched_task_group point to an already freed pointer.

            parent doing fork()      | someone moving the parent
                                     | to another cgroup
      -------------------------------+-------------------------------
      copy_process()
          + dup_task_struct()<1>
                                      parent move to another cgroup,
                                      and free the old cgroup. <2>
          + sched_fork()
            + __set_task_cpu()<3>
            + task_fork_fair()
              + sched_slice()<4>

    In the worst case, this bug can lead to "use-after-free" and
    cause panic as shown above:

      (1) parent copy its sched_task_group to child at <1>;

      (2) someone move the parent to another cgroup and free the old
          cgroup at <2>;

      (3) the sched_task_group and cfs_rq that belong to the old cgroup
          will be accessed at <3> and <4>, which cause a panic:

      [] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      [] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0
      [] Oops: 0000 [#1] SMP PTI
      [] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0.x86_64+ #1
      [] RIP: 0010:sched_slice+0x84/0xc0

      [] Call Trace:
      []  task_fork_fair+0x81/0x120
      []  sched_fork+0x132/0x240
      []  copy_process.part.5+0x675/0x20e0
      []  ? __handle_mm_fault+0x63f/0x690
      []  _do_fork+0xcd/0x3b0
      []  do_syscall_64+0x5d/0x1d0
      []  entry_SYSCALL_64_after_hwframe+0x65/0xca
      [] RIP: 0033:0x7f04418cd7e1

    Between cgroup_can_fork() and cgroup_post_fork(), the cgroup
    membership and thus sched_task_group can't change. So update child's
    sched_task_group at sched_post_fork() and move task_fork() and
    __set_task_cpu() (where accees the sched_task_group) from sched_fork()
    to sched_post_fork().

    Fixes: 8323f26ce3 ("sched: Fix race in task_group")
    Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lkml.kernel.org/r/20210915064030.2231-1-zhangqiao22@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:47 -05:00
Phil Auld fbc84644bc sched: Make struct sched_statistics independent of fair sched class
Bugzilla: http://bugzilla.redhat.com/2020279

commit ceeadb83aea28372e54857bf88ab7e17af48ab7b
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Sep 5 14:35:41 2021 +0000

    sched: Make struct sched_statistics independent of fair sched class

    If we want to use the schedstats facility to trace other sched classes, we
    should make it independent of fair sched class. The struct sched_statistics
    is the schedular statistics of a task_struct or a task_group. So we can
    move it into struct task_struct and struct task_group to achieve the goal.

    After the patch, schestats are orgnized as follows,

        struct task_struct {
           ...
           struct sched_entity se;
           struct sched_rt_entity rt;
           struct sched_dl_entity dl;
           ...
           struct sched_statistics stats;
           ...
       };

    Regarding the task group, schedstats is only supported for fair group
    sched, and a new struct sched_entity_stats is introduced, suggested by
    Peter -

        struct sched_entity_stats {
            struct sched_entity     se;
            struct sched_statistics stats;
        } __no_randomize_layout;

    Then with the se in a task_group, we can easily get the stats.

    The sched_statistics members may be frequently modified when schedstats is
    enabled, in order to avoid impacting on random data which may in the same
    cacheline with them, the struct sched_statistics is defined as cacheline
    aligned.

    As this patch changes the core struct of scheduler, so I verified the
    performance it may impact on the scheduler with 'perf bench sched
    pipe', suggested by Mel. Below is the result, in which all the values
    are in usecs/op.
                                      Before               After
          kernel.sched_schedstats=0  5.2~5.4               5.2~5.4
          kernel.sched_schedstats=1  5.3~5.5               5.3~5.5
    [These data is a little difference with the earlier version, that is
     because my old test machine is destroyed so I have to use a new
     different test machine.]

    Almost no impact on the sched performance.

    No functional change.

    [lkp@intel.com: reported build failure in earlier version]

    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:46 -05:00
Phil Auld b3e5bde075 sched/fair: Add cfs bandwidth burst statistics
Bugzilla: http://bugzilla.redhat.com/2020279

commit bcb1704a1ed2de580a46f28922e223a65f16e0f5
Author: Huaixin Chang <changhuaixin@linux.alibaba.com>
Date:   Mon Aug 30 11:22:14 2021 +0800

    sched/fair: Add cfs bandwidth burst statistics

    Two new statistics are introduced to show the internal of burst feature
    and explain why burst helps or not.

    nr_bursts:  number of periods bandwidth burst occurs
    burst_time: cumulative wall-time (in nanoseconds) that any cpus has
                used above quota in respective periods

    Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
    Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
    Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20210830032215.16302-2-changhuaixin@linux.alibaba.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:46 -05:00
Phil Auld dfabd0fc37 sched: Switch wait_task_inactive to HRTIMER_MODE_REL_HARD
Bugzilla: http://bugzilla.redhat.com/2020279

commit c33627e9a1143afb988fb98d917c4a2faa16f9d9
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Aug 26 19:04:08 2021 +0200

    sched: Switch wait_task_inactive to HRTIMER_MODE_REL_HARD

    With PREEMPT_RT enabled all hrtimers callbacks will be invoked in
    softirq mode unless they are explicitly marked as HRTIMER_MODE_HARD.
    During boot kthread_bind() is used for the creation of per-CPU threads
    and then hangs in wait_task_inactive() if the ksoftirqd is not
    yet up and running.
    The hang disappeared since commit
       26c7295be0 ("kthread: Do not preempt current task if it is going to call schedule()")

    but enabling function trace on boot reliably leads to the freeze on boot
    behaviour again.
    The timer in wait_task_inactive() can not be directly used by a user
    interface to abuse it and create a mass wake up of several tasks at the
    same time leading to long sections with disabled interrupts.
    Therefore it is safe to make the timer HRTIMER_MODE_REL_HARD.

    Switch the timer to HRTIMER_MODE_REL_HARD.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210826170408.vm7rlj7odslshwch@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:45 -05:00