Commit Graph

1596 Commits

Author SHA1 Message Date
Phil Auld f0cdbfa9cb sched/deadline: Collect sched_dl_entity initialization
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Minor fuzz due to unrelated whitespace difference from
 upstream.

commit 9e07d45c5210f5dd6701c00d55791983db7320fa
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sat Nov 4 11:59:19 2023 +0100

    sched/deadline: Collect sched_dl_entity initialization

    Create a single function that initializes a sched_dl_entity.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lkml.kernel.org/r/51acc695eecf0a1a2f78f9a044e11ffd9b316bcf.1699095159.git.bristot@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:16 -04:00
Phil Auld e72c2d7c2f sched: Use WRITE_ONCE() for p->on_rq
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit d6111cf45c5787282b2e20d77bdb6b28881d516a
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue Oct 31 11:12:01 2023 -0700

    sched: Use WRITE_ONCE() for p->on_rq

    Since RCU-tasks uses READ_ONCE(p->on_rq), ensure the write-side
    matches with WRITE_ONCE().

    Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/e4896e0b-eacc-45a2-a7a8-de2280a51ecc@paulmck-laptop

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:16 -04:00
Phil Auld fea5f42a4c sched/fair: Remove SIS_PROP
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 984ffb6a4366752c949f7b39640aecdce222607f
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Oct 20 12:35:33 2023 +0200

    sched/fair: Remove SIS_PROP

    SIS_UTIL seems to work well, lets remove the old thing.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:15 -04:00
Phil Auld 26c251b772 sched: Add cpus_share_resources API
JIRA: https://issues.redhat.com/browse/RHEL-15622

commit b95303e0aeaf446b65169dd4142cacdaeb7d4c8b
Author: Barry Song <song.bao.hua@hisilicon.com>
Date:   Thu Oct 19 11:33:21 2023 +0800

    sched: Add cpus_share_resources API

    Add cpus_share_resources() API. This is the preparation for the
    optimization of select_idle_cpu() on platforms with cluster scheduler
    level.

    On a machine with clusters cpus_share_resources() will test whether
    two cpus are within the same cluster. On a non-cluster machine it
    will behaves the same as cpus_share_cache(). So we use "resources"
    here for cache resources.

    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
    Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:46:43 -04:00
Phil Auld e7e65f39f8 sched/debug: Print 'tgid' in sched_show_task()
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit bc87127a45928de5fdf0ec39d7a86e1edd0e179e
Author: Yajun Deng <yajun.deng@linux.dev>
Date:   Thu Jul 20 16:05:16 2023 +0800

    sched/debug: Print 'tgid' in sched_show_task()

    Multiple blocked tasks are printed when the system hangs. They may have
    the same parent pid, but belong to different task groups.

    Printing tgid lets users better know whether these tasks are from the same
    task group or not.

    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230720080516.1515297-1-yajun.deng@linux.dev

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:56 -04:00
Phil Auld 5215e85db0 sched/headers: Remove duplicate header inclusions
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit d4d6596b43868a1e05fe5b047e73c3aff96444c6
Author: Yu Liao <liaoyu15@huawei.com>
Date:   Wed Aug 2 10:15:01 2023 +0800

    sched/headers: Remove duplicate header inclusions

    <linux/psi.h> and "autogroup.h" are included twice, remove the duplicate header
    inclusion.

    Signed-off-by: Yu Liao <liaoyu15@huawei.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230802021501.2511569-1-liaoyu15@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:56 -04:00
Phil Auld 05b3b562da sched/debug: Add new tracepoint to track compute energy computation
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 15874a3d27e6405e9d17595f83bd3ca1b6cab16d
Author: Qais Yousef <qyousef@layalina.io>
Date:   Sun Sep 17 00:29:55 2023 +0100

    sched/debug: Add new tracepoint to track compute energy computation

    It was useful to track feec() placement decision and debug the spare
    capacity and optimization issues vs uclamp_max.

    Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230916232955.2099394-4-qyousef@layalina.io

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:56 -04:00
Phil Auld 82d55a3223 sched/core: Refactor the task_flags check for worker sleeping in sched_submit_work()
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 3eafe225995c67f8c179011ec2d6e4c12b32a53d
Author: Wang Jinchao <wangjinchao@xfusion.com>
Date:   Sun Aug 20 20:53:17 2023 +0800

    sched/core: Refactor the task_flags check for worker sleeping in sched_submit_work()

    Simplify the conditional logic for checking worker flags
    by splitting the original compound `if` statement into
    separate `if` and `else if` clauses.

    This modification not only retains the previous functionality,
    but also reduces a single `if` check, improving code clarity
    and potentially enhancing performance.

    Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/ZOIMvURE99ZRAYEj@fedora

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:56 -04:00
Phil Auld 3952152618 sched/debug: Avoid checking in_atomic_preempt_off() twice in schedule_debug()
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit dc461c48deda8a2d243fbaf49e276d555eb833d8
Author: Liming Wu <liming.wu@jaguarmicro.com>
Date:   Fri Aug 25 10:35:00 2023 +0800

    sched/debug: Avoid checking in_atomic_preempt_off() twice in schedule_debug()

    in_atomic_preempt_off() already gets called in schedule_debug() once,
    which is the only caller of __schedule_bug().

    Skip the second call within __schedule_bug(), it should always be true
    at this point.

    [ mingo: Clarified the changelog. ]

    Signed-off-by: Liming Wu <liming.wu@jaguarmicro.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230825023501.1848-1-liming.wu@jaguarmicro.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:55 -04:00
Phil Auld e4b674ea9d kernel/sched: Modify initial boot task idle setup
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit cff9b2332ab762b7e0586c793c431a8f2ea4db04
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Fri Sep 15 13:44:44 2023 -0400

    kernel/sched: Modify initial boot task idle setup

    Initial booting is setting the task flag to idle (PF_IDLE) by the call
    path sched_init() -> init_idle().  Having the task idle and calling
    call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
    set.  Subsequent calls to any cond_resched() will enable IRQs,
    potentially earlier than the IRQ setup has completed.  Recent changes
    have caused just this scenario and IRQs have been enabled early.

    This causes a warning later in start_kernel() as interrupts are enabled
    before they are fully set up.

    Fix this issue by setting the PF_IDLE flag later in the boot sequence.

    Although the boot task was marked as idle since (at least) d80e4fda576d,
    I am not sure that it is wrong to do so.  The forced context-switch on
    idle task was introduced in the tiny_rcu update, so I'm going to claim
    this fixes 5f6130fa52.

    Fixes: 5f6130fa52 ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:55 -04:00
Phil Auld 7e2b960e90 sched/fair: Rename check_preempt_curr() to wakeup_preempt()
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Minor fuzz in fair.c due to having RT merged,
  specifically: ea622076b76f ("sched: Add support for lazy preemption")

commit e23edc86b09df655bf8963bbcb16647adc787395
Author: Ingo Molnar <mingo@kernel.org>
Date:   Tue Sep 19 10:38:21 2023 +0200

    sched/fair: Rename check_preempt_curr() to wakeup_preempt()

    The name is a bit opaque - make it clear that this is about wakeup
    preemption.

    Also rename the ->check_preempt_curr() methods similarly.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:55 -04:00
Phil Auld 23fec151eb sched/core: Use do-while instead of for loop in set_nr_if_polling()
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 4ff34ad3d39377d9f6953f3606ccf611ce636767
Author: Uros Bizjak <ubizjak@gmail.com>
Date:   Tue Feb 28 17:14:26 2023 +0100

    sched/core: Use do-while instead of for loop in set_nr_if_polling()

    Use equivalent do-while loop instead of infinite for loop.

    There are no asm code changes.

    Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230228161426.4508-1-ubizjak@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:49:13 -04:00
Phil Auld 9f16bf1bd9 sched: add WF_CURRENT_CPU and externise ttwu
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit ab83f455f04df5b2f7c6d4de03b6d2eaeaa27b8a
Author: Peter Oskolkov <posk@google.com>
Date:   Tue Mar 7 23:31:57 2023 -0800

    sched: add WF_CURRENT_CPU and externise ttwu

    Add WF_CURRENT_CPU wake flag that advices the scheduler to
    move the wakee to the current CPU. This is useful for fast on-CPU
    context switching use cases.

    In addition, make ttwu external rather than static so that
    the flag could be passed to it from outside of sched/core.c.

    Signed-off-by: Peter Oskolkov <posk@google.com>
    Signed-off-by: Andrei Vagin <avagin@google.com>
    Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230308073201.3102738-3-avagin@google.com
    Signed-off-by: Kees Cook <keescook@chromium.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:49:13 -04:00
Phil Auld 5664854fb4 sched/core: introduce sched_core_idle_cpu()
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 548796e2e70b44b4661fd7feee6eb239245ff1f8
Author: Cruz Zhao <CruzZhao@linux.alibaba.com>
Date:   Thu Jun 29 12:02:04 2023 +0800

    sched/core: introduce sched_core_idle_cpu()

    As core scheduling introduced, a new state of idle is defined as
    force idle, running idle task but nr_running greater than zero.

    If a cpu is in force idle state, idle_cpu() will return zero. This
    result makes sense in some scenarios, e.g., load balance,
    showacpu when dumping, and judge the RCU boost kthread is starving.

    But this will cause error in other scenarios, e.g., tick_irq_exit():
    When force idle, rq->curr == rq->idle but rq->nr_running > 0, results
    that idle_cpu() returns 0. In function tick_irq_exit(), if idle_cpu()
    is 0, tick_nohz_irq_exit() will not be called, and ts->idle_active will
    not become 1, which became 0 in tick_nohz_irq_enter().
    ts->idle_sleeptime won't update in function update_ts_time_stats(), if
    ts->idle_active is 0, which should be 1. And this bug will result that
    ts->idle_sleeptime is less than the actual value, and finally will
    result that the idle time in /proc/stat is less than the actual value.

    To solve this problem, we introduce sched_core_idle_cpu(), which
    returns 1 when force idle. We audit all users of idle_cpu(), and
    change idle_cpu() into sched_core_idle_cpu() in function
    tick_irq_exit().

    v2-->v3: Only replace idle_cpu() with sched_core_idle_cpu() in
    function tick_irq_exit(). And modify the corresponding commit log.

    Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Peter Zijlstra <peterz@infradead.org>
    Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Joel Fernandes <joel@joelfernandes.org>
    Link: https://lore.kernel.org/r/1688011324-42406-1-git-send-email-CruzZhao@linux.alibaba.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:49:13 -04:00
Phil Auld dd1a6e3897 sched: add throttled time stat for throttled children
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 677ea015f231aa38b3972aa7be54ecd2637e99fd
Author: Josh Don <joshdon@google.com>
Date:   Tue Jun 20 11:32:47 2023 -0700

    sched: add throttled time stat for throttled children

    We currently export the total throttled time for cgroups that are given
    a bandwidth limit. This patch extends this accounting to also account
    the total time that each children cgroup has been throttled.

    This is useful to understand the degree to which children have been
    affected by the throttling control. Children which are not runnable
    during the entire throttled period, for example, will not show any
    self-throttling time during this period.

    Expose this in a new interface, 'cpu.stat.local', which is similar to
    how non-hierarchical events are accounted in 'memory.events.local'.

    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:49:13 -04:00
Phil Auld 74c4d90dda sched/cpufreq: Rework schedutil governor performance estimation
JIRA: https://issues.redhat.com/browse/RHEL-29020

commit 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Wed Nov 22 14:39:03 2023 +0100

    sched/cpufreq: Rework schedutil governor performance estimation

    The current method to take into account uclamp hints when estimating the
    target frequency can end in a situation where the selected target
    frequency is finally higher than uclamp hints, whereas there are no real
    needs. Such cases mainly happen because we are currently mixing the
    traditional scheduler utilization signal with the uclamp performance
    hints. By adding these 2 metrics, we loose an important information when
    it comes to select the target frequency, and we have to make some
    assumptions which can't fit all cases.

    Rework the interface between the scheduler and schedutil governor in order
    to propagate all information down to the cpufreq governor.

    effective_cpu_util() interface changes and now returns the actual
    utilization of the CPU with 2 optional inputs:

    - The minimum performance for this CPU; typically the capacity to handle
      the deadline task and the interrupt pressure. But also uclamp_min
      request when available.

    - The maximum targeting performance for this CPU which reflects the
      maximum level that we would like to not exceed. By default it will be
      the CPU capacity but can be reduced because of some performance hints
      set with uclamp. The value can be lower than actual utilization and/or
      min performance level.

    A new sugov_effective_cpu_perf() interface is also available to compute
    the final performance level that is targeted for the CPU, after applying
    some cpufreq headroom and taking into account all inputs.

    With these 2 functions, schedutil is now able to decide when it must go
    above uclamp hints. It now also has a generic way to get the min
    performance level.

    The dependency between energy model and cpufreq governor and its headroom
    policy doesn't exist anymore.

    eenv_pd_max_util() asks schedutil for the targeted performance after
    applying the impact of the waking task.

    [ mingo: Refined the changelog & C comments. ]

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Rafael J. Wysocki <rafael@kernel.org>
    Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:43:59 -04:00
Phil Auld 49d1b3f5c9 sched/topology: Consolidate and clean up access to a CPU's max compute capacity
JIRA: https://issues.redhat.com/browse/RHEL-29020

commit 7bc263840bc3377186cb06b003ac287bb2f18ce2
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Mon Oct 9 12:36:16 2023 +0200

    sched/topology: Consolidate and clean up access to a CPU's max compute capacity

    Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
    instead.

    The scheduler uses 3 methods to get access to a CPU's max compute capacity:

     - arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.

     - cpu_capacity_orig field which is periodically updated with
       arch_scale_cpu_capacity().

     - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.

    There is no real need to save the value returned by arch_scale_cpu_capacity()
    in struct rq. arch_scale_cpu_capacity() returns:

     - either a per_cpu variable.

     - or a const value for systems which have only one capacity.

    Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.

    No functional changes.

    Some performance tests on Arm64:

      - small SMP device (hikey): no noticeable changes
      - HMP device (RB5):         hackbench shows minor improvement (1-2%)
      - large smp (thx2):         hackbench and tbench shows minor improvement (1%)

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-05 09:43:50 -04:00
Phil Auld fe04f5c470 sched/timers: Explain why idle task schedules out on remote timer enqueue
JIRA: https://issues.redhat.com/browse/RHEL-29020

commit 194600008d5c43b5a4ba98c4b81633397e34ffad
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Tue Nov 14 14:38:40 2023 -0500

    sched/timers: Explain why idle task schedules out on remote timer enqueue

    Trying to avoid that didn't bring much value after testing, add comment
    about this.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Rafael J. Wysocki <rafael@kernel.org>
    Link: https://lkml.kernel.org/r/20231114193840.4041-3-frederic@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-04 12:46:56 -04:00
Waiman Long 707cdda6c1 sched: Provide rt_mutex specific scheduler helpers
JIRA: https://issues.redhat.com/browse/RHEL-28616

commit 6b596e62ed9f90c4a97e68ae1f7b1af5beeb3c05
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri, 8 Sep 2023 18:22:51 +0200

    sched: Provide rt_mutex specific scheduler helpers

    With PREEMPT_RT there is a rt_mutex recursion problem where
    sched_submit_work() can use an rtlock (aka spinlock_t). More
    specifically what happens is:

      mutex_lock() /* really rt_mutex */
        ...
          __rt_mutex_slowlock_locked()
            task_blocks_on_rt_mutex()
              // enqueue current task as waiter
              // do PI chain walk
            rt_mutex_slowlock_block()
              schedule()
                sched_submit_work()
                  ...
                  spin_lock() /* really rtlock */
                    ...
                      __rt_mutex_slowlock_locked()
                        task_blocks_on_rt_mutex()
                          // enqueue current task as waiter *AGAIN*
                          // *CONFUSION*

    Fix this by making rt_mutex do the sched_submit_work() early, before
    it enqueues itself as a waiter -- before it even knows *if* it will
    wait.

    [[ basically Thomas' patch but with different naming and a few asserts
       added ]]

    Originally-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230908162254.999499-5-bigeasy@linutronix.de

Signed-off-by: Waiman Long <longman@redhat.com>
2024-03-27 10:05:58 -04:00
Waiman Long d11684a859 sched: Extract __schedule_loop()
JIRA: https://issues.redhat.com/browse/RHEL-28616

commit de1474b46d889ee0367f6e71d9adfeb0711e4a8d
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Fri, 8 Sep 2023 18:22:50 +0200

    sched: Extract __schedule_loop()

    There are currently two implementations of this basic __schedule()
    loop, and there is soon to be a third.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230908162254.999499-4-bigeasy@linutronix.de

Signed-off-by: Waiman Long <longman@redhat.com>
2024-03-27 10:05:58 -04:00
Waiman Long f87f8018c0 sched: Constrain locks in sched_submit_work()
JIRA: https://issues.redhat.com/browse/RHEL-28616

commit 28bc55f654de49f6122c7475b01b5d5ef4bdf0d4
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri, 8 Sep 2023 18:22:48 +0200

    sched: Constrain locks in sched_submit_work()

    Even though sched_submit_work() is ran from preemptible context,
    it is discouraged to have it use blocking locks due to the recursion
    potential.

    Enforce this.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230908162254.999499-2-bigeasy@linutronix.de

Signed-off-by: Waiman Long <longman@redhat.com>
2024-03-27 10:05:57 -04:00
Waiman Long 41316178dc Revert "sched/core: Provide sched_rtmutex() and expose sched work helpers")
JIRA: https://issues.redhat.com/browse/RHEL-28616
Upstream Status: RHEL only

Revert linux-rt-devel specific commit ca66ec3b9994 ("sched/core:
Provide sched_rtmutex() and expose sched work helpers") to prepare for
the submission of upstream equivalent.

Signed-off-by: Waiman Long <longman@redhat.com>
2024-03-27 09:56:36 -04:00
Waiman Long 07a160c823 Revert "sched/core: Add __always_inline to schedule_loop()"
JIRA: https://issues.redhat.com/browse/RHEL-28616
Upstream Status: RHEL only

Revert RHEL only commit ec180d083a ("sched/core: Add __always_inline
to schedule_loop()").

Signed-off-by: Waiman Long <longman@redhat.com>
2024-03-27 09:56:34 -04:00
Phil Auld b4dc1de6f4 sched: Misc cleanups
JIRA: https://issues.redhat.com/browse/RHEL-29017
Conflicts: Context differences due to having RT patchset 6bc27040eb
  ("sched: Add support for lazy preemption"). A number of hunks skipped due to not
  having af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") and related
  commits in that series.

commit 0e34600ac9317dbe5f0a7bfaa3d7187d757572ed
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 9 20:52:49 2023 +0200

    sched: Misc cleanups

    Random remaining guard use...

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-15 12:27:27 -04:00
Phil Auld a3ff38e571 sched: Simplify tg_set_cfs_bandwidth()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 6fb45460615358157a6d3c990e74f9c1395247e2
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 9 20:45:16 2023 +0200

    sched: Simplify tg_set_cfs_bandwidth()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-15 12:27:27 -04:00
Phil Auld 7be2f9d32c sched: Simplify sched_move_task()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit fa614b4feb5a246474ac71b45e520a8ddefc809c
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 9 20:41:09 2023 +0200

    sched: Simplify sched_move_task()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-15 12:27:27 -04:00
Phil Auld 660ea6ce0b sched: Simplify sched_rr_get_interval()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit af7c5763f5e8bc1b3f827354a283ccaf6a8c8098
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 9 16:59:05 2023 +0200

    sched: Simplify sched_rr_get_interval()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-15 12:27:26 -04:00
Phil Auld a5f564e4a8 sched: Simplify yield_to()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 7a50f76674f8b6f4f30a1cec954179f10e20110c
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 9 16:58:23 2023 +0200

    sched: Simplify yield_to()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-15 12:27:26 -04:00
Phil Auld 09540ce68b sched: Simplify sched_{set,get}affinity()
JIRA: https://issues.redhat.com/browse/RHEL-29017
Conflicts: Minor manual fixup needed due to having RHEL-only
  patch 05fddaaaac ("sched/core: Use empty mask to reset cpumasks
  in sched_setaffinity()").

commit 92c2ec5bc1081e6bbbe172bcfb1a566ad7b4f809
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 9 16:57:35 2023 +0200

    sched: Simplify sched_{set,get}affinity()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-15 12:27:15 -04:00
Phil Auld e3d25cdfd8 sched: Simplify syscalls
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit febe162d4d9158cf2b5d48fdd440db7bb55dd622
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 9 16:54:54 2023 +0200

    sched: Simplify syscalls

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:36:17 -04:00
Phil Auld 7d8b86de57 sched: Simplify set_user_nice()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 94b548a15e8ec47dfbf6925bdfb64bb5657dce0c
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 9 20:52:55 2023 +0200

    sched: Simplify set_user_nice()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:36:13 -04:00
Phil Auld 653b9d8a24 sched: Simplify sched_core_cpu_{starting,deactivate}()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 7170509cadbb76e5fa7d7b090d2cbdb93d56a2de
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:30 2023 +0200

    sched: Simplify sched_core_cpu_{starting,deactivate}()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211812.371787909@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:36:09 -04:00
Phil Auld f1dfda18ac sched: Simplify try_steal_cookie()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit b4e1fa1e14286f7a825b10d8ebb2e9c0f77c241b
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:29 2023 +0200

    sched: Simplify try_steal_cookie()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211812.304154828@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:36:05 -04:00
Phil Auld 035259148c sched: Simplify sched_tick_remote()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 6dafc713e3b0d8ffbd696d200d8c9dd212ddcdfc
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:28 2023 +0200

    sched: Simplify sched_tick_remote()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211812.236247952@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:36:01 -04:00
Phil Auld bf350e4481 sched: Simplify sched_exec()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 4bdada79f3464d85f6e187213c088e7c934e0554
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:27 2023 +0200

    sched: Simplify sched_exec()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211812.168490417@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:35:56 -04:00
Phil Auld 8774c3dfad sched: Simplify ttwu()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 857d315f1201cfcf60e5849c96d2b4dd20f90ebf
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:26 2023 +0200

    sched: Simplify ttwu()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211812.101069260@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:35:52 -04:00
Phil Auld 2d0ed06667 sched: Simplify wake_up_if_idle()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 4eb054f92b066ec0a0cba6896ee8eff4c91dfc9e
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:25 2023 +0200

    sched: Simplify wake_up_if_idle()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211812.032678917@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:35:47 -04:00
Phil Auld f4b0880d3d sched: Simplify: migrate_swap_stop()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 5bb76f1ddf2a7dd98f5a89d7755600ed1b4a7fcd
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:24 2023 +0200

    sched: Simplify: migrate_swap_stop()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211811.964370836@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:35:43 -04:00
Phil Auld fe6b874eba sched: Simplify sysctl_sched_uclamp_handler()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 0f92cdf36f848f1c077924f857a49789e00331c0
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:23 2023 +0200

    sched: Simplify sysctl_sched_uclamp_handler()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211811.896559109@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:35:39 -04:00
Phil Auld ba129ce513 sched: Simplify get_nohz_timer_target()
JIRA: https://issues.redhat.com/browse/RHEL-29017

commit 7537b90c0036759e0b1b43dfbc6224dc5e900b13
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 1 22:41:22 2023 +0200

    sched: Simplify get_nohz_timer_target()

    Use guards to reduce gotos and simplify control flow.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230801211811.828443100@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-03-13 14:35:32 -04:00
Waiman Long 9b35f92491 sched/core: Make sched_setaffinity() always return -EINVAL on empty cpumask
JIRA: https://issues.redhat.com/browse/RHEL-21440
Upstream Status: RHEL only

Since RHEL commit 05fddaaaac ("sched/core: Use empty mask to reset
cpumasks in sched_setaffinity()"), an empty cpumask is used for resetting
a user requested CPU affinity set by a previous sched_setaffinity() call.
An error code of -ENODEV is returned for a successful reset. However,
this broke some test cases that only expects an error code of -EINVAL.
So another RHEL commit 90f7bb0c18 ("sched/core: Don't return -ENODEV
from sched_setaffinity()") was merged to return 0 in that case. Again,
this still broke some other test cases.

This patch restores the old behavior of always returning -EINVAL on an
empty cpumask with the exception that 0 may still be returned if the
empty cpumask is caused by a input len parameter of 0 which is another
way of resetting user requested CPU affinity that I had proposed to
upstream before.

Fixes: 90f7bb0c18 ("sched/core: Don't return -ENODEV from sched_setaffinity()")
Signed-off-by: Waiman Long <longman@redhat.com>
2024-02-07 13:45:10 -05:00
Scott Weaver 02be8ace82 Merge: sched/core: Don't return -ENODEV from sched_setaffinity()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3388

JIRA: https://issues.redhat.com/browse/RHEL-16613
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3388/edit
Upstream Status: RHEL only
Tested: A unit test was run to verify that CPU affinity reset only
	happened when the full mask was empty not when any of the out
	of bound bits were set.

RHEL commit 05fddaaaac ("sched/core: Use empty mask to reset cpumasks
in sched_setaffinity()") enables the use of an empty cpumask to reset
a user-provided CPU affinity via sched_setaffinity(2) syscall. It will
return a value of -ENODEV to indicate a success in resetting the user
provided CPU affinity.

However, the current way to check for empty cpumask using cpumask_emtpy()
is not robust enough to avoid many false positives leading to an
erroneous return of -ENODEV which can confuse user applications leading
to incorrect behavior. For example, if the system has 28 CPUs and bit
28 is set, it will treat the cpumask as empty.

Instead of cpumask_empty(), bitmap_empty() should be used to check if
all the bits in the cpumask_size() buffer are zero. This should avoid
many false positives. However, false positive can still happen if the
bit set is outside the range allowed by cpumask_size(). So we need to
check the full user_mask buffer to see if it is really empty to avoid
any false positive. By doing so, there should be no need to return a
-ENODEV error code which is a workaround to handle the false positives.
A value of 0 will be returned if the reset is successful or -EINVAL
will be if user-provided CPU affinity hasn't been properly set by
sched_setaffinity(2).

Fixes: 05fddaaaac ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()")
Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: John B. Wyatt IV <jwyatt@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2024-01-09 10:42:38 -05:00
Phil Auld ea3a71ed56 sched/core: Fix RQCF_ACT_SKIP leak
JIRA: https://issues.redhat.com/browse/RHEL-15489
Conflicts: Applied with fuzz due to not having mm_cid changes,
specifically 223baf9d17f2 ("sched: Fix performance regression
introduced by mm_cid")

commit 5ebde09d91707a4a9bec1e3d213e3c12ffde348f
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Thu Oct 12 17:00:03 2023 +0800

    sched/core: Fix RQCF_ACT_SKIP leak

    Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning.

    This warning may be triggered in the following situations:

        CPU0                                      CPU1

    __schedule()
      *rq->clock_update_flags <<= 1;*   unregister_fair_sched_group()
      pick_next_task_fair+0x4a/0x410      destroy_cfs_bandwidth()
        newidle_balance+0x115/0x3e0       for_each_possible_cpu(i) *i=0*
          rq_unpin_lock(this_rq, rf)      __cfsb_csd_unthrottle()
          raw_spin_rq_unlock(this_rq)
                                          rq_lock(*CPU0_rq*, &rf)
                                          rq_clock_start_loop_update()
                                          rq->clock_update_flags & RQCF_ACT_SKIP <--
          raw_spin_rq_lock(this_rq)

    The purpose of RQCF_ACT_SKIP is to skip the update rq clock,
    but the update is very early in __schedule(), but we clear
    RQCF_*_SKIP very late, causing it to span that gap above
    and triggering this warning.

    In __schedule() we can clear the RQCF_*_SKIP flag immediately
    after update_rq_clock() to avoid this RQCF_ACT_SKIP leak warning.
    And set rq->clock_update_flags to RQCF_UPDATED to avoid
    rq->clock_update_flags < RQCF_ACT_SKIP warning that may be triggered later.

    Fixes: ebb83d84e49b ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")
    Closes: https://lore.kernel.org/all/20230913082424.73252-1-jiahao.os@bytedance.com
    Reported-by: Igor Raits <igor.raits@gmail.com>
    Reported-by: Bagas Sanjaya <bagasdotme@gmail.com>
    Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-11-28 13:07:59 -05:00
Waiman Long 90f7bb0c18 sched/core: Don't return -ENODEV from sched_setaffinity()
JIRA: https://issues.redhat.com/browse/RHEL-16613
Upstream Status: RHEL only
Tested: A unit test was run to verify that CPU affinity reset only
	happened when the full mask was empty not when any of the out
	of bound bits were set.

RHEL commit 05fddaaaac ("sched/core: Use empty mask to reset cpumasks
in sched_setaffinity()") enables the use of an empty cpumask to reset
a user-provided CPU affinity via sched_setaffinity(2) syscall. It will
return a value of -ENODEV to indicate a success in resetting the user
provided CPU affinity.

However, the current way to check for empty cpumask using cpumask_emtpy()
is not robust enough to avoid many false positives leading to an
erroneous return of -ENODEV which can confuse user applications leading
to incorrect behavior. For example, if the system has 28 CPUs and bit
28 is set, it will treat the cpumask as empty.

Instead of cpumask_empty(), bitmap_empty() should be used to check if
all the bits in the cpumask_size() buffer are zero. This should avoid
many false positives. However, false positive can still happen if the
bit set is outside the range allowed by cpumask_size(). So we need to
check the full user_mask buffer to see if it is really empty to avoid
any false positive. By doing so, there should be no need to return a
-ENODEV error code which is a workaround to handle the false positives.
A value of 0 will be returned if the reset is successful or -EINVAL
will be if user-provided CPU affinity hasn't been properly set by
sched_setaffinity(2).

Fixes: 05fddaaaac ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()")
Signed-off-by: Waiman Long <longman@redhat.com>
2023-11-20 09:33:18 -05:00
Chris von Recklinghausen b92cce1ea6 mm: multi-gen LRU: support page table walks
Conflicts:
	fs/exec.c - We already have
		33a2d6bc3480 ("Revert "fs/exec: allow to unshare a time namespace on vfork+exec"")
		so don't add call to timens_on_fork back in
	include/linux/mmzone.h - We already have
		e6ad640bc404 ("mm: deduplicate cacheline padding code")
		so keep CACHELINE_PADDING(_pad2_) over ZONE_PADDING(_pad2_)
	mm/vmscan.c - The backport of
		badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs")
		added an #include <linux/debugfs.h>. Keep it.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bd74fdaea146029e4fa12c6de89adbe0779348a9
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:05 2022 -0600

    mm: multi-gen LRU: support page table walks

    To further exploit spatial locality, the aging prefers to walk page tables
    to search for young PTEs and promote hot pages.  A kill switch will be
    added in the next patch to disable this behavior.  When disabled, the
    aging relies on the rmap only.

    NB: this behavior has nothing similar with the page table scanning in the
    2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
    to swapcache and unmaps them.

    To avoid confusion, the term "iteration" specifically means the traversal
    of an entire mm_struct list; the term "walk" will be applied to page
    tables and the rmap, as usual.

    An mm_struct list is maintained for each memcg, and an mm_struct follows
    its owner task to the new memcg when this task is migrated.  Given an
    lruvec, the aging iterates lruvec_memcg()->mm_list and calls
    walk_page_range() with each mm_struct on this list to promote hot pages
    before it increments max_seq.

    When multiple page table walkers iterate the same list, each of them gets
    a unique mm_struct; therefore they can run concurrently.  Page table
    walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
    pages it left in the previous memcg will not be promoted when its current
    memcg is under reclaim.  Similarly, page table walkers will not promote
    pages from nodes other than the one under reclaim.

    This patch uses the following optimizations when walking page tables:
    1. It tracks the usage of mm_struct's between context switches so that
       page table walkers can skip processes that have been sleeping since
       the last iteration.
    2. It uses generational Bloom filters to record populated branches so
       that page table walkers can reduce their search space based on the
       query results, e.g., to skip page tables containing mostly holes or
       misplaced pages.
    3. It takes advantage of the accessed bit in non-leaf PMD entries when
       CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
    4. It does not zigzag between a PGD table and the same PMD table
       spanning multiple VMAs. IOW, it finishes all the VMAs within the
       range of the same PMD table before it returns to a PGD table. This
       improves the cache performance for workloads that have large
       numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change

      Single workload:
        memcached (anon): +[8, 10]%
                    Ops/sec      KB/sec
          patch1-7: 1147696.57   44640.29
          patch1-8: 1245274.91   48435.66

      Configurations:
        no change

    Client benchmark results:
      kswapd profiles:
        patch1-7
          48.16%  lzo1x_1_do_compress (real work)
           8.20%  page_vma_mapped_walk (overhead)
           7.06%  _raw_spin_unlock_irq
           2.92%  ptep_clear_flush
           2.53%  __zram_bvec_write
           2.11%  do_raw_spin_lock
           2.02%  memmove
           1.93%  lru_gen_look_around
           1.56%  free_unref_page_list
           1.40%  memset

        patch1-8
          49.44%  lzo1x_1_do_compress (real work)
           6.19%  page_vma_mapped_walk (overhead)
           5.97%  _raw_spin_unlock_irq
           3.13%  get_pfn_folio
           2.85%  ptep_clear_flush
           2.42%  __zram_bvec_write
           2.08%  do_raw_spin_lock
           1.92%  memmove
           1.44%  alloc_zspage
           1.36%  memset

      Configurations:
        no change

    Thanks to the following developers for their efforts [3].
      kernel test robot <lkp@intel.com>

    [1] https://lwn.net/Articles/23732/
    [2] https://llvm.org/docs/ScudoHardenedAllocator.html
    [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/

    Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:46 -04:00
Chris von Recklinghausen 8e351dfac4 memory tiering: adjust hot threshold automatically
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit c959924b0dc53bf6252793f41480bc01b9792570
Author: Huang Ying <ying.huang@intel.com>
Date:   Wed Jul 13 16:39:53 2022 +0800

    memory tiering: adjust hot threshold automatically

    The promotion hot threshold is workload and system configuration
    dependent.  So in this patch, a method to adjust the hot threshold
    automatically is implemented.  The basic idea is to control the number of
    the candidate promotion pages to match the promotion rate limit.  If the
    hint page fault latency of a page is less than the hot threshold, we will
    try to promote the page, and the page is called the candidate promotion
    page.

    If the number of the candidate promotion pages in the statistics interval
    is much more than the promotion rate limit, the hot threshold will be
    decreased to reduce the number of the candidate promotion pages.
    Otherwise, the hot threshold will be increased to increase the number of
    the candidate promotion pages.

    To make the above method works, in each statistics interval, the total
    number of the pages to check (on which the hint page faults occur) and the
    hot/cold distribution need to be stable.  Because the page tables are
    scanned linearly in NUMA balancing, but the hot/cold distribution isn't
    uniform along the address usually, the statistics interval should be
    larger than the NUMA balancing scan period.  So in the patch, the max scan
    period is used as statistics interval and it works well in our tests.

    Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: osalvador <osalvador@suse.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:31 -04:00
Scott Weaver c007b2ed95 Merge: rcu: Backport upstream RCU commits up to v6.4
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3099

JIRA: https://issues.redhat.com/browse/RHEL-5228
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3099
Omitted-fix: 4d09caec2fab ("arm64: kcsan: Support detecting more missing memory barriers")

This MR backports upstream RCU commits up to v6.1 with related fixes,
if applicable. It also includes a number of KCSAN commits which provide
helpers and APIs that may be referenced by commits from other subsystems.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-17 09:01:45 -04:00
Scott Weaver 0f4bee1faf Merge: Scheduler updates for 9.4
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3029

JIRA: https://issues.redhat.com/browse/RHEL-1536
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2208016

Tested:  With scheduelr stress tests, cfs bandwidth tests and unit tests on
specific pieces (such as enabling WARN_DOUBLE_CLOCK etc), in addition to cki
and perf QE testing.

Updates and fixes from up to v6.5 for scheduler related code. This includes
a revert of one of the RT merge patches which is then re-applied in the form
it took when added to Linus's tree (see the "wait_task_inactive()" commits).

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Joe Lawrence <joe.lawrence@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-17 09:01:44 -04:00
Scott Weaver 569bd0e035 Merge: trace: Add trace_ipi_send_cpumask()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3080

Linux has had tracepoints tied to IPI reception for a while, but none tied to IPI emission.

This series add tracepoints to the actual codepath sending the IPIs, which makes it possible to trace and track sources of IPI with Ftrace. This is very useful for setups where IPIs to certain CPUs are /mostly/ undesired and a source of unwanted interference (e.g. CPU isolation).

Bugzilla: https://bugzilla.redhat.com/2192613

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: John B. Wyatt IV <jwyatt@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Valentin Schneider <vschneid@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-09-27 09:34:11 -04:00
Waiman Long 29653e5b7c sched: Add helper nr_context_switches_cpu()
JIRA: https://issues.redhat.com/browse/RHEL-5228

commit 7c182722a0a9447e31f9645de4f311e5bc59b480
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date:   Sat, 19 Nov 2022 17:25:05 +0800

    sched: Add helper nr_context_switches_cpu()

    Add a function nr_context_switches_cpu() that returns  number of context
    switches since boot on the specified CPU.  This information will be used
    to diagnose RCU CPU stalls.

    Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ben Segall <bsegall@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
    Cc: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-09-22 13:21:37 -04:00
Jerome Marchand 04a26afde2 trace: Add trace_ipi_send_cpu()
Bugzilla: https://bugzilla.redhat.com/2192613

Conflicts: context change due to missing commit ed29b0b4fd83
("io_uring: move to separate directory")

commit 68e2d17c9eb311ab59aeb6d0c38aad8985fa2596
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Mar 22 11:28:36 2023 +0100

    trace: Add trace_ipi_send_cpu()

    Because copying cpumasks around when targeting a single CPU is a bit
    daft...

    Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230322103004.GA571242%40hirez.programming.kicks-ass.net

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-09-14 15:36:30 +02:00
Jerome Marchand aa5786b04d sched, smp: Trace smp callback causing an IPI
Bugzilla: https://bugzilla.redhat.com/2192613

Conflicts: Need to modify __smp_call_single_queue_debug() too. It was
removed upstream by commit 1771257cb447 ("locking/csd_lock: Remove
added data from CSD lock debugging")

commit 68f4ff04dbada18dad79659c266a8e5e29e458cd
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Tue Mar 7 14:35:58 2023 +0000

    sched, smp: Trace smp callback causing an IPI

    Context
    =======

    The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter
    which so far has only been fed with NULL.

    While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing
    struct layout (meaning their callback func can be accessed without caring
    about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function
    attached to its struct. This means we need to check the type of a CSD
    before eventually dereferencing its associated callback.

    This isn't as trivial as it sounds: the CSD type is stored in
    __call_single_node.u_flags, which get cleared right before the callback is
    executed via csd_unlock(). This implies checking the CSD type before it is
    enqueued on the call_single_queue, as the target CPU's queue can be flushed
    before we get to sending an IPI.

    Furthermore, send_call_function_single_ipi() only has a CPU parameter, and
    would need to have an additional argument to trickle down the invoked
    function. This is somewhat silly, as the extra argument will always be
    pushed down to the function even when nothing is being traced, which is
    unnecessary overhead.

    Changes
    =======

    send_call_function_single_ipi() is only used by smp.c, and is defined in
    sched/core.c as it contains scheduler-specific ops (set_nr_if_polling() of
    a CPU's idle task).

    Split it into two parts: the scheduler bits remain in sched/core.c, and the
    actual IPI emission is moved into smp.c. This lets us define an
    __always_inline helper function that can take the related callback as
    parameter without creating useless register pressure in the non-traced path
    which only gains a (disabled) static branch.

    Do the same thing for the multi IPI case.

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230307143558.294354-8-vschneid@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-09-14 15:36:30 +02:00
Jerome Marchand 160dc2ad5b sched, smp: Trace IPIs sent via send_call_function_single_ipi()
Bugzilla: https://bugzilla.redhat.com/2192613

Conflicts: context change due to missing commit ed29b0b4fd83
("io_uring: move to separate directory")

commit cc9cb0a71725aa8dd8d8f534a9b562bbf7981f75
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Tue Mar 7 14:35:53 2023 +0000

    sched, smp: Trace IPIs sent via send_call_function_single_ipi()

    send_call_function_single_ipi() is the thing that sends IPIs at the bottom
    of smp_call_function*() via either generic_exec_single() or
    smp_call_function_many_cond(). Give it an IPI-related tracepoint.

    Note that this ends up tracing any IPI sent via __smp_call_single_queue(),
    which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free".

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Acked-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230307143558.294354-3-vschneid@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-09-14 15:36:30 +02:00
Phil Auld c11309550b sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 96500560f0c73c71bca1b27536c6254fa0e8ce37
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Tue Jun 13 16:20:10 2023 +0800

    sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()

    There is a double update_rq_clock() invocation:

      __balance_push_cpu_stop()
        update_rq_clock()
        __migrate_task()
          update_rq_clock()

    Sadly select_fallback_rq() also needs update_rq_clock() for
    __do_set_cpus_allowed(), it is not possible to remove the update from
    __balance_push_cpu_stop(). So remove it from __migrate_task() and
    ensure all callers of this function call update_rq_clock() prior to
    calling it.

    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20230613082012.49615-3-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld 57aa0597d5 sched/core: Fixed missing rq clock update before calling set_rq_offline()
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit cab3ecaed5cdcc9c36a96874b4c45056a46ece45
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Tue Jun 13 16:20:09 2023 +0800

    sched/core: Fixed missing rq clock update before calling set_rq_offline()

    When using a cpufreq governor that uses
    cpufreq_add_update_util_hook(), it is possible to trigger a missing
    update_rq_clock() warning for the CPU hotplug path:

      rq_attach_root()
        set_rq_offline()
          rq_offline_rt()
            __disable_runtime()
              sched_rt_rq_enqueue()
                enqueue_top_rt_rq()
                  cpufreq_update_util()
                    data->func(data, rq_clock(rq), flags)

    Move update_rq_clock() from sched_cpu_deactivate() (one of it's
    callers) into set_rq_offline() such that it covers all
    set_rq_offline() usage.

    Additionally change rq_attach_root() to use rq_lock_irqsave() so that
    it will properly manage the runqueue clock flags.

    Suggested-by: Ben Segall <bsegall@google.com>
    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20230613082012.49615-2-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld 51c8946826 sched: Consider task_struct::saved_state in wait_task_inactive()
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 1c06918788e8ae6e69e4381a2806617312922524
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed May 31 16:39:07 2023 +0200

    sched: Consider task_struct::saved_state in wait_task_inactive()

    With the introduction of task_struct::saved_state in commit
    5f220be21418 ("sched/wakeup: Prepare for RT sleeping spin/rwlocks")
    matching the task state has gotten more complicated. That same commit
    changed try_to_wake_up() to consider both states, but
    wait_task_inactive() has been neglected.

    Sebastian noted that the wait_task_inactive() usage in
    ptrace_check_attach() can misbehave when ptrace_stop() is blocked on
    the tasklist_lock after it sets TASK_TRACED.

    Therefore extract a common helper from ttwu_state_match() and use that
    to teach wait_task_inactive() about the PREEMPT_RT locks.

    Originally-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Link: https://lkml.kernel.org/r/20230601091234.GW83892@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld 9e3d063131 sched: Unconditionally use full-fat wait_task_inactive()
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit d5e1586617be7093ea3419e3fa9387ed833cdbb1
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 2 10:42:53 2023 +0200

    sched: Unconditionally use full-fat wait_task_inactive()

    While modifying wait_task_inactive() for PREEMPT_RT; the build robot
    noted that UP got broken. This led to audit and consideration of the
    UP implementation of wait_task_inactive().

    It looks like the UP implementation is also broken for PREEMPT;
    consider task_current_syscall() getting preempted between the two
    calls to wait_task_inactive().

    Therefore move the wait_task_inactive() implementation out of
    CONFIG_SMP and unconditionally use it.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230602103731.GA630648%40hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Phil Auld ec56b1904a sched: Change wait_task_inactive()s match_state
JIRA: https://issues.redhat.com/browse/RHEL-1536
Conflicts: This was applied out of order with f9fc8cad9728 ("sched:
    Add TASK_ANY for wait_task_inactive()") so adjusted code to match
    what the results should have been.

commit 9204a97f7ae862fc8a3330ec8335917534c3fb63
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon Aug 22 13:18:19 2022 +0200

    sched: Change wait_task_inactive()s match_state

    Make wait_task_inactive()'s @match_state work like ttwu()'s @state.

    That is, instead of an equal comparison, use it as a mask. This allows
    matching multiple block conditions.

    (removes the unlikely; it doesn't make sense how it's only part of the
    condition)

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:58 -04:00
Phil Auld 418216578b Revert "sched: Consider task_struct::saved_state in wait_task_inactive()."
JIRA: https://issues.redhat.com/browse/RHEL-1536
Upstream status: RHEL only
Conflicts: A later patch renamed task_running() to task_on_cpu() so this
did not revert cleanly. In addition match_state does not need to be checked
for 0 due to f9fc8cad9728 sched: Add TASK_ANY for wait_task_inactive().

This reverts commit 3673cc2e61.

This is commit a015745ca41f from the RT tree merge. It will be re-applied in
the form it was in when merged to Linus' tree as 1c06918788.

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:58 -04:00
Phil Auld c59c893622 sched/core: Make sched_dynamic_mutex static
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 9b8e17813aeccc29c2f9f2e6e68997a6eac2d26d
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Tue Apr 11 22:26:41 2023 -0700

    sched/core: Make sched_dynamic_mutex static

    The sched_dynamic_mutex is only used within the file.  Make it static.

    Fixes: e3ff7c609f39 ("livepatch,sched: Add livepatch task switching to cond_resched()")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/oe-kbuild-all/202304062335.tNuUjgsl-lkp@intel.com/

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 63797cb734 sched/core: Reduce cost of sched_move_task when config autogroup
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit eff6c8ce8d4d7faef75f66614dd20bb50595d261
Author: wuchi <wuchi.zero@gmail.com>
Date:   Tue Mar 21 14:44:59 2023 +0800

    sched/core: Reduce cost of sched_move_task when config autogroup

    Some sched_move_task calls are useless because that
    task_struct->sched_task_group maybe not changed (equals task_group
    of cpu_cgroup) when system enable autogroup. So do some checks in
    sched_move_task.

    sched_move_task eg:
    task A belongs to cpu_cgroup0 and autogroup0, it will always belong
    to cpu_cgroup0 when do_exit. So there is no need to do {de|en}queue.
    The call graph is as follow.

      do_exit
        sched_autogroup_exit_task
          sched_move_task
            dequeue_task
              sched_change_group
                A.sched_task_group = sched_get_task_group (=cpu_cgroup0)
            enqueue_task

    Performance results:
    ===========================
    1. env
            cpu: bogomips=4600.00
         kernel: 6.3.0-rc3
     cpu_cgroup: 6:cpu,cpuacct:/user.slice

    2. cmds
    do_exit script:

      for i in {0..10000}; do
          sleep 0 &
          done
      wait

    Run the above script, then use the following bpftrace cmd to get
    the cost of sched_move_task:

      bpftrace -e 'k:sched_move_task { @ts[tid] = nsecs; }
                   kr:sched_move_task /@ts[tid]/
                      { @ns += nsecs - @ts[tid]; delete(@ts[tid]); }'

    3. cost time(ns):
      without patch: 43528033
      with    patch: 18541416
               diff:-24986617  -57.4%

    As the result show, the patch will save 57.4% in the scenario.

    Signed-off-by: wuchi <wuchi.zero@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230321064459.39421-1-wuchi.zero@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 112765493a sched/core: Avoid selecting the task that is throttled to run when core-sched enable
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 530bfad1d53d103f98cec66a3e491a36d397884d
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Thu Mar 16 16:18:06 2023 +0800

    sched/core: Avoid selecting the task that is throttled to run when core-sched enable

    When {rt, cfs}_rq or dl task is throttled, since cookied tasks
    are not dequeued from the core tree, So sched_core_find() and
    sched_core_next() may return throttled task, which may
    cause throttled task to run on the CPU.

    So we add checks in sched_core_find() and sched_core_next()
    to make sure that the return is a runnable task that is
    not throttled.

    Co-developed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
    Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230316081806.69544-1-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 2e4b079146 sched_getaffinity: don't assume 'cpumask_size()' is fully initialized
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit 6015b1aca1a233379625385feb01dd014aca60b5
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue Mar 14 19:32:38 2023 -0700

    sched_getaffinity: don't assume 'cpumask_size()' is fully initialized

    The getaffinity() system call uses 'cpumask_size()' to decide how big
    the CPU mask is - so far so good.  It is indeed the allocation size of a
    cpumask.

    But the code also assumes that the whole allocation is initialized
    without actually doing so itself.  That's wrong, because we might have
    fixed-size allocations (making copying and clearing more efficient), but
    not all of it is then necessarily used if 'nr_cpu_ids' is smaller.

    Having checked other users of 'cpumask_size()', they all seem to be ok,
    either using it purely for the allocation size, or explicitly zeroing
    the cpumask before using the size in bytes to copy it.

    See for example the ublk_ctrl_get_queue_affinity() function that uses
    the proper 'zalloc_cpumask_var()' to make sure that the whole mask is
    cleared, whether the storage is on the stack or if it was an external
    allocation.

    Fix this by just zeroing the allocation before using it.  Do the same
    for the compat version of sched_getaffinity(), which had the same logic.

    Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to
    access the bits.  For a cpumask_var_t, it ends up being a pointer to the
    same data either way, but it's just a good idea to treat it like you
    would a 'cpumask_t'.  The compat case already did that.

    Reported-by: Ryan Roberts <ryan.roberts@arm.com>
    Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/
    Cc: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 8142c03a19 livepatch,sched: Add livepatch task switching to cond_resched()
JIRA: https://issues.redhat.com/browse/RHEL-1536
Conflicts: Minor fixup due to already having 8df1947c71 ("livepatch:
    Replace the fake signal sending with TIF_NOTIFY_SIGNAL infrastructure")

commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Fri Feb 24 08:50:00 2023 -0800

    livepatch,sched: Add livepatch task switching to cond_resched()

    There have been reports [1][2] of live patches failing to complete
    within a reasonable amount of time due to CPU-bound kthreads.

    Fix it by patching tasks in cond_resched().

    There are four different flavors of cond_resched(), depending on the
    kernel configuration.  Hook into all of them.

    A more elegant solution might be to use a preempt notifier.  However,
    non-ORC unwinders can't unwind a preempted task reliably.

    [1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
    [2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org

    Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Petr Mladek <pmladek@suse.com>
    Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
    Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:06 -04:00
Phil Auld 4e3b05f4b0 sched/fair: Block nohz tick_stop when cfs bandwidth in use
Bugzilla: https://bugzilla.redhat.com/2208016
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
Conflicts: minor fuzz due to context.

commit 88c56cfeaec4642aee8aac58b38d5708c6aae0d3
Author: Phil Auld <pauld@redhat.com>
Date:   Wed Jul 12 09:33:57 2023 -0400

    sched/fair: Block nohz tick_stop when cfs bandwidth in use

    CFS bandwidth limits and NOHZ full don't play well together.  Tasks
    can easily run well past their quotas before a remote tick does
    accounting.  This leads to long, multi-period stalls before such
    tasks can run again. Currently, when presented with these conflicting
    requirements the scheduler is favoring nohz_full and letting the tick
    be stopped. However, nohz tick stopping is already best-effort, there
    are a number of conditions that can prevent it, whereas cfs runtime
    bandwidth is expected to be enforced.

    Make the scheduler favor bandwidth over stopping the tick by setting
    TICK_DEP_BIT_SCHED when the only running task is a cfs task with
    runtime limit enabled. We use cfs_b->hierarchical_quota to
    determine if the task requires the tick.

    Add check in pick_next_task_fair() as well since that is where
    we have a handle on the task that is actually going to be running.

    Add check in sched_can_stop_tick() to cover some edge cases such
    as nr_running going from 2->1 and the 1 remains the running task.

    Reviewed-By: Ben Segall <bsegall@google.com>
    Signed-off-by: Phil Auld <pauld@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:25:42 -04:00
Phil Auld 3f3cb409d3 sched, cgroup: Restore meaning to hierarchical_quota
Bugzilla: https://bugzilla.redhat.com/2208016
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

commit c98c18270be115678f4295b10a5af5dcc9c4efa0
Author: Phil Auld <pauld@redhat.com>
Date:   Fri Jul 14 08:57:46 2023 -0400

    sched, cgroup: Restore meaning to hierarchical_quota

    In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
    groups due to the previous fix simply taking the min.  It should
    reflect a limit imposed at that level or by an ancestor. Even
    though cgroupv2 does not require child quota to be less than or
    equal to that of its ancestors the task group will still be
    constrained by such a quota so this should be shown here. Cgroupv1
    continues to set this correctly.

    In both cases, add initialization when a new task group is created
    based on the current parent's value (or RUNTIME_INF in the case of
    root_task_group). Otherwise, the field is wrong until a quota is
    changed after creation and __cfs_schedulable() is called.

    Fixes: c53593e5cb ("sched, cgroup: Don't reject lower cpu.max on ancestors")
    Signed-off-by: Phil Auld <pauld@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Ben Segall <bsegall@google.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:25:41 -04:00
Jan Stancek 8d19d78fab Merge: sched/core: Use empty mask to reset cpumasks in sched_setaffinity()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2962

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219681
Upstream Status: RHEL only

Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask"), user provided CPU affinity via sched_setaffinity(2) is
perserved even if the task is being moved to a different cpuset. However,
that affinity is also being inherited by any subsequently created child
processes which may not want or be aware of that affinity.

One way to solve this problem is to provide a way to back off from
that user provided CPU affinity.  This patch implements such a scheme
by using an empty cpumask to signal a reset of the cpumasks to the
default as allowed by the current cpuset.

Before this patch, passing in an empty cpumask to sched_setaffinity(2)
will always return an -EINVAL error. With this patch, an alternative
error of -ENODEV will be returned returned if sched_setaffinity(2)
has been called before to set up user_cpus_ptr. In this case, the
user_cpus_ptr that stores the user provided affinity will be cleared and
the task's CPU affinity will be reset to that of the current cpuset. This
alternative error code of -ENODEV signals that the no CPU is specified
and, at the same time, a side effect of resetting cpu affinity to the
cpuset default.

If sched_setaffinity(2) has not been called previously, an EINVAL error
will be returned with an empty cpumask just like before.  Tests or
tools that rely on the behavior that an empty cpumask will return an
error code will not be affected.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: John B. Wyatt IV <jwyatt@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-09-01 21:26:13 +02:00
Jan Stancek f2a2d5da21 Merge: cgroup/cpuset: Provide better cpuset API to enable creation of isolated partition
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2957

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2957

OpenShfit requires support to disable CPU load balancing for the Telco
use cases and this is a gating factor in determining if it can switch
to use cgroup v2 as the default.

The current RHEL9 kernel is able to create an isolated cpuset partition
of exclusive CPUs with load balancing disabled for cgroup v2. However,
it currently has the limitation that isolated cpuset partitions can
only be formed clustered around the cgroup root. That doesn't fit the
current OpenShift use case where systemd is primarily responsible for
managing the cgroup filesystem and OpenShift can only manage child
cgroups further away from the cgroup root.

To address the need of OpenShift, a patch series [1] has been proposed
upstream to extend the v2 cpuset partition semantics to allow the
creation of isolated partitions further away from cgroup root by adding a
new cpuset control file "cpuset.cpus.exclusive" to distribute potential
exclusive CPUs down the cgroup hierarchy for the creation of isolated
cpuset partition.

This MR incorporates the proposed upstream patches with its dependency
patches to provide a way for OpenShift to move forward with switching
the default cgroup from v1 to v2 for the 4.14 release.

The last 6 patches are the proposed upstream patches and the rests have
been merged upstream either in the mainline or the cgroup maintainer's
tree.

[1] https://lore.kernel.org/lkml/20230817132454.755459-1-longman@redhat.com/

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Aristeu Rozanski <arozansk@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-09-01 21:26:08 +02:00
Waiman Long 132876f2ff cgroup/cpuset: Free DL BW in case can_attach() fails
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568

commit 2ef269ef1ac006acf974793d975539244d77b28f
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Mon, 8 May 2023 09:58:54 +0200

    cgroup/cpuset: Free DL BW in case can_attach() fails

    cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks
    have been checked. DL BW is not allocated per-task but as a sum over
    all DL tasks migrating.

    If multiple controllers are attached to the cgroup next to the cpuset
    controller a non-cpuset can_attach() can fail. In this case free DL BW
    in cpuset_cancel_attach().

    Finally, update cpuset DL task count (nr_deadline_tasks) only in
    cpuset_attach().

    Suggested-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-08-28 11:07:05 -04:00
Waiman Long 5503327426 sched/deadline: Create DL BW alloc, free & check overflow interface
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568

commit 85989106feb734437e2d598b639991b9185a43a6
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date:   Mon, 8 May 2023 09:58:53 +0200

    sched/deadline: Create DL BW alloc, free & check overflow interface

    While moving a set of tasks between exclusive cpusets,
    cpuset_can_attach() -> task_can_attach() calls dl_cpu_busy(..., p) for
    DL BW overflow checking and per-task DL BW allocation on the destination
    root_domain for the DL tasks in this set.

    This approach has the issue of not freeing already allocated DL BW in
    the following error cases:

    (1) The set of tasks includes multiple DL tasks and DL BW overflow
        checking fails for one of the subsequent DL tasks.

    (2) Another controller next to the cpuset controller which is attached
        to the same cgroup fails in its can_attach().

    To address this problem rework dl_cpu_busy():

    (1) Split it into dl_bw_check_overflow() & dl_bw_alloc() and add a
        dedicated dl_bw_free().

    (2) dl_bw_alloc() & dl_bw_free() take a `u64 dl_bw` parameter instead of
        a `struct task_struct *p` used in dl_cpu_busy(). This allows to
        allocate DL BW for a set of tasks too rather than only for a single
        task.

    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-08-28 11:07:05 -04:00
Waiman Long 3493ed9e35 sched/cpuset: Bring back cpuset_mutex
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568

commit 111cd11bbc54850f24191c52ff217da88a5e639b
Author: Juri Lelli <juri.lelli@redhat.com>
Date:   Mon, 8 May 2023 09:58:50 +0200

    sched/cpuset: Bring back cpuset_mutex

    Turns out percpu_cpuset_rwsem - commit 1243dc518c ("cgroup/cpuset:
    Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
    as it has been reported to cause slowdowns in workloads that need to
    change cpuset configuration frequently and it is also not implementing
    priority inheritance (which causes troubles with realtime workloads).

    Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
    only for SCHED_DEADLINE tasks (other policies don't care about stable
    cpusets anyway).

    Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-08-28 11:07:04 -04:00
Crystal Wood ec180d083a sched/core: Add __always_inline to schedule_loop()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232098

Upstream Status: RHEL only

Without __always_inline, this function breaks wchan.

schedule_loop() was added by patches from the upstream RT tree; a respin
of the patches for upstream has __always_inline.

Signed-off-by: Crystal Wood <swood@redhat.com>
2023-08-21 09:57:26 -05:00
Waiman Long 05fddaaaac sched/core: Use empty mask to reset cpumasks in sched_setaffinity()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219681
Upstream Status: RHEL only

Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask"), user provided CPU affinity via sched_setaffinity(2) is
perserved even if the task is being moved to a different cpuset. However,
that affinity is also being inherited by any subsequently created child
processes which may not want or be aware of that affinity.

One way to solve this problem is to provide a way to back off from
that user provided CPU affinity.  This patch implements such a scheme
by using an empty cpumask to signal a reset of the cpumasks to the
default as allowed by the current cpuset.

Before this patch, passing in an empty cpumask to sched_setaffinity(2)
will always return an -EINVAL error. With this patch, an alternative
error of -ENODEV will be returned returned if sched_setaffinity(2)
has been called before to set up user_cpus_ptr. In this case, the
user_cpus_ptr that stores the user provided affinity will be cleared and
the task's CPU affinity will be reset to that of the current cpuset. This
alternative error code of -ENODEV signals that the no CPU is specified
and, at the same time, a side effect of resetting cpu affinity to the
cpuset default.

If sched_setaffinity(2) has not been called previously, an EINVAL error
will be returned with an empty cpumask just like before.  Tests or
tools that rely on the behavior that an empty cpumask will return an
error code will not be affected.

We will have to update the sched_setaffinity(2) manpage to document
this possible side effect of passing in an empty cpumask.

Signed-off-by: Waiman Long <longman@redhat.com>
2023-08-19 14:53:37 -04:00
Jan Stancek b7217f6931 Merge: sched/core: Provide sched_rtmutex() and expose sched work helpers
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2829

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218724

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git

Avoid corrupting lock state due to blocking on a lock in sched_submit_work() while in the process of blocking on another lock.

Signed-off-by: Crystal Wood <swood@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-07-31 16:05:41 +02:00
Crystal Wood 09e4f82619 sched/core: Provide sched_rtmutex() and expose sched work helpers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218724

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git

commit ca66ec3b9994e5f82b433697e37512f7d28b6d22
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Apr 27 13:19:34 2023 +0200

    sched/core: Provide sched_rtmutex() and expose sched work helpers

    schedule() invokes sched_submit_work() before scheduling and
    sched_update_worker() afterwards to ensure that queued block requests are
    flushed and the (IO)worker machineries can instantiate new workers if
    required. This avoids deadlocks and starvation.

    With rt_mutexes this can lead to subtle problem:

      When rtmutex blocks current::pi_blocked_on points to the rtmutex it
      blocks on. When one of the functions in sched_submit/resume_work()
      contends on a rtmutex based lock then that would corrupt
      current::pi_blocked_on.

    Make it possible to let rtmutex issue the calls outside of the slowpath,
    i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:

      - Exposing sched_submit_work() and moving the task_running() condition
        into schedule()

      - Renamimg sched_update_worker() to sched_resume_work() and exposing it
        too.

      - Providing sched_rtmutex() which just does the inner loop of scheduling
        until need_resched() is not longer set. Split out the loop so this does
        not create yet another copy.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Link: https://lore.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Signed-off-by: Crystal Wood <swood@redhat.com>
2023-07-18 17:22:36 -05:00
Oleg Nesterov b85b393abb ptrace: Don't change __state
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

commit 2500ad1c7fa42ad734677853961a3a8bec0772c5
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Apr 29 08:43:34 2022 -0500

    ptrace: Don't change __state

    Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
    command is executing.

    Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and
    implement a new jobctl flag TASK_PTRACE_FROZEN.  This new flag is set
    in jobctl_freeze_task and cleared when ptrace_stop is awoken or in
    jobctl_unfreeze_task (when ptrace_stop remains asleep).

    In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
    when the wake up is for a fatal signal.  Skip adding __TASK_TRACED
    when TASK_PTRACE_FROZEN is not set.  This has the same effect as
    changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use
    TASK_KILLABLE go through signal_wake_up.

    Handle a ptrace_stop being called with a pending fatal signal.
    Previously it would have been handled by schedule simply failing to
    sleep.  As TASK_WAKEKILL is no longer part of TASK_TRACED schedule
    will sleep with a fatal_signal_pending.   The code in signal_wake_up
    guarantees that the code will be awaked by any fatal signal that
    codes after TASK_TRACED is set.

    Previously the __state value of __TASK_TRACED was changed to
    TASK_RUNNING when woken up or back to TASK_TRACED when the code was
    left in ptrace_stop.  Now when woken up ptrace_stop now clears
    JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
    clears JOBCTL_PTRACE_FROZEN.

    Tested-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-07-06 15:55:31 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Jan Stancek 3b12a1f1fc Merge: Scheduler updates for 9.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2392

JIRA: https://issues.redhat.com/browse/RHEL-282
Tested: With scheduler stress tests. Perf QE is running performance regression tests.

Update the kernel's core scheduler and related code with fixes and minor changes from
the upstream kernel. This will sync up to roughly linux v6.3-rc6.  Added a couple of
cpumask things which fit better here.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-16 11:49:47 +02:00
Jan Stancek eeab15fa15 Merge: Scheduler uclamp and asym updates to v6.3-rc1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2337

JIRA: https://issues.redhat.com/browse/RHEL-310
Tested: scheduler stress tests.

This is a collection of commits that update (mostly)
the uclamp code in the scheduler. We don't have
CONFIG_UCLAMP_TASK enabled right now but we might in
the future. We do though have EAS enabled and this helps
keep the code in sync to reduce issues withother patches.

It's broken out of the main scheduler update for 9.3 to
keep it contained and make the other MR smaller.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-15 09:35:54 +02:00
Jan Stancek f58fc750ef Merge: Sched/psi: updates to v6.3-rc1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2325

JIRA: https://issues.redhat.com/browse/RHEL-311
Tested: Enabled PSI and ran various stress tests.

Updates and bug fixes for the PSI subsystem. This brings
the code up to about v6.3-rc1. It does not include the
runtime enablement interface (34f26a15611 "sched/psi: Per-cgroup
PSI accounting disable/re-enable interfaceas") that required a larger
set of cgroup and kernfs patches. That may be take later if the
prerequisites are provided.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-11 12:12:13 +02:00
Jeff Moyer 2b8780eae3 io_uring: move to separate directory
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit ed29b0b4fd835b058ddd151c49d021e28d631ee6
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon May 23 17:05:03 2022 -0600

    io_uring: move to separate directory
    
    In preparation for splitting io_uring up a bit, move it into its own
    top level directory. It didn't really belong in fs/ anyway, as it's
    not a file system only API.
    
    This adds io_uring/ and moves the core files in there, and updates the
    MAINTAINERS file for the new location.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 04:49:02 -04:00
Jan Stancek 567f50bcff Merge: sched/core: Fix arch_scale_freq_tick() on tickless systems
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2276

Bugzilla: https://bugzilla.redhat.com/1996625

commit 7fb3ff22ad8772bbf0e3ce1ef3eb7b09f431807f
Author: Yair Podemsky <ypodemsk@redhat.com>
Date:   Wed Nov 30 14:51:21 2022 +0200

    sched/core: Fix arch_scale_freq_tick() on tickless systems

    In order for the scheduler to be frequency invariant we measure the
    ratio between the maximum CPU frequency and the actual CPU frequency.

    During long tickless periods of time the calculations that keep track
    of that might overflow, in the function scale_freq_tick():

      if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
              goto error;

    eventually forcing the kernel to disable the feature for all CPUs,
    and show the warning message:

       "Scheduler frequency invariance went wobbly, disabling!".

    Let's avoid that by limiting the frequency invariant calculations
    to CPUs with regular tick.

    Fixes: e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")
    Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
    Signed-off-by: Yair Podemsky <ypodemsk@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>
    Link: https://lore.kernel.org/r/20221130125121.34407-1-ypodemsk@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-04-25 06:58:25 +02:00
Phil Auld 02c4ba58b7 sched/fair: Sanitize vruntime of entity being migrated
JIRA: https://issues.redhat.com/browse/RHEL-282

commit a53ce18cacb477dd0513c607f187d16f0fa96f71
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Fri Mar 17 17:08:10 2023 +0100

    sched/fair: Sanitize vruntime of entity being migrated

    Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
    fixes an overflowing bug, but ignore a case that se->exec_start is reset
    after a migration.

    For fixing this case, we delay the reset of se->exec_start after
    placing the entity which se->exec_start to detect long sleeping task.

    In order to take into account a possible divergence between the clock_task
    of 2 rqs, we increase the threshold to around 104 days.

    Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
    Originally-by: Zhang Qiao <zhangqiao22@huawei.com>
    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
    Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 14:16:00 -04:00
Phil Auld d3f2df660a sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 585463f0d58aa4d29b744c7c53b222b8028de87f
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Mon Oct 3 16:34:20 2022 +0100

    sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()

    This removes the second use of the sched_core_mask temporary mask.

    Suggested-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 10:04:09 -04:00
Phil Auld f089b6b716 sched/core: Fix a missed update of user_cpus_ptr
JIRA: https://issues.redhat.com/browse/RHEL-282

commit df14b7f9efcda35e59bb6f50351aac25c50f6e24
Author: Waiman Long <longman@redhat.com>
Date:   Fri Feb 3 13:18:49 2023 -0500

    sched/core: Fix a missed update of user_cpus_ptr

    Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
    cpumask"), a successful call to sched_setaffinity() should always save
    the user requested cpu affinity mask in a task's user_cpus_ptr. However,
    when the given cpu mask is the same as the current one, user_cpus_ptr
    is not updated. Fix this by saving the user mask in this case too.

    Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230203181849.221943-1-longman@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:20 -04:00
Phil Auld bf73c54d24 sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 5657c116783545fb49cd7004994c187128552b12
Author: Waiman Long <longman@redhat.com>
Date:   Sun Jan 15 14:31:22 2023 -0500

    sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs

    The kernel commit 9a5418bc48ba ("sched/core: Use kfree_rcu() in
    do_set_cpus_allowed()") introduces a bug for kernels built with non-SMP
    configs. Calling sched_setaffinity() on such a uniprocessor kernel will
    cause cpumask_copy() to be called with a NULL pointer leading to general
    protection fault. This is not really a problem in real use cases as
    there aren't that many uniprocessor kernel configs in use and calling
    sched_setaffinity() on such a uniprocessor system doesn't make sense.

    Fix this problem by making sure cpumask_copy() will not be called in
    such a case.

    Fixes: 9a5418bc48ba ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()")
    Reported-by: kernel test robot <yujie.liu@intel.com>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230115193122.563036-1-longman@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:20 -04:00
Phil Auld ffd9ddbf5a sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 160fb0d83f206b3429fc495864a022110f9e4978
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date:   Fri Dec 23 18:32:57 2022 +0800

    sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate()

    ttwu_do_activate() is used for a complete wakeup, in which we will
    activate_task() and use ttwu_do_wakeup() to mark the task runnable
    and perform wakeup-preemption, also call class->task_woken() callback
    and update the rq->idle_stamp.

    Since ttwu_runnable() is not a complete wakeup, don't need all those
    done in ttwu_do_wakeup(), so we can move those to ttwu_do_activate()
    to simplify ttwu_do_wakeup(), making it only mark the task runnable
    to be reused in ttwu_runnable() and try_to_wake_up().

    This patch should not have any functional changes.

    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20221223103257.4962-2-zhouchengming@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:20 -04:00
Phil Auld a09a99cf93 sched/core: Micro-optimize ttwu_runnable()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit efe09385864f3441c71711f91e621992f9423c01
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date:   Fri Dec 23 18:32:56 2022 +0800

    sched/core: Micro-optimize ttwu_runnable()

    ttwu_runnable() is used as a fast wakeup path when the wakee task
    is running on CPU or runnable on RQ, in both cases we can just
    set its state to TASK_RUNNING to prevent a sleep.

    If the wakee task is on_cpu running, we don't need to update_rq_clock()
    or check_preempt_curr().

    But if the wakee task is on_rq && !on_cpu (e.g. an IRQ hit before
    the task got to schedule() and the task been preempted), we should
    check_preempt_curr() to see if it can preempt the current running.

    This also removes the class->task_woken() callback from ttwu_runnable(),
    which wasn't required per the RT/DL implementations: any required push
    operation would have been queued during class->set_next_task() when p
    got preempted.

    ttwu_runnable() also loses the update to rq->idle_stamp, as by definition
    the rq cannot be idle in this scenario.

    Suggested-by: Valentin Schneider <vschneid@redhat.com>
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Link: https://lore.kernel.org/r/20221223103257.4962-1-zhouchengming@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:19 -04:00
Phil Auld 4881a62e1d sched: Make const-safe
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 904cbab71dda1689d41a240541179f21ff433c40
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Dec 12 14:49:46 2022 +0000

    sched: Make const-safe

    With a modified container_of() that preserves constness, the compiler
    finds some pointers which should have been marked as const.  task_of()
    also needs to become const-preserving for the !FAIR_GROUP_SCHED case so
    that cfs_rq_of() can take a const argument.  No change to generated code.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20221212144946.2657785-1-willy@infradead.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:19 -04:00
Phil Auld 6a7d52383a sched: Clear ttwu_pending after enqueue_task()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit d6962c4fe8f96f7d384d6489b6b5ab5bf3e35991
Author: Tianchen Ding <dtcccc@linux.alibaba.com>
Date:   Fri Nov 4 10:36:01 2022 +0800

    sched: Clear ttwu_pending after enqueue_task()

    We found a long tail latency in schbench whem m*t is close to nr_cpus.
    (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)

    This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
    too early, and idle_cpu() will return true until the wakee task enqueued.
    This will mislead the waker when selecting idle cpu, and wake multiple
    worker threads on the same wakee cpu. This situation is enlarged by
    commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
    wakelist if wakee cpu is idle") because it tends to use wakelist.

    Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
    (Intel(R) Xeon(R) Platinum 8369B).

    Latency percentiles (usec):
                    base      base+revert_f3dd3f674555   base+this_patch
    50.0000th:         9                            13                 9
    75.0000th:        12                            19                12
    90.0000th:        15                            22                15
    95.0000th:        18                            24                17
    *99.0000th:       27                            31                24
    99.5000th:      3364                            33                27
    99.9000th:     12560                            36                30

    We also tested on unixbench and hackbench, and saw no performance
    change.

    Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Link: https://lkml.kernel.org/r/20221104023601.12844-1-dtcccc@linux.alibaba.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:19 -04:00
Phil Auld a774282315 sched/fair: Cleanup loop_max and loop_break
JIRA: https://issues.redhat.com/browse/RHEL-282

commit c59862f8265f8060b6650ee1dc12159fe5c89779
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Thu Aug 25 14:27:24 2022 +0200

    sched/fair: Cleanup loop_max and loop_break

    sched_nr_migrate_break is set to a fix value and never changes so we can
    replace it by a define SCHED_NR_MIGRATE_BREAK.

    Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value
    of sysctl_sched_nr_migrate which can be init to different values.

    Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate.

    The behavior stays unchanged unless you modify sysctl_sched_nr_migrate
    trough debugfs.

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220825122726.20819-3-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:18 -04:00
Phil Auld a9fcc51032 sched: Add TASK_ANY for wait_task_inactive()
JIRA: https://issues.redhat.com/browse/RHEL-282
Conflicts:  Context differences caused by having PREEMPT_RT
merged, specifically a015745ca41f ("sched: Consider
task_struct::saved_state in wait_task_inactive()").

commit f9fc8cad9728124cefe8844fb53d1814c92c6bfc
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Sep 6 12:39:55 2022 +0200

    sched: Add TASK_ANY for wait_task_inactive()

    Now that wait_task_inactive()'s @match_state argument is a mask (like
    ttwu()) it is possible to replace the special !match_state case with
    an 'all-states' value such that any blocked state will match.

    Suggested-by: Ingo Molnar (mingo@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:18 -04:00
Phil Auld 7ab9d04d74 sched: Rename task_running() to task_on_cpu()
JIRA: https://issues.redhat.com/browse/RHEL-282
Conflicts:  Context differences caused by having PREEMPT_RT
merged, specifically a015745ca41f ("sched: Consider
task_struct::saved_state in wait_task_inactive()").

commit 0b9d46fc5ef7a457cc635b30b010081228cb81ac
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Sep 6 12:33:04 2022 +0200

    sched: Rename task_running() to task_on_cpu()

    There is some ambiguity about task_running() in that it is unrelated
    to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing
    task_on_cpu().

    Suggested-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:18 -04:00
Phil Auld 3f3a0eeee3 sched/fair: Allow changing cgroup of new forked task
JIRA: https://issues.redhat.com/browse/RHEL-282

commit df16b71c686cb096774e30153c9ce6756450796c
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date:   Thu Aug 18 20:48:03 2022 +0800

    sched/fair: Allow changing cgroup of new forked task

    commit 7dc603c902 ("sched/fair: Fix PELT integrity for new tasks")
    introduce a TASK_NEW state and an unnessary limitation that would fail
    when changing cgroup of new forked task.

    Because at that time, we can't handle task_change_group_fair() for new
    forked fair task which hasn't been woken up by wake_up_new_task(),
    which will cause detach on an unattached task sched_avg problem.

    This patch delete this unnessary limitation by adding check before do
    detach or attach in task_change_group_fair().

    So cpu_cgrp_subsys.can_attach() has nothing to do for fair tasks,
    only define it in #ifdef CONFIG_RT_GROUP_SCHED.

    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20220818124805.601-8-zhouchengming@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:17 -04:00
Phil Auld fb17b0f886 sched/fair: Remove redundant cpu_cgrp_subsys->fork()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 39c4261191bf05e7eb310f852980a6d0afe5582a
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date:   Thu Aug 18 20:47:58 2022 +0800

    sched/fair: Remove redundant cpu_cgrp_subsys->fork()

    We use cpu_cgrp_subsys->fork() to set task group for the new fair task
    in cgroup_post_fork().

    Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
    has already set_task_rq() for the new fair task in sched_cgroup_fork(),
    so cpu_cgrp_subsys->fork() can be removed.

      cgroup_can_fork()     --> pin parent's sched_task_group
      sched_cgroup_fork()
        __set_task_cpu()
          set_task_rq()
      cgroup_post_fork()
        ss->fork() := cpu_cgroup_fork()
          sched_change_group(..., TASK_SET_GROUP)
            task_set_group_fair()
              set_task_rq()  --> can be removed

    After this patch's change, task_change_group_fair() only need to
    care about task cgroup migration, make the code much simplier.

    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20220818124805.601-3-zhouchengming@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:17 -04:00
Phil Auld 9b10d97986 sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 09348d75a6ce60eec85c86dd0ab7babc4db3caf6
Author: Ingo Molnar <mingo@kernel.org>
Date:   Thu Aug 11 08:54:52 2022 +0200

    sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

    There's no good reason to crash a user's system with a BUG_ON(),
    chances are high that they'll never even see the crash message on
    Xorg, and it won't make it into the syslog either.

    By using a WARN_ON_ONCE() we at least give the user a chance to report
    any bugs triggered here - instead of getting silent hangs.

    None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore
    cases where a NULL check is done via a BUG_ON() and we let a NULL
    pointer through after a WARN_ON_ONCE().

    There's one exception: WARN_ON_ONCE() arguments with side-effects,
    such as locking - in this case we use the return value of the
    WARN_ON_ONCE(), such as in:

     -       BUG_ON(!lock_task_sighand(p, &flags));
     +       if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
     +               return;

    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:17 -04:00
Phil Auld 30180b878d sched/fair: Make per-cpu cpumasks static
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 18c31c9711a90b48a77b78afb65012d9feec444c
Author: Bing Huang <huangbing@kylinos.cn>
Date:   Sat Jul 23 05:36:09 2022 +0800

    sched/fair: Make per-cpu cpumasks static

    The load_balance_mask and select_rq_mask percpu variables are only used in
    kernel/sched/fair.c.

    Make them static and move their allocation into init_sched_fair_class().

    Replace kzalloc_node() with zalloc_cpumask_var_node() to get rid of the
    CONFIG_CPUMASK_OFFSTACK #ifdef and to align with per-cpu cpumask
    allocation for RT (local_cpu_mask in init_sched_rt_class()) and DL
    class (local_cpu_mask_dl in init_sched_dl_class()).

    [ mingo: Tidied up changelog & touched up the code. ]

    Signed-off-by: Bing Huang <huangbing@kylinos.cn>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20220722213609.3901-1-huangbing775@126.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:16 -04:00
Phil Auld 680e019203 sched/debug: Print each field value left-aligned in sched_show_task()
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 0f03d6805bfc454279169a1460abb3f6b3db317f
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date:   Wed Jul 27 14:08:19 2022 +0800

    sched/debug: Print each field value left-aligned in sched_show_task()

    Currently, the values of some fields are printed right-aligned, causing
    the field value to be next to the next field name rather than next to its
    own field name. So print each field value left-aligned, to make it more
    readable.

     Before:
            stack:    0 pid:  307 ppid:     2 flags:0x00000008
     After:
            stack:0     pid:308   ppid:2      flags:0x0000000a

    This also makes them print in the same style as the other two fields:

            task:demo0           state:R  running task

    Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20220727060819.1085-1-thunder.leizhen@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:16 -04:00
Phil Auld 78210daf7b sched: Snapshot thread flags
JIRA: https://issues.redhat.com/browse/RHEL-282

commit 0569b245132c40015281610353935a50e282eb94
Author: Mark Rutland <mark.rutland@arm.com>
Date:   Mon Nov 29 13:06:45 2021 +0000

    sched: Snapshot thread flags

    Some thread flags can be set remotely, and so even when IRQs are disabled,
    the flags can change under our feet. Generally this is unlikely to cause a
    problem in practice, but it is somewhat unsound, and KCSAN will
    legitimately warn that there is a data race.

    To avoid such issues, a snapshot of the flags has to be taken prior to
    using them. Some places already use READ_ONCE() for that, others do not.

    Convert them all to the new flag accessor helpers.

    The READ_ONCE(ti->flags) .. cmpxchg(ti->flags) loop in
    set_nr_if_polling() is left as-is for clarity.

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20211129130653.2037928-4-mark.rutland@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-18 09:34:16 -04:00
Phil Auld cb73223615 sched/core: Adjusting the order of scanning CPU
JIRA: https://issues.redhat.com/browse/RHEL-310

commit 8589018acc65e5ddfd111f0a7ee85f9afde3a830
Author: Hao Jia <jiahao.os@bytedance.com>
Date:   Fri Dec 16 14:24:06 2022 +0800

    sched/core: Adjusting the order of scanning CPU

    When select_idle_capacity() starts scanning for an idle CPU, it starts
    with target CPU that has already been checked in select_idle_sibling().
    So we start checking from the next CPU and try the target CPU at the end.
    Similarly for task_numa_assign(), we have just checked numa_migrate_on
    of dst_cpu, so start from the next CPU. This also works for
    steal_cookie_task(), the first scan must fail and start directly
    from the next one.

    Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Link: https://lore.kernel.org/r/20221216062406.7812-3-jiahao.os@bytedance.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-04-17 16:14:35 -04:00