JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Minor fuzz due to unrelated whitespace difference from
upstream.
commit 9e07d45c5210f5dd6701c00d55791983db7320fa
Author: Peter Zijlstra <peterz@infradead.org>
Date: Sat Nov 4 11:59:19 2023 +0100
sched/deadline: Collect sched_dl_entity initialization
Create a single function that initializes a sched_dl_entity.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/51acc695eecf0a1a2f78f9a044e11ffd9b316bcf.1699095159.git.bristot@kernel.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
commit d6111cf45c5787282b2e20d77bdb6b28881d516a
Author: Paul E. McKenney <paulmck@kernel.org>
Date: Tue Oct 31 11:12:01 2023 -0700
sched: Use WRITE_ONCE() for p->on_rq
Since RCU-tasks uses READ_ONCE(p->on_rq), ensure the write-side
matches with WRITE_ONCE().
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/e4896e0b-eacc-45a2-a7a8-de2280a51ecc@paulmck-laptop
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
commit 984ffb6a4366752c949f7b39640aecdce222607f
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Oct 20 12:35:33 2023 +0200
sched/fair: Remove SIS_PROP
SIS_UTIL seems to work well, lets remove the old thing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-15622
commit b95303e0aeaf446b65169dd4142cacdaeb7d4c8b
Author: Barry Song <song.bao.hua@hisilicon.com>
Date: Thu Oct 19 11:33:21 2023 +0800
sched: Add cpus_share_resources API
Add cpus_share_resources() API. This is the preparation for the
optimization of select_idle_cpu() on platforms with cluster scheduler
level.
On a machine with clusters cpus_share_resources() will test whether
two cpus are within the same cluster. On a non-cluster machine it
will behaves the same as cpus_share_cache(). So we use "resources"
here for cache resources.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
commit bc87127a45928de5fdf0ec39d7a86e1edd0e179e
Author: Yajun Deng <yajun.deng@linux.dev>
Date: Thu Jul 20 16:05:16 2023 +0800
sched/debug: Print 'tgid' in sched_show_task()
Multiple blocked tasks are printed when the system hangs. They may have
the same parent pid, but belong to different task groups.
Printing tgid lets users better know whether these tasks are from the same
task group or not.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230720080516.1515297-1-yajun.deng@linux.dev
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
commit 3eafe225995c67f8c179011ec2d6e4c12b32a53d
Author: Wang Jinchao <wangjinchao@xfusion.com>
Date: Sun Aug 20 20:53:17 2023 +0800
sched/core: Refactor the task_flags check for worker sleeping in sched_submit_work()
Simplify the conditional logic for checking worker flags
by splitting the original compound `if` statement into
separate `if` and `else if` clauses.
This modification not only retains the previous functionality,
but also reduces a single `if` check, improving code clarity
and potentially enhancing performance.
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/ZOIMvURE99ZRAYEj@fedora
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
commit dc461c48deda8a2d243fbaf49e276d555eb833d8
Author: Liming Wu <liming.wu@jaguarmicro.com>
Date: Fri Aug 25 10:35:00 2023 +0800
sched/debug: Avoid checking in_atomic_preempt_off() twice in schedule_debug()
in_atomic_preempt_off() already gets called in schedule_debug() once,
which is the only caller of __schedule_bug().
Skip the second call within __schedule_bug(), it should always be true
at this point.
[ mingo: Clarified the changelog. ]
Signed-off-by: Liming Wu <liming.wu@jaguarmicro.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230825023501.1848-1-liming.wu@jaguarmicro.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
commit cff9b2332ab762b7e0586c793c431a8f2ea4db04
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date: Fri Sep 15 13:44:44 2023 -0400
kernel/sched: Modify initial boot task idle setup
Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle(). Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set. Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed. Recent changes
have caused just this scenario and IRQs have been enabled early.
This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.
Fix this issue by setting the PF_IDLE flag later in the boot sequence.
Although the boot task was marked as idle since (at least) d80e4fda576d,
I am not sure that it is wrong to do so. The forced context-switch on
idle task was introduced in the tiny_rcu update, so I'm going to claim
this fixes 5f6130fa52.
Fixes: 5f6130fa52 ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Minor fuzz in fair.c due to having RT merged,
specifically: ea622076b76f ("sched: Add support for lazy preemption")
commit e23edc86b09df655bf8963bbcb16647adc787395
Author: Ingo Molnar <mingo@kernel.org>
Date: Tue Sep 19 10:38:21 2023 +0200
sched/fair: Rename check_preempt_curr() to wakeup_preempt()
The name is a bit opaque - make it clear that this is about wakeup
preemption.
Also rename the ->check_preempt_curr() methods similarly.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
commit 4ff34ad3d39377d9f6953f3606ccf611ce636767
Author: Uros Bizjak <ubizjak@gmail.com>
Date: Tue Feb 28 17:14:26 2023 +0100
sched/core: Use do-while instead of for loop in set_nr_if_polling()
Use equivalent do-while loop instead of infinite for loop.
There are no asm code changes.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230228161426.4508-1-ubizjak@gmail.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
commit ab83f455f04df5b2f7c6d4de03b6d2eaeaa27b8a
Author: Peter Oskolkov <posk@google.com>
Date: Tue Mar 7 23:31:57 2023 -0800
sched: add WF_CURRENT_CPU and externise ttwu
Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases.
In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-3-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
commit 548796e2e70b44b4661fd7feee6eb239245ff1f8
Author: Cruz Zhao <CruzZhao@linux.alibaba.com>
Date: Thu Jun 29 12:02:04 2023 +0800
sched/core: introduce sched_core_idle_cpu()
As core scheduling introduced, a new state of idle is defined as
force idle, running idle task but nr_running greater than zero.
If a cpu is in force idle state, idle_cpu() will return zero. This
result makes sense in some scenarios, e.g., load balance,
showacpu when dumping, and judge the RCU boost kthread is starving.
But this will cause error in other scenarios, e.g., tick_irq_exit():
When force idle, rq->curr == rq->idle but rq->nr_running > 0, results
that idle_cpu() returns 0. In function tick_irq_exit(), if idle_cpu()
is 0, tick_nohz_irq_exit() will not be called, and ts->idle_active will
not become 1, which became 0 in tick_nohz_irq_enter().
ts->idle_sleeptime won't update in function update_ts_time_stats(), if
ts->idle_active is 0, which should be 1. And this bug will result that
ts->idle_sleeptime is less than the actual value, and finally will
result that the idle time in /proc/stat is less than the actual value.
To solve this problem, we introduce sched_core_idle_cpu(), which
returns 1 when force idle. We audit all users of idle_cpu(), and
change idle_cpu() into sched_core_idle_cpu() in function
tick_irq_exit().
v2-->v3: Only replace idle_cpu() with sched_core_idle_cpu() in
function tick_irq_exit(). And modify the corresponding commit log.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/1688011324-42406-1-git-send-email-CruzZhao@linux.alibaba.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25535
commit 677ea015f231aa38b3972aa7be54ecd2637e99fd
Author: Josh Don <joshdon@google.com>
Date: Tue Jun 20 11:32:47 2023 -0700
sched: add throttled time stat for throttled children
We currently export the total throttled time for cgroups that are given
a bandwidth limit. This patch extends this accounting to also account
the total time that each children cgroup has been throttled.
This is useful to understand the degree to which children have been
affected by the throttling control. Children which are not runnable
during the entire throttled period, for example, will not show any
self-throttling time during this period.
Expose this in a new interface, 'cpu.stat.local', which is similar to
how non-hierarchical events are accounted in 'memory.events.local'.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29020
commit 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed Nov 22 14:39:03 2023 +0100
sched/cpufreq: Rework schedutil governor performance estimation
The current method to take into account uclamp hints when estimating the
target frequency can end in a situation where the selected target
frequency is finally higher than uclamp hints, whereas there are no real
needs. Such cases mainly happen because we are currently mixing the
traditional scheduler utilization signal with the uclamp performance
hints. By adding these 2 metrics, we loose an important information when
it comes to select the target frequency, and we have to make some
assumptions which can't fit all cases.
Rework the interface between the scheduler and schedutil governor in order
to propagate all information down to the cpufreq governor.
effective_cpu_util() interface changes and now returns the actual
utilization of the CPU with 2 optional inputs:
- The minimum performance for this CPU; typically the capacity to handle
the deadline task and the interrupt pressure. But also uclamp_min
request when available.
- The maximum targeting performance for this CPU which reflects the
maximum level that we would like to not exceed. By default it will be
the CPU capacity but can be reduced because of some performance hints
set with uclamp. The value can be lower than actual utilization and/or
min performance level.
A new sugov_effective_cpu_perf() interface is also available to compute
the final performance level that is targeted for the CPU, after applying
some cpufreq headroom and taking into account all inputs.
With these 2 functions, schedutil is now able to decide when it must go
above uclamp hints. It now also has a generic way to get the min
performance level.
The dependency between energy model and cpufreq governor and its headroom
policy doesn't exist anymore.
eenv_pd_max_util() asks schedutil for the targeted performance after
applying the impact of the waking task.
[ mingo: Refined the changelog & C comments. ]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29020
commit 7bc263840bc3377186cb06b003ac287bb2f18ce2
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Mon Oct 9 12:36:16 2023 +0200
sched/topology: Consolidate and clean up access to a CPU's max compute capacity
Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
instead.
The scheduler uses 3 methods to get access to a CPU's max compute capacity:
- arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.
- cpu_capacity_orig field which is periodically updated with
arch_scale_cpu_capacity().
- capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.
There is no real need to save the value returned by arch_scale_cpu_capacity()
in struct rq. arch_scale_cpu_capacity() returns:
- either a per_cpu variable.
- or a const value for systems which have only one capacity.
Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.
No functional changes.
Some performance tests on Arm64:
- small SMP device (hikey): no noticeable changes
- HMP device (RB5): hackbench shows minor improvement (1-2%)
- large smp (thx2): hackbench and tbench shows minor improvement (1%)
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29020
commit 194600008d5c43b5a4ba98c4b81633397e34ffad
Author: Frederic Weisbecker <frederic@kernel.org>
Date: Tue Nov 14 14:38:40 2023 -0500
sched/timers: Explain why idle task schedules out on remote timer enqueue
Trying to avoid that didn't bring much value after testing, add comment
about this.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231114193840.4041-3-frederic@kernel.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-28616
commit 6b596e62ed9f90c4a97e68ae1f7b1af5beeb3c05
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 8 Sep 2023 18:22:51 +0200
sched: Provide rt_mutex specific scheduler helpers
With PREEMPT_RT there is a rt_mutex recursion problem where
sched_submit_work() can use an rtlock (aka spinlock_t). More
specifically what happens is:
mutex_lock() /* really rt_mutex */
...
__rt_mutex_slowlock_locked()
task_blocks_on_rt_mutex()
// enqueue current task as waiter
// do PI chain walk
rt_mutex_slowlock_block()
schedule()
sched_submit_work()
...
spin_lock() /* really rtlock */
...
__rt_mutex_slowlock_locked()
task_blocks_on_rt_mutex()
// enqueue current task as waiter *AGAIN*
// *CONFUSION*
Fix this by making rt_mutex do the sched_submit_work() early, before
it enqueues itself as a waiter -- before it even knows *if* it will
wait.
[[ basically Thomas' patch but with different naming and a few asserts
added ]]
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-5-bigeasy@linutronix.de
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-28616
commit de1474b46d889ee0367f6e71d9adfeb0711e4a8d
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Fri, 8 Sep 2023 18:22:50 +0200
sched: Extract __schedule_loop()
There are currently two implementations of this basic __schedule()
loop, and there is soon to be a third.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-4-bigeasy@linutronix.de
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-28616
commit 28bc55f654de49f6122c7475b01b5d5ef4bdf0d4
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 8 Sep 2023 18:22:48 +0200
sched: Constrain locks in sched_submit_work()
Even though sched_submit_work() is ran from preemptible context,
it is discouraged to have it use blocking locks due to the recursion
potential.
Enforce this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-2-bigeasy@linutronix.de
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-28616
Upstream Status: RHEL only
Revert linux-rt-devel specific commit ca66ec3b9994 ("sched/core:
Provide sched_rtmutex() and expose sched work helpers") to prepare for
the submission of upstream equivalent.
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-28616
Upstream Status: RHEL only
Revert RHEL only commit ec180d083a ("sched/core: Add __always_inline
to schedule_loop()").
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29017
Conflicts: Context differences due to having RT patchset 6bc27040eb
("sched: Add support for lazy preemption"). A number of hunks skipped due to not
having af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") and related
commits in that series.
commit 0e34600ac9317dbe5f0a7bfaa3d7187d757572ed
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 9 20:52:49 2023 +0200
sched: Misc cleanups
Random remaining guard use...
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29017
commit 6fb45460615358157a6d3c990e74f9c1395247e2
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 9 20:45:16 2023 +0200
sched: Simplify tg_set_cfs_bandwidth()
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29017
commit fa614b4feb5a246474ac71b45e520a8ddefc809c
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 9 20:41:09 2023 +0200
sched: Simplify sched_move_task()
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29017
commit af7c5763f5e8bc1b3f827354a283ccaf6a8c8098
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 9 16:59:05 2023 +0200
sched: Simplify sched_rr_get_interval()
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29017
commit 7a50f76674f8b6f4f30a1cec954179f10e20110c
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 9 16:58:23 2023 +0200
sched: Simplify yield_to()
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29017
Conflicts: Minor manual fixup needed due to having RHEL-only
patch 05fddaaaac ("sched/core: Use empty mask to reset cpumasks
in sched_setaffinity()").
commit 92c2ec5bc1081e6bbbe172bcfb1a566ad7b4f809
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 9 16:57:35 2023 +0200
sched: Simplify sched_{set,get}affinity()
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29017
commit febe162d4d9158cf2b5d48fdd440db7bb55dd622
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 9 16:54:54 2023 +0200
sched: Simplify syscalls
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29017
commit 94b548a15e8ec47dfbf6925bdfb64bb5657dce0c
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 9 20:52:55 2023 +0200
sched: Simplify set_user_nice()
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-21440
Upstream Status: RHEL only
Since RHEL commit 05fddaaaac ("sched/core: Use empty mask to reset
cpumasks in sched_setaffinity()"), an empty cpumask is used for resetting
a user requested CPU affinity set by a previous sched_setaffinity() call.
An error code of -ENODEV is returned for a successful reset. However,
this broke some test cases that only expects an error code of -EINVAL.
So another RHEL commit 90f7bb0c18 ("sched/core: Don't return -ENODEV
from sched_setaffinity()") was merged to return 0 in that case. Again,
this still broke some other test cases.
This patch restores the old behavior of always returning -EINVAL on an
empty cpumask with the exception that 0 may still be returned if the
empty cpumask is caused by a input len parameter of 0 which is another
way of resetting user requested CPU affinity that I had proposed to
upstream before.
Fixes: 90f7bb0c18 ("sched/core: Don't return -ENODEV from sched_setaffinity()")
Signed-off-by: Waiman Long <longman@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3388
JIRA: https://issues.redhat.com/browse/RHEL-16613
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3388/edit
Upstream Status: RHEL only
Tested: A unit test was run to verify that CPU affinity reset only
happened when the full mask was empty not when any of the out
of bound bits were set.
RHEL commit 05fddaaaac ("sched/core: Use empty mask to reset cpumasks
in sched_setaffinity()") enables the use of an empty cpumask to reset
a user-provided CPU affinity via sched_setaffinity(2) syscall. It will
return a value of -ENODEV to indicate a success in resetting the user
provided CPU affinity.
However, the current way to check for empty cpumask using cpumask_emtpy()
is not robust enough to avoid many false positives leading to an
erroneous return of -ENODEV which can confuse user applications leading
to incorrect behavior. For example, if the system has 28 CPUs and bit
28 is set, it will treat the cpumask as empty.
Instead of cpumask_empty(), bitmap_empty() should be used to check if
all the bits in the cpumask_size() buffer are zero. This should avoid
many false positives. However, false positive can still happen if the
bit set is outside the range allowed by cpumask_size(). So we need to
check the full user_mask buffer to see if it is really empty to avoid
any false positive. By doing so, there should be no need to return a
-ENODEV error code which is a workaround to handle the false positives.
A value of 0 will be returned if the reset is successful or -EINVAL
will be if user-provided CPU affinity hasn't been properly set by
sched_setaffinity(2).
Fixes: 05fddaaaac ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()")
Signed-off-by: Waiman Long <longman@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: John B. Wyatt IV <jwyatt@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-15489
Conflicts: Applied with fuzz due to not having mm_cid changes,
specifically 223baf9d17f2 ("sched: Fix performance regression
introduced by mm_cid")
commit 5ebde09d91707a4a9bec1e3d213e3c12ffde348f
Author: Hao Jia <jiahao.os@bytedance.com>
Date: Thu Oct 12 17:00:03 2023 +0800
sched/core: Fix RQCF_ACT_SKIP leak
Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning.
This warning may be triggered in the following situations:
CPU0 CPU1
__schedule()
*rq->clock_update_flags <<= 1;* unregister_fair_sched_group()
pick_next_task_fair+0x4a/0x410 destroy_cfs_bandwidth()
newidle_balance+0x115/0x3e0 for_each_possible_cpu(i) *i=0*
rq_unpin_lock(this_rq, rf) __cfsb_csd_unthrottle()
raw_spin_rq_unlock(this_rq)
rq_lock(*CPU0_rq*, &rf)
rq_clock_start_loop_update()
rq->clock_update_flags & RQCF_ACT_SKIP <--
raw_spin_rq_lock(this_rq)
The purpose of RQCF_ACT_SKIP is to skip the update rq clock,
but the update is very early in __schedule(), but we clear
RQCF_*_SKIP very late, causing it to span that gap above
and triggering this warning.
In __schedule() we can clear the RQCF_*_SKIP flag immediately
after update_rq_clock() to avoid this RQCF_ACT_SKIP leak warning.
And set rq->clock_update_flags to RQCF_UPDATED to avoid
rq->clock_update_flags < RQCF_ACT_SKIP warning that may be triggered later.
Fixes: ebb83d84e49b ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")
Closes: https://lore.kernel.org/all/20230913082424.73252-1-jiahao.os@bytedance.com
Reported-by: Igor Raits <igor.raits@gmail.com>
Reported-by: Bagas Sanjaya <bagasdotme@gmail.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@gmail.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-16613
Upstream Status: RHEL only
Tested: A unit test was run to verify that CPU affinity reset only
happened when the full mask was empty not when any of the out
of bound bits were set.
RHEL commit 05fddaaaac ("sched/core: Use empty mask to reset cpumasks
in sched_setaffinity()") enables the use of an empty cpumask to reset
a user-provided CPU affinity via sched_setaffinity(2) syscall. It will
return a value of -ENODEV to indicate a success in resetting the user
provided CPU affinity.
However, the current way to check for empty cpumask using cpumask_emtpy()
is not robust enough to avoid many false positives leading to an
erroneous return of -ENODEV which can confuse user applications leading
to incorrect behavior. For example, if the system has 28 CPUs and bit
28 is set, it will treat the cpumask as empty.
Instead of cpumask_empty(), bitmap_empty() should be used to check if
all the bits in the cpumask_size() buffer are zero. This should avoid
many false positives. However, false positive can still happen if the
bit set is outside the range allowed by cpumask_size(). So we need to
check the full user_mask buffer to see if it is really empty to avoid
any false positive. By doing so, there should be no need to return a
-ENODEV error code which is a workaround to handle the false positives.
A value of 0 will be returned if the reset is successful or -EINVAL
will be if user-provided CPU affinity hasn't been properly set by
sched_setaffinity(2).
Fixes: 05fddaaaac ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()")
Signed-off-by: Waiman Long <longman@redhat.com>
Conflicts:
fs/exec.c - We already have
33a2d6bc3480 ("Revert "fs/exec: allow to unshare a time namespace on vfork+exec"")
so don't add call to timens_on_fork back in
include/linux/mmzone.h - We already have
e6ad640bc404 ("mm: deduplicate cacheline padding code")
so keep CACHELINE_PADDING(_pad2_) over ZONE_PADDING(_pad2_)
mm/vmscan.c - The backport of
badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs")
added an #include <linux/debugfs.h>. Keep it.
JIRA: https://issues.redhat.com/browse/RHEL-1848
commit bd74fdaea146029e4fa12c6de89adbe0779348a9
Author: Yu Zhao <yuzhao@google.com>
Date: Sun Sep 18 02:00:05 2022 -0600
mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages. A kill switch will be
added in the next patch to disable this behavior. When disabled, the
aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated. Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.
When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently. Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim. Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[8, 10]%
Ops/sec KB/sec
patch1-7: 1147696.57 44640.29
patch1-8: 1245274.91 48435.66
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
48.16% lzo1x_1_do_compress (real work)
8.20% page_vma_mapped_walk (overhead)
7.06% _raw_spin_unlock_irq
2.92% ptep_clear_flush
2.53% __zram_bvec_write
2.11% do_raw_spin_lock
2.02% memmove
1.93% lru_gen_look_around
1.56% free_unref_page_list
1.40% memset
patch1-8
49.44% lzo1x_1_do_compress (real work)
6.19% page_vma_mapped_walk (overhead)
5.97% _raw_spin_unlock_irq
3.13% get_pfn_folio
2.85% ptep_clear_flush
2.42% __zram_bvec_write
2.08% do_raw_spin_lock
1.92% memmove
1.44% alloc_zspage
1.36% memset
Configurations:
no change
Thanks to the following developers for their efforts [3].
kernel test robot <lkp@intel.com>
[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1848
commit c959924b0dc53bf6252793f41480bc01b9792570
Author: Huang Ying <ying.huang@intel.com>
Date: Wed Jul 13 16:39:53 2022 +0800
memory tiering: adjust hot threshold automatically
The promotion hot threshold is workload and system configuration
dependent. So in this patch, a method to adjust the hot threshold
automatically is implemented. The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit. If the
hint page fault latency of a page is less than the hot threshold, we will
try to promote the page, and the page is called the candidate promotion
page.
If the number of the candidate promotion pages in the statistics interval
is much more than the promotion rate limit, the hot threshold will be
decreased to reduce the number of the candidate promotion pages.
Otherwise, the hot threshold will be increased to increase the number of
the candidate promotion pages.
To make the above method works, in each statistics interval, the total
number of the pages to check (on which the hint page faults occur) and the
hot/cold distribution need to be stable. Because the page tables are
scanned linearly in NUMA balancing, but the hot/cold distribution isn't
uniform along the address usually, the statistics interval should be
larger than the NUMA balancing scan period. So in the patch, the max scan
period is used as statistics interval and it works well in our tests.
Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: osalvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3029
JIRA: https://issues.redhat.com/browse/RHEL-1536
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2208016
Tested: With scheduelr stress tests, cfs bandwidth tests and unit tests on
specific pieces (such as enabling WARN_DOUBLE_CLOCK etc), in addition to cki
and perf QE testing.
Updates and fixes from up to v6.5 for scheduler related code. This includes
a revert of one of the RT merge patches which is then re-applied in the form
it took when added to Linus's tree (see the "wait_task_inactive()" commits).
Signed-off-by: Phil Auld <pauld@redhat.com>
Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3080
Linux has had tracepoints tied to IPI reception for a while, but none tied to IPI emission.
This series add tracepoints to the actual codepath sending the IPIs, which makes it possible to trace and track sources of IPI with Ftrace. This is very useful for setups where IPIs to certain CPUs are /mostly/ undesired and a source of unwanted interference (e.g. CPU isolation).
Bugzilla: https://bugzilla.redhat.com/2192613
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Approved-by: John B. Wyatt IV <jwyatt@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-5228
commit 7c182722a0a9447e31f9645de4f311e5bc59b480
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date: Sat, 19 Nov 2022 17:25:05 +0800
sched: Add helper nr_context_switches_cpu()
Add a function nr_context_switches_cpu() that returns number of context
switches since boot on the specified CPU. This information will be used
to diagnose RCU CPU stalls.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2192613
Conflicts: context change due to missing commit ed29b0b4fd83
("io_uring: move to separate directory")
commit 68e2d17c9eb311ab59aeb6d0c38aad8985fa2596
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed Mar 22 11:28:36 2023 +0100
trace: Add trace_ipi_send_cpu()
Because copying cpumasks around when targeting a single CPU is a bit
daft...
Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230322103004.GA571242%40hirez.programming.kicks-ass.net
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2192613
Conflicts: Need to modify __smp_call_single_queue_debug() too. It was
removed upstream by commit 1771257cb447 ("locking/csd_lock: Remove
added data from CSD lock debugging")
commit 68f4ff04dbada18dad79659c266a8e5e29e458cd
Author: Valentin Schneider <vschneid@redhat.com>
Date: Tue Mar 7 14:35:58 2023 +0000
sched, smp: Trace smp callback causing an IPI
Context
=======
The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter
which so far has only been fed with NULL.
While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing
struct layout (meaning their callback func can be accessed without caring
about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function
attached to its struct. This means we need to check the type of a CSD
before eventually dereferencing its associated callback.
This isn't as trivial as it sounds: the CSD type is stored in
__call_single_node.u_flags, which get cleared right before the callback is
executed via csd_unlock(). This implies checking the CSD type before it is
enqueued on the call_single_queue, as the target CPU's queue can be flushed
before we get to sending an IPI.
Furthermore, send_call_function_single_ipi() only has a CPU parameter, and
would need to have an additional argument to trickle down the invoked
function. This is somewhat silly, as the extra argument will always be
pushed down to the function even when nothing is being traced, which is
unnecessary overhead.
Changes
=======
send_call_function_single_ipi() is only used by smp.c, and is defined in
sched/core.c as it contains scheduler-specific ops (set_nr_if_polling() of
a CPU's idle task).
Split it into two parts: the scheduler bits remain in sched/core.c, and the
actual IPI emission is moved into smp.c. This lets us define an
__always_inline helper function that can take the related callback as
parameter without creating useless register pressure in the non-traced path
which only gains a (disabled) static branch.
Do the same thing for the multi IPI case.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230307143558.294354-8-vschneid@redhat.com
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2192613
Conflicts: context change due to missing commit ed29b0b4fd83
("io_uring: move to separate directory")
commit cc9cb0a71725aa8dd8d8f534a9b562bbf7981f75
Author: Valentin Schneider <vschneid@redhat.com>
Date: Tue Mar 7 14:35:53 2023 +0000
sched, smp: Trace IPIs sent via send_call_function_single_ipi()
send_call_function_single_ipi() is the thing that sends IPIs at the bottom
of smp_call_function*() via either generic_exec_single() or
smp_call_function_many_cond(). Give it an IPI-related tracepoint.
Note that this ends up tracing any IPI sent via __smp_call_single_queue(),
which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free".
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230307143558.294354-3-vschneid@redhat.com
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
commit 96500560f0c73c71bca1b27536c6254fa0e8ce37
Author: Hao Jia <jiahao.os@bytedance.com>
Date: Tue Jun 13 16:20:10 2023 +0800
sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
There is a double update_rq_clock() invocation:
__balance_push_cpu_stop()
update_rq_clock()
__migrate_task()
update_rq_clock()
Sadly select_fallback_rq() also needs update_rq_clock() for
__do_set_cpus_allowed(), it is not possible to remove the update from
__balance_push_cpu_stop(). So remove it from __migrate_task() and
ensure all callers of this function call update_rq_clock() prior to
calling it.
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20230613082012.49615-3-jiahao.os@bytedance.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
commit cab3ecaed5cdcc9c36a96874b4c45056a46ece45
Author: Hao Jia <jiahao.os@bytedance.com>
Date: Tue Jun 13 16:20:09 2023 +0800
sched/core: Fixed missing rq clock update before calling set_rq_offline()
When using a cpufreq governor that uses
cpufreq_add_update_util_hook(), it is possible to trigger a missing
update_rq_clock() warning for the CPU hotplug path:
rq_attach_root()
set_rq_offline()
rq_offline_rt()
__disable_runtime()
sched_rt_rq_enqueue()
enqueue_top_rt_rq()
cpufreq_update_util()
data->func(data, rq_clock(rq), flags)
Move update_rq_clock() from sched_cpu_deactivate() (one of it's
callers) into set_rq_offline() such that it covers all
set_rq_offline() usage.
Additionally change rq_attach_root() to use rq_lock_irqsave() so that
it will properly manage the runqueue clock flags.
Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20230613082012.49615-2-jiahao.os@bytedance.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
commit 1c06918788e8ae6e69e4381a2806617312922524
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 31 16:39:07 2023 +0200
sched: Consider task_struct::saved_state in wait_task_inactive()
With the introduction of task_struct::saved_state in commit
5f220be21418 ("sched/wakeup: Prepare for RT sleeping spin/rwlocks")
matching the task state has gotten more complicated. That same commit
changed try_to_wake_up() to consider both states, but
wait_task_inactive() has been neglected.
Sebastian noted that the wait_task_inactive() usage in
ptrace_check_attach() can misbehave when ptrace_stop() is blocked on
the tasklist_lock after it sets TASK_TRACED.
Therefore extract a common helper from ttwu_state_match() and use that
to teach wait_task_inactive() about the PREEMPT_RT locks.
Originally-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20230601091234.GW83892@hirez.programming.kicks-ass.net
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
commit d5e1586617be7093ea3419e3fa9387ed833cdbb1
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 2 10:42:53 2023 +0200
sched: Unconditionally use full-fat wait_task_inactive()
While modifying wait_task_inactive() for PREEMPT_RT; the build robot
noted that UP got broken. This led to audit and consideration of the
UP implementation of wait_task_inactive().
It looks like the UP implementation is also broken for PREEMPT;
consider task_current_syscall() getting preempted between the two
calls to wait_task_inactive().
Therefore move the wait_task_inactive() implementation out of
CONFIG_SMP and unconditionally use it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230602103731.GA630648%40hirez.programming.kicks-ass.net
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
Conflicts: This was applied out of order with f9fc8cad9728 ("sched:
Add TASK_ANY for wait_task_inactive()") so adjusted code to match
what the results should have been.
commit 9204a97f7ae862fc8a3330ec8335917534c3fb63
Author: Peter Zijlstra <peterz@infradead.org>
Date: Mon Aug 22 13:18:19 2022 +0200
sched: Change wait_task_inactive()s match_state
Make wait_task_inactive()'s @match_state work like ttwu()'s @state.
That is, instead of an equal comparison, use it as a mask. This allows
matching multiple block conditions.
(removes the unlikely; it doesn't make sense how it's only part of the
condition)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
Upstream status: RHEL only
Conflicts: A later patch renamed task_running() to task_on_cpu() so this
did not revert cleanly. In addition match_state does not need to be checked
for 0 due to f9fc8cad9728 sched: Add TASK_ANY for wait_task_inactive().
This reverts commit 3673cc2e61.
This is commit a015745ca41f from the RT tree merge. It will be re-applied in
the form it was in when merged to Linus' tree as 1c06918788.
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
commit 9b8e17813aeccc29c2f9f2e6e68997a6eac2d26d
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date: Tue Apr 11 22:26:41 2023 -0700
sched/core: Make sched_dynamic_mutex static
The sched_dynamic_mutex is only used within the file. Make it static.
Fixes: e3ff7c609f39 ("livepatch,sched: Add livepatch task switching to cond_resched()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/oe-kbuild-all/202304062335.tNuUjgsl-lkp@intel.com/
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
commit eff6c8ce8d4d7faef75f66614dd20bb50595d261
Author: wuchi <wuchi.zero@gmail.com>
Date: Tue Mar 21 14:44:59 2023 +0800
sched/core: Reduce cost of sched_move_task when config autogroup
Some sched_move_task calls are useless because that
task_struct->sched_task_group maybe not changed (equals task_group
of cpu_cgroup) when system enable autogroup. So do some checks in
sched_move_task.
sched_move_task eg:
task A belongs to cpu_cgroup0 and autogroup0, it will always belong
to cpu_cgroup0 when do_exit. So there is no need to do {de|en}queue.
The call graph is as follow.
do_exit
sched_autogroup_exit_task
sched_move_task
dequeue_task
sched_change_group
A.sched_task_group = sched_get_task_group (=cpu_cgroup0)
enqueue_task
Performance results:
===========================
1. env
cpu: bogomips=4600.00
kernel: 6.3.0-rc3
cpu_cgroup: 6:cpu,cpuacct:/user.slice
2. cmds
do_exit script:
for i in {0..10000}; do
sleep 0 &
done
wait
Run the above script, then use the following bpftrace cmd to get
the cost of sched_move_task:
bpftrace -e 'k:sched_move_task { @ts[tid] = nsecs; }
kr:sched_move_task /@ts[tid]/
{ @ns += nsecs - @ts[tid]; delete(@ts[tid]); }'
3. cost time(ns):
without patch: 43528033
with patch: 18541416
diff:-24986617 -57.4%
As the result show, the patch will save 57.4% in the scenario.
Signed-off-by: wuchi <wuchi.zero@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230321064459.39421-1-wuchi.zero@gmail.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
commit 530bfad1d53d103f98cec66a3e491a36d397884d
Author: Hao Jia <jiahao.os@bytedance.com>
Date: Thu Mar 16 16:18:06 2023 +0800
sched/core: Avoid selecting the task that is throttled to run when core-sched enable
When {rt, cfs}_rq or dl task is throttled, since cookied tasks
are not dequeued from the core tree, So sched_core_find() and
sched_core_next() may return throttled task, which may
cause throttled task to run on the CPU.
So we add checks in sched_core_find() and sched_core_next()
to make sure that the return is a runnable task that is
not throttled.
Co-developed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230316081806.69544-1-jiahao.os@bytedance.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
commit 6015b1aca1a233379625385feb01dd014aca60b5
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue Mar 14 19:32:38 2023 -0700
sched_getaffinity: don't assume 'cpumask_size()' is fully initialized
The getaffinity() system call uses 'cpumask_size()' to decide how big
the CPU mask is - so far so good. It is indeed the allocation size of a
cpumask.
But the code also assumes that the whole allocation is initialized
without actually doing so itself. That's wrong, because we might have
fixed-size allocations (making copying and clearing more efficient), but
not all of it is then necessarily used if 'nr_cpu_ids' is smaller.
Having checked other users of 'cpumask_size()', they all seem to be ok,
either using it purely for the allocation size, or explicitly zeroing
the cpumask before using the size in bytes to copy it.
See for example the ublk_ctrl_get_queue_affinity() function that uses
the proper 'zalloc_cpumask_var()' to make sure that the whole mask is
cleared, whether the storage is on the stack or if it was an external
allocation.
Fix this by just zeroing the allocation before using it. Do the same
for the compat version of sched_getaffinity(), which had the same logic.
Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to
access the bits. For a cpumask_var_t, it ends up being a pointer to the
same data either way, but it's just a good idea to treat it like you
would a 'cpumask_t'. The compat case already did that.
Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1536
Conflicts: Minor fixup due to already having 8df1947c71 ("livepatch:
Replace the fake signal sending with TIF_NOTIFY_SIGNAL infrastructure")
commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date: Fri Feb 24 08:50:00 2023 -0800
livepatch,sched: Add livepatch task switching to cond_resched()
There have been reports [1][2] of live patches failing to complete
within a reasonable amount of time due to CPU-bound kthreads.
Fix it by patching tasks in cond_resched().
There are four different flavors of cond_resched(), depending on the
kernel configuration. Hook into all of them.
A more elegant solution might be to use a preempt notifier. However,
non-ORC unwinders can't unwind a preempted task reliably.
[1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
[2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org
Signed-off-by: Phil Auld <pauld@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2208016
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
Conflicts: minor fuzz due to context.
commit 88c56cfeaec4642aee8aac58b38d5708c6aae0d3
Author: Phil Auld <pauld@redhat.com>
Date: Wed Jul 12 09:33:57 2023 -0400
sched/fair: Block nohz tick_stop when cfs bandwidth in use
CFS bandwidth limits and NOHZ full don't play well together. Tasks
can easily run well past their quotas before a remote tick does
accounting. This leads to long, multi-period stalls before such
tasks can run again. Currently, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.
Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled. We use cfs_b->hierarchical_quota to
determine if the task requires the tick.
Add check in pick_next_task_fair() as well since that is where
we have a handle on the task that is actually going to be running.
Add check in sched_can_stop_tick() to cover some edge cases such
as nr_running going from 2->1 and the 1 remains the running task.
Reviewed-By: Ben Segall <bsegall@google.com>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com
Signed-off-by: Phil Auld <pauld@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2208016
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
commit c98c18270be115678f4295b10a5af5dcc9c4efa0
Author: Phil Auld <pauld@redhat.com>
Date: Fri Jul 14 08:57:46 2023 -0400
sched, cgroup: Restore meaning to hierarchical_quota
In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min. It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.
In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.
Fixes: c53593e5cb ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com
Signed-off-by: Phil Auld <pauld@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2962
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219681
Upstream Status: RHEL only
Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask"), user provided CPU affinity via sched_setaffinity(2) is
perserved even if the task is being moved to a different cpuset. However,
that affinity is also being inherited by any subsequently created child
processes which may not want or be aware of that affinity.
One way to solve this problem is to provide a way to back off from
that user provided CPU affinity. This patch implements such a scheme
by using an empty cpumask to signal a reset of the cpumasks to the
default as allowed by the current cpuset.
Before this patch, passing in an empty cpumask to sched_setaffinity(2)
will always return an -EINVAL error. With this patch, an alternative
error of -ENODEV will be returned returned if sched_setaffinity(2)
has been called before to set up user_cpus_ptr. In this case, the
user_cpus_ptr that stores the user provided affinity will be cleared and
the task's CPU affinity will be reset to that of the current cpuset. This
alternative error code of -ENODEV signals that the no CPU is specified
and, at the same time, a side effect of resetting cpu affinity to the
cpuset default.
If sched_setaffinity(2) has not been called previously, an EINVAL error
will be returned with an empty cpumask just like before. Tests or
tools that rely on the behavior that an empty cpumask will return an
error code will not be affected.
Signed-off-by: Waiman Long <longman@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: John B. Wyatt IV <jwyatt@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2957
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2957
OpenShfit requires support to disable CPU load balancing for the Telco
use cases and this is a gating factor in determining if it can switch
to use cgroup v2 as the default.
The current RHEL9 kernel is able to create an isolated cpuset partition
of exclusive CPUs with load balancing disabled for cgroup v2. However,
it currently has the limitation that isolated cpuset partitions can
only be formed clustered around the cgroup root. That doesn't fit the
current OpenShift use case where systemd is primarily responsible for
managing the cgroup filesystem and OpenShift can only manage child
cgroups further away from the cgroup root.
To address the need of OpenShift, a patch series [1] has been proposed
upstream to extend the v2 cpuset partition semantics to allow the
creation of isolated partitions further away from cgroup root by adding a
new cpuset control file "cpuset.cpus.exclusive" to distribute potential
exclusive CPUs down the cgroup hierarchy for the creation of isolated
cpuset partition.
This MR incorporates the proposed upstream patches with its dependency
patches to provide a way for OpenShift to move forward with switching
the default cgroup from v1 to v2 for the 4.14 release.
The last 6 patches are the proposed upstream patches and the rests have
been merged upstream either in the mainline or the cgroup maintainer's
tree.
[1] https://lore.kernel.org/lkml/20230817132454.755459-1-longman@redhat.com/
Signed-off-by: Waiman Long <longman@redhat.com>
Approved-by: Aristeu Rozanski <arozansk@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568
commit 2ef269ef1ac006acf974793d975539244d77b28f
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Mon, 8 May 2023 09:58:54 +0200
cgroup/cpuset: Free DL BW in case can_attach() fails
cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks
have been checked. DL BW is not allocated per-task but as a sum over
all DL tasks migrating.
If multiple controllers are attached to the cgroup next to the cpuset
controller a non-cpuset can_attach() can fail. In this case free DL BW
in cpuset_cancel_attach().
Finally, update cpuset DL task count (nr_deadline_tasks) only in
cpuset_attach().
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568
commit 85989106feb734437e2d598b639991b9185a43a6
Author: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Mon, 8 May 2023 09:58:53 +0200
sched/deadline: Create DL BW alloc, free & check overflow interface
While moving a set of tasks between exclusive cpusets,
cpuset_can_attach() -> task_can_attach() calls dl_cpu_busy(..., p) for
DL BW overflow checking and per-task DL BW allocation on the destination
root_domain for the DL tasks in this set.
This approach has the issue of not freeing already allocated DL BW in
the following error cases:
(1) The set of tasks includes multiple DL tasks and DL BW overflow
checking fails for one of the subsequent DL tasks.
(2) Another controller next to the cpuset controller which is attached
to the same cgroup fails in its can_attach().
To address this problem rework dl_cpu_busy():
(1) Split it into dl_bw_check_overflow() & dl_bw_alloc() and add a
dedicated dl_bw_free().
(2) dl_bw_alloc() & dl_bw_free() take a `u64 dl_bw` parameter instead of
a `struct task_struct *p` used in dl_cpu_busy(). This allows to
allocate DL BW for a set of tasks too rather than only for a single
task.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568
commit 111cd11bbc54850f24191c52ff217da88a5e639b
Author: Juri Lelli <juri.lelli@redhat.com>
Date: Mon, 8 May 2023 09:58:50 +0200
sched/cpuset: Bring back cpuset_mutex
Turns out percpu_cpuset_rwsem - commit 1243dc518c ("cgroup/cpuset:
Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
as it has been reported to cause slowdowns in workloads that need to
change cpuset configuration frequently and it is also not implementing
priority inheritance (which causes troubles with realtime workloads).
Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
only for SCHED_DEADLINE tasks (other policies don't care about stable
cpusets anyway).
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232098
Upstream Status: RHEL only
Without __always_inline, this function breaks wchan.
schedule_loop() was added by patches from the upstream RT tree; a respin
of the patches for upstream has __always_inline.
Signed-off-by: Crystal Wood <swood@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219681
Upstream Status: RHEL only
Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask"), user provided CPU affinity via sched_setaffinity(2) is
perserved even if the task is being moved to a different cpuset. However,
that affinity is also being inherited by any subsequently created child
processes which may not want or be aware of that affinity.
One way to solve this problem is to provide a way to back off from
that user provided CPU affinity. This patch implements such a scheme
by using an empty cpumask to signal a reset of the cpumasks to the
default as allowed by the current cpuset.
Before this patch, passing in an empty cpumask to sched_setaffinity(2)
will always return an -EINVAL error. With this patch, an alternative
error of -ENODEV will be returned returned if sched_setaffinity(2)
has been called before to set up user_cpus_ptr. In this case, the
user_cpus_ptr that stores the user provided affinity will be cleared and
the task's CPU affinity will be reset to that of the current cpuset. This
alternative error code of -ENODEV signals that the no CPU is specified
and, at the same time, a side effect of resetting cpu affinity to the
cpuset default.
If sched_setaffinity(2) has not been called previously, an EINVAL error
will be returned with an empty cpumask just like before. Tests or
tools that rely on the behavior that an empty cpumask will return an
error code will not be affected.
We will have to update the sched_setaffinity(2) manpage to document
this possible side effect of passing in an empty cpumask.
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218724
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
commit ca66ec3b9994e5f82b433697e37512f7d28b6d22
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Thu Apr 27 13:19:34 2023 +0200
sched/core: Provide sched_rtmutex() and expose sched work helpers
schedule() invokes sched_submit_work() before scheduling and
sched_update_worker() afterwards to ensure that queued block requests are
flushed and the (IO)worker machineries can instantiate new workers if
required. This avoids deadlocks and starvation.
With rt_mutexes this can lead to subtle problem:
When rtmutex blocks current::pi_blocked_on points to the rtmutex it
blocks on. When one of the functions in sched_submit/resume_work()
contends on a rtmutex based lock then that would corrupt
current::pi_blocked_on.
Make it possible to let rtmutex issue the calls outside of the slowpath,
i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:
- Exposing sched_submit_work() and moving the task_running() condition
into schedule()
- Renamimg sched_update_worker() to sched_resume_work() and exposing it
too.
- Providing sched_rtmutex() which just does the inner loop of scheduling
until need_resched() is not longer set. Split out the loop so this does
not create yet another copy.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Crystal Wood <swood@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325
commit 2500ad1c7fa42ad734677853961a3a8bec0772c5
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Fri Apr 29 08:43:34 2022 -0500
ptrace: Don't change __state
Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
command is executing.
Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and
implement a new jobctl flag TASK_PTRACE_FROZEN. This new flag is set
in jobctl_freeze_task and cleared when ptrace_stop is awoken or in
jobctl_unfreeze_task (when ptrace_stop remains asleep).
In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
when the wake up is for a fatal signal. Skip adding __TASK_TRACED
when TASK_PTRACE_FROZEN is not set. This has the same effect as
changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use
TASK_KILLABLE go through signal_wake_up.
Handle a ptrace_stop being called with a pending fatal signal.
Previously it would have been handled by schedule simply failing to
sleep. As TASK_WAKEKILL is no longer part of TASK_TRACED schedule
will sleep with a fatal_signal_pending. The code in signal_wake_up
guarantees that the code will be awaked by any fatal signal that
codes after TASK_TRACED is set.
Previously the __state value of __TASK_TRACED was changed to
TASK_RUNNING when woken up or back to TASK_TRACED when the code was
left in ptrace_stop. Now when woken up ptrace_stop now clears
JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
clears JOBCTL_PTRACE_FROZEN.
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375
# Merge Request Required Information
## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits). The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.
## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
This is actually just an optimization, and it has non-trivial conflicts
which would require additional backports to resolve. Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
This fix is incorrectly tagged. The code that it applies to is not present in our tree.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2392
JIRA: https://issues.redhat.com/browse/RHEL-282
Tested: With scheduler stress tests. Perf QE is running performance regression tests.
Update the kernel's core scheduler and related code with fixes and minor changes from
the upstream kernel. This will sync up to roughly linux v6.3-rc6. Added a couple of
cpumask things which fit better here.
Signed-off-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2337
JIRA: https://issues.redhat.com/browse/RHEL-310
Tested: scheduler stress tests.
This is a collection of commits that update (mostly)
the uclamp code in the scheduler. We don't have
CONFIG_UCLAMP_TASK enabled right now but we might in
the future. We do though have EAS enabled and this helps
keep the code in sync to reduce issues withother patches.
It's broken out of the main scheduler update for 9.3 to
keep it contained and make the other MR smaller.
Signed-off-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2325
JIRA: https://issues.redhat.com/browse/RHEL-311
Tested: Enabled PSI and ran various stress tests.
Updates and bug fixes for the PSI subsystem. This brings
the code up to about v6.3-rc1. It does not include the
runtime enablement interface (34f26a15611 "sched/psi: Per-cgroup
PSI accounting disable/re-enable interfaceas") that required a larger
set of cgroup and kernfs patches. That may be take later if the
prerequisites are provided.
Signed-off-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
commit ed29b0b4fd835b058ddd151c49d021e28d631ee6
Author: Jens Axboe <axboe@kernel.dk>
Date: Mon May 23 17:05:03 2022 -0600
io_uring: move to separate directory
In preparation for splitting io_uring up a bit, move it into its own
top level directory. It didn't really belong in fs/ anyway, as it's
not a file system only API.
This adds io_uring/ and moves the core files in there, and updates the
MAINTAINERS file for the new location.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2276
Bugzilla: https://bugzilla.redhat.com/1996625
commit 7fb3ff22ad8772bbf0e3ce1ef3eb7b09f431807f
Author: Yair Podemsky <ypodemsk@redhat.com>
Date: Wed Nov 30 14:51:21 2022 +0200
sched/core: Fix arch_scale_freq_tick() on tickless systems
In order for the scheduler to be frequency invariant we measure the
ratio between the maximum CPU frequency and the actual CPU frequency.
During long tickless periods of time the calculations that keep track
of that might overflow, in the function scale_freq_tick():
if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
goto error;
eventually forcing the kernel to disable the feature for all CPUs,
and show the warning message:
"Scheduler frequency invariance went wobbly, disabling!".
Let's avoid that by limiting the frequency invariant calculations
to CPUs with regular tick.
Fixes: e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")
Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Yair Podemsky <ypodemsk@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Link: https://lore.kernel.org/r/20221130125121.34407-1-ypodemsk@redhat.com
Signed-off-by: Phil Auld <pauld@redhat.com>
Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit a53ce18cacb477dd0513c607f187d16f0fa96f71
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Fri Mar 17 17:08:10 2023 +0100
sched/fair: Sanitize vruntime of entity being migrated
Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.
For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.
In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.
Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
Originally-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit 585463f0d58aa4d29b744c7c53b222b8028de87f
Author: Valentin Schneider <vschneid@redhat.com>
Date: Mon Oct 3 16:34:20 2022 +0100
sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()
This removes the second use of the sched_core_mask temporary mask.
Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit df14b7f9efcda35e59bb6f50351aac25c50f6e24
Author: Waiman Long <longman@redhat.com>
Date: Fri Feb 3 13:18:49 2023 -0500
sched/core: Fix a missed update of user_cpus_ptr
Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask"), a successful call to sched_setaffinity() should always save
the user requested cpu affinity mask in a task's user_cpus_ptr. However,
when the given cpu mask is the same as the current one, user_cpus_ptr
is not updated. Fix this by saving the user mask in this case too.
Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230203181849.221943-1-longman@redhat.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit 5657c116783545fb49cd7004994c187128552b12
Author: Waiman Long <longman@redhat.com>
Date: Sun Jan 15 14:31:22 2023 -0500
sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs
The kernel commit 9a5418bc48ba ("sched/core: Use kfree_rcu() in
do_set_cpus_allowed()") introduces a bug for kernels built with non-SMP
configs. Calling sched_setaffinity() on such a uniprocessor kernel will
cause cpumask_copy() to be called with a NULL pointer leading to general
protection fault. This is not really a problem in real use cases as
there aren't that many uniprocessor kernel configs in use and calling
sched_setaffinity() on such a uniprocessor system doesn't make sense.
Fix this problem by making sure cpumask_copy() will not be called in
such a case.
Fixes: 9a5418bc48ba ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()")
Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230115193122.563036-1-longman@redhat.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit 160fb0d83f206b3429fc495864a022110f9e4978
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date: Fri Dec 23 18:32:57 2022 +0800
sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate()
ttwu_do_activate() is used for a complete wakeup, in which we will
activate_task() and use ttwu_do_wakeup() to mark the task runnable
and perform wakeup-preemption, also call class->task_woken() callback
and update the rq->idle_stamp.
Since ttwu_runnable() is not a complete wakeup, don't need all those
done in ttwu_do_wakeup(), so we can move those to ttwu_do_activate()
to simplify ttwu_do_wakeup(), making it only mark the task runnable
to be reused in ttwu_runnable() and try_to_wake_up().
This patch should not have any functional changes.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20221223103257.4962-2-zhouchengming@bytedance.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit efe09385864f3441c71711f91e621992f9423c01
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date: Fri Dec 23 18:32:56 2022 +0800
sched/core: Micro-optimize ttwu_runnable()
ttwu_runnable() is used as a fast wakeup path when the wakee task
is running on CPU or runnable on RQ, in both cases we can just
set its state to TASK_RUNNING to prevent a sleep.
If the wakee task is on_cpu running, we don't need to update_rq_clock()
or check_preempt_curr().
But if the wakee task is on_rq && !on_cpu (e.g. an IRQ hit before
the task got to schedule() and the task been preempted), we should
check_preempt_curr() to see if it can preempt the current running.
This also removes the class->task_woken() callback from ttwu_runnable(),
which wasn't required per the RT/DL implementations: any required push
operation would have been queued during class->set_next_task() when p
got preempted.
ttwu_runnable() also loses the update to rq->idle_stamp, as by definition
the rq cannot be idle in this scenario.
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20221223103257.4962-1-zhouchengming@bytedance.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit 904cbab71dda1689d41a240541179f21ff433c40
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Mon Dec 12 14:49:46 2022 +0000
sched: Make const-safe
With a modified container_of() that preserves constness, the compiler
finds some pointers which should have been marked as const. task_of()
also needs to become const-preserving for the !FAIR_GROUP_SCHED case so
that cfs_rq_of() can take a const argument. No change to generated code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221212144946.2657785-1-willy@infradead.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit d6962c4fe8f96f7d384d6489b6b5ab5bf3e35991
Author: Tianchen Ding <dtcccc@linux.alibaba.com>
Date: Fri Nov 4 10:36:01 2022 +0800
sched: Clear ttwu_pending after enqueue_task()
We found a long tail latency in schbench whem m*t is close to nr_cpus.
(e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)
This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
too early, and idle_cpu() will return true until the wakee task enqueued.
This will mislead the waker when selecting idle cpu, and wake multiple
worker threads on the same wakee cpu. This situation is enlarged by
commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
wakelist if wakee cpu is idle") because it tends to use wakelist.
Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
(Intel(R) Xeon(R) Platinum 8369B).
Latency percentiles (usec):
base base+revert_f3dd3f674555 base+this_patch
50.0000th: 9 13 9
75.0000th: 12 19 12
90.0000th: 15 22 15
95.0000th: 18 24 17
*99.0000th: 27 31 24
99.5000th: 3364 33 27
99.9000th: 12560 36 30
We also tested on unixbench and hackbench, and saw no performance
change.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20221104023601.12844-1-dtcccc@linux.alibaba.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit c59862f8265f8060b6650ee1dc12159fe5c89779
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu Aug 25 14:27:24 2022 +0200
sched/fair: Cleanup loop_max and loop_break
sched_nr_migrate_break is set to a fix value and never changes so we can
replace it by a define SCHED_NR_MIGRATE_BREAK.
Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value
of sysctl_sched_nr_migrate which can be init to different values.
Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate.
The behavior stays unchanged unless you modify sysctl_sched_nr_migrate
trough debugfs.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220825122726.20819-3-vincent.guittot@linaro.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
Conflicts: Context differences caused by having PREEMPT_RT
merged, specifically a015745ca41f ("sched: Consider
task_struct::saved_state in wait_task_inactive()").
commit f9fc8cad9728124cefe8844fb53d1814c92c6bfc
Author: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 6 12:39:55 2022 +0200
sched: Add TASK_ANY for wait_task_inactive()
Now that wait_task_inactive()'s @match_state argument is a mask (like
ttwu()) it is possible to replace the special !match_state case with
an 'all-states' value such that any blocked state will match.
Suggested-by: Ingo Molnar (mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
Conflicts: Context differences caused by having PREEMPT_RT
merged, specifically a015745ca41f ("sched: Consider
task_struct::saved_state in wait_task_inactive()").
commit 0b9d46fc5ef7a457cc635b30b010081228cb81ac
Author: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 6 12:33:04 2022 +0200
sched: Rename task_running() to task_on_cpu()
There is some ambiguity about task_running() in that it is unrelated
to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing
task_on_cpu().
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit df16b71c686cb096774e30153c9ce6756450796c
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date: Thu Aug 18 20:48:03 2022 +0800
sched/fair: Allow changing cgroup of new forked task
commit 7dc603c902 ("sched/fair: Fix PELT integrity for new tasks")
introduce a TASK_NEW state and an unnessary limitation that would fail
when changing cgroup of new forked task.
Because at that time, we can't handle task_change_group_fair() for new
forked fair task which hasn't been woken up by wake_up_new_task(),
which will cause detach on an unattached task sched_avg problem.
This patch delete this unnessary limitation by adding check before do
detach or attach in task_change_group_fair().
So cpu_cgrp_subsys.can_attach() has nothing to do for fair tasks,
only define it in #ifdef CONFIG_RT_GROUP_SCHED.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-8-zhouchengming@bytedance.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit 39c4261191bf05e7eb310f852980a6d0afe5582a
Author: Chengming Zhou <zhouchengming@bytedance.com>
Date: Thu Aug 18 20:47:58 2022 +0800
sched/fair: Remove redundant cpu_cgrp_subsys->fork()
We use cpu_cgrp_subsys->fork() to set task group for the new fair task
in cgroup_post_fork().
Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
has already set_task_rq() for the new fair task in sched_cgroup_fork(),
so cpu_cgrp_subsys->fork() can be removed.
cgroup_can_fork() --> pin parent's sched_task_group
sched_cgroup_fork()
__set_task_cpu()
set_task_rq()
cgroup_post_fork()
ss->fork() := cpu_cgroup_fork()
sched_change_group(..., TASK_SET_GROUP)
task_set_group_fair()
set_task_rq() --> can be removed
After this patch's change, task_change_group_fair() only need to
care about task cgroup migration, make the code much simplier.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20220818124805.601-3-zhouchengming@bytedance.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit 09348d75a6ce60eec85c86dd0ab7babc4db3caf6
Author: Ingo Molnar <mingo@kernel.org>
Date: Thu Aug 11 08:54:52 2022 +0200
sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
There's no good reason to crash a user's system with a BUG_ON(),
chances are high that they'll never even see the crash message on
Xorg, and it won't make it into the syslog either.
By using a WARN_ON_ONCE() we at least give the user a chance to report
any bugs triggered here - instead of getting silent hangs.
None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore
cases where a NULL check is done via a BUG_ON() and we let a NULL
pointer through after a WARN_ON_ONCE().
There's one exception: WARN_ON_ONCE() arguments with side-effects,
such as locking - in this case we use the return value of the
WARN_ON_ONCE(), such as in:
- BUG_ON(!lock_task_sighand(p, &flags));
+ if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
+ return;
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit 18c31c9711a90b48a77b78afb65012d9feec444c
Author: Bing Huang <huangbing@kylinos.cn>
Date: Sat Jul 23 05:36:09 2022 +0800
sched/fair: Make per-cpu cpumasks static
The load_balance_mask and select_rq_mask percpu variables are only used in
kernel/sched/fair.c.
Make them static and move their allocation into init_sched_fair_class().
Replace kzalloc_node() with zalloc_cpumask_var_node() to get rid of the
CONFIG_CPUMASK_OFFSTACK #ifdef and to align with per-cpu cpumask
allocation for RT (local_cpu_mask in init_sched_rt_class()) and DL
class (local_cpu_mask_dl in init_sched_dl_class()).
[ mingo: Tidied up changelog & touched up the code. ]
Signed-off-by: Bing Huang <huangbing@kylinos.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220722213609.3901-1-huangbing775@126.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit 0f03d6805bfc454279169a1460abb3f6b3db317f
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date: Wed Jul 27 14:08:19 2022 +0800
sched/debug: Print each field value left-aligned in sched_show_task()
Currently, the values of some fields are printed right-aligned, causing
the field value to be next to the next field name rather than next to its
own field name. So print each field value left-aligned, to make it more
readable.
Before:
stack: 0 pid: 307 ppid: 2 flags:0x00000008
After:
stack:0 pid:308 ppid:2 flags:0x0000000a
This also makes them print in the same style as the other two fields:
task:demo0 state:R running task
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220727060819.1085-1-thunder.leizhen@huawei.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-282
commit 0569b245132c40015281610353935a50e282eb94
Author: Mark Rutland <mark.rutland@arm.com>
Date: Mon Nov 29 13:06:45 2021 +0000
sched: Snapshot thread flags
Some thread flags can be set remotely, and so even when IRQs are disabled,
the flags can change under our feet. Generally this is unlikely to cause a
problem in practice, but it is somewhat unsound, and KCSAN will
legitimately warn that there is a data race.
To avoid such issues, a snapshot of the flags has to be taken prior to
using them. Some places already use READ_ONCE() for that, others do not.
Convert them all to the new flag accessor helpers.
The READ_ONCE(ti->flags) .. cmpxchg(ti->flags) loop in
set_nr_if_polling() is left as-is for clarity.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20211129130653.2037928-4-mark.rutland@arm.com
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-310
commit 8589018acc65e5ddfd111f0a7ee85f9afde3a830
Author: Hao Jia <jiahao.os@bytedance.com>
Date: Fri Dec 16 14:24:06 2022 +0800
sched/core: Adjusting the order of scanning CPU
When select_idle_capacity() starts scanning for an idle CPU, it starts
with target CPU that has already been checked in select_idle_sibling().
So we start checking from the next CPU and try the target CPU at the end.
Similarly for task_numa_assign(), we have just checked numa_migrate_on
of dst_cpu, so start from the next CPU. This also works for
steal_cookie_task(), the first scan must fail and start directly
from the next one.
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20221216062406.7812-3-jiahao.os@bytedance.com
Signed-off-by: Phil Auld <pauld@redhat.com>