Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Jerome Marchand	04a26afde2	trace: Add trace_ipi_send_cpu() Bugzilla: https://bugzilla.redhat.com/2192613 Conflicts: context change due to missing commit ed29b0b4fd83 ("io_uring: move to separate directory") commit 68e2d17c9eb311ab59aeb6d0c38aad8985fa2596 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Mar 22 11:28:36 2023 +0100 trace: Add trace_ipi_send_cpu() Because copying cpumasks around when targeting a single CPU is a bit daft... Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230322103004.GA571242%40hirez.programming.kicks-ass.net Signed-off-by: Jerome Marchand <jmarchan@redhat.com>	2023-09-14 15:36:30 +02:00
Jerome Marchand	aa5786b04d	sched, smp: Trace smp callback causing an IPI Bugzilla: https://bugzilla.redhat.com/2192613 Conflicts: Need to modify __smp_call_single_queue_debug() too. It was removed upstream by commit 1771257cb447 ("locking/csd_lock: Remove added data from CSD lock debugging") commit 68f4ff04dbada18dad79659c266a8e5e29e458cd Author: Valentin Schneider <vschneid@redhat.com> Date: Tue Mar 7 14:35:58 2023 +0000 sched, smp: Trace smp callback causing an IPI Context ======= The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter which so far has only been fed with NULL. While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing struct layout (meaning their callback func can be accessed without caring about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function attached to its struct. This means we need to check the type of a CSD before eventually dereferencing its associated callback. This isn't as trivial as it sounds: the CSD type is stored in __call_single_node.u_flags, which get cleared right before the callback is executed via csd_unlock(). This implies checking the CSD type before it is enqueued on the call_single_queue, as the target CPU's queue can be flushed before we get to sending an IPI. Furthermore, send_call_function_single_ipi() only has a CPU parameter, and would need to have an additional argument to trickle down the invoked function. This is somewhat silly, as the extra argument will always be pushed down to the function even when nothing is being traced, which is unnecessary overhead. Changes ======= send_call_function_single_ipi() is only used by smp.c, and is defined in sched/core.c as it contains scheduler-specific ops (set_nr_if_polling() of a CPU's idle task). Split it into two parts: the scheduler bits remain in sched/core.c, and the actual IPI emission is moved into smp.c. This lets us define an __always_inline helper function that can take the related callback as parameter without creating useless register pressure in the non-traced path which only gains a (disabled) static branch. Do the same thing for the multi IPI case. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230307143558.294354-8-vschneid@redhat.com Signed-off-by: Jerome Marchand <jmarchan@redhat.com>	2023-09-14 15:36:30 +02:00
Jerome Marchand	160dc2ad5b	sched, smp: Trace IPIs sent via send_call_function_single_ipi() Bugzilla: https://bugzilla.redhat.com/2192613 Conflicts: context change due to missing commit ed29b0b4fd83 ("io_uring: move to separate directory") commit cc9cb0a71725aa8dd8d8f534a9b562bbf7981f75 Author: Valentin Schneider <vschneid@redhat.com> Date: Tue Mar 7 14:35:53 2023 +0000 sched, smp: Trace IPIs sent via send_call_function_single_ipi() send_call_function_single_ipi() is the thing that sends IPIs at the bottom of smp_call_function*() via either generic_exec_single() or smp_call_function_many_cond(). Give it an IPI-related tracepoint. Note that this ends up tracing any IPI sent via __smp_call_single_queue(), which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free". Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230307143558.294354-3-vschneid@redhat.com Signed-off-by: Jerome Marchand <jmarchan@redhat.com>	2023-09-14 15:36:30 +02:00
Phil Auld	c11309550b	sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop() JIRA: https://issues.redhat.com/browse/RHEL-1536 commit 96500560f0c73c71bca1b27536c6254fa0e8ce37 Author: Hao Jia <jiahao.os@bytedance.com> Date: Tue Jun 13 16:20:10 2023 +0800 sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop() There is a double update_rq_clock() invocation: __balance_push_cpu_stop() update_rq_clock() __migrate_task() update_rq_clock() Sadly select_fallback_rq() also needs update_rq_clock() for __do_set_cpus_allowed(), it is not possible to remove the update from __balance_push_cpu_stop(). So remove it from __migrate_task() and ensure all callers of this function call update_rq_clock() prior to calling it. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20230613082012.49615-3-jiahao.os@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:59 -04:00
Phil Auld	57aa0597d5	sched/core: Fixed missing rq clock update before calling set_rq_offline() JIRA: https://issues.redhat.com/browse/RHEL-1536 commit cab3ecaed5cdcc9c36a96874b4c45056a46ece45 Author: Hao Jia <jiahao.os@bytedance.com> Date: Tue Jun 13 16:20:09 2023 +0800 sched/core: Fixed missing rq clock update before calling set_rq_offline() When using a cpufreq governor that uses cpufreq_add_update_util_hook(), it is possible to trigger a missing update_rq_clock() warning for the CPU hotplug path: rq_attach_root() set_rq_offline() rq_offline_rt() __disable_runtime() sched_rt_rq_enqueue() enqueue_top_rt_rq() cpufreq_update_util() data->func(data, rq_clock(rq), flags) Move update_rq_clock() from sched_cpu_deactivate() (one of it's callers) into set_rq_offline() such that it covers all set_rq_offline() usage. Additionally change rq_attach_root() to use rq_lock_irqsave() so that it will properly manage the runqueue clock flags. Suggested-by: Ben Segall <bsegall@google.com> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20230613082012.49615-2-jiahao.os@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:59 -04:00
Phil Auld	51c8946826	sched: Consider task_struct::saved_state in wait_task_inactive() JIRA: https://issues.redhat.com/browse/RHEL-1536 commit 1c06918788e8ae6e69e4381a2806617312922524 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 16:39:07 2023 +0200 sched: Consider task_struct::saved_state in wait_task_inactive() With the introduction of task_struct::saved_state in commit 5f220be21418 ("sched/wakeup: Prepare for RT sleeping spin/rwlocks") matching the task state has gotten more complicated. That same commit changed try_to_wake_up() to consider both states, but wait_task_inactive() has been neglected. Sebastian noted that the wait_task_inactive() usage in ptrace_check_attach() can misbehave when ptrace_stop() is blocked on the tasklist_lock after it sets TASK_TRACED. Therefore extract a common helper from ttwu_state_match() and use that to teach wait_task_inactive() about the PREEMPT_RT locks. Originally-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20230601091234.GW83892@hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:59 -04:00
Phil Auld	9e3d063131	sched: Unconditionally use full-fat wait_task_inactive() JIRA: https://issues.redhat.com/browse/RHEL-1536 commit d5e1586617be7093ea3419e3fa9387ed833cdbb1 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 2 10:42:53 2023 +0200 sched: Unconditionally use full-fat wait_task_inactive() While modifying wait_task_inactive() for PREEMPT_RT; the build robot noted that UP got broken. This led to audit and consideration of the UP implementation of wait_task_inactive(). It looks like the UP implementation is also broken for PREEMPT; consider task_current_syscall() getting preempted between the two calls to wait_task_inactive(). Therefore move the wait_task_inactive() implementation out of CONFIG_SMP and unconditionally use it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230602103731.GA630648%40hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:59 -04:00
Phil Auld	ec56b1904a	sched: Change wait_task_inactive()s match_state JIRA: https://issues.redhat.com/browse/RHEL-1536 Conflicts: This was applied out of order with f9fc8cad9728 ("sched: Add TASK_ANY for wait_task_inactive()") so adjusted code to match what the results should have been. commit 9204a97f7ae862fc8a3330ec8335917534c3fb63 Author: Peter Zijlstra <peterz@infradead.org> Date: Mon Aug 22 13:18:19 2022 +0200 sched: Change wait_task_inactive()s match_state Make wait_task_inactive()'s @match_state work like ttwu()'s @state. That is, instead of an equal comparison, use it as a mask. This allows matching multiple block conditions. (removes the unlikely; it doesn't make sense how it's only part of the condition) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:58 -04:00
Phil Auld	418216578b	Revert "sched: Consider task_struct::saved_state in wait_task_inactive()." JIRA: https://issues.redhat.com/browse/RHEL-1536 Upstream status: RHEL only Conflicts: A later patch renamed task_running() to task_on_cpu() so this did not revert cleanly. In addition match_state does not need to be checked for 0 due to f9fc8cad9728 sched: Add TASK_ANY for wait_task_inactive(). This reverts commit `3673cc2e61`. This is commit a015745ca41f from the RT tree merge. It will be re-applied in the form it was in when merged to Linus' tree as 1c06918788. Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:58 -04:00
Phil Auld	c59c893622	sched/core: Make sched_dynamic_mutex static JIRA: https://issues.redhat.com/browse/RHEL-1536 commit 9b8e17813aeccc29c2f9f2e6e68997a6eac2d26d Author: Josh Poimboeuf <jpoimboe@kernel.org> Date: Tue Apr 11 22:26:41 2023 -0700 sched/core: Make sched_dynamic_mutex static The sched_dynamic_mutex is only used within the file. Make it static. Fixes: e3ff7c609f39 ("livepatch,sched: Add livepatch task switching to cond_resched()") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/oe-kbuild-all/202304062335.tNuUjgsl-lkp@intel.com/ Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:26:06 -04:00
Phil Auld	63797cb734	sched/core: Reduce cost of sched_move_task when config autogroup JIRA: https://issues.redhat.com/browse/RHEL-1536 commit eff6c8ce8d4d7faef75f66614dd20bb50595d261 Author: wuchi <wuchi.zero@gmail.com> Date: Tue Mar 21 14:44:59 2023 +0800 sched/core: Reduce cost of sched_move_task when config autogroup Some sched_move_task calls are useless because that task_struct->sched_task_group maybe not changed (equals task_group of cpu_cgroup) when system enable autogroup. So do some checks in sched_move_task. sched_move_task eg: task A belongs to cpu_cgroup0 and autogroup0, it will always belong to cpu_cgroup0 when do_exit. So there is no need to do {de\|en}queue. The call graph is as follow. do_exit sched_autogroup_exit_task sched_move_task dequeue_task sched_change_group A.sched_task_group = sched_get_task_group (=cpu_cgroup0) enqueue_task Performance results: =========================== 1. env cpu: bogomips=4600.00 kernel: 6.3.0-rc3 cpu_cgroup: 6:cpu,cpuacct:/user.slice 2. cmds do_exit script: for i in {0..10000}; do sleep 0 & done wait Run the above script, then use the following bpftrace cmd to get the cost of sched_move_task: bpftrace -e 'k:sched_move_task { @ts[tid] = nsecs; } kr:sched_move_task /@ts[tid]/ { @ns += nsecs - @ts[tid]; delete(@ts[tid]); }' 3. cost time(ns): without patch: 43528033 with patch: 18541416 diff:-24986617 -57.4% As the result show, the patch will save 57.4% in the scenario. Signed-off-by: wuchi <wuchi.zero@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230321064459.39421-1-wuchi.zero@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:26:06 -04:00
Phil Auld	112765493a	sched/core: Avoid selecting the task that is throttled to run when core-sched enable JIRA: https://issues.redhat.com/browse/RHEL-1536 commit 530bfad1d53d103f98cec66a3e491a36d397884d Author: Hao Jia <jiahao.os@bytedance.com> Date: Thu Mar 16 16:18:06 2023 +0800 sched/core: Avoid selecting the task that is throttled to run when core-sched enable When {rt, cfs}_rq or dl task is throttled, since cookied tasks are not dequeued from the core tree, So sched_core_find() and sched_core_next() may return throttled task, which may cause throttled task to run on the CPU. So we add checks in sched_core_find() and sched_core_next() to make sure that the return is a runnable task that is not throttled. Co-developed-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230316081806.69544-1-jiahao.os@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:26:06 -04:00
Phil Auld	2e4b079146	sched_getaffinity: don't assume 'cpumask_size()' is fully initialized JIRA: https://issues.redhat.com/browse/RHEL-1536 commit 6015b1aca1a233379625385feb01dd014aca60b5 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue Mar 14 19:32:38 2023 -0700 sched_getaffinity: don't assume 'cpumask_size()' is fully initialized The getaffinity() system call uses 'cpumask_size()' to decide how big the CPU mask is - so far so good. It is indeed the allocation size of a cpumask. But the code also assumes that the whole allocation is initialized without actually doing so itself. That's wrong, because we might have fixed-size allocations (making copying and clearing more efficient), but not all of it is then necessarily used if 'nr_cpu_ids' is smaller. Having checked other users of 'cpumask_size()', they all seem to be ok, either using it purely for the allocation size, or explicitly zeroing the cpumask before using the size in bytes to copy it. See for example the ublk_ctrl_get_queue_affinity() function that uses the proper 'zalloc_cpumask_var()' to make sure that the whole mask is cleared, whether the storage is on the stack or if it was an external allocation. Fix this by just zeroing the allocation before using it. Do the same for the compat version of sched_getaffinity(), which had the same logic. Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to access the bits. For a cpumask_var_t, it ends up being a pointer to the same data either way, but it's just a good idea to treat it like you would a 'cpumask_t'. The compat case already did that. Reported-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/ Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:26:06 -04:00
Phil Auld	8142c03a19	livepatch,sched: Add livepatch task switching to cond_resched() JIRA: https://issues.redhat.com/browse/RHEL-1536 Conflicts: Minor fixup due to already having `8df1947c71` ("livepatch: Replace the fake signal sending with TIF_NOTIFY_SIGNAL infrastructure") commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8 Author: Josh Poimboeuf <jpoimboe@kernel.org> Date: Fri Feb 24 08:50:00 2023 -0800 livepatch,sched: Add livepatch task switching to cond_resched() There have been reports [1][2] of live patches failing to complete within a reasonable amount of time due to CPU-bound kthreads. Fix it by patching tasks in cond_resched(). There are four different flavors of cond_resched(), depending on the kernel configuration. Hook into all of them. A more elegant solution might be to use a preempt notifier. However, non-ORC unwinders can't unwind a preempted task reliably. [1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/ [2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:26:06 -04:00
Phil Auld	4e3b05f4b0	sched/fair: Block nohz tick_stop when cfs bandwidth in use Bugzilla: https://bugzilla.redhat.com/2208016 Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core Conflicts: minor fuzz due to context. commit 88c56cfeaec4642aee8aac58b38d5708c6aae0d3 Author: Phil Auld <pauld@redhat.com> Date: Wed Jul 12 09:33:57 2023 -0400 sched/fair: Block nohz tick_stop when cfs bandwidth in use CFS bandwidth limits and NOHZ full don't play well together. Tasks can easily run well past their quotas before a remote tick does accounting. This leads to long, multi-period stalls before such tasks can run again. Currently, when presented with these conflicting requirements the scheduler is favoring nohz_full and letting the tick be stopped. However, nohz tick stopping is already best-effort, there are a number of conditions that can prevent it, whereas cfs runtime bandwidth is expected to be enforced. Make the scheduler favor bandwidth over stopping the tick by setting TICK_DEP_BIT_SCHED when the only running task is a cfs task with runtime limit enabled. We use cfs_b->hierarchical_quota to determine if the task requires the tick. Add check in pick_next_task_fair() as well since that is where we have a handle on the task that is actually going to be running. Add check in sched_can_stop_tick() to cover some edge cases such as nr_running going from 2->1 and the 1 remains the running task. Reviewed-By: Ben Segall <bsegall@google.com> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:25:42 -04:00
Phil Auld	3f3cb409d3	sched, cgroup: Restore meaning to hierarchical_quota Bugzilla: https://bugzilla.redhat.com/2208016 Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core commit c98c18270be115678f4295b10a5af5dcc9c4efa0 Author: Phil Auld <pauld@redhat.com> Date: Fri Jul 14 08:57:46 2023 -0400 sched, cgroup: Restore meaning to hierarchical_quota In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task groups due to the previous fix simply taking the min. It should reflect a limit imposed at that level or by an ancestor. Even though cgroupv2 does not require child quota to be less than or equal to that of its ancestors the task group will still be constrained by such a quota so this should be shown here. Cgroupv1 continues to set this correctly. In both cases, add initialization when a new task group is created based on the current parent's value (or RUNTIME_INF in the case of root_task_group). Otherwise, the field is wrong until a quota is changed after creation and __cfs_schedulable() is called. Fixes: `c53593e5cb` ("sched, cgroup: Don't reject lower cpu.max on ancestors") Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:25:41 -04:00
Jan Stancek	8d19d78fab	Merge: sched/core: Use empty mask to reset cpumasks in sched_setaffinity() MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2962 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219681 Upstream Status: RHEL only Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask"), user provided CPU affinity via sched_setaffinity(2) is perserved even if the task is being moved to a different cpuset. However, that affinity is also being inherited by any subsequently created child processes which may not want or be aware of that affinity. One way to solve this problem is to provide a way to back off from that user provided CPU affinity. This patch implements such a scheme by using an empty cpumask to signal a reset of the cpumasks to the default as allowed by the current cpuset. Before this patch, passing in an empty cpumask to sched_setaffinity(2) will always return an -EINVAL error. With this patch, an alternative error of -ENODEV will be returned returned if sched_setaffinity(2) has been called before to set up user_cpus_ptr. In this case, the user_cpus_ptr that stores the user provided affinity will be cleared and the task's CPU affinity will be reset to that of the current cpuset. This alternative error code of -ENODEV signals that the no CPU is specified and, at the same time, a side effect of resetting cpu affinity to the cpuset default. If sched_setaffinity(2) has not been called previously, an EINVAL error will be returned with an empty cpumask just like before. Tests or tools that rely on the behavior that an empty cpumask will return an error code will not be affected. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: John B. Wyatt IV <jwyatt@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-09-01 21:26:13 +02:00
Jan Stancek	f2a2d5da21	Merge: cgroup/cpuset: Provide better cpuset API to enable creation of isolated partition MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2957 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2957 OpenShfit requires support to disable CPU load balancing for the Telco use cases and this is a gating factor in determining if it can switch to use cgroup v2 as the default. The current RHEL9 kernel is able to create an isolated cpuset partition of exclusive CPUs with load balancing disabled for cgroup v2. However, it currently has the limitation that isolated cpuset partitions can only be formed clustered around the cgroup root. That doesn't fit the current OpenShift use case where systemd is primarily responsible for managing the cgroup filesystem and OpenShift can only manage child cgroups further away from the cgroup root. To address the need of OpenShift, a patch series [1] has been proposed upstream to extend the v2 cpuset partition semantics to allow the creation of isolated partitions further away from cgroup root by adding a new cpuset control file "cpuset.cpus.exclusive" to distribute potential exclusive CPUs down the cgroup hierarchy for the creation of isolated cpuset partition. This MR incorporates the proposed upstream patches with its dependency patches to provide a way for OpenShift to move forward with switching the default cgroup from v1 to v2 for the 4.14 release. The last 6 patches are the proposed upstream patches and the rests have been merged upstream either in the mainline or the cgroup maintainer's tree. [1] https://lore.kernel.org/lkml/20230817132454.755459-1-longman@redhat.com/ Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Aristeu Rozanski <arozansk@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-09-01 21:26:08 +02:00
Waiman Long	132876f2ff	cgroup/cpuset: Free DL BW in case can_attach() fails Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568 commit 2ef269ef1ac006acf974793d975539244d77b28f Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Mon, 8 May 2023 09:58:54 +0200 cgroup/cpuset: Free DL BW in case can_attach() fails cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks have been checked. DL BW is not allocated per-task but as a sum over all DL tasks migrating. If multiple controllers are attached to the cgroup next to the cpuset controller a non-cpuset can_attach() can fail. In this case free DL BW in cpuset_cancel_attach(). Finally, update cpuset DL task count (nr_deadline_tasks) only in cpuset_attach(). Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-08-28 11:07:05 -04:00
Waiman Long	5503327426	sched/deadline: Create DL BW alloc, free & check overflow interface Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568 commit 85989106feb734437e2d598b639991b9185a43a6 Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Mon, 8 May 2023 09:58:53 +0200 sched/deadline: Create DL BW alloc, free & check overflow interface While moving a set of tasks between exclusive cpusets, cpuset_can_attach() -> task_can_attach() calls dl_cpu_busy(..., p) for DL BW overflow checking and per-task DL BW allocation on the destination root_domain for the DL tasks in this set. This approach has the issue of not freeing already allocated DL BW in the following error cases: (1) The set of tasks includes multiple DL tasks and DL BW overflow checking fails for one of the subsequent DL tasks. (2) Another controller next to the cpuset controller which is attached to the same cgroup fails in its can_attach(). To address this problem rework dl_cpu_busy(): (1) Split it into dl_bw_check_overflow() & dl_bw_alloc() and add a dedicated dl_bw_free(). (2) dl_bw_alloc() & dl_bw_free() take a `u64 dl_bw` parameter instead of a `struct task_struct *p` used in dl_cpu_busy(). This allows to allocate DL BW for a set of tasks too rather than only for a single task. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-08-28 11:07:05 -04:00
Waiman Long	3493ed9e35	sched/cpuset: Bring back cpuset_mutex Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568 commit 111cd11bbc54850f24191c52ff217da88a5e639b Author: Juri Lelli <juri.lelli@redhat.com> Date: Mon, 8 May 2023 09:58:50 +0200 sched/cpuset: Bring back cpuset_mutex Turns out percpu_cpuset_rwsem - commit `1243dc518c` ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea, as it has been reported to cause slowdowns in workloads that need to change cpuset configuration frequently and it is also not implementing priority inheritance (which causes troubles with realtime workloads). Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it only for SCHED_DEADLINE tasks (other policies don't care about stable cpusets anyway). Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-08-28 11:07:04 -04:00
Crystal Wood	ec180d083a	sched/core: Add __always_inline to schedule_loop() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232098 Upstream Status: RHEL only Without __always_inline, this function breaks wchan. schedule_loop() was added by patches from the upstream RT tree; a respin of the patches for upstream has __always_inline. Signed-off-by: Crystal Wood <swood@redhat.com>	2023-08-21 09:57:26 -05:00
Waiman Long	05fddaaaac	sched/core: Use empty mask to reset cpumasks in sched_setaffinity() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219681 Upstream Status: RHEL only Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask"), user provided CPU affinity via sched_setaffinity(2) is perserved even if the task is being moved to a different cpuset. However, that affinity is also being inherited by any subsequently created child processes which may not want or be aware of that affinity. One way to solve this problem is to provide a way to back off from that user provided CPU affinity. This patch implements such a scheme by using an empty cpumask to signal a reset of the cpumasks to the default as allowed by the current cpuset. Before this patch, passing in an empty cpumask to sched_setaffinity(2) will always return an -EINVAL error. With this patch, an alternative error of -ENODEV will be returned returned if sched_setaffinity(2) has been called before to set up user_cpus_ptr. In this case, the user_cpus_ptr that stores the user provided affinity will be cleared and the task's CPU affinity will be reset to that of the current cpuset. This alternative error code of -ENODEV signals that the no CPU is specified and, at the same time, a side effect of resetting cpu affinity to the cpuset default. If sched_setaffinity(2) has not been called previously, an EINVAL error will be returned with an empty cpumask just like before. Tests or tools that rely on the behavior that an empty cpumask will return an error code will not be affected. We will have to update the sched_setaffinity(2) manpage to document this possible side effect of passing in an empty cpumask. Signed-off-by: Waiman Long <longman@redhat.com>	2023-08-19 14:53:37 -04:00
Jan Stancek	b7217f6931	Merge: sched/core: Provide sched_rtmutex() and expose sched work helpers MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2829 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218724 Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git Avoid corrupting lock state due to blocking on a lock in sched_submit_work() while in the process of blocking on another lock. Signed-off-by: Crystal Wood <swood@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-07-31 16:05:41 +02:00
Crystal Wood	09e4f82619	sched/core: Provide sched_rtmutex() and expose sched work helpers Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218724 Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git commit ca66ec3b9994e5f82b433697e37512f7d28b6d22 Author: Thomas Gleixner <tglx@linutronix.de> Date: Thu Apr 27 13:19:34 2023 +0200 sched/core: Provide sched_rtmutex() and expose sched work helpers schedule() invokes sched_submit_work() before scheduling and sched_update_worker() afterwards to ensure that queued block requests are flushed and the (IO)worker machineries can instantiate new workers if required. This avoids deadlocks and starvation. With rt_mutexes this can lead to subtle problem: When rtmutex blocks current::pi_blocked_on points to the rtmutex it blocks on. When one of the functions in sched_submit/resume_work() contends on a rtmutex based lock then that would corrupt current::pi_blocked_on. Make it possible to let rtmutex issue the calls outside of the slowpath, i.e. when it is guaranteed that current::pi_blocked_on is NULL, by: - Exposing sched_submit_work() and moving the task_running() condition into schedule() - Renamimg sched_update_worker() to sched_resume_work() and exposing it too. - Providing sched_rtmutex() which just does the inner loop of scheduling until need_resched() is not longer set. Split out the loop so this does not create yet another copy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Crystal Wood <swood@redhat.com>	2023-07-18 17:22:36 -05:00
Oleg Nesterov	b85b393abb	ptrace: Don't change __state Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325 commit 2500ad1c7fa42ad734677853961a3a8bec0772c5 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Fri Apr 29 08:43:34 2022 -0500 ptrace: Don't change __state Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace command is executing. Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and implement a new jobctl flag TASK_PTRACE_FROZEN. This new flag is set in jobctl_freeze_task and cleared when ptrace_stop is awoken or in jobctl_unfreeze_task (when ptrace_stop remains asleep). In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL when the wake up is for a fatal signal. Skip adding __TASK_TRACED when TASK_PTRACE_FROZEN is not set. This has the same effect as changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use TASK_KILLABLE go through signal_wake_up. Handle a ptrace_stop being called with a pending fatal signal. Previously it would have been handled by schedule simply failing to sleep. As TASK_WAKEKILL is no longer part of TASK_TRACED schedule will sleep with a fatal_signal_pending. The code in signal_wake_up guarantees that the code will be awaked by any fatal signal that codes after TASK_TRACED is set. Previously the __state value of __TASK_TRACED was changed to TASK_RUNNING when woken up or back to TASK_TRACED when the code was left in ptrace_stop. Now when woken up ptrace_stop now clears JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced clears JOBCTL_PTRACE_FROZEN. Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2023-07-06 15:55:31 +02:00
Jan Stancek	704d11b087	Merge: enable io_uring MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375 # Merge Request Required Information ## Summary of Changes This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits). The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option. ## Approved Development Ticket Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214 Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation") This is actually just an optimization, and it has non-trivial conflicts which would require additional backports to resolve. Skip it. Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce") This fix is incorrectly tagged. The code that it applies to is not present in our tree. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Approved-by: John Meneghini <jmeneghi@redhat.com> Approved-by: Ming Lei <ming.lei@redhat.com> Approved-by: Maurizio Lombardi <mlombard@redhat.com> Approved-by: Brian Foster <bfoster@redhat.com> Approved-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-17 07:47:08 +02:00
Jan Stancek	3b12a1f1fc	Merge: Scheduler updates for 9.3 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2392 JIRA: https://issues.redhat.com/browse/RHEL-282 Tested: With scheduler stress tests. Perf QE is running performance regression tests. Update the kernel's core scheduler and related code with fixes and minor changes from the upstream kernel. This will sync up to roughly linux v6.3-rc6. Added a couple of cpumask things which fit better here. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-16 11:49:47 +02:00
Jan Stancek	eeab15fa15	Merge: Scheduler uclamp and asym updates to v6.3-rc1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2337 JIRA: https://issues.redhat.com/browse/RHEL-310 Tested: scheduler stress tests. This is a collection of commits that update (mostly) the uclamp code in the scheduler. We don't have CONFIG_UCLAMP_TASK enabled right now but we might in the future. We do though have EAS enabled and this helps keep the code in sync to reduce issues withother patches. It's broken out of the main scheduler update for 9.3 to keep it contained and make the other MR smaller. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-15 09:35:54 +02:00
Jan Stancek	f58fc750ef	Merge: Sched/psi: updates to v6.3-rc1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2325 JIRA: https://issues.redhat.com/browse/RHEL-311 Tested: Enabled PSI and ran various stress tests. Updates and bug fixes for the PSI subsystem. This brings the code up to about v6.3-rc1. It does not include the runtime enablement interface (34f26a15611 "sched/psi: Per-cgroup PSI accounting disable/re-enable interfaceas") that required a larger set of cgroup and kernfs patches. That may be take later if the prerequisites are provided. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Approved-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-11 12:12:13 +02:00
Jeff Moyer	2b8780eae3	io_uring: move to separate directory Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit ed29b0b4fd835b058ddd151c49d021e28d631ee6 Author: Jens Axboe <axboe@kernel.dk> Date: Mon May 23 17:05:03 2022 -0600 io_uring: move to separate directory In preparation for splitting io_uring up a bit, move it into its own top level directory. It didn't really belong in fs/ anyway, as it's not a file system only API. This adds io_uring/ and moves the core files in there, and updates the MAINTAINERS file for the new location. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-04-29 04:49:02 -04:00
Jan Stancek	567f50bcff	Merge: sched/core: Fix arch_scale_freq_tick() on tickless systems MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2276 Bugzilla: https://bugzilla.redhat.com/1996625 commit 7fb3ff22ad8772bbf0e3ce1ef3eb7b09f431807f Author: Yair Podemsky <ypodemsk@redhat.com> Date: Wed Nov 30 14:51:21 2022 +0200 sched/core: Fix arch_scale_freq_tick() on tickless systems In order for the scheduler to be frequency invariant we measure the ratio between the maximum CPU frequency and the actual CPU frequency. During long tickless periods of time the calculations that keep track of that might overflow, in the function scale_freq_tick(): if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt)) goto error; eventually forcing the kernel to disable the feature for all CPUs, and show the warning message: "Scheduler frequency invariance went wobbly, disabling!". Let's avoid that by limiting the frequency invariant calculations to CPUs with regular tick. Fixes: `e2b0d619b4` ("x86, sched: check for counters overflow in frequency invariant accounting") Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Yair Podemsky <ypodemsk@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz> Link: https://lore.kernel.org/r/20221130125121.34407-1-ypodemsk@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Juri Lelli <juri.lelli@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-04-25 06:58:25 +02:00
Phil Auld	02c4ba58b7	sched/fair: Sanitize vruntime of entity being migrated JIRA: https://issues.redhat.com/browse/RHEL-282 commit a53ce18cacb477dd0513c607f187d16f0fa96f71 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Fri Mar 17 17:08:10 2023 +0100 sched/fair: Sanitize vruntime of entity being migrated Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed") fixes an overflowing bug, but ignore a case that se->exec_start is reset after a migration. For fixing this case, we delay the reset of se->exec_start after placing the entity which se->exec_start to detect long sleeping task. In order to take into account a possible divergence between the clock_task of 2 rqs, we increase the threshold to around 104 days. Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed") Originally-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 14:16:00 -04:00
Phil Auld	d3f2df660a	sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() JIRA: https://issues.redhat.com/browse/RHEL-282 commit 585463f0d58aa4d29b744c7c53b222b8028de87f Author: Valentin Schneider <vschneid@redhat.com> Date: Mon Oct 3 16:34:20 2022 +0100 sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() This removes the second use of the sched_core_mask temporary mask. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 10:04:09 -04:00
Phil Auld	f089b6b716	sched/core: Fix a missed update of user_cpus_ptr JIRA: https://issues.redhat.com/browse/RHEL-282 commit df14b7f9efcda35e59bb6f50351aac25c50f6e24 Author: Waiman Long <longman@redhat.com> Date: Fri Feb 3 13:18:49 2023 -0500 sched/core: Fix a missed update of user_cpus_ptr Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask"), a successful call to sched_setaffinity() should always save the user requested cpu affinity mask in a task's user_cpus_ptr. However, when the given cpu mask is the same as the current one, user_cpus_ptr is not updated. Fix this by saving the user mask in this case too. Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230203181849.221943-1-longman@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:20 -04:00
Phil Auld	bf73c54d24	sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs JIRA: https://issues.redhat.com/browse/RHEL-282 commit 5657c116783545fb49cd7004994c187128552b12 Author: Waiman Long <longman@redhat.com> Date: Sun Jan 15 14:31:22 2023 -0500 sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs The kernel commit 9a5418bc48ba ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()") introduces a bug for kernels built with non-SMP configs. Calling sched_setaffinity() on such a uniprocessor kernel will cause cpumask_copy() to be called with a NULL pointer leading to general protection fault. This is not really a problem in real use cases as there aren't that many uniprocessor kernel configs in use and calling sched_setaffinity() on such a uniprocessor system doesn't make sense. Fix this problem by making sure cpumask_copy() will not be called in such a case. Fixes: 9a5418bc48ba ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()") Reported-by: kernel test robot <yujie.liu@intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230115193122.563036-1-longman@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:20 -04:00
Phil Auld	ffd9ddbf5a	sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate() JIRA: https://issues.redhat.com/browse/RHEL-282 commit 160fb0d83f206b3429fc495864a022110f9e4978 Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Fri Dec 23 18:32:57 2022 +0800 sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate() ttwu_do_activate() is used for a complete wakeup, in which we will activate_task() and use ttwu_do_wakeup() to mark the task runnable and perform wakeup-preemption, also call class->task_woken() callback and update the rq->idle_stamp. Since ttwu_runnable() is not a complete wakeup, don't need all those done in ttwu_do_wakeup(), so we can move those to ttwu_do_activate() to simplify ttwu_do_wakeup(), making it only mark the task runnable to be reused in ttwu_runnable() and try_to_wake_up(). This patch should not have any functional changes. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221223103257.4962-2-zhouchengming@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:20 -04:00
Phil Auld	a09a99cf93	sched/core: Micro-optimize ttwu_runnable() JIRA: https://issues.redhat.com/browse/RHEL-282 commit efe09385864f3441c71711f91e621992f9423c01 Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Fri Dec 23 18:32:56 2022 +0800 sched/core: Micro-optimize ttwu_runnable() ttwu_runnable() is used as a fast wakeup path when the wakee task is running on CPU or runnable on RQ, in both cases we can just set its state to TASK_RUNNING to prevent a sleep. If the wakee task is on_cpu running, we don't need to update_rq_clock() or check_preempt_curr(). But if the wakee task is on_rq && !on_cpu (e.g. an IRQ hit before the task got to schedule() and the task been preempted), we should check_preempt_curr() to see if it can preempt the current running. This also removes the class->task_woken() callback from ttwu_runnable(), which wasn't required per the RT/DL implementations: any required push operation would have been queued during class->set_next_task() when p got preempted. ttwu_runnable() also loses the update to rq->idle_stamp, as by definition the rq cannot be idle in this scenario. Suggested-by: Valentin Schneider <vschneid@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20221223103257.4962-1-zhouchengming@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:19 -04:00
Phil Auld	4881a62e1d	sched: Make const-safe JIRA: https://issues.redhat.com/browse/RHEL-282 commit 904cbab71dda1689d41a240541179f21ff433c40 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Mon Dec 12 14:49:46 2022 +0000 sched: Make const-safe With a modified container_of() that preserves constness, the compiler finds some pointers which should have been marked as const. task_of() also needs to become const-preserving for the !FAIR_GROUP_SCHED case so that cfs_rq_of() can take a const argument. No change to generated code. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221212144946.2657785-1-willy@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:19 -04:00
Phil Auld	6a7d52383a	sched: Clear ttwu_pending after enqueue_task() JIRA: https://issues.redhat.com/browse/RHEL-282 commit d6962c4fe8f96f7d384d6489b6b5ab5bf3e35991 Author: Tianchen Ding <dtcccc@linux.alibaba.com> Date: Fri Nov 4 10:36:01 2022 +0800 sched: Clear ttwu_pending after enqueue_task() We found a long tail latency in schbench whem mt is close to nr_cpus. (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.) This is because when the wakee cpu is idle, rq->ttwu_pending is cleared too early, and idle_cpu() will return true until the wakee task enqueued. This will mislead the waker when selecting idle cpu, and wake multiple worker threads on the same wakee cpu. This situation is enlarged by commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle") because it tends to use wakelist. Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu (Intel(R) Xeon(R) Platinum 8369B). Latency percentiles (usec): base base+revert_f3dd3f674555 base+this_patch 50.0000th: 9 13 9 75.0000th: 12 19 12 90.0000th: 15 22 15 95.0000th: 18 24 17 99.0000th: 27 31 24 99.5000th: 3364 33 27 99.9000th: 12560 36 30 We also tested on unixbench and hackbench, and saw no performance change. Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20221104023601.12844-1-dtcccc@linux.alibaba.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:19 -04:00
Phil Auld	a774282315	sched/fair: Cleanup loop_max and loop_break JIRA: https://issues.redhat.com/browse/RHEL-282 commit c59862f8265f8060b6650ee1dc12159fe5c89779 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Thu Aug 25 14:27:24 2022 +0200 sched/fair: Cleanup loop_max and loop_break sched_nr_migrate_break is set to a fix value and never changes so we can replace it by a define SCHED_NR_MIGRATE_BREAK. Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value of sysctl_sched_nr_migrate which can be init to different values. Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate. The behavior stays unchanged unless you modify sysctl_sched_nr_migrate trough debugfs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-3-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:18 -04:00
Phil Auld	a9fcc51032	sched: Add TASK_ANY for wait_task_inactive() JIRA: https://issues.redhat.com/browse/RHEL-282 Conflicts: Context differences caused by having PREEMPT_RT merged, specifically a015745ca41f ("sched: Consider task_struct::saved_state in wait_task_inactive()"). commit f9fc8cad9728124cefe8844fb53d1814c92c6bfc Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Sep 6 12:39:55 2022 +0200 sched: Add TASK_ANY for wait_task_inactive() Now that wait_task_inactive()'s @match_state argument is a mask (like ttwu()) it is possible to replace the special !match_state case with an 'all-states' value such that any blocked state will match. Suggested-by: Ingo Molnar (mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:18 -04:00
Phil Auld	7ab9d04d74	sched: Rename task_running() to task_on_cpu() JIRA: https://issues.redhat.com/browse/RHEL-282 Conflicts: Context differences caused by having PREEMPT_RT merged, specifically a015745ca41f ("sched: Consider task_struct::saved_state in wait_task_inactive()"). commit 0b9d46fc5ef7a457cc635b30b010081228cb81ac Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Sep 6 12:33:04 2022 +0200 sched: Rename task_running() to task_on_cpu() There is some ambiguity about task_running() in that it is unrelated to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing task_on_cpu(). Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:18 -04:00
Phil Auld	3f3a0eeee3	sched/fair: Allow changing cgroup of new forked task JIRA: https://issues.redhat.com/browse/RHEL-282 commit df16b71c686cb096774e30153c9ce6756450796c Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Thu Aug 18 20:48:03 2022 +0800 sched/fair: Allow changing cgroup of new forked task commit `7dc603c902` ("sched/fair: Fix PELT integrity for new tasks") introduce a TASK_NEW state and an unnessary limitation that would fail when changing cgroup of new forked task. Because at that time, we can't handle task_change_group_fair() for new forked fair task which hasn't been woken up by wake_up_new_task(), which will cause detach on an unattached task sched_avg problem. This patch delete this unnessary limitation by adding check before do detach or attach in task_change_group_fair(). So cpu_cgrp_subsys.can_attach() has nothing to do for fair tasks, only define it in #ifdef CONFIG_RT_GROUP_SCHED. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-8-zhouchengming@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:17 -04:00
Phil Auld	fb17b0f886	sched/fair: Remove redundant cpu_cgrp_subsys->fork() JIRA: https://issues.redhat.com/browse/RHEL-282 commit 39c4261191bf05e7eb310f852980a6d0afe5582a Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Thu Aug 18 20:47:58 2022 +0800 sched/fair: Remove redundant cpu_cgrp_subsys->fork() We use cpu_cgrp_subsys->fork() to set task group for the new fair task in cgroup_post_fork(). Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races") has already set_task_rq() for the new fair task in sched_cgroup_fork(), so cpu_cgrp_subsys->fork() can be removed. cgroup_can_fork() --> pin parent's sched_task_group sched_cgroup_fork() __set_task_cpu() set_task_rq() cgroup_post_fork() ss->fork() := cpu_cgroup_fork() sched_change_group(..., TASK_SET_GROUP) task_set_group_fair() set_task_rq() --> can be removed After this patch's change, task_change_group_fair() only need to care about task cgroup migration, make the code much simplier. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20220818124805.601-3-zhouchengming@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:17 -04:00
Phil Auld	9b10d97986	sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() JIRA: https://issues.redhat.com/browse/RHEL-282 commit 09348d75a6ce60eec85c86dd0ab7babc4db3caf6 Author: Ingo Molnar <mingo@kernel.org> Date: Thu Aug 11 08:54:52 2022 +0200 sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either. By using a WARN_ON_ONCE() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs. None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON_ONCE(). There's one exception: WARN_ON_ONCE() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON_ONCE(), such as in: - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return; Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:17 -04:00
Phil Auld	30180b878d	sched/fair: Make per-cpu cpumasks static JIRA: https://issues.redhat.com/browse/RHEL-282 commit 18c31c9711a90b48a77b78afb65012d9feec444c Author: Bing Huang <huangbing@kylinos.cn> Date: Sat Jul 23 05:36:09 2022 +0800 sched/fair: Make per-cpu cpumasks static The load_balance_mask and select_rq_mask percpu variables are only used in kernel/sched/fair.c. Make them static and move their allocation into init_sched_fair_class(). Replace kzalloc_node() with zalloc_cpumask_var_node() to get rid of the CONFIG_CPUMASK_OFFSTACK #ifdef and to align with per-cpu cpumask allocation for RT (local_cpu_mask in init_sched_rt_class()) and DL class (local_cpu_mask_dl in init_sched_dl_class()). [ mingo: Tidied up changelog & touched up the code. ] Signed-off-by: Bing Huang <huangbing@kylinos.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220722213609.3901-1-huangbing775@126.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:16 -04:00
Phil Auld	680e019203	sched/debug: Print each field value left-aligned in sched_show_task() JIRA: https://issues.redhat.com/browse/RHEL-282 commit 0f03d6805bfc454279169a1460abb3f6b3db317f Author: Zhen Lei <thunder.leizhen@huawei.com> Date: Wed Jul 27 14:08:19 2022 +0800 sched/debug: Print each field value left-aligned in sched_show_task() Currently, the values of some fields are printed right-aligned, causing the field value to be next to the next field name rather than next to its own field name. So print each field value left-aligned, to make it more readable. Before: stack: 0 pid: 307 ppid: 2 flags:0x00000008 After: stack:0 pid:308 ppid:2 flags:0x0000000a This also makes them print in the same style as the other two fields: task:demo0 state:R running task Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220727060819.1085-1-thunder.leizhen@huawei.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:16 -04:00
Phil Auld	78210daf7b	sched: Snapshot thread flags JIRA: https://issues.redhat.com/browse/RHEL-282 commit 0569b245132c40015281610353935a50e282eb94 Author: Mark Rutland <mark.rutland@arm.com> Date: Mon Nov 29 13:06:45 2021 +0000 sched: Snapshot thread flags Some thread flags can be set remotely, and so even when IRQs are disabled, the flags can change under our feet. Generally this is unlikely to cause a problem in practice, but it is somewhat unsound, and KCSAN will legitimately warn that there is a data race. To avoid such issues, a snapshot of the flags has to be taken prior to using them. Some places already use READ_ONCE() for that, others do not. Convert them all to the new flag accessor helpers. The READ_ONCE(ti->flags) .. cmpxchg(ti->flags) loop in set_nr_if_polling() is left as-is for clarity. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211129130653.2037928-4-mark.rutland@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:16 -04:00
Phil Auld	cb73223615	sched/core: Adjusting the order of scanning CPU JIRA: https://issues.redhat.com/browse/RHEL-310 commit 8589018acc65e5ddfd111f0a7ee85f9afde3a830 Author: Hao Jia <jiahao.os@bytedance.com> Date: Fri Dec 16 14:24:06 2022 +0800 sched/core: Adjusting the order of scanning CPU When select_idle_capacity() starts scanning for an idle CPU, it starts with target CPU that has already been checked in select_idle_sibling(). So we start checking from the next CPU and try the target CPU at the end. Similarly for task_numa_assign(), we have just checked numa_migrate_on of dst_cpu, so start from the next CPU. This also works for steal_cookie_task(), the first scan must fail and start directly from the next one. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20221216062406.7812-3-jiahao.os@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-17 16:14:35 -04:00
Phil Auld	11d3f0cf26	sched: Introduce struct balance_callback to avoid CFI mismatches JIRA: https://issues.redhat.com/browse/RHEL-310 commit 8e5bad7dccec2014f24497b57d8a8ee0b752c290 Author: Kees Cook <keescook@chromium.org> Date: Fri Oct 7 17:07:58 2022 -0700 sched: Introduce struct balance_callback to avoid CFI mismatches Introduce distinct struct balance_callback instead of performing function pointer casting which will trip CFI. Avoids warnings as found by Clang's future -Wcast-function-type-strict option: In file included from kernel/sched/core.c:84: kernel/sched/sched.h:1755:15: warning: cast from 'void ()(struct rq )' to 'void ()(struct callback_head )' converts to incompatible function type [-Wcast-function-type-strict] head->func = (void ()(struct callback_head ))func; ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No binary differences result from this change. This patch is a cleanup based on Brad Spengler/PaX Team's modifications to sched code in their last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Reported-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1724 Link: https://lkml.kernel.org/r/20221008000758.2957718-1-keescook@chromium.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-10 11:35:02 -04:00
Phil Auld	830c2b71ea	sched/uclamp: Fix fits_capacity() check in feec() JIRA: https://issues.redhat.com/browse/RHEL-310 commit 244226035a1f9b2b6c326e55ae5188fab4f428cb Author: Qais Yousef <qyousef@layalina.io> Date: Thu Aug 4 15:36:03 2022 +0100 sched/uclamp: Fix fits_capacity() check in feec() As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024, it'll always pick the previous CPU because fits_capacity() will always return false in this case. The new util_fits_cpu() logic should handle this correctly for us beside more corner cases where similar failures could occur, like when using UCLAMP_MAX. We open code uclamp_rq_util_with() except for the clamp() part, util_fits_cpu() needs the 'raw' values to be passed to it. Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp value for the rq. Makes the code more readable and ensures the right rules (use READ_ONCE/WRITE_ONCE) are respected transparently. [1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html Fixes: `1d42509e47` ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") Reported-by: Yun Hsiang <hsiang023167@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-4-qais.yousef@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-10 11:35:02 -04:00
Phil Auld	20844af32a	sched/psi: Use task->psi_flags to clear in CPU migration JIRA: https://issues.redhat.com/browse/RHEL-311 commit 52b33d87b9197c51e8ffdc61873739d90dd0a16f Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Mon Sep 26 16:19:31 2022 +0800 sched/psi: Use task->psi_flags to clear in CPU migration The commit `d583d360a6` ("psi: Fix psi state corruption when schedule() races with cgroup move") fixed a race problem by making cgroup_move_task() use task->psi_flags instead of looking at the scheduler state. We can extend task->psi_flags usage to CPU migration, which should be a minor optimization for performance and code simplicity. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220926081931.45420-1-zhouchengming@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-07 09:17:26 -04:00
Phil Auld	c345ba1c0c	sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure JIRA: https://issues.redhat.com/browse/RHEL-311 commit 52b1364ba0b105122d6de0e719b36db705011ac1 Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Fri Aug 26 00:41:08 2022 +0800 sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-07 09:17:25 -04:00
Phil Auld	e59312ac89	sched/core: Fix arch_scale_freq_tick() on tickless systems Bugzilla: https://bugzilla.redhat.com/1996625 commit 7fb3ff22ad8772bbf0e3ce1ef3eb7b09f431807f Author: Yair Podemsky <ypodemsk@redhat.com> Date: Wed Nov 30 14:51:21 2022 +0200 sched/core: Fix arch_scale_freq_tick() on tickless systems In order for the scheduler to be frequency invariant we measure the ratio between the maximum CPU frequency and the actual CPU frequency. During long tickless periods of time the calculations that keep track of that might overflow, in the function scale_freq_tick(): if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt)) goto error; eventually forcing the kernel to disable the feature for all CPUs, and show the warning message: "Scheduler frequency invariance went wobbly, disabling!". Let's avoid that by limiting the frequency invariant calculations to CPUs with regular tick. Fixes: `e2b0d619b4` ("x86, sched: check for counters overflow in frequency invariant accounting") Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Yair Podemsky <ypodemsk@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz> Link: https://lore.kernel.org/r/20221130125121.34407-1-ypodemsk@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-03-30 11:52:21 -04:00
Waiman Long	415317267b	sched/debug: Show the registers of 'current' in dump_cpu_task() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516 commit bc1cca97e6da6c7c34db7c5b864bb354ca5305ac Author: Zhen Lei <thunder.leizhen@huawei.com> Date: Thu, 4 Aug 2022 10:34:20 +0800 sched/debug: Show the registers of 'current' in dump_cpu_task() The dump_cpu_task() function does not print registers on architectures that do not support NMIs. However, registers can be useful for debugging. Fortunately, in the case where dump_cpu_task() is invoked from an interrupt handler and is dumping the current CPU's stack, the get_irq_regs() function can be used to get the registers. Therefore, this commit makes dump_cpu_task() check to see if it is being asked to dump the current CPU's stack from within an interrupt handler, and, if so, it uses the get_irq_regs() function to obtain the registers. On systems that do support NMIs, this commit has the further advantage of avoiding a self-NMI in this case. This is an example of rcu self-detected stall on arm64, which does not support NMIs: [ 27.501721] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.502238] rcu: 0-....: (1250 ticks this GP) idle=4f7/1/0x4000000000000000 softirq=2594/2594 fqs=619 [ 27.502632] (t=1251 jiffies g=2989 q=29 ncpus=4) [ 27.503845] CPU: 0 PID: 306 Comm: test0 Not tainted 5.19.0-rc7-00009-g1c1a6c29ff99-dirty #46 [ 27.504732] Hardware name: linux,dummy-virt (DT) [ 27.504947] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 27.504998] pc : arch_counter_read+0x18/0x24 [ 27.505301] lr : arch_counter_read+0x18/0x24 [ 27.505328] sp : ffff80000b29bdf0 [ 27.505345] x29: ffff80000b29bdf0 x28: 0000000000000000 x27: 0000000000000000 [ 27.505475] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 27.505553] x23: 0000000000001f40 x22: ffff800009849c48 x21: 000000065f871ae0 [ 27.505627] x20: 00000000000025ec x19: ffff80000a6eb300 x18: ffffffffffffffff [ 27.505654] x17: 0000000000000001 x16: 0000000000000000 x15: ffff80000a6d0296 [ 27.505681] x14: ffffffffffffffff x13: ffff80000a29bc18 x12: 0000000000000426 [ 27.505709] x11: 0000000000000162 x10: ffff80000a2f3c18 x9 : ffff80000a29bc18 [ 27.505736] x8 : 00000000ffffefff x7 : ffff80000a2f3c18 x6 : 00000000759bd013 [ 27.505761] x5 : 01ffffffffffffff x4 : 0002dc6c00000000 x3 : 0000000000000017 [ 27.505787] x2 : 00000000000025ec x1 : ffff80000b29bdf0 x0 : 0000000075a30653 [ 27.505937] Call trace: [ 27.506002] arch_counter_read+0x18/0x24 [ 27.506171] ktime_get+0x48/0xa0 [ 27.506207] test_task+0x70/0xf0 [ 27.506227] kthread+0x10c/0x110 [ 27.506243] ret_from_fork+0x10/0x20 This is a marked improvement over the old output: [ 27.944550] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.944980] rcu: 0-....: (1249 ticks this GP) idle=cbb/1/0x4000000000000000 softirq=2610/2610 fqs=614 [ 27.945407] (t=1251 jiffies g=2681 q=28 ncpus=4) [ 27.945731] Task dump for CPU 0: [ 27.945844] task:test0 state:R running task stack: 0 pid: 306 ppid: 2 flags:0x0000000a [ 27.946073] Call trace: [ 27.946151] dump_backtrace.part.0+0xc8/0xd4 [ 27.946378] show_stack+0x18/0x70 [ 27.946405] sched_show_task+0x150/0x180 [ 27.946427] dump_cpu_task+0x44/0x54 [ 27.947193] rcu_dump_cpu_stacks+0xec/0x130 [ 27.947212] rcu_sched_clock_irq+0xb18/0xef0 [ 27.947231] update_process_times+0x68/0xac [ 27.947248] tick_sched_handle+0x34/0x60 [ 27.947266] tick_sched_timer+0x4c/0xa4 [ 27.947281] __hrtimer_run_queues+0x178/0x360 [ 27.947295] hrtimer_interrupt+0xe8/0x244 [ 27.947309] arch_timer_handler_virt+0x38/0x4c [ 27.947326] handle_percpu_devid_irq+0x88/0x230 [ 27.947342] generic_handle_domain_irq+0x2c/0x44 [ 27.947357] gic_handle_irq+0x44/0xc4 [ 27.947376] call_on_irq_stack+0x2c/0x54 [ 27.947415] do_interrupt_handler+0x80/0x94 [ 27.947431] el1_interrupt+0x34/0x70 [ 27.947447] el1h_64_irq_handler+0x18/0x24 [ 27.947462] el1h_64_irq+0x64/0x68 <--- the above backtrace is worthless [ 27.947474] arch_counter_read+0x18/0x24 [ 27.947487] ktime_get+0x48/0xa0 [ 27.947501] test_task+0x70/0xf0 [ 27.947520] kthread+0x10c/0x110 [ 27.947538] ret_from_fork+0x10/0x20 Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com>	2023-03-30 08:47:58 -04:00
Waiman Long	c6babad818	sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516 commit e73dfe30930b75c98746152e7a2f6a8ab6067b51 Author: Zhen Lei <thunder.leizhen@huawei.com> Date: Thu, 4 Aug 2022 10:34:19 +0800 sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task() The trigger_all_cpu_backtrace() function attempts to send an NMI to the target CPU, which usually provides much better stack traces than the dump_cpu_task() function's approach of dumping that stack from some other CPU. So much so that most calls to dump_cpu_task() only happen after a call to trigger_all_cpu_backtrace() has failed. And the exception to this rule really should attempt to use trigger_all_cpu_backtrace() first. Therefore, move the trigger_all_cpu_backtrace() invocation into dump_cpu_task(). Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com>	2023-03-30 08:47:57 -04:00
Waiman Long	6ddf329bf5	context_tracking: Split user tracking Kconfig Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516 Conflicts: 1) All the hunks from arch/mips/Kconfig, arch/loongarch/Kconfig and arch/xtensa/* as they are unsupported arches and cannot be applied directly. 2) Context diffs in arch/arm/Kconfig, arch/riscv/Kconfig, arch/riscv/kernel/entry.S and arch/x86/Kconfig. commit 24a9c54182b3758801b8ca6c8c237cc2ff654732 Author: Frederic Weisbecker <frederic@kernel.org> Date: Wed, 8 Jun 2022 16:40:24 +0200 context_tracking: Split user tracking Kconfig Context tracking is going to be used not only to track user transitions but also idle/IRQs/NMIs. The user tracking part will then become a separate feature. Prepare Kconfig for that. [ frederic: Apply Max Filippov feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com>	2023-03-30 08:36:16 -04:00
Waiman Long	0ceb37b5ca	rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516 commit e386b6725798eec07facedf4d4bb710c079fd25c Author: Paul E. McKenney <paulmck@kernel.org> Date: Thu, 2 Jun 2022 17:30:01 -0700 rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs Currently, the RCU Tasks Trace grace-period kthread IPIs each online CPU using smp_call_function_single() in order to track any tasks currently in RCU Tasks Trace read-side critical sections during which the corresponding task has neither blocked nor been preempted. These IPIs are annoying and are also not strictly necessary because any task that blocks or is preempted within its current RCU Tasks Trace read-side critical section will be tracked on one of the per-CPU rcu_tasks_percpu structure's ->rtp_blkd_tasks list. So the only time that this is a problem is if one of the CPUs runs through a long-duration RCU Tasks Trace read-side critical section without a context switch. Note that the task_call_func() function cannot help here because there is no safe way to identify the target task. Of course, the task_call_func() function will be very useful later, when processing the list of tasks, but it needs to know the task. This commit therefore creates a cpu_curr_snapshot() function that returns a pointer the task_struct structure of some task that happened to be running on the specified CPU more or less during the time that the cpu_curr_snapshot() function was executing. If there was no context switch during this time, this function will return a pointer to the task_struct structure of the task that was running throughout. If there was a context switch, then the outgoing task will be taken care of by RCU's context-switch hook, and the incoming task was either already taken care during some previous context switch, or it is not currently within an RCU Tasks Trace read-side critical section. And in this latter case, the grace period already started, so there is no need to wait on this task. This new cpu_curr_snapshot() function is invoked on each CPU early in the RCU Tasks Trace grace-period processing, and the resulting tasks are queued for later quiescent-state inspection. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <kafai@fb.com> Cc: KP Singh <kpsingh@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-03-30 08:36:12 -04:00
Juri Lelli	6bc27040eb	sched: Add support for lazy preemption Bugzilla: https://bugzilla.redhat.com/2171995 Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git commit ea622076b76f25526278b448dc8326db01758c0a Author: Thomas Gleixner <tglx@linutronix.de> Date: Fri Oct 26 18:50:54 2012 +0100 sched: Add support for lazy preemption It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>	2023-02-27 13:46:09 +01:00
Juri Lelli	3673cc2e61	sched: Consider task_struct::saved_state in wait_task_inactive(). Bugzilla: https://bugzilla.redhat.com/2171995 Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git commit a015745ca41f057beb9650166271fc6188f33d9b Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Wed Jun 22 12:27:05 2022 +0200 sched: Consider task_struct::saved_state in wait_task_inactive(). Ptrace is using wait_task_inactive() to wait for the tracee to reach a certain task state. On PREEMPT_RT that state may be stored in task_struct::saved_state while the tracee blocks on a sleeping lock and task_struct::__state is set to TASK_RTLOCK_WAIT. It is not possible to check only for TASK_RTLOCK_WAIT to be sure that the task is blocked on a sleeping lock because during wake up (after the sleeping lock has been acquired) the task state is set TASK_RUNNING. After the task in on CPU and acquired the pi_lock it will reset the state accordingly but until then TASK_RUNNING will be observed (with the desired state saved in saved_state). Check also for task_struct::saved_state if the desired match was not found in task_struct::__state on PREEMPT_RT. If the state was found in saved_state, wait until the task is idle and state is visible in task_struct::__state. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/Yt%2FpQAFQ1xKNK0RY@linutronix.de Signed-off-by: Juri Lelli <juri.lelli@redhat.com>	2023-02-27 13:46:08 +01:00
Waiman Long	22c20d7c8a	sched/core: Use kfree_rcu() in do_set_cpus_allowed() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143847 Upstream Status: tip commit 9a5418bc48babb313d2a62df29ebe21ce8c06c59 commit 9a5418bc48babb313d2a62df29ebe21ce8c06c59 Author: Waiman Long <longman@redhat.com> Date: Fri, 30 Dec 2022 23:11:20 -0500 sched/core: Use kfree_rcu() in do_set_cpus_allowed() Commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") may call kfree() if user_cpus_ptr was previously set. Unfortunately, some of the callers of do_set_cpus_allowed() may have pi_lock held when calling it. So the following splats may be printed especially when running with a PREEMPT_RT kernel: WARNING: possible circular locking dependency detected BUG: sleeping function called from invalid context To avoid these problems, kfree_rcu() is used instead. An internal cpumask_rcuhead union is created for the sole purpose of facilitating the use of kfree_rcu() to free the cpumask. Since user_cpus_ptr is not being used in non-SMP configs, the newly introduced alloc_user_cpus_ptr() helper will return NULL in this case and sched_setaffinity() is modified to handle this special case. Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20221231041120.440785-3-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com>	2023-01-09 14:59:01 -05:00
Waiman Long	f3e0ad343d	sched/core: Fix use-after-free bug in dup_user_cpus_ptr() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143847 Upstream Status: tip commit 87ca4f9efbd7cc649ff43b87970888f2812945b8 commit 87ca4f9efbd7cc649ff43b87970888f2812945b8 Author: Waiman Long <longman@redhat.com> Date: Fri, 30 Dec 2022 23:11:19 -0500 sched/core: Fix use-after-free bug in dup_user_cpus_ptr() Since commit 07ec77a1d4e8 ("sched: Allow task CPU affinity to be restricted on asymmetric systems"), the setting and clearing of user_cpus_ptr are done under pi_lock for arm64 architecture. However, dup_user_cpus_ptr() accesses user_cpus_ptr without any lock protection. Since sched_setaffinity() can be invoked from another process, the process being modified may be undergoing fork() at the same time. When racing with the clearing of user_cpus_ptr in __set_cpus_allowed_ptr_locked(), it can lead to user-after-free and possibly double-free in arm64 kernel. Commit 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask") fixes this problem as user_cpus_ptr, once set, will never be cleared in a task's lifetime. However, this bug was re-introduced in commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") which allows the clearing of user_cpus_ptr in do_set_cpus_allowed(). This time, it will affect all arches. Fix this bug by always clearing the user_cpus_ptr of the newly cloned/forked task before the copying process starts and check the user_cpus_ptr state of the source task under pi_lock. Note to stable, this patch won't be applicable to stable releases. Just copy the new dup_user_cpus_ptr() function over. Fixes: 07ec77a1d4e8 ("sched: Allow task CPU affinity to be restricted on asymmetric systems") Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") Reported-by: David Wang 王标 <wangbiao3@xiaomi.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221231041120.440785-2-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com>	2023-01-09 14:58:58 -05:00
Frantisek Hrbata	e73e910ed6	Merge: Scheduler updates for 9.2 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1582 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2115520 Depends: !1372 Tested: Scheduler stress and regression tests. Ran nohz full and numa balancing tests. Perf QE ran perfromance regression tests. Omitted-fix: 62ebaf2f9261 ("ath6kl: avoid flush_scheduled_work() usage") Not enabled in rhel config. Omitted-fix: 0538fa09bb10 ("gpu/drm/bridge/cadence: avoid flush_scheduled_work() usage") Omitted-fix: a4345557527f ("scsi: qla2xxx: Always wait for qlt_sess_work_fn() from qlt_stop_phase1()") These 3 just reference c4f135d64382 ("workqueue: Wrap flush_workqueue() using a macro") in commit log since they depend on it. Series to keep core scheduler code close to upstream linux and to apply potentially needed fixes. This brings the scheduler and some related code up to roughly 6.0. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-11-17 03:54:24 -05:00
Frantisek Hrbata	10ec0ed632	Merge: Update drivers/powercap to enable support for Arm SystemReady IR platforms MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1372 ``` Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2126952 This is one of a series of patch sets to enable Arm SystemReady IR support in the kernel for compliant platforms. This set cleans up powercap and enables DTPM for edge systems to use in thermal and power management; this is all in drivers/powercap. This set has been tested via simple boot tests, and of course the CI loop. This may be difficult to test on Arm due to DTPM being a very new feature. However, this is exactly the same powercap framework used by intel_rapl, which should continue to function properly regardless. Signed-off-by: Al Stone <ahs3@redhat.com> ``` Approved-by: David Arcari <darcari@redhat.com> Approved-by: Mark Langsdorf <mlangsdo@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-11-15 07:30:57 -05:00
Phil Auld	a6fd92ef44	sched/core: Do not requeue task on CPU excluded from cpus_mask Bugzilla: https://bugzilla.redhat.com/2115520 commit 751d4cbc43879229dbc124afefe240b70fd29a85 Author: Mel Gorman <mgorman@techsingularity.net> Date: Thu Aug 4 10:21:19 2022 +0100 sched/core: Do not requeue task on CPU excluded from cpus_mask The following warning was triggered on a large machine early in boot on a distribution kernel but the same problem should also affect mainline. WARNING: CPU: 439 PID: 10 at ../kernel/workqueue.c:2231 process_one_work+0x4d/0x440 Call Trace: <TASK> rescuer_thread+0x1f6/0x360 kthread+0x156/0x180 ret_from_fork+0x22/0x30 </TASK> Commit `c6e7bd7afa` ("sched/core: Optimize ttwu() spinning on p->on_cpu") optimises ttwu by queueing a task that is descheduling on the wakelist, but does not check if the task descheduling is still allowed to run on that CPU. In this warning, the problematic task is a workqueue rescue thread which checks if the rescue is for a per-cpu workqueue and running on the wrong CPU. While this is early in boot and it should be possible to create workers, the rescue thread may still used if the MAYDAY_INITIAL_TIMEOUT is reached or MAYDAY_INTERVAL and on a sufficiently large machine, the rescue thread is being used frequently. Tracing confirmed that the task should have migrated properly using the stopper thread to handle the migration. However, a parallel wakeup from udev running on another CPU that does not share CPU cache observes p->on_cpu and uses task_cpu(p), queues the task on the old CPU and triggers the warning. Check that the wakee task that is descheduling is still allowed to run on its current CPU and if not, wait for the descheduling to complete and select an allowed CPU. Fixes: `c6e7bd7afa` ("sched/core: Optimize ttwu() spinning on p->on_cpu") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220804092119.20137-1-mgorman@techsingularity.net Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:41 -04:00
Phil Auld	303a0ad632	sched/core: Always flush pending blk_plug Bugzilla: https://bugzilla.redhat.com/2115520 commit 401e4963bf45c800e3e9ea0d3a0289d738005fd4 Author: John Keeping <john@metanate.com> Date: Fri Jul 8 17:27:02 2022 +0100 sched/core: Always flush pending blk_plug With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two normal priority tasks (SCHED_OTHER, nice level zero): INFO: task kworker/u8:0:8 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 Workqueue: writeback wb_workfn (flush-7:0) [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174) [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>] +(rt_mutex_slowlock.constprop.0+0xac/0x174) [<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54) [<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec) [<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c) [<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8) [<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4) [<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4) [<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c) [<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8) [<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178) [<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38) Exception stack(0xc10e3fb0 to 0xc10e3ff8) 3fa0: 00000000 00000000 00000000 00000000 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 INFO: task tar:2083 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000 [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24) [<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30) [<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8) [<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0) [<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144) [<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4) [<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250) [<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148) [<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98) [<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec) [<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c) Exception stack(0xc2e1bfa8 to 0xc2e1bff0) bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000 bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190 bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc Here the kworker is waiting on msdos_sb_info::s_lock which is held by tar which is in turn waiting for a buffer which is locked waiting to be flushed, but this operation is plugged in the kworker. The lock is a normal struct mutex, so tsk_is_pi_blocked() will always return false on !RT and thus the behaviour changes for RT. It seems that the intent here is to skip blk_flush_plug() in the case where a non-preemptible lock (such as a spinlock) has been converted to a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT schedule flag. But sched_submit_work() is only called from schedule() which is never called in this scenario, so the check can simply be deleted. Looking at the history of the -rt patchset, in fact this change was present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b ("sched/core: Rework the __schedule() preempt argument"). As described in [1]: The schedule process must distinguish between blocking on a regular sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock and rwlock): - rwsem and mutex must flush block requests (blk_schedule_flush_plug()) even if blocked on a lock. This can not deadlock because this also happens for non-RT. There should be a warning if the scheduling point is within a RCU read section. - spinlock and rwlock must not flush block requests. This will deadlock if the callback attempts to acquire a lock which is already acquired. Similarly to being preempted, there should be no warning if the scheduling point is within a RCU read section. and with the tsk_is_pi_blocked() in the scheduler path, we hit the first issue. [1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches Signed-off-by: John Keeping <john@metanate.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:40 -04:00
Phil Auld	a1b1e51378	sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling Bugzilla: https://bugzilla.redhat.com/2115520 commit c02d5546ea34d589c83eda5055dbd727a396642b Author: Uros Bizjak <ubizjak@gmail.com> Date: Wed Jun 29 17:15:52 2022 +0200 sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling Use try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in set_nr_{and_not,if}_polling. x86 cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg. The definition of cmpxchg based fetch_or was changed in the same way as atomic_fetch_##op definitions were changed in `e6790e4b5d`. Also declare these two functions as inline to ensure inlining. In the case of set_nr_and_not_polling, the compiler (gcc) tries to outsmart itself by constructing the boolean return value with logic operations on the fetched value, and these extra operations enlarge the function over the inlining threshold value. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220629151552.6015-1-ubizjak@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:40 -04:00
Phil Auld	9798738b93	sched/fair: Rename select_idle_mask to select_rq_mask Bugzilla: https://bugzilla.redhat.com/2115520 commit ec4fc801a02d96180c597238fe87141471b70971 Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Thu Jun 23 11:11:02 2022 +0200 sched/fair: Rename select_idle_mask to select_rq_mask On 21/06/2022 11:04, Vincent Donnefort wrote: > From: Dietmar Eggemann <dietmar.eggemann@arm.com> https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered that this patch doesn't build anymore (on tip sched/core or linux-next) because of commit f5b2eeb499910 ("sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()"). New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to select_rq_mask below. -- >8 -- Decouple the name of the per-cpu cpumask select_idle_mask from its usage in select_idle_[cpu/capacity]() of the CFS run-queue selection (select_task_rq_fair()). This is to support the reuse of this cpumask in the Energy Aware Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue selection. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:39 -04:00
Phil Auld	41e6e3a50f	sched: only perform capability check on privileged operation Bugzilla: https://bugzilla.redhat.com/2115520 commit 700a78335fc28a59c307f420857fd2d4521549f8 Author: Christian Göttsche <cgzones@googlemail.com> Date: Wed Jun 15 17:25:04 2022 +0200 sched: only perform capability check on privileged operation sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler() a CAP_SYS_NICE audit event unconditionally, even when the requested operation does not require that capability / is unprivileged, i.e. for reducing niceness. This is relevant in connection with SELinux, where a capability check results in a policy decision and by default a denial message on insufficient permission is issued. It can lead to three undesired cases: 1. A denial message is generated, even in case the operation was an unprivileged one and thus the syscall succeeded, creating noise. 2. To avoid the noise from 1. the policy writer adds a rule to ignore those denial messages, hiding future syscalls, where the task performs an actual privileged operation, leading to hidden limited functionality of that task. 3. To avoid the noise from 1. the policy writer adds a rule to allow the task the capability CAP_SYS_NICE, while it does not need it, violating the principle of least privilege. Conduct privilged/unprivileged categorization first and perform a capable test (and at most once) only if needed. Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:39 -04:00
Phil Auld	f7aa98b454	sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle Bugzilla: https://bugzilla.redhat.com/2115520 commit f3dd3f674555bd9455c5ae7fafce0696bd9931b3 Author: Tianchen Ding <dtcccc@linux.alibaba.com> Date: Thu Jun 9 07:34:12 2022 +0800 sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle Wakelist can help avoid cache bouncing and offload the overhead of waker cpu. So far, using wakelist within the same llc only happens on WF_ON_CPU, and this limitation could be removed to further improve wakeup performance. The commit `518cd62341` ("sched: Only queue remote wakeups when crossing cache boundaries") disabled queuing tasks on wakelist when the cpus share llc. This is because, at that time, the scheduler must send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also supports TIF_POLLING, so this is not a problem now when the wakee cpu is in idle polling. Benefits: Queuing the task on idle cpu can help improving performance on waker cpu and utilization on wakee cpu, and further improve locality because the wakee cpu can handle its own rq. This patch helps improving rt on our real java workloads where wakeup happens frequently. Consider the normal condition (CPU0 and CPU1 share same llc) Before this patch: CPU0 CPU1 select_task_rq() idle rq_lock(CPU1->rq) enqueue_task(CPU1->rq) notify CPU1 (by sending IPI or CPU1 polling) resched() After this patch: CPU0 CPU1 select_task_rq() idle add to wakelist of CPU1 notify CPU1 (by sending IPI or CPU1 polling) rq_lock(CPU1->rq) enqueue_task(CPU1->rq) resched() We see CPU0 can finish its work earlier. It only needs to put task to wakelist and return. While CPU1 is idle, so let itself handle its own runqueue data. This patch brings no difference about IPI. This patch only takes effect when the wakee cpu is: 1) idle polling 2) idle not polling For 1), there will be no IPI with or without this patch. For 2), there will always be an IPI before or after this patch. Before this patch: waker cpu will enqueue task and check preempt. Since "idle" will be sure to be preempted, waker cpu must send a resched IPI. After this patch: waker cpu will put the task to the wakelist of wakee cpu, and send an IPI. Benchmark: We've tested schbench, unixbench, and hachbench on both x86 and arm64. On x86 (Intel Xeon Platinum 8269CY): schbench -m 2 -t 8 Latency percentiles (usec) before after 50.0000th: 8 6 75.0000th: 10 7 90.0000th: 11 8 95.0000th: 12 8 99.0000th: 13 10 99.5000th: 15 11 99.9000th: 18 14 Unixbench with full threads (104) before after Dhrystone 2 using register variables 3011862938 3009935994 -0.06% Double-Precision Whetstone 617119.3 617298.5 0.03% Execl Throughput 27667.3 27627.3 -0.14% File Copy 1024 bufsize 2000 maxblocks 785871.4 784906.2 -0.12% File Copy 256 bufsize 500 maxblocks 210113.6 212635.4 1.20% File Copy 4096 bufsize 8000 maxblocks 2328862.2 2320529.1 -0.36% Pipe Throughput 145535622.8 145323033.2 -0.15% Pipe-based Context Switching 3221686.4 3583975.4 11.25% Process Creation 101347.1 103345.4 1.97% Shell Scripts (1 concurrent) 120193.5 123977.8 3.15% Shell Scripts (8 concurrent) 17233.4 17138.4 -0.55% System Call Overhead 5300604.8 5312213.6 0.22% hackbench -g 1 -l 100000 before after Time 3.246 2.251 On arm64 (Ampere Altra): schbench -m 2 -t 8 Latency percentiles (usec) before after 50.0000th: 14 10 75.0000th: 19 14 90.0000th: 22 16 95.0000th: 23 16 99.0000th: 24 17 99.5000th: 24 17 99.9000th: 28 25 Unixbench with full threads (80) before after Dhrystone 2 using register variables 3536194249 3537019613 0.02% Double-Precision Whetstone 629383.6 629431.6 0.01% Execl Throughput 65920.5 65846.2 -0.11% File Copy 1024 bufsize 2000 maxblocks 1063722.8 1064026.8 0.03% File Copy 256 bufsize 500 maxblocks 322684.5 318724.5 -1.23% File Copy 4096 bufsize 8000 maxblocks 2348285.3 2328804.8 -0.83% Pipe Throughput 133542875.3 131619389.8 -1.44% Pipe-based Context Switching 3215356.1 3576945.1 11.25% Process Creation 108520.5 120184.6 10.75% Shell Scripts (1 concurrent) 122636.3 121888 -0.61% Shell Scripts (8 concurrent) 17462.1 17381.4 -0.46% System Call Overhead 4429998.9 4435006.7 0.11% hackbench -g 1 -l 100000 before after Time 4.217 2.916 Our patch has improvement on schbench, hackbench and Pipe-based Context Switching of unixbench when there exists idle cpus, and no obvious regression on other tests of unixbench. This can help improve rt in scenes where wakeup happens frequently. Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:39 -04:00
Phil Auld	eb73b8ff33	sched: Fix the check of nr_running at queue wakelist Bugzilla: https://bugzilla.redhat.com/2115520 commit 28156108fecb1f808b21d216e8ea8f0d205a530c Author: Tianchen Ding <dtcccc@linux.alibaba.com> Date: Thu Jun 9 07:34:11 2022 +0800 sched: Fix the check of nr_running at queue wakelist The commit `2ebb177175` ("sched/core: Offload wakee task activation if it the wakee is descheduling") checked rq->nr_running <= 1 to avoid task stacking when WF_ON_CPU. Per the ordering of writes to p->on_rq and p->on_cpu, observing p->on_cpu (WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through the deactivate_task() in __schedule(), thus p has been accounted out of rq->nr_running. As such, the task being the only runnable task on the rq implies reading rq->nr_running == 0 at that point. The benchmark result is in [1]. [1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/ Suggested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:39 -04:00
Phil Auld	e208332528	sched: Reverse sched_class layout Bugzilla: https://bugzilla.redhat.com/2115520 commit 546a3fee174969ff323d70ff27b1ef181f0d7ceb Author: Peter Zijlstra <peterz@infradead.org> Date: Tue May 17 13:46:54 2022 +0200 sched: Reverse sched_class layout Because GCC-12 is fully stupid about array bounds and it's just really hard to get a solid array definition from a linker script, flip the array order to avoid needing negative offsets :-/ This makes the whole relational pointer magic a little less obvious, but alas. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/YoOLLmLG7HRTXeEm@hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:38 -04:00
Phil Auld	216aaa830b	sched/core: Avoid obvious double update_rq_clock warning Bugzilla: https://bugzilla.redhat.com/2115520 commit 2679a83731d51a744657f718fc02c3b077e47562 Author: Hao Jia <jiahao.os@bytedance.com> Date: Sat Apr 30 16:58:42 2022 +0800 sched/core: Avoid obvious double update_rq_clock warning When we use raw_spin_rq_lock() to acquire the rq lock and have to update the rq clock while holding the lock, the kernel may issue a WARN_DOUBLE_CLOCK warning. Since we directly use raw_spin_rq_lock() to acquire rq lock instead of rq_lock(), there is no corresponding change to rq->clock_update_flags. In particular, we have obtained the rq lock of other CPUs, the rq->clock_update_flags of this CPU may be RQCF_UPDATED at this time, and then calling update_rq_clock() will trigger the WARN_DOUBLE_CLOCK warning. So we need to clear RQCF_UPDATED of rq->clock_update_flags to avoid the WARN_DOUBLE_CLOCK warning. For the sched_rt_period_timer() and migrate_task_rq_dl() cases we simply replace raw_spin_rq_lock()/raw_spin_rq_unlock() with rq_lock()/rq_unlock(). For the {pull,push}_{rt,dl}_task() cases, we add the double_rq_clock_clear_update() function to clear RQCF_UPDATED of rq->clock_update_flags, and call double_rq_clock_clear_update() before double_lock_balance()/double_rq_lock() returns to avoid the WARN_DOUBLE_CLOCK warning. Some call trace reports: Call Trace 1: <IRQ> sched_rt_period_timer+0x10f/0x3a0 ? enqueue_top_rt_rq+0x110/0x110 __hrtimer_run_queues+0x1a9/0x490 hrtimer_interrupt+0x10b/0x240 __sysvec_apic_timer_interrupt+0x8a/0x250 sysvec_apic_timer_interrupt+0x9a/0xd0 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x12/0x20 Call Trace 2: <TASK> activate_task+0x8b/0x110 push_rt_task.part.108+0x241/0x2c0 push_rt_tasks+0x15/0x30 finish_task_switch+0xaa/0x2e0 ? __switch_to+0x134/0x420 __schedule+0x343/0x8e0 ? hrtimer_start_range_ns+0x101/0x340 schedule+0x4e/0xb0 do_nanosleep+0x8e/0x160 hrtimer_nanosleep+0x89/0x120 ? hrtimer_init_sleeper+0x90/0x90 __x64_sys_nanosleep+0x96/0xd0 do_syscall_64+0x34/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Call Trace 3: <TASK> deactivate_task+0x93/0xe0 pull_rt_task+0x33e/0x400 balance_rt+0x7e/0x90 __schedule+0x62f/0x8e0 do_task_dead+0x3f/0x50 do_exit+0x7b8/0xbb0 do_group_exit+0x2d/0x90 get_signal+0x9df/0x9e0 ? preempt_count_add+0x56/0xa0 ? __remove_hrtimer+0x35/0x70 arch_do_signal_or_restart+0x36/0x720 ? nanosleep_copyout+0x39/0x50 ? do_nanosleep+0x131/0x160 ? audit_filter_inodes+0xf5/0x120 exit_to_user_mode_prepare+0x10f/0x1e0 syscall_exit_to_user_mode+0x17/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Call Trace 4: update_rq_clock+0x128/0x1a0 migrate_task_rq_dl+0xec/0x310 set_task_cpu+0x84/0x1e4 try_to_wake_up+0x1d8/0x5c0 wake_up_process+0x1c/0x30 hrtimer_wakeup+0x24/0x3c __hrtimer_run_queues+0x114/0x270 hrtimer_interrupt+0xe8/0x244 arch_timer_handler_phys+0x30/0x50 handle_percpu_devid_irq+0x88/0x140 generic_handle_domain_irq+0x40/0x60 gic_handle_irq+0x48/0xe0 call_on_irq_stack+0x2c/0x60 do_interrupt_handler+0x80/0x84 Steps to reproduce: 1. Enable CONFIG_SCHED_DEBUG when compiling the kernel 2. echo 1 > /sys/kernel/debug/clear_warn_once echo "WARN_DOUBLE_CLOCK" > /sys/kernel/debug/sched/features echo "NO_RT_PUSH_IPI" > /sys/kernel/debug/sched/features 3. Run some rt/dl tasks that periodically work and sleep, e.g. Create 2*n rt or dl (90% running) tasks via rt-app (on a system with n CPUs), and Dietmar Eggemann reports Call Trace 4 when running on PREEMPT_RT kernel. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20220430085843.62939-2-jiahao.os@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:38 -04:00
Phil Auld	26b81fc091	sched: Fix build warning without CONFIG_SYSCTL Bugzilla: https://bugzilla.redhat.com/2115520 commit 494dcdf46e5cdee926c9f441d37e3ea1db57d1da Author: YueHaibing <yuehaibing@huawei.com> Date: Wed Apr 27 21:10:02 2022 +0800 sched: Fix build warning without CONFIG_SYSCTL IF CONFIG_SYSCTL is n, build warn: kernel/sched/core.c:1782:12: warning: ‘sysctl_sched_uclamp_handler’ defined but not used [-Wunused-function] static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write, ^~~~~~~~~~~~~~~~~~~~~~~~~~~ sysctl_sched_uclamp_handler() is used while CONFIG_SYSCTL enabled, wrap all related code with CONFIG_SYSCTL to fix this. Fixes: 3267e0156c33 ("sched: Move uclamp_util sysctls to core.c") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:37 -04:00
Phil Auld	ad01f5836e	sched: Move uclamp_util sysctls to core.c Bugzilla: https://bugzilla.redhat.com/2115520 commit 3267e0156c3341ac25b37a0f60551cdae1634b60 Author: Zhen Ni <nizhen@uniontech.com> Date: Tue Feb 15 19:46:02 2022 +0800 sched: Move uclamp_util sysctls to core.c move uclamp_util sysctls to core.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:36 -04:00
Phil Auld	a3912174ad	sched: Move rt_period/runtime sysctls to rt.c Bugzilla: https://bugzilla.redhat.com/2115520 commit d9ab0e63fa7f8405fbb19e28c5191e0880a7f2db Author: Zhen Ni <nizhen@uniontech.com> Date: Tue Feb 15 19:45:59 2022 +0800 sched: Move rt_period/runtime sysctls to rt.c move rt_period/runtime sysctls to rt.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:36 -04:00
Phil Auld	da09e6d005	sched: Move schedstats sysctls to core.c Bugzilla: https://bugzilla.redhat.com/2115520 commit f5ef06d58be8311a9425e6a54a053ecb350952f3 Author: Zhen Ni <nizhen@uniontech.com> Date: Tue Feb 15 19:45:58 2022 +0800 sched: Move schedstats sysctls to core.c move schedstats sysctls to core.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:35 -04:00
Phil Auld	4241f8765f	Merge remote-tracking branch 'origin/merge-requests/1372' into bz2115520 Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:13:30 -04:00
Waiman Long	a8b188fafd	sched: Always clear user_cpus_ptr in do_set_cpus_allowed() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107354 Upstream Status: tip commit 851a723e45d1c4c8f6f7b0d2cfbc5f53690bb4e9 commit 851a723e45d1c4c8f6f7b0d2cfbc5f53690bb4e9 Author: Waiman Long <longman@redhat.com> Date: Thu, 22 Sep 2022 14:00:41 -0400 sched: Always clear user_cpus_ptr in do_set_cpus_allowed() The do_set_cpus_allowed() function is used by either kthread_bind() or select_fallback_rq(). In both cases the user affinity (if any) should be destroyed too. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-6-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com>	2022-10-27 19:48:17 -04:00
Waiman Long	a2add19e1a	sched: Enforce user requested affinity Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107354 Upstream Status: tip commit da019032819a1f09943d3af676892ec8c627668e Conflicts: A merge conflict in kernel/sched/sched.h due to the presence of RH_KABI code requiring manual merge. commit da019032819a1f09943d3af676892ec8c627668e Author: Waiman Long <longman@redhat.com> Date: Thu, 22 Sep 2022 14:00:39 -0400 sched: Enforce user requested affinity It was found that the user requested affinity via sched_setaffinity() can be easily overwritten by other kernel subsystems without an easy way to reset it back to what the user requested. For example, any change to the current cpuset hierarchy may reset the cpumask of the tasks in the affected cpusets to the default cpuset value even if those tasks have pre-existing user requested affinity. That is especially easy to trigger under a cgroup v2 environment where writing "+cpuset" to the root cgroup's cgroup.subtree_control file will reset the cpus affinity of all the processes in the system. That is problematic in a nohz_full environment where the tasks running in the nohz_full CPUs usually have their cpus affinity explicitly set and will behave incorrectly if cpus affinity changes. Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr() and use it to restrcit the given cpumask unless there is no overlap. In that case, it will fallback to the given one. The SCA_USER flag is reused to indicate intent to set user_cpus_ptr and so user_cpus_ptr masking should be skipped. In addition, masking should also be skipped if any of the SCA_MIGRATE_* flag is set. All callers of set_cpus_allowed_ptr() will be affected by this change. A scratch cpumask is added to percpu runqueues structure for doing additional masking when user_cpus_ptr is set. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-4-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com>	2022-10-27 19:48:06 -04:00
Waiman Long	b5b3deb05e	sched: Always preserve the user requested cpumask Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107354 Upstream Status: tip commit 8f9ea86fdf99b81458cc21fc1c591fcd4a0fa1f4 commit 8f9ea86fdf99b81458cc21fc1c591fcd4a0fa1f4 Author: Waiman Long <longman@redhat.com> Date: Thu, 22 Sep 2022 14:00:38 -0400 sched: Always preserve the user requested cpumask Unconditionally preserve the user requested cpumask on sched_setaffinity() calls. This allows using it outside of the fairly narrow restrict_cpus_allowed_ptr() use-case and fix some cpuset issues that currently suffer destruction of cpumasks. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-3-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com>	2022-10-27 19:44:44 -04:00
Waiman Long	8a370625e9	sched: Introduce affinity_context Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107354 Upstream Status: tip commit 713a2e21a5137e96d2594f53d19784ffde3ddbd0 commit 713a2e21a5137e96d2594f53d19784ffde3ddbd0 Author: Waiman Long <longman@redhat.com> Date: Thu, 22 Sep 2022 14:00:40 -0400 sched: Introduce affinity_context In order to prepare for passing through additional data through the affinity call-chains, convert the mask and flags argument into a structure. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-5-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com>	2022-10-27 19:44:41 -04:00
Waiman Long	31d9c33c0e	sched: Add __releases annotations to affine_move_task() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107354 Upstream Status: tip commit 5584e8ac2c68280e5ac31d231c23cdb7dfa225db commit 5584e8ac2c68280e5ac31d231c23cdb7dfa225db Author: Waiman Long <longman@redhat.com> Date: Thu, 22 Sep 2022 14:00:37 -0400 sched: Add __releases annotations to affine_move_task() affine_move_task() assumes task_rq_lock() has been called and it does an implicit task_rq_unlock() before returning. Add the appropriate __releases annotations to make this clear. A typo error in comment is also fixed. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-2-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com>	2022-10-27 19:44:37 -04:00
Al Stone	0d2d511544	sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2126952 Tested: This is one of a series of patch sets to enable Arm SystemReady IR support in the kernel for compliant platforms. This set cleans up powercap and enables DTPM for edge systems to use in thermal and power management; this is all in drivers/powercap. This set has been tested via simple boot tests, and of course the CI loop. This may be difficult to test on Arm due to DTPM being a very new feature. However, this is exactly the same powercap framework used by intel_rapl, which should continue to function properly regardless. commit bb4479994945e9170534389a7762eb56149320ac Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Tue Jun 21 10:04:10 2022 +0100 sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com (cherry picked from commit bb4479994945e9170534389a7762eb56149320ac) Signed-off-by: Al Stone <ahs3@redhat.com>	2022-10-24 09:08:12 -06:00
Chris von Recklinghausen	63534db797	NUMA balancing: optimize page placement for memory tiering system Bugzilla: https://bugzilla.redhat.com/2120352 commit c574bbe917036c8968b984c82c7b13194fe5ce98 Author: Huang Ying <ying.huang@intel.com> Date: Tue Mar 22 14:46:23 2022 -0700 NUMA balancing: optimize page placement for memory tiering system With the advent of various new memory types, some machines will have multiple types of memory, e.g. DRAM and PMEM (persistent memory). The memory subsystem of these machines can be called memory tiering system, because the performance of the different types of memory are usually different. In such system, because of the memory accessing pattern changing etc, some pages in the slow memory may become hot globally. So in this patch, the NUMA balancing mechanism is enhanced to optimize the page placement among the different memory types according to hot/cold dynamically. In a typical memory tiering system, there are CPUs, fast memory and slow memory in each physical NUMA node. The CPUs and the fast memory will be put in one logical node (called fast memory node), while the slow memory will be put in another (faked) logical node (called slow memory node). That is, the fast memory is regarded as local while the slow memory is regarded as remote. So it's possible for the recently accessed pages in the slow memory node to be promoted to the fast memory node via the existing NUMA balancing mechanism. The original NUMA balancing mechanism will stop to migrate pages if the free memory of the target node becomes below the high watermark. This is a reasonable policy if there's only one memory type. But this makes the original NUMA balancing mechanism almost do not work to optimize page placement among different memory types. Details are as follows. It's the common cases that the working-set size of the workload is larger than the size of the fast memory nodes. Otherwise, it's unnecessary to use the slow memory at all. So, there are almost always no enough free pages in the fast memory nodes, so that the globally hot pages in the slow memory node cannot be promoted to the fast memory node. To solve the issue, we have 2 choices as follows, a. Ignore the free pages watermark checking when promoting hot pages from the slow memory node to the fast memory node. This will create some memory pressure in the fast memory node, thus trigger the memory reclaiming. So that, the cold pages in the fast memory node will be demoted to the slow memory node. b. Define a new watermark called wmark_promo which is higher than wmark_high, and have kswapd reclaiming pages until free pages reach such watermark. The scenario is as follows: when we want to promote hot-pages from a slow memory to a fast memory, but fast memory's free pages would go lower than high watermark with such promotion, we wake up kswapd with wmark_promo watermark in order to demote cold pages and free us up some space. So, next time we want to promote hot-pages we might have a chance of doing so. The choice "a" may create high memory pressure in the fast memory node. If the memory pressure of the workload is high, the memory pressure may become so high that the memory allocation latency of the workload is influenced, e.g. the direct reclaiming may be triggered. The choice "b" works much better at this aspect. If the memory pressure of the workload is high, the hot pages promotion will stop earlier because its allocation watermark is higher than that of the normal memory allocation. So in this patch, choice "b" is implemented. A new zone watermark (WMARK_PROMO) is added. Which is larger than the high watermark and can be controlled via watermark_scale_factor. In addition to the original page placement optimization among sockets, the NUMA balancing mechanism is extended to be used to optimize page placement according to hot/cold among different memory types. So the sysctl user space interface (numa_balancing) is extended in a backward compatible way as follow, so that the users can enable/disable these functionality individually. The sysctl is converted from a Boolean value to a bits field. The definition of the flags is, - 0: NUMA_BALANCING_DISABLED - 1: NUMA_BALANCING_NORMAL - 2: NUMA_BALANCING_MEMORY_TIERING We have tested the patch with the pmbench memory accessing benchmark with the 80:20 read/write ratio and the Gauss access address distribution on a 2 socket Intel server with Optane DC Persistent Memory Model. The test results shows that the pmbench score can improve up to 95.9%. Thanks Andrew Morton to help fix the document format error. Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:54 -04:00
Chris von Recklinghausen	2d6179b3cd	kthread: Generalize pf_io_worker so it can point to struct kthread Bugzilla: https://bugzilla.redhat.com/2120352 commit e32cf5dfbe227b355776948b2c9b5691b84d1cbd Author: Eric W. Biederman <ebiederm@xmission.com> Date: Wed Dec 22 22:10:09 2021 -0600 kthread: Generalize pf_io_worker so it can point to struct kthread The point of using set_child_tid to hold the kthread pointer was that it already did what is necessary. There are now restrictions on when set_child_tid can be initialized and when set_child_tid can be used in schedule_tail. Which indicates that continuing to use set_child_tid to hold the kthread pointer is a bad idea. Instead of continuing to use the set_child_tid field of task_struct generalize the pf_io_worker field of task_struct and use it to hold the kthread pointer. Rename pf_io_worker (which is a void * pointer) to worker_private so it can be used to store kthreads struct kthread pointer. Update the kthread code to store the kthread pointer in the worker_private field. Remove the places where set_child_tid had to be dealt with carefully because kthreads also used it. Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com Link: https://lkml.kernel.org/r/87a6grvqy8.fsf_-_@email.froward.int.ebiederm.org Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:35 -04:00
Chris von Recklinghausen	34dce2be8d	kthread: Never put_user the set_child_tid address Bugzilla: https://bugzilla.redhat.com/2120352 commit 00580f03af5eb2a527875b4a80a5effd95bda2fa Author: Eric W. Biederman <ebiederm@xmission.com> Date: Wed Dec 22 16:57:50 2021 -0600 kthread: Never put_user the set_child_tid address Kernel threads abuse set_child_tid. Historically that has been fine as set_child_tid was initialized after the kernel thread had been forked. Unfortunately storing struct kthread in set_child_tid after the thread is running makes struct kthread being unusable for storing result codes of the thread. When set_child_tid is set to struct kthread during fork that results in schedule_tail writing the thread id to the beggining of struct kthread (if put_user does not realize it is a kernel address). Solve this by skipping the put_user for all kthreads. Reported-by: Nathan Chancellor <nathan@kernel.org> Link: https://lkml.kernel.org/r/YcNsG0Lp94V13whH@archlinux-ax161 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:34 -04:00
Chris von Recklinghausen	5b51e7ace6	kthread: Warn about failed allocations for the init kthread Bugzilla: https://bugzilla.redhat.com/2120352 commit dd621ee0cf8eb32445c8f5f26d3b7555953071d8 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Tue Dec 21 11:41:14 2021 -0600 kthread: Warn about failed allocations for the init kthread Failed allocates are not expected when setting up the initial task and it is not really possible to handle them either. So I added a warning to report if such an allocation failure ever happens. Correct the sense of the warning so it warns when an allocation failure happens not when the allocation succeeded. Oops. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Link: https://lkml.kernel.org/r/20211221231611.785b74cf@canb.auug.org.au Link: https://lkml.kernel.org/r/CA+G9fYvLaR5CF777CKeWTO+qJFTN6vAvm95gtzN+7fw3Wi5hkA@mail.gmail.com Link: https://lkml.kernel.org/r/20211216102956.GC10708@xsang-OptiPlex-9020 Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:34 -04:00
Chris von Recklinghausen	e1e51160dc	kthread: Ensure struct kthread is present for all kthreads Bugzilla: https://bugzilla.redhat.com/2120352 commit 40966e316f86b8cfd83abd31ccb4df729309d3e7 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Thu Dec 2 09:56:14 2021 -0600 kthread: Ensure struct kthread is present for all kthreads Today the rules are a bit iffy and arbitrary about which kernel threads have struct kthread present. Both idle threads and thread started with create_kthread want struct kthread present so that is effectively all kernel threads. Make the rule that if PF_KTHREAD and the task is running then struct kthread is present. This will allow the kernel thread code to using tsk->exit_code with different semantics from ordinary processes. To make ensure that struct kthread is present for all kernel threads move it's allocation into copy_process. Add a deallocation of struct kthread in exec for processes that were kernel threads. Move the allocation of struct kthread for the initial thread earlier so that it is not repeated for each additional idle thread. Move the initialization of struct kthread into set_kthread_struct so that the structure is always and reliably initailized. Clear set_child_tid in free_kthread_struct to ensure the kthread struct is reliably freed during exec. The function free_kthread_struct does not need to clear vfork_done during exec as exec_mm_release called from exec_mmap has already cleared vfork_done. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:33 -04:00
Frantisek Hrbata	37715a7ab5	Merge: Backport scheduler related v5.19 and earlier commits for kernel-rt MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1319 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2120671 Tested: By me with scheduler stress tests. Series of prerequisites for the RT patch set that touches scheduler code. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-09-27 08:47:30 -04:00
Phil Auld	035866b87a	smp: Rename flush_smp_call_function_from_idle() Bugzilla: https://bugzilla.redhat.com/2120671 commit 16bf5a5e1ec56474ed2a19d72f272ed09a5d3ea1 Author: Thomas Gleixner <tglx@linutronix.de> Date: Wed Apr 13 15:31:03 2022 +0200 smp: Rename flush_smp_call_function_from_idle() This is invoked from the stopper thread too, which is definitely not idle. Rename it to flush_smp_call_function_queue() and fixup the callers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220413133024.305001096@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2022-09-08 11:25:07 -04:00
Phil Auld	bb91a8ff6d	sched: Fix missing prototype warnings Bugzilla: https://bugzilla.redhat.com/2120671 commit d664e399128bd78b905ff480917e2c2d4949e101 Author: Thomas Gleixner <tglx@linutronix.de> Date: Wed Apr 13 15:31:02 2022 +0200 sched: Fix missing prototype warnings A W=1 build emits more than a dozen missing prototype warnings related to scheduler and scheduler specific includes. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220413133024.249118058@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2022-09-08 11:25:07 -04:00
Waiman Long	d42238049b	preempt/dynamic: Introduce preemption model accessors Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2117491 commit cfe43f478b79ba45573ca22d52d0d8823be068fa Author: Valentin Schneider <vschneid@redhat.com> Date: Fri, 12 Nov 2021 18:52:01 +0000 preempt/dynamic: Introduce preemption model accessors CONFIG_PREEMPT{_NONE, _VOLUNTARY} designate either: o The build-time preemption model when !PREEMPT_DYNAMIC o The default boot-time preemption model when PREEMPT_DYNAMIC IOW, using those on PREEMPT_DYNAMIC kernels is meaningless - the actual model could have been set to something else by the "preempt=foo" cmdline parameter. Same problem applies to CONFIG_PREEMPTION. Introduce a set of helpers to determine the actual preemption model used by the live kernel. Suggested-by: Marco Elver <elver@google.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20211112185203.280040-3-valentin.schneider@arm.com Signed-off-by: Waiman Long <longman@redhat.com>	2022-08-30 17:21:52 -04:00
Waiman Long	1a0eb66558	sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2104946 Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=sched/urgent&id=b6e8d40d43ae4dec00c8fea2593eeea3114b8f44 commit b6e8d40d43ae4dec00c8fea2593eeea3114b8f44 Author: Waiman Long <longman@redhat.com> Date: Tue, 2 Aug 2022 21:54:51 -0400 sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating that the cpuset will just use the effective CPUs of its parent. So cpuset_can_attach() can call task_can_attach() with an empty mask. This can lead to cpumask_any_and() returns nr_cpu_ids causing the call to dl_bw_of() to crash due to percpu value access of an out of bound CPU value. For example: [80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0 : [80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0 : [80468.207946] Call Trace: [80468.208947] cpuset_can_attach+0xa0/0x140 [80468.209953] cgroup_migrate_execute+0x8c/0x490 [80468.210931] cgroup_update_dfl_csses+0x254/0x270 [80468.211898] cgroup_subtree_control_write+0x322/0x400 [80468.212854] kernfs_fop_write_iter+0x11c/0x1b0 [80468.213777] new_sync_write+0x11f/0x1b0 [80468.214689] vfs_write+0x1eb/0x280 [80468.215592] ksys_write+0x5f/0xe0 [80468.216463] do_syscall_64+0x5c/0x80 [80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae Fix that by using effective_cpus instead. For cgroup v1, effective_cpus is the same as cpus_allowed. For v2, effective_cpus is the real cpumask to be used by tasks within the cpuset anyway. Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to reflect the change. In addition, a check is added to task_can_attach() to guard against the possibility that cpumask_any_and() may return a value >= nr_cpu_ids. Fixes: `7f51412a41` ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com>	2022-08-03 10:41:05 -04:00
Patrick Talbert	5cbac754a7	Merge: sched: Fix balance_push() vs __sched_setscheduler() MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1059 Bugzilla: https://bugzilla.redhat.com/2100215 Tested: Ran cpu hot[un]plug for 24+ hours while stress tests were running. commit 04193d590b390ec7a0592630f46d559ec6564ba1 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Jun 7 22:41:55 2022 +0200 sched: Fix balance_push() vs __sched_setscheduler() The purpose of balance_push() is to act as a filter on task selection in the case of CPU hotplug, specifically when taking the CPU out. It does this by (ab)using the balance callback infrastructure, with the express purpose of keeping all the unlikely/odd cases in a single place. In order to serve its purpose, the balance_push_callback needs to be (exclusively) on the callback list at all times (noting that the callback always places itself back on the list the moment it runs, also noting that when the CPU goes down, regular balancing concerns are moot, so ignoring them is fine). And here-in lies the problem, __sched_setscheduler()'s use of splice_balance_callbacks() takes the callbacks off the list across a lock-break, making it possible for, an interleaving, __schedule() to see an empty list and not get filtered. Fixes: `ae79270232` ("sched: Optimize finish_lock_switch()") Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Link: https://lkml.kernel.org/r/20220519134706.GH2578@worktop.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Valentin Schneider <vschneid@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-07-12 10:32:54 +02:00
Phil Auld	d560b649e2	sched: Fix balance_push() vs __sched_setscheduler() Bugzilla: https://bugzilla.redhat.com/2100215 commit 04193d590b390ec7a0592630f46d559ec6564ba1 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Jun 7 22:41:55 2022 +0200 sched: Fix balance_push() vs __sched_setscheduler() The purpose of balance_push() is to act as a filter on task selection in the case of CPU hotplug, specifically when taking the CPU out. It does this by (ab)using the balance callback infrastructure, with the express purpose of keeping all the unlikely/odd cases in a single place. In order to serve its purpose, the balance_push_callback needs to be (exclusively) on the callback list at all times (noting that the callback always places itself back on the list the moment it runs, also noting that when the CPU goes down, regular balancing concerns are moot, so ignoring them is fine). And here-in lies the problem, __sched_setscheduler()'s use of splice_balance_callbacks() takes the callbacks off the list across a lock-break, making it possible for, an interleaving, __schedule() to see an empty list and not get filtered. Fixes: `ae79270232` ("sched: Optimize finish_lock_switch()") Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Link: https://lkml.kernel.org/r/20220519134706.GH2578@worktop.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2022-06-22 16:22:26 -04:00
Ming Lei	4415be8560	block: check that there is a plug in blk_flush_plug Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917 commit aa8dcccaf32bfdc09f2aff089d5d60c37da5b7b5 Author: Christoph Hellwig <hch@lst.de> Date: Thu Jan 27 08:05:49 2022 +0100 block: check that there is a plug in blk_flush_plug Rename blk_flush_plug to __blk_flush_plug and add a wrapper that includes the NULL check instead of open coding that check everywhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220127070549.1377856-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2022-06-22 08:56:20 +08:00
Ming Lei	d5d4963cf5	block: remove blk_needs_flush_plug Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917 commit b1f866b013e6e5583f2f0bf4a61d13eddb9a1799 Author: Christoph Hellwig <hch@lst.de> Date: Thu Jan 27 08:05:48 2022 +0100 block: remove blk_needs_flush_plug blk_needs_flush_plug fails to account for the cb_list, which needs flushing as well. Remove it and just check if there is a plug instead of poking into the internals of the plug structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220127070549.1377856-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2022-06-22 08:56:19 +08:00
Phil Auld	5087f87023	sched/tracing: Append prev_state to tp args instead Bugzilla: https://bugzilla.redhat.com/2078906 Conflicts: Skipped one hunk, in samples, due to not having 3a73333fb370 ("tracing: Add TRACE_CUSTOM_EVENT() macro"). commit 9c2136be0878c88c53dea26943ce40bb03ad8d8d Author: Delyan Kratunov <delyank@fb.com> Date: Wed May 11 18:28:36 2022 +0000 sched/tracing: Append prev_state to tp args instead Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20) added a new prev_state argument to the sched_switch tracepoint, before the prev task_struct pointer. This reordering of arguments broke BPF programs that use the raw tracepoint (e.g. tp_btf programs). The type of the second argument has changed and existing programs that assume a task_struct* argument (e.g. for bpf_task_storage access) will now fail to verify. If we instead append the new argument to the end, all existing programs would continue to work and can conditionally extract the prev_state argument on supported kernel versions. Fixes: fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20) Signed-off-by: Delyan Kratunov <delyank@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/c8a6930dfdd58a4a5755fc01732675472979732b.camel@fb.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-06-02 09:20:55 -04:00
Phil Auld	2dfe14261a	sched: Teach the forced-newidle balancer about CPU affinity limitation. Bugzilla: https://bugzilla.redhat.com/2078906 commit 386ef214c3c6ab111d05e1790e79475363abaa05 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Thu Mar 17 15:51:32 2022 +0100 sched: Teach the forced-newidle balancer about CPU affinity limitation. try_steal_cookie() looks at task_struct::cpus_mask to decide if the task could be moved to `this' CPU. It ignores that the task might be in a migration disabled section while not on the CPU. In this case the task must not be moved otherwise per-CPU assumption are broken. Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a task can be moved. Fixes: `d2dfa17bc7` ("sched: Trivial forced-newidle balancer") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2022-06-01 13:54:12 -04:00
Phil Auld	35c29596b9	sched/core: Fix forceidle balancing Bugzilla: https://bugzilla.redhat.com/2078906 commit 5b6547ed97f4f5dfc23f8e3970af6d11d7b7ed7e Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Mar 16 22:03:41 2022 +0100 sched/core: Fix forceidle balancing Steve reported that ChromeOS encounters the forceidle balancer being ran from rt_mutex_setprio()'s balance_callback() invocation and explodes. Now, the forceidle balancer gets queued every time the idle task gets selected, set_next_task(), which is strictly too often. rt_mutex_setprio() also uses set_next_task() in the 'change' pattern: queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED / running = task_current(rq, p); / rq->curr == p / if (queued) dequeue_task(...); if (running) put_prev_task(...); / change task properties / if (queued) enqueue_task(...); if (running) set_next_task(...); However, rt_mutex_setprio() will explicitly not run this pattern on the idle task (since priority boosting the idle task is quite insane). Most other 'change' pattern users are pidhash based and would also not apply to idle. Also, the change pattern doesn't contain a __balance_callback() invocation and hence we could have an out-of-band balance-callback, which should* trigger the WARN in rq_pin_lock() (which guards against this exact anti-pattern). So while none of that explains how this happens, it does indicate that having it in set_next_task() might not be the most robust option. Instead, explicitly queue the forceidle balancer from pick_next_task() when it does indeed result in forceidle selection. Having it here, ensures it can only be triggered under the __schedule() rq->lock instance, and hence must be ran from that context. This also happens to clean up the code a little, so win-win. Fixes: `d2dfa17bc7` ("sched: Trivial forced-newidle balancer") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: T.J. Alumbaugh <talumbau@chromium.org> Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2022-06-01 13:54:12 -04:00
Patrick Talbert	f9a5b7f4d0	Merge: Scheduler RT prerequisites MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/754 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076594 Tested: Sanity tested with scheduler stress tests. This is a handful of commits to help the RT merge. Keeping the differences as small as possible reduces the maintenance. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Fernando Pacheco <fpacheco@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-05-12 09:28:27 +02:00
Patrick Talbert	d92575ea9d	Merge: sched/deadline: code cleanup MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/729 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2065219 Upstream Status: Linux Tested: by me with scheduler stress tests using deadline class, admission control failures and general stress tests. A series of fixes and cleanup for the deadline scheduler class. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Juri Lelli <juri.lelli@redhat.com> Approved-by: Fernando Pacheco <fpacheco@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-05-12 09:28:24 +02:00
Phil Auld	83eb03a64e	sched: Make RCU nest depth distinct in __might_resched() Bugzilla: https://bugzilla.redhat.com/2076594 commit 50e081b96e35e43b65591f40f7376204decd1cb5 Author: Thomas Gleixner <tglx@linutronix.de> Date: Thu Sep 23 18:54:43 2021 +0200 sched: Make RCU nest depth distinct in __might_resched() For !RT kernels RCU nest depth in __might_resched() is always expected to be 0, but on RT kernels it can be non zero while the preempt count is expected to be always 0. Instead of playing magic games in interpreting the 'preempt_offset' argument, rename it to 'offsets' and use the lower 8 bits for the expected preempt count, allow to hand in the expected RCU nest depth in the upper bits and adopt the __might_resched() code and related checks and printks. The affected call sites are updated in subsequent steps. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.243232823@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-19 15:01:33 -04:00
Phil Auld	627bfaffba	sched: Make might_sleep() output less confusing Bugzilla: https://bugzilla.redhat.com/2076594 commit 8d713b699e84aade6b64e241a35f22e166fc8174 Author: Thomas Gleixner <tglx@linutronix.de> Date: Thu Sep 23 18:54:41 2021 +0200 sched: Make might_sleep() output less confusing might_sleep() output is pretty informative, but can be confusing at times especially with PREEMPT_RCU when the check triggers due to a voluntary sleep inside a RCU read side critical section: BUG: sleeping function called from invalid context at kernel/test.c:110 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 Preemption disabled at: migrate_disable+0x33/0xa0 in_atomic() is 0, but it still tells that preemption was disabled at migrate_disable(), which is completely useless because preemption is not disabled. But the interesting information to decode the above, i.e. the RCU nesting depth, is not printed. That becomes even more confusing when might_sleep() is invoked from cond_resched_lock() within a RCU read side critical section. Here the expected preemption count is 1 and not 0. BUG: sleeping function called from invalid context at kernel/test.c:131 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 Preemption disabled at: test_cond_lock+0xf3/0x1c0 So in_atomic() is set, which is expected as the caller holds a spinlock, but it's unclear why this is broken and the preempt disable IP is just pointing at the correct place, i.e. spin_lock(), which is obviously not helpful either. Make that more useful in general: - Print preempt_count() and the expected value and for the CONFIG_PREEMPT_RCU case: - Print the RCU read side critical section nesting depth - Print the preempt disable IP only when preempt count does not have the expected value. So the might_sleep() dump from a within a preemptible RCU read side critical section becomes: BUG: sleeping function called from invalid context at kernel/test.c:110 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 and the cond_resched_lock() case becomes: BUG: sleeping function called from invalid context at kernel/test.c:141 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 preempt_count: 1, expected: 1 RCU nest depth: 1, expected: 0 which makes is pretty obvious what's going on. For all other cases the preempt disable IP is still printed as before: BUG: sleeping function called from invalid context at kernel/test.c: 156 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 Preemption disabled at: [<ffffffff82b48326>] test_might_sleep+0xbe/0xf8 BUG: sleeping function called from invalid context at kernel/test.c: 163 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 1, expected: 0 RCU nest depth: 1, expected: 0 Preemption disabled at: [<ffffffff82b48326>] test_might_sleep+0x1e4/0x280 This also prepares to provide a better debugging output for RT enabled kernels and their spinlock substitutions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.181022656@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-19 15:01:33 -04:00
Phil Auld	ace9ad8221	sched: Cleanup might_sleep() printks Bugzilla: https://bugzilla.redhat.com/2076594 commit a45ed302b6e6fe5b03166321c08b4f2ad4a92a35 Author: Thomas Gleixner <tglx@linutronix.de> Date: Thu Sep 23 18:54:40 2021 +0200 sched: Cleanup might_sleep() printks Convert them to pr_*(). No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.117496067@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-19 15:01:33 -04:00
Phil Auld	5d55e0afeb	sched: Remove preempt_offset argument from __might_sleep() Bugzilla: https://bugzilla.redhat.com/2076594 commit 42a387566c567603bafa1ec0c5b71c35cba83e86 Author: Thomas Gleixner <tglx@linutronix.de> Date: Thu Sep 23 18:54:38 2021 +0200 sched: Remove preempt_offset argument from __might_sleep() All callers hand in 0 and never will hand in anything else. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.054321586@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-19 15:01:33 -04:00
Phil Auld	acc9612b04	sched: Clean up the might_sleep() underscore zoo Bugzilla: https://bugzilla.redhat.com/2076594 commit 874f670e6088d3bff3972ecd44c1cb00610f9183 Author: Thomas Gleixner <tglx@linutronix.de> Date: Thu Sep 23 18:54:35 2021 +0200 sched: Clean up the might_sleep() underscore zoo __might_sleep() vs. ___might_sleep() is hard to distinguish. Aside of that the three underscore variant is exposed to provide a checkpoint for rescheduling points which are distinct from blocking points. They are semantically a preemption point which means that scheduling is state preserving. A real blocking operation, e.g. mutex_lock(), wait*(), which cannot preserve a task state which is not equal to RUNNING. While technically blocking on a "sleeping" spinlock in RT enabled kernels falls into the voluntary scheduling category because it has to wait until the contended spin/rw lock becomes available, the RT lock substitution code can semantically be mapped to a voluntary preemption because the RT lock substitution code and the scheduler are providing mechanisms to preserve the task state and to take regular non-lock related wakeups into account. Rename ___might_sleep() to __might_resched() to make the distinction of these functions clear. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165357.928693482@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-19 15:01:33 -04:00
Phil Auld	293846bc7d	sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy() Bugzilla: http://bugzilla.redhat.com/2065219 commit 772b6539fdda31462cc08368e78df60b31a58bab Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Wed Mar 2 19:34:30 2022 +0100 sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy() Both functions are doing almost the same, that is checking if admission control is still respected. With exclusive cpusets, dl_task_can_attach() checks if the destination cpuset (i.e. its root domain) has enough CPU capacity to accommodate the task. dl_cpu_busy() checks if there is enough CPU capacity in the cpuset in case the CPU is hot-plugged out. dl_task_can_attach() is used to check if a task can be admitted while dl_cpu_busy() is used to check if a CPU can be hotplugged out. Make dl_cpu_busy() able to deal with a task and use it instead of dl_task_can_attach() in task_can_attach(). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-4-dietmar.eggemann@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-13 12:56:36 -04:00
Phil Auld	c812902eec	sched/deadline: Remove unused def_dl_bandwidth Bugzilla: http://bugzilla.redhat.com/2065219 commit eb77cf1c151c4a1c2147cbf24d84bcf0ba504e7c Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Wed Mar 2 19:34:28 2022 +0100 sched/deadline: Remove unused def_dl_bandwidth Since commit `1724813d9f` ("sched/deadline: Remove the sysctl_sched_dl knobs") the default deadline bandwidth control structure has no purpose. Remove it. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-2-dietmar.eggemann@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-13 10:46:56 -04:00
Phil Auld	20c10cc17b	sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y Bugzilla: http://bugzilla.redhat.com/2069275 commit a7b2553b5ece1aba4b5994eef150d0a1269b5805 Author: Ingo Molnar <mingo@kernel.org> Date: Tue Mar 15 10:33:53 2022 +0100 sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y This header is not (yet) standalone. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-11 17:38:23 -04:00
Phil Auld	6f4ee2303a	sched/headers: Reorganize, clean up and optimize kernel/sched/core.c dependencies Bugzilla: http://bugzilla.redhat.com/2069275 commit e66f6481a8c748ce2d4b37a3d5e10c4dd0d65e80 Author: Ingo Molnar <mingo@kernel.org> Date: Wed Feb 23 08:17:15 2022 +0100 sched/headers: Reorganize, clean up and optimize kernel/sched/core.c dependencies Use all generic headers from kernel/sched/sched.h that are required for it to build. Sort the sections alphabetically. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-11 17:38:22 -04:00
Phil Auld	b0b1db90ca	sched/headers: Standardize kernel/sched/sched.h header dependencies Bugzilla: http://bugzilla.redhat.com/2069275 commit b9e9c6ca6e54b5d58a57663f76c5cb33c12ea98f Author: Ingo Molnar <mingo@kernel.org> Date: Sun Feb 13 08:19:43 2022 +0100 sched/headers: Standardize kernel/sched/sched.h header dependencies kernel/sched/sched.h is a weird mix of ad-hoc headers included in the middle of the header. Two of them rely on being included in the middle of kernel/sched/sched.h, due to definitions they require: - "stat.h" needs the rq definitions. - "autogroup.h" needs the task_group definition. Move the inclusion of these two files out of kernel/sched/sched.h, and include them in all files that require them. Move of the rest of the header dependencies to the top of the kernel/sched/sched.h file. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-11 17:38:22 -04:00
Phil Auld	c08b78797d	Merge remote-tracking branch 'origin/merge-requests/673' into bz2069275 Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-11 12:37:32 -04:00
Phil Auld	233aa69d39	Merge remote-tracking branch 'origin/merge-requests/671' into bz2069275 Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-11 12:37:07 -04:00
Phil Auld	0e372dcf73	sched/preempt: Add PREEMPT_DYNAMIC using static keys Bugzilla: http://bugzilla.redhat.com/2065226 commit 99cf983cc8bca4adb461b519664c939a565cfd4d Author: Mark Rutland <mark.rutland@arm.com> Date: Mon Feb 14 16:52:14 2022 +0000 sched/preempt: Add PREEMPT_DYNAMIC using static keys Where an architecture selects HAVE_STATIC_CALL but not HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline which will either branch to a callee or return to the caller. On such architectures, a number of constraints can conspire to make those trampolines more complicated and potentially less useful than we'd like. For example: * Hardware and software control flow integrity schemes can require the addition of "landing pad" instructions (e.g. `BTI` for arm64), which will also be present at the "real" callee. * Limited branch ranges can require that trampolines generate or load an address into a register and perform an indirect branch (or at least have a slow path that does so). This loses some of the benefits of having a direct branch. * Interaction with SW CFI schemes can be complicated and fragile, e.g. requiring that we can recognise idiomatic codegen and remove indirections understand, at least until clang proves more helpful mechanisms for dealing with this. For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we really only need to enable/disable specific preemption functions. We can achieve the same effect without a number of the pain points above by using static keys to fold early returns into the preemption functions themselves rather than in an out-of-line trampoline, effectively inlining the trampoline into the start of the function. For arm64, this results in good code generation. For example, the dynamic_cond_resched() wrapper looks as follows when enabled. When disabled, the first `B` is replaced with a `NOP`, resulting in an early return. \| <dynamic_cond_resched>: \| bti c \| b <dynamic_cond_resched+0x10> // or `nop` \| mov w0, #0x0 \| ret \| mrs x0, sp_el0 \| ldr x0, [x0, #8] \| cbnz x0, <dynamic_cond_resched+0x8> \| paciasp \| stp x29, x30, [sp, #-16]! \| mov x29, sp \| bl <preempt_schedule_common> \| mov w0, #0x1 \| ldp x29, x30, [sp], #16 \| autiasp \| ret ... compared to the regular form of the function: \| <__cond_resched>: \| bti c \| mrs x0, sp_el0 \| ldr x1, [x0, #8] \| cbz x1, <__cond_resched+0x18> \| mov w0, #0x0 \| ret \| paciasp \| stp x29, x30, [sp, #-16]! \| mov x29, sp \| bl <preempt_schedule_common> \| mov w0, #0x1 \| ldp x29, x30, [sp], #16 \| autiasp \| ret Any architecture which implements static keys should be able to use this to implement PREEMPT_DYNAMIC with similar cost to non-inlined static calls. Since this is likely to have greater overhead than (inlined) static calls, PREEMPT_DYNAMIC is only defaulted to enabled when HAVE_PREEMPT_DYNAMIC_CALL is selected. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-07 09:35:08 -04:00
Phil Auld	c14a9a1c67	sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY Bugzilla: http://bugzilla.redhat.com/2065226 commit 33c64734be3461222a8aa27d3dadc477ebca62de Author: Mark Rutland <mark.rutland@arm.com> Date: Mon Feb 14 16:52:13 2022 +0000 sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY Now that the enabled/disabled states for the preemption functions are declared alongside their definitions, the core PREEMPT_DYNAMIC logic is no longer tied to GENERIC_ENTRY, and can safely be selected so long as an architecture provides enabled/disabled states for irqentry_exit_cond_resched(). Make it possible to select HAVE_PREEMPT_DYNAMIC without GENERIC_ENTRY. For existing users of HAVE_PREEMPT_DYNAMIC there should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20220214165216.2231574-5-mark.rutland@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-07 09:35:08 -04:00
Phil Auld	4d8e4a6697	sched/preempt: Refactor sched_dynamic_update() Bugzilla: http://bugzilla.redhat.com/2065226 commit 8a69fe0be143b0a1af829f85f0e9a1ae7d6a04db Author: Mark Rutland <mark.rutland@arm.com> Date: Mon Feb 14 16:52:11 2022 +0000 sched/preempt: Refactor sched_dynamic_update() Currently sched_dynamic_update needs to open-code the enabled/disabled function names for each preemption model it supports, when in practice this is a boolean enabled/disabled state for each function. Make this clearer and avoid repetition by defining the enabled/disabled states at the function definition, and using helper macros to perform the static_call_update(). Where x86 currently overrides the enabled function, it is made to provide both the enabled and disabled states for consistency, with defaults provided by the core code otherwise. In subsequent patches this will allow us to support PREEMPT_DYNAMIC without static calls. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20220214165216.2231574-3-mark.rutland@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-07 09:35:07 -04:00
Phil Auld	cd905af16b	sched/preempt: Move PREEMPT_DYNAMIC logic later Bugzilla: http://bugzilla.redhat.com/2065226 commit 4c7485584d48f60b1e742c7c6a3a1fa503d48d97 Author: Mark Rutland <mark.rutland@arm.com> Date: Mon Feb 14 16:52:10 2022 +0000 sched/preempt: Move PREEMPT_DYNAMIC logic later The PREEMPT_DYNAMIC logic in kernel/sched/core.c patches static calls for a bunch of preemption functions. While most are defined prior to this, the definition of cond_resched() is later in the file, and so we only have its declarations from include/linux/sched.h. In subsequent patches we'd like to define some macros alongside the definition of each of the preemption functions, which we can use within sched_dynamic_update(). For this to be possible, the PREEMPT_DYNAMIC logic needs to be placed after the various preemption functions. As a preparatory step, this patch moves the PREEMPT_DYNAMIC logic after the various preemption functions, with no other changes -- this is purely a move. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20220214165216.2231574-2-mark.rutland@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-04-07 09:35:07 -04:00
Phil Auld	1cf795c344	sched/isolation: Use single feature type while referring to housekeeping cpumask Bugzilla: http://bugzilla.redhat.com/2065222 commit 04d4e665a60902cf36e7ad39af1179cb5df542ad Author: Frederic Weisbecker <frederic@kernel.org> Date: Mon Feb 7 16:59:06 2022 +0100 sched/isolation: Use single feature type while referring to housekeeping cpumask Refer to housekeeping APIs using single feature types instead of flags. This prevents from passing multiple isolation features at once to housekeeping interfaces, which soon won't be possible anymore as each isolation features will have their own cpumask. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-31 10:40:39 -04:00
Phil Auld	a5be4d79e1	sched/numa: Fix NUMA topology for systems with CPU-less nodes Bugzilla: http://bugzilla.redhat.com/2062831 commit 0fb3978b0aac3a5c08637aed03cc2d65f793508f Author: Huang Ying <ying.huang@intel.com> Date: Mon Feb 14 20:15:52 2022 +0800 sched/numa: Fix NUMA topology for systems with CPU-less nodes The NUMA topology parameters (sched_numa_topology_type, sched_domains_numa_levels, and sched_max_numa_distance, etc.) identified by scheduler may be wrong for systems with CPU-less nodes. For example, the ACPI SLIT of a system with CPU-less persistent memory (Intel Optane DCPMM) nodes is as follows, [000h 0000 4] Signature : "SLIT" [System Locality Information Table] [004h 0004 4] Table Length : 0000042C [008h 0008 1] Revision : 01 [009h 0009 1] Checksum : 59 [00Ah 0010 6] Oem ID : "XXXX" [010h 0016 8] Oem Table ID : "XXXXXXX" [018h 0024 4] Oem Revision : 00000001 [01Ch 0028 4] Asl Compiler ID : "INTL" [020h 0032 4] Asl Compiler Revision : 20091013 [024h 0036 8] Localities : 0000000000000004 [02Ch 0044 4] Locality 0 : 0A 15 11 1C [030h 0048 4] Locality 1 : 15 0A 1C 11 [034h 0052 4] Locality 2 : 11 1C 0A 1C [038h 0056 4] Locality 3 : 1C 11 1C 0A While the `numactl -H` output is as follows, available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 64136 MB node 0 free: 5981 MB node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 1 size: 64466 MB node 1 free: 10415 MB node 2 cpus: node 2 size: 253952 MB node 2 free: 253920 MB node 3 cpus: node 3 size: 253952 MB node 3 free: 253951 MB node distances: node 0 1 2 3 0: 10 21 17 28 1: 21 10 28 17 2: 17 28 10 28 3: 28 17 28 10 In this system, there are only 2 sockets. In each memory controller, both DRAM and PMEM DIMMs are installed. Although the physical NUMA topology is simple, the logical NUMA topology becomes a little complex. Because both the distance(0, 1) and distance (1, 3) are less than the distance (0, 3), it appears that node 1 sits between node 0 and node 3. And the whole system appears to be a glueless mesh NUMA topology type. But it's definitely not, there is even no CPU in node 3. This isn't a practical problem now yet. Because the PMEM nodes (node 2 and node 3 in example system) are offlined by default during system boot. So init_numa_topology_type() called during system boot will ignore them and set sched_numa_topology_type to NUMA_DIRECT. And init_numa_topology_type() is only called at runtime when a CPU of a never-onlined-before node gets plugged in. And there's no CPU in the PMEM nodes. But it appears better to fix this to make the code more robust. To test the potential problem. We have used a debug patch to call init_numa_topology_type() when the PMEM node is onlined (in __set_migration_target_nodes()). With that, the NUMA parameters identified by scheduler is as follows, sched_numa_topology_type: NUMA_GLUELESS_MESH sched_domains_numa_levels: 4 sched_max_numa_distance: 28 To fix the issue, the CPU-less nodes are ignored when the NUMA topology parameters are identified. Because a node may become CPU-less or not at run time because of CPU hotplug, the NUMA topology parameters need to be re-initialized at runtime for CPU hotplug too. With the patch, the NUMA parameters identified for the example system above is as follows, sched_numa_topology_type: NUMA_DIRECT sched_domains_numa_levels: 2 sched_max_numa_distance: 21 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-28 09:28:37 -04:00
Phil Auld	6c8a46d512	sched: replace cpumask_weight with cpumask_empty where appropriate Bugzilla: http://bugzilla.redhat.com/2062831 commit 1087ad4e3f88c474b8134a482720782922bf3fdf Author: Yury Norov <yury.norov@gmail.com> Date: Thu Feb 10 14:49:06 2022 -0800 sched: replace cpumask_weight with cpumask_empty where appropriate In some places, kernel/sched code calls cpumask_weight() to check if any bit of a given cpumask is set. We can do it more efficiently with cpumask_empty() because cpumask_empty() stops traversing the cpumask as soon as it finds first set bit, while cpumask_weight() counts all bits unconditionally. Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220210224933.379149-23-yury.norov@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-28 09:28:37 -04:00
Phil Auld	090df5874d	sched/tracing: Don't re-read p->state when emitting sched_switch event Bugzilla: http://bugzilla.redhat.com/2062831 commit fa2c3254d7cfff5f7a916ab928a562d1165f17bb Author: Valentin Schneider <valentin.schneider@arm.com> Date: Thu Jan 20 16:25:19 2022 +0000 sched/tracing: Don't re-read p->state when emitting sched_switch event As of commit `c6e7bd7afa` ("sched/core: Optimize ttwu() spinning on p->on_cpu") the following sequence becomes possible: p->__state = TASK_INTERRUPTIBLE; __schedule() deactivate_task(p); ttwu() READ !p->on_rq p->__state=TASK_WAKING trace_sched_switch() __trace_sched_switch_state() task_state_index() return 0; TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in the trace event. Prevent this by pushing the value read from __schedule() down the trace event. Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-28 09:28:37 -04:00
Phil Auld	eb2eff33b3	sched/core: Export pelt_thermal_tp Bugzilla: http://bugzilla.redhat.com/2062831 commit 77cf151b7bbdfa3577b3c3f3a5e267a6c60a263b Author: Qais Yousef <qais.yousef@arm.com> Date: Thu Oct 28 12:50:05 2021 +0100 sched/core: Export pelt_thermal_tp We can't use this tracepoint in modules without having the symbol exported first, fix that. Fixes: `765047932f` ("sched/pelt: Add support to track thermal pressure") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-28 09:28:37 -04:00
Phil Auld	e99d064416	sched/core: Accounting forceidle time for all tasks except idle task Bugzilla: http://bugzilla.redhat.com/2062831 commit b171501f258063f5c56dd2c5fdf310802d8d7dc1 Author: Cruz Zhao <CruzZhao@linux.alibaba.com> Date: Tue Jan 11 17:55:59 2022 +0800 sched/core: Accounting forceidle time for all tasks except idle task There are two types of forced idle time: forced idle time from cookie'd task and forced idle time form uncookie'd task. The forced idle time from uncookie'd task is actually caused by the cookie'd task in runqueue indirectly, and it's more accurate to measure the capacity loss with the sum of both. Assuming cpu x and cpu y are a pair of SMT siblings, consider the following scenarios: 1.There's a cookie'd task running on cpu x, and there're 4 uncookie'd tasks running on cpu y. For cpu x, there will be 80% forced idle time (from uncookie'd task); for cpu y, there will be 20% forced idle time (from cookie'd task). 2.There's a uncookie'd task running on cpu x, and there're 4 cookie'd tasks running on cpu y. For cpu x, there will be 80% forced idle time (from cookie'd task); for cpu y, there will be 20% forced idle time (from uncookie'd task). The scenario1 can recurrent by stress-ng(scenario2 can recurrent similary): (cookie'd)taskset -c x stress-ng -c 1 -l 100 (uncookie'd)taskset -c y stress-ng -c 4 -l 100 In the above two scenarios, the total capacity loss is 1 cpu, but in scenario1, the cookie'd forced idle time tells us 20% cpu capacity loss, in scenario2, the cookie'd forced idle time tells us 80% cpu capacity loss, which are not accurate. It'll be more accurate to measure with cookie'd forced idle time and uncookie'd forced idle time. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Link: https://lore.kernel.org/r/1641894961-9241-2-git-send-email-CruzZhao@linux.alibaba.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-28 09:28:37 -04:00
Phil Auld	ad2f1aab97	sched: Avoid double preemption in __cond_resched_lock() Bugzilla: http://bugzilla.redhat.com/2062831 commit 7e406d1ff39b8ee574036418a5043c86723170cf Author: Peter Zijlstra <peterz@infradead.org> Date: Sat Dec 25 01:04:57 2021 +0100 sched: Avoid double preemption in __cond_resched_lock() For PREEMPT/DYNAMIC_PREEMPT the _unlock() will already trigger a preemption, no point in then calling preempt_schedule_common() again*. Use _cond_resched() instead, since this is a NOP for the preemptible configs while it provide a preemption point for the others. Reported-by: xuhaifeng <xuhaifeng@oppo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YcGnvDEYBwOiV0cR@hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-28 09:28:36 -04:00
Phil Auld	2d595edebf	sched: Trigger warning if ->migration_disabled counter underflows. Bugzilla: http://bugzilla.redhat.com/2062831 commit 9d0df37797453f168afdb2e6fd0353c73718ae9a Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Mon Nov 29 18:46:44 2021 +0100 sched: Trigger warning if ->migration_disabled counter underflows. If migrate_enable() is used more often than its counter part then it remains undetected and rq::nr_pinned will underflow, too. Add a warning if migrate_enable() is attempted if without a matching a migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20211129174654.668506-2-bigeasy@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-28 09:28:36 -04:00
Phil Auld	fb7c2476b1	sched/fair: Replace CFS internal cpu_util() with cpu_util_cfs() Bugzilla: http://bugzilla.redhat.com/2062831 commit 82762d2af31a60081162890983a83499c9c7dd74 Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Thu Nov 18 17:42:40 2021 +0100 sched/fair: Replace CFS internal cpu_util() with cpu_util_cfs() cpu_util_cfs() was created by commit `d4edd662ac` ("sched/cpufreq: Use the DEADLINE utilization signal") to enable the access to CPU utilization from the Schedutil CPUfreq governor. Commit `a07630b8b2` ("sched/cpufreq/schedutil: Use util_est for OPP selection") added util_est support later. The only thing cpu_util() is doing on top of what cpu_util_cfs() already does is to clamp the return value to the [0..capacity_orig] capacity range of the CPU. Integrating this into cpu_util_cfs() is not harming the existing users (Schedutil and CPUfreq cooling (latter via sched_cpu_util() wrapper)). For straightforwardness, prefer to keep using `int cpu` as the function parameter over using `struct rq *rq` which might avoid some calls to cpu_rq(cpu) -> per_cpu(runqueues, cpu) -> RELOC_HIDE(). Update cfs_util()'s documentation and reuse it for cpu_util_cfs(). Remove cpu_util(). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211118164240.623551-1-dietmar.eggemann@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-28 09:28:36 -04:00
Phil Auld	b5be4938b7	sched/core: Forced idle accounting Bugzilla: http://bugzilla.redhat.com/2062831 Conflicts: fuzz due to kabi padding in struct rq commit 4feee7d12603deca8775f9f9ae5e121093837444 Author: Josh Don <joshdon@google.com> Date: Mon Oct 18 13:34:28 2021 -0700 sched/core: Forced idle accounting Adds accounting for "forced idle" time, which is time where a cookie'd task forces its SMT sibling to idle, despite the presence of runnable tasks. Forced idle time is one means to measure the cost of enabling core scheduling (ie. the capacity lost due to the need to force idle). Forced idle time is attributed to the thread responsible for causing the forced idle. A few details: - Forced idle time is displayed via /proc/PID/sched. It also requires that schedstats is enabled. - Forced idle is only accounted when a sibling hyperthread is held idle despite the presence of runnable tasks. No time is charged if a sibling is idle but has no runnable tasks. - Tasks with 0 cookie are never charged forced idle. - For SMT > 2, we scale the amount of forced idle charged based on the number of forced idle siblings. Additionally, we split the time up and evenly charge it to all running tasks, as each is equally responsible for the forced idle. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211018203428.2025792-1-joshdon@google.com Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-28 09:28:35 -04:00
Phil Auld	fc23011d32	sched: Fix yet more sched_fork() races Bugzilla: http://bugzilla.redhat.com/2062836 commit b1e8206582f9d680cff7d04828708c8b6ab32957 Author: Peter Zijlstra <peterz@infradead.org> Date: Mon Feb 14 10:16:57 2022 +0100 sched: Fix yet more sched_fork() races Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") fixed a fork race vs cgroup, it opened up a race vs syscalls by not placing the task on the runqueue before it gets exposed through the pidhash. Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is trying to fix a single instance of this, instead fix the whole class of issues, effectively reverting this commit. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-14 09:25:27 -04:00
Phil Auld	4f57951af2	sched/fair: Fix fault in reweight_entity Bugzilla: http://bugzilla.redhat.com/2062836 commit 13765de8148f71fa795e0a6607de37c49ea5915a Author: Tadeusz Struk <tadeusz.struk@linaro.org> Date: Thu Feb 3 08:18:46 2022 -0800 sched/fair: Fix fault in reweight_entity Syzbot found a GPF in reweight_entity. This has been bisected to commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") There is a race between sched_post_fork() and setpriority(PRIO_PGRP) within a thread group that causes a null-ptr-deref in reweight_entity() in CFS. The scenario is that the main process spawns number of new threads, which then call setpriority(PRIO_PGRP, 0, -20), wait, and exit. For each of the new threads the copy_process() gets invoked, which adds the new task_struct and calls sched_post_fork() for it. In the above scenario there is a possibility that setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread in the group that is just being created by copy_process(), and for which the sched_post_fork() has not been executed yet. This will trigger a null pointer dereference in reweight_entity(), as it will try to access the run queue pointer, which hasn't been set. Before the mentioned change the cfs_rq pointer for the task has been set in sched_fork(), which is called much earlier in copy_process(), before the new task is added to the thread_group. Now it is done in the sched_post_fork(), which is called after that. To fix the issue the remove the update_load param from the update_load param() function and call reweight_task() only if the task flag doesn't have the TASK_NEW flag set. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: syzbot+af7a719bc92395ee41b3@syzkaller.appspotmail.com Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220203161846.1160750-1-tadeusz.struk@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>	2022-03-14 09:25:18 -04:00
Herton R. Krzesinski	f13f32b81b	Merge: sched: backports from 5.16 merge window MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/217 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2020279 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2029640 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1921343 Upstream Status: Linux Tested: By me, with scheduler stress and sanity tests. Boot tested on Alderlake for topology changes. 5.16+ scheduler fixes. This includes some commits requested by the Livepatch team and some AlderLake topology changes. A few additional patches were pulled in to make the rest apply. With those and the dependency all patches apply cleanly. v2: added 3 more commits from sched/urgent. Added one last (hopefully) fix from sched/urgent. Signed-off-by: Phil Auld <pauld@redhat.com> RH-Acked-by: David Arcari <darcari@redhat.com> RH-Acked-by: Rafael Aquini <aquini@redhat.com> RH-Acked-by: Wander Lairson Costa <wander@redhat.com> RH-Acked-by: Waiman Long <longman@redhat.com> RH-Acked-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2021-12-22 10:22:13 -03:00
Herton R. Krzesinski	bc4cd05211	Merge: block: update to v5.16 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/148 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403 Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2023396 Upstream Status: merged to linus tree already Update block layer and related drivers(drivers/block) with v5.16. Signed-off-by: Ming Lei <ming.lei@redhat.com> RH-Acked-by: Rafael Aquini <aquini@redhat.com> RH-Acked-by: Lyude Paul <lyude@redhat.com> RH-Acked-by: Gopal Tiwari <gtiwari@redhat.com> RH-Acked-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2021-12-17 09:07:53 -03:00
Phil Auld	8ad4a5d307	sched/uclamp: Fix rq->uclamp_max not set on first enqueue Bugzilla: http://bugzilla.redhat.com/2020279 commit 315c4f884800c45cb6bd8c90422fad554a8b9588 Author: Qais Yousef <qais.yousef@arm.com> Date: Thu Dec 2 11:20:33 2021 +0000 sched/uclamp: Fix rq->uclamp_max not set on first enqueue Commit `d81ae8aac8` ("sched/uclamp: Fix initialization of struct uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to match the woken up task's uclamp_max when the rq is idle. The code was relying on rq->uclamp_max initialized to zero, so on first enqueue static inline void uclamp_rq_inc_id(struct rq rq, struct task_struct p, enum uclamp_id clamp_id) { ... if (uc_se->value > READ_ONCE(uc_rq->value)) WRITE_ONCE(uc_rq->value, uc_se->value); } was actually resetting it. But since commit `d81ae8aac8` changed the default to 1024, this no longer works. And since rq->uclamp_flags is also initialized to 0, neither above code path nor uclamp_idle_reset() update the rq->uclamp_max on first wake up from idle. This is only visible from first wake up(s) until the first dequeue to idle after enabling the static key. And it only matters if the uclamp_max of this task is < 1024 since only then its uclamp_max will be effectively ignored. Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to ensure uclamp_idle_reset() is called which then will update the rq uclamp_max value as expected. Fixes: `d81ae8aac8` ("sched/uclamp: Fix initialization of struct uclamp_rq") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:52 -05:00
Phil Auld	29674bf319	preempt/dynamic: Fix setup_preempt_mode() return value Bugzilla: http://bugzilla.redhat.com/2020279 commit 9ed20bafc85806ca6c97c9128cec46c3ef80ae86 Author: Andrew Halaney <ahalaney@redhat.com> Date: Fri Dec 3 17:32:03 2021 -0600 preempt/dynamic: Fix setup_preempt_mode() return value __setup() callbacks expect 1 for success and 0 for failure. Correct the usage here to reflect that. Fixes: `826bfeb37b` ("preempt/dynamic: Support dynamic preempt with preempt= boot option") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:52 -05:00
Phil Auld	9ce1981ceb	sched/scs: Reset task stack state in bringup_cpu() Bugzilla: http://bugzilla.redhat.com/2020279 commit dce1ca0525bfdc8a69a9343bc714fbc19a2f04b3 Author: Mark Rutland <mark.rutland@arm.com> Date: Tue Nov 23 11:40:47 2021 +0000 sched/scs: Reset task stack state in bringup_cpu() To hot unplug a CPU, the idle task on that CPU calls a few layers of C code before finally leaving the kernel. When KASAN is in use, poisoned shadow is left around for each of the active stack frames, and when shadow call stacks are in use. When shadow call stacks (SCS) are in use the task's saved SCS SP is left pointing at an arbitrary point within the task's shadow call stack. When a CPU is offlined than onlined back into the kernel, this stale state can adversely affect execution. Stale KASAN shadow can alias new stackframes and result in bogus KASAN warnings. A stale SCS SP is effectively a memory leak, and prevents a portion of the shadow call stack being used. Across a number of hotplug cycles the idle task's entire shadow call stack can become unusable. We previously fixed the KASAN issue in commit: `e1b77c9298` ("sched/kasan: remove stale KASAN poison after hotplug") ... by removing any stale KASAN stack poison immediately prior to onlining a CPU. Subsequently in commit: `f1a0a376ca` ("sched/core: Initialize the idle task with preemption disabled") ... the refactoring left the KASAN and SCS cleanup in one-time idle thread initialization code rather than something invoked prior to each CPU being onlined, breaking both as above. We fixed SCS (but not KASAN) in commit: 63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit") ... but as this runs in the context of the idle task being offlined it's potentially fragile. To fix these consistently and more robustly, reset the SCS SP and KASAN shadow of a CPU's idle task immediately before we online that CPU in bringup_cpu(). This ensures the idle task always has a consistent state when it is running, and removes the need to so so when exiting an idle task. Whenever any thread is created, dup_task_struct() will give the task a stack which is free of KASAN shadow, and initialize the task's SCS SP, so there's no need to specially initialize either for idle thread within init_idle(), as this was only necessary to handle hotplug cycles. I've tested this on arm64 with: * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK ... offlining and onlining CPUS with: \| while true; do \| for C in /sys/devices/system/cpu/cpu*/online; do \| echo 0 > $C; \| echo 1 > $C; \| done \| done Fixes: `f1a0a376ca` ("sched/core: Initialize the idle task with preemption disabled") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/ Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:52 -05:00
Phil Auld	4402fa0cd3	sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain() Bugzilla: http://bugzilla.redhat.com/2020279 commit 42dc938a590c96eeb429e1830123fef2366d9c80 Author: Vincent Donnefort <vincent.donnefort@arm.com> Date: Thu Nov 4 17:51:20 2021 +0000 sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain() Nothing protects the access to the per_cpu variable sd_llc_id. When testing the same CPU (i.e. this_cpu == that_cpu), a race condition exists with update_top_cache_domain(). One scenario being: CPU1 CPU2 ================================================================== per_cpu(sd_llc_id, CPUX) => 0 partition_sched_domains_locked() detach_destroy_domains() cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX) per_cpu(sd_llc_id, CPUX) => 0 per_cpu(sd_llc_id, CPUX) = CPUX per_cpu(sd_llc_id, CPUX) => CPUX return false ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result is a warning triggered from ttwu_queue_wakelist(). Avoid a such race in cpus_share_cache() by always returning true when this_cpu == that_cpu. Fixes: `518cd62341` ("sched: Only queue remote wakeups when crossing cache boundaries") Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211104175120.857087-1-vincent.donnefort@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:51 -05:00
Phil Auld	70d81a5df7	sched/fair: Prevent dead task groups from regaining cfs_rq's Bugzilla: http://bugzilla.redhat.com/2020279 commit b027789e5e50494c2325cc70c8642e7fd6059479 Author: Mathias Krause <minipli@grsecurity.net> Date: Wed Nov 3 20:06:13 2021 +0100 sched/fair: Prevent dead task groups from regaining cfs_rq's Kevin is reporting crashes which point to a use-after-free of a cfs_rq in update_blocked_averages(). Initial debugging revealed that we've live cfs_rq's (on_list=1) in an about to be kfree()'d task group in free_fair_sched_group(). However, it was unclear how that can happen. His kernel config happened to lead to a layout of struct sched_entity that put the 'my_q' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast. Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit `a7b359fc6a` ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfs_rq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tg_unthrottle_up() getting called via sched_cfs_period_timer() via the timer interrupt at an inconvenient time. When unregister_fair_sched_group() unlinks all cfs_rq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregister_fair_sched_group() has finished but prior to unlinking the task group, sched_cfs_period_timer() will execute and walk the list of task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the dying task group. These will later -- in free_fair_sched_group() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing. To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below: CPU1: CPU2: : timer IRQ: : do_sched_cfs_period_timer(): : : : distribute_cfs_runtime(): : rcu_read_lock(); : : : unthrottle_cfs_rq(): sched_offline_group(): : : walk_tg_tree_from(…,tg_unthrottle_up,…): list_del_rcu(&tg->list); : (1) : list_for_each_entry_rcu(child, &parent->children, siblings) : : (2) list_del_rcu(&tg->siblings); : : tg_unthrottle_up(): unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; : : list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : : : : if (!cfs_rq_is_decayed(cfs_rq) \|\| cfs_rq->nr_running) (3) : list_add_leaf_cfs_rq(cfs_rq); : : : : : : : : : : (4) : rcu_read_unlock(); CPU 2 walks the task group list in parallel to sched_offline_group(), specifically, it'll read the soon to be unlinked task group entry at (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of these at (3), which is the cause of the UAF later on. To prevent this additional race from happening, we need to wait until walk_tg_tree_from() has finished traversing the task groups, i.e. after the RCU read critical section ends in (4). Afterwards we're safe to call unregister_fair_sched_group(), as each new walk won't see the dying task group any more. On top of that, we need to wait yet another RCU grace period after unregister_fair_sched_group() to ensure print_cfs_stats(), which might run concurrently, always sees valid objects, i.e. not already free'd ones. This patch survives Michal's reproducer[2] for 8h+ now, which used to trigger within minutes before. [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/ [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/ Fixes: `a7b359fc6a` ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") [peterz: shuffle code around a bit] Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:51 -05:00
Phil Auld	167140e68e	sched: Remove pointless preemption disable in sched_submit_work() Bugzilla: http://bugzilla.redhat.com/2020279 commit b945efcdd07d86cece1cce68503aae91f107eacb Author: Thomas Gleixner <tglx@linutronix.de> Date: Wed Sep 29 11:37:32 2021 +0200 sched: Remove pointless preemption disable in sched_submit_work() Neither wq_worker_sleeping() nor io_wq_worker_sleeping() require to be invoked with preemption disabled: - The worker flag checks operations only need to be serialized against the worker thread itself. - The accounting and worker pool operations are serialized with locks. which means that disabling preemption has neither a reason nor a value. Remove it and update the stale comment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lkml.kernel.org/r/8735pnafj7.ffs@tglx Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:49 -05:00
Phil Auld	e9a43d5999	sched: Move mmdrop to RCU on RT Bugzilla: http://bugzilla.redhat.com/2020279 commit 8d491de6edc27138806cae6e8eca455beb325b62 Author: Thomas Gleixner <tglx@linutronix.de> Date: Tue Sep 28 14:24:32 2021 +0200 sched: Move mmdrop to RCU on RT mmdrop() is invoked from finish_task_switch() by the incoming task to drop the mm which was handed over by the previous task. mmdrop() can be quite expensive which prevents an incoming real-time task from getting useful work done. Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels it delagates the eventually required invocation of __mmdrop() to RCU. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210928122411.648582026@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:49 -05:00
Phil Auld	395c062ef5	sched: Move kprobes cleanup out of finish_task_switch() Bugzilla: http://bugzilla.redhat.com/2020279 commit 670721c7bd2a6e16e40db29b2707a27bdecd6928 Author: Thomas Gleixner <tglx@linutronix.de> Date: Tue Sep 28 14:24:28 2021 +0200 sched: Move kprobes cleanup out of finish_task_switch() Doing cleanups in the tail of schedule() is a latency punishment for the incoming task. The point of invoking kprobes_task_flush() for a dead task is that the instances are returned and cannot leak when __schedule() is kprobed. Move it into the delayed cleanup. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210928122411.537994026@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:49 -05:00
Phil Auld	85d252d9f1	sched: Limit the number of task migrations per batch on RT Bugzilla: http://bugzilla.redhat.com/2020279 commit 691925f3ddccea832cf2d162dc277d2623a816e3 Author: Thomas Gleixner <tglx@linutronix.de> Date: Tue Sep 28 14:24:25 2021 +0200 sched: Limit the number of task migrations per batch on RT Batched task migrations are a source for large latencies as they keep the scheduler from running while processing the migrations. Limit the batch size to 8 instead of 32 when running on a RT enabled kernel. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210928122411.425097596@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:49 -05:00
Phil Auld	1028c3ee10	sched: Simplify wake_up_idle() Bugzilla: http://bugzilla.redhat.com/2020279 commit 8850cb663b5cda04d33f9cfbc38889d73d3c8e24 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Sep 21 22:16:02 2021 +0200 sched: Simplify wake_up_idle() Simplify and make wake_up_if_idle() more robust, also don't iterate the whole machine with preempt_disable() in it's caller: wake_up_all_idle_cpus(). This prepares for another wake_up_if_idle() user that needs a full do_idle() cycle. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929152428.769328779@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:48 -05:00
Phil Auld	c693803b2b	sched,rcu: Rework try_invoke_on_locked_down_task() Bugzilla: http://bugzilla.redhat.com/2020279 commit 9b3c4ab3045e953670c7de9c1165fae5358a7237 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Sep 21 21:54:32 2021 +0200 sched,rcu: Rework try_invoke_on_locked_down_task() Give try_invoke_on_locked_down_task() a saner name and have it return an int so that the caller might distinguish between different reasons of failure. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929152428.649944917@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:48 -05:00
Phil Auld	a2ee57d025	sched: Improve try_invoke_on_locked_down_task() Bugzilla: http://bugzilla.redhat.com/2020279 commit f6ac18fafcf6cc5e41c26766d12ad335ed81012e Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Sep 22 10:14:15 2021 +0200 sched: Improve try_invoke_on_locked_down_task() Clarify and tighten try_invoke_on_locked_down_task(). Basically the function calls @func under task_rq_lock(), except it avoids taking rq->lock when possible. This makes calling @func unconditional (the function will get renamed in a later patch to remove the try). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929152428.589323576@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:47 -05:00
Phil Auld	eec2be9bfb	kernel/sched: Fix sched_fork() access an invalid sched_task_group Bugzilla: http://bugzilla.redhat.com/2020279 commit 4ef0c5c6b5ba1f38f0ea1cedad0cad722f00c14a Author: Zhang Qiao <zhangqiao22@huawei.com> Date: Wed Sep 15 14:40:30 2021 +0800 kernel/sched: Fix sched_fork() access an invalid sched_task_group There is a small race between copy_process() and sched_fork() where child->sched_task_group point to an already freed pointer. parent doing fork() \| someone moving the parent \| to another cgroup -------------------------------+------------------------------- copy_process() + dup_task_struct()<1> parent move to another cgroup, and free the old cgroup. <2> + sched_fork() + __set_task_cpu()<3> + task_fork_fair() + sched_slice()<4> In the worst case, this bug can lead to "use-after-free" and cause panic as shown above: (1) parent copy its sched_task_group to child at <1>; (2) someone move the parent to another cgroup and free the old cgroup at <2>; (3) the sched_task_group and cfs_rq that belong to the old cgroup will be accessed at <3> and <4>, which cause a panic: [] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0 [] Oops: 0000 [#1] SMP PTI [] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G OE --------- - - 4.18.0.x86_64+ #1 [] RIP: 0010:sched_slice+0x84/0xc0 [] Call Trace: [] task_fork_fair+0x81/0x120 [] sched_fork+0x132/0x240 [] copy_process.part.5+0x675/0x20e0 [] ? __handle_mm_fault+0x63f/0x690 [] _do_fork+0xcd/0x3b0 [] do_syscall_64+0x5d/0x1d0 [] entry_SYSCALL_64_after_hwframe+0x65/0xca [] RIP: 0033:0x7f04418cd7e1 Between cgroup_can_fork() and cgroup_post_fork(), the cgroup membership and thus sched_task_group can't change. So update child's sched_task_group at sched_post_fork() and move task_fork() and __set_task_cpu() (where accees the sched_task_group) from sched_fork() to sched_post_fork(). Fixes: `8323f26ce3` ("sched: Fix race in task_group") Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20210915064030.2231-1-zhangqiao22@huawei.com Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:47 -05:00
Phil Auld	fbc84644bc	sched: Make struct sched_statistics independent of fair sched class Bugzilla: http://bugzilla.redhat.com/2020279 commit ceeadb83aea28372e54857bf88ab7e17af48ab7b Author: Yafang Shao <laoar.shao@gmail.com> Date: Sun Sep 5 14:35:41 2021 +0000 sched: Make struct sched_statistics independent of fair sched class If we want to use the schedstats facility to trace other sched classes, we should make it independent of fair sched class. The struct sched_statistics is the schedular statistics of a task_struct or a task_group. So we can move it into struct task_struct and struct task_group to achieve the goal. After the patch, schestats are orgnized as follows, struct task_struct { ... struct sched_entity se; struct sched_rt_entity rt; struct sched_dl_entity dl; ... struct sched_statistics stats; ... }; Regarding the task group, schedstats is only supported for fair group sched, and a new struct sched_entity_stats is introduced, suggested by Peter - struct sched_entity_stats { struct sched_entity se; struct sched_statistics stats; } __no_randomize_layout; Then with the se in a task_group, we can easily get the stats. The sched_statistics members may be frequently modified when schedstats is enabled, in order to avoid impacting on random data which may in the same cacheline with them, the struct sched_statistics is defined as cacheline aligned. As this patch changes the core struct of scheduler, so I verified the performance it may impact on the scheduler with 'perf bench sched pipe', suggested by Mel. Below is the result, in which all the values are in usecs/op. Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5 [These data is a little difference with the earlier version, that is because my old test machine is destroyed so I have to use a new different test machine.] Almost no impact on the sched performance. No functional change. [lkp@intel.com: reported build failure in earlier version] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:46 -05:00
Phil Auld	b3e5bde075	sched/fair: Add cfs bandwidth burst statistics Bugzilla: http://bugzilla.redhat.com/2020279 commit bcb1704a1ed2de580a46f28922e223a65f16e0f5 Author: Huaixin Chang <changhuaixin@linux.alibaba.com> Date: Mon Aug 30 11:22:14 2021 +0800 sched/fair: Add cfs bandwidth burst statistics Two new statistics are introduced to show the internal of burst feature and explain why burst helps or not. nr_bursts: number of periods bandwidth burst occurs burst_time: cumulative wall-time (in nanoseconds) that any cpus has used above quota in respective periods Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210830032215.16302-2-changhuaixin@linux.alibaba.com Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:46 -05:00
Phil Auld	dfabd0fc37	sched: Switch wait_task_inactive to HRTIMER_MODE_REL_HARD Bugzilla: http://bugzilla.redhat.com/2020279 commit c33627e9a1143afb988fb98d917c4a2faa16f9d9 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Thu Aug 26 19:04:08 2021 +0200 sched: Switch wait_task_inactive to HRTIMER_MODE_REL_HARD With PREEMPT_RT enabled all hrtimers callbacks will be invoked in softirq mode unless they are explicitly marked as HRTIMER_MODE_HARD. During boot kthread_bind() is used for the creation of per-CPU threads and then hangs in wait_task_inactive() if the ksoftirqd is not yet up and running. The hang disappeared since commit `26c7295be0` ("kthread: Do not preempt current task if it is going to call schedule()") but enabling function trace on boot reliably leads to the freeze on boot behaviour again. The timer in wait_task_inactive() can not be directly used by a user interface to abuse it and create a mass wake up of several tasks at the same time leading to long sections with disabled interrupts. Therefore it is safe to make the timer HRTIMER_MODE_REL_HARD. Switch the timer to HRTIMER_MODE_REL_HARD. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210826170408.vm7rlj7odslshwch@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2021-12-13 16:07:45 -05:00

1 2 3 4 5 ...

1596 Commits