Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Phil Auld	f0cdbfa9cb	sched/deadline: Collect sched_dl_entity initialization JIRA: https://issues.redhat.com/browse/RHEL-25535 Conflicts: Minor fuzz due to unrelated whitespace difference from upstream. commit 9e07d45c5210f5dd6701c00d55791983db7320fa Author: Peter Zijlstra <peterz@infradead.org> Date: Sat Nov 4 11:59:19 2023 +0100 sched/deadline: Collect sched_dl_entity initialization Create a single function that initializes a sched_dl_entity. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/51acc695eecf0a1a2f78f9a044e11ffd9b316bcf.1699095159.git.bristot@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 15:47:16 -04:00
Phil Auld	e72c2d7c2f	sched: Use WRITE_ONCE() for p->on_rq JIRA: https://issues.redhat.com/browse/RHEL-25535 commit d6111cf45c5787282b2e20d77bdb6b28881d516a Author: Paul E. McKenney <paulmck@kernel.org> Date: Tue Oct 31 11:12:01 2023 -0700 sched: Use WRITE_ONCE() for p->on_rq Since RCU-tasks uses READ_ONCE(p->on_rq), ensure the write-side matches with WRITE_ONCE(). Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/e4896e0b-eacc-45a2-a7a8-de2280a51ecc@paulmck-laptop Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 15:47:16 -04:00
Phil Auld	fea5f42a4c	sched/fair: Remove SIS_PROP JIRA: https://issues.redhat.com/browse/RHEL-25535 commit 984ffb6a4366752c949f7b39640aecdce222607f Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Oct 20 12:35:33 2023 +0200 sched/fair: Remove SIS_PROP SIS_UTIL seems to work well, lets remove the old thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 15:47:15 -04:00
Phil Auld	26c251b772	sched: Add cpus_share_resources API JIRA: https://issues.redhat.com/browse/RHEL-15622 commit b95303e0aeaf446b65169dd4142cacdaeb7d4c8b Author: Barry Song <song.bao.hua@hisilicon.com> Date: Thu Oct 19 11:33:21 2023 +0800 sched: Add cpus_share_resources API Add cpus_share_resources() API. This is the preparation for the optimization of select_idle_cpu() on platforms with cluster scheduler level. On a machine with clusters cpus_share_resources() will test whether two cpus are within the same cluster. On a non-cluster machine it will behaves the same as cpus_share_cache(). So we use "resources" here for cache resources. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 15:46:43 -04:00
Phil Auld	e7e65f39f8	sched/debug: Print 'tgid' in sched_show_task() JIRA: https://issues.redhat.com/browse/RHEL-25535 commit bc87127a45928de5fdf0ec39d7a86e1edd0e179e Author: Yajun Deng <yajun.deng@linux.dev> Date: Thu Jul 20 16:05:16 2023 +0800 sched/debug: Print 'tgid' in sched_show_task() Multiple blocked tasks are printed when the system hangs. They may have the same parent pid, but belong to different task groups. Printing tgid lets users better know whether these tasks are from the same task group or not. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230720080516.1515297-1-yajun.deng@linux.dev Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 09:40:56 -04:00
Phil Auld	5215e85db0	sched/headers: Remove duplicate header inclusions JIRA: https://issues.redhat.com/browse/RHEL-25535 commit d4d6596b43868a1e05fe5b047e73c3aff96444c6 Author: Yu Liao <liaoyu15@huawei.com> Date: Wed Aug 2 10:15:01 2023 +0800 sched/headers: Remove duplicate header inclusions <linux/psi.h> and "autogroup.h" are included twice, remove the duplicate header inclusion. Signed-off-by: Yu Liao <liaoyu15@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230802021501.2511569-1-liaoyu15@huawei.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 09:40:56 -04:00
Phil Auld	05b3b562da	sched/debug: Add new tracepoint to track compute energy computation JIRA: https://issues.redhat.com/browse/RHEL-25535 commit 15874a3d27e6405e9d17595f83bd3ca1b6cab16d Author: Qais Yousef <qyousef@layalina.io> Date: Sun Sep 17 00:29:55 2023 +0100 sched/debug: Add new tracepoint to track compute energy computation It was useful to track feec() placement decision and debug the spare capacity and optimization issues vs uclamp_max. Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230916232955.2099394-4-qyousef@layalina.io Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 09:40:56 -04:00
Phil Auld	82d55a3223	sched/core: Refactor the task_flags check for worker sleeping in sched_submit_work() JIRA: https://issues.redhat.com/browse/RHEL-25535 commit 3eafe225995c67f8c179011ec2d6e4c12b32a53d Author: Wang Jinchao <wangjinchao@xfusion.com> Date: Sun Aug 20 20:53:17 2023 +0800 sched/core: Refactor the task_flags check for worker sleeping in sched_submit_work() Simplify the conditional logic for checking worker flags by splitting the original compound `if` statement into separate `if` and `else if` clauses. This modification not only retains the previous functionality, but also reduces a single `if` check, improving code clarity and potentially enhancing performance. Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/ZOIMvURE99ZRAYEj@fedora Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 09:40:56 -04:00
Phil Auld	3952152618	sched/debug: Avoid checking in_atomic_preempt_off() twice in schedule_debug() JIRA: https://issues.redhat.com/browse/RHEL-25535 commit dc461c48deda8a2d243fbaf49e276d555eb833d8 Author: Liming Wu <liming.wu@jaguarmicro.com> Date: Fri Aug 25 10:35:00 2023 +0800 sched/debug: Avoid checking in_atomic_preempt_off() twice in schedule_debug() in_atomic_preempt_off() already gets called in schedule_debug() once, which is the only caller of __schedule_bug(). Skip the second call within __schedule_bug(), it should always be true at this point. [ mingo: Clarified the changelog. ] Signed-off-by: Liming Wu <liming.wu@jaguarmicro.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230825023501.1848-1-liming.wu@jaguarmicro.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 09:40:55 -04:00
Phil Auld	e4b674ea9d	kernel/sched: Modify initial boot task idle setup JIRA: https://issues.redhat.com/browse/RHEL-25535 commit cff9b2332ab762b7e0586c793c431a8f2ea4db04 Author: Liam R. Howlett <Liam.Howlett@oracle.com> Date: Fri Sep 15 13:44:44 2023 -0400 kernel/sched: Modify initial boot task idle setup Initial booting is setting the task flag to idle (PF_IDLE) by the call path sched_init() -> init_idle(). Having the task idle and calling call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be set. Subsequent calls to any cond_resched() will enable IRQs, potentially earlier than the IRQ setup has completed. Recent changes have caused just this scenario and IRQs have been enabled early. This causes a warning later in start_kernel() as interrupts are enabled before they are fully set up. Fix this issue by setting the PF_IDLE flag later in the boot sequence. Although the boot task was marked as idle since (at least) d80e4fda576d, I am not sure that it is wrong to do so. The forced context-switch on idle task was introduced in the tiny_rcu update, so I'm going to claim this fixes `5f6130fa52`. Fixes: `5f6130fa52` ("tiny_rcu: Directly force QS when call_rcu_[bh\|sched]() on idle_task") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/ Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 09:40:55 -04:00
Phil Auld	7e2b960e90	sched/fair: Rename check_preempt_curr() to wakeup_preempt() JIRA: https://issues.redhat.com/browse/RHEL-25535 Conflicts: Minor fuzz in fair.c due to having RT merged, specifically: ea622076b76f ("sched: Add support for lazy preemption") commit e23edc86b09df655bf8963bbcb16647adc787395 Author: Ingo Molnar <mingo@kernel.org> Date: Tue Sep 19 10:38:21 2023 +0200 sched/fair: Rename check_preempt_curr() to wakeup_preempt() The name is a bit opaque - make it clear that this is about wakeup preemption. Also rename the ->check_preempt_curr() methods similarly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 09:40:55 -04:00
Phil Auld	23fec151eb	sched/core: Use do-while instead of for loop in set_nr_if_polling() JIRA: https://issues.redhat.com/browse/RHEL-25535 commit 4ff34ad3d39377d9f6953f3606ccf611ce636767 Author: Uros Bizjak <ubizjak@gmail.com> Date: Tue Feb 28 17:14:26 2023 +0100 sched/core: Use do-while instead of for loop in set_nr_if_polling() Use equivalent do-while loop instead of infinite for loop. There are no asm code changes. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230228161426.4508-1-ubizjak@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-05 09:49:13 -04:00
Phil Auld	9f16bf1bd9	sched: add WF_CURRENT_CPU and externise ttwu JIRA: https://issues.redhat.com/browse/RHEL-25535 commit ab83f455f04df5b2f7c6d4de03b6d2eaeaa27b8a Author: Peter Oskolkov <posk@google.com> Date: Tue Mar 7 23:31:57 2023 -0800 sched: add WF_CURRENT_CPU and externise ttwu Add WF_CURRENT_CPU wake flag that advices the scheduler to move the wakee to the current CPU. This is useful for fast on-CPU context switching use cases. In addition, make ttwu external rather than static so that the flag could be passed to it from outside of sched/core.c. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Andrei Vagin <avagin@google.com> Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Link: https://lore.kernel.org/r/20230308073201.3102738-3-avagin@google.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-05 09:49:13 -04:00
Phil Auld	5664854fb4	sched/core: introduce sched_core_idle_cpu() JIRA: https://issues.redhat.com/browse/RHEL-25535 commit 548796e2e70b44b4661fd7feee6eb239245ff1f8 Author: Cruz Zhao <CruzZhao@linux.alibaba.com> Date: Thu Jun 29 12:02:04 2023 +0800 sched/core: introduce sched_core_idle_cpu() As core scheduling introduced, a new state of idle is defined as force idle, running idle task but nr_running greater than zero. If a cpu is in force idle state, idle_cpu() will return zero. This result makes sense in some scenarios, e.g., load balance, showacpu when dumping, and judge the RCU boost kthread is starving. But this will cause error in other scenarios, e.g., tick_irq_exit(): When force idle, rq->curr == rq->idle but rq->nr_running > 0, results that idle_cpu() returns 0. In function tick_irq_exit(), if idle_cpu() is 0, tick_nohz_irq_exit() will not be called, and ts->idle_active will not become 1, which became 0 in tick_nohz_irq_enter(). ts->idle_sleeptime won't update in function update_ts_time_stats(), if ts->idle_active is 0, which should be 1. And this bug will result that ts->idle_sleeptime is less than the actual value, and finally will result that the idle time in /proc/stat is less than the actual value. To solve this problem, we introduce sched_core_idle_cpu(), which returns 1 when force idle. We audit all users of idle_cpu(), and change idle_cpu() into sched_core_idle_cpu() in function tick_irq_exit(). v2-->v3: Only replace idle_cpu() with sched_core_idle_cpu() in function tick_irq_exit(). And modify the corresponding commit log. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/r/1688011324-42406-1-git-send-email-CruzZhao@linux.alibaba.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-05 09:49:13 -04:00
Phil Auld	dd1a6e3897	sched: add throttled time stat for throttled children JIRA: https://issues.redhat.com/browse/RHEL-25535 commit 677ea015f231aa38b3972aa7be54ecd2637e99fd Author: Josh Don <joshdon@google.com> Date: Tue Jun 20 11:32:47 2023 -0700 sched: add throttled time stat for throttled children We currently export the total throttled time for cgroups that are given a bandwidth limit. This patch extends this accounting to also account the total time that each children cgroup has been throttled. This is useful to understand the degree to which children have been affected by the throttling control. Children which are not runnable during the entire throttled period, for example, will not show any self-throttling time during this period. Expose this in a new interface, 'cpu.stat.local', which is similar to how non-hierarchical events are accounted in 'memory.events.local'. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-05 09:49:13 -04:00
Phil Auld	74c4d90dda	sched/cpufreq: Rework schedutil governor performance estimation JIRA: https://issues.redhat.com/browse/RHEL-29020 commit 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Wed Nov 22 14:39:03 2023 +0100 sched/cpufreq: Rework schedutil governor performance estimation The current method to take into account uclamp hints when estimating the target frequency can end in a situation where the selected target frequency is finally higher than uclamp hints, whereas there are no real needs. Such cases mainly happen because we are currently mixing the traditional scheduler utilization signal with the uclamp performance hints. By adding these 2 metrics, we loose an important information when it comes to select the target frequency, and we have to make some assumptions which can't fit all cases. Rework the interface between the scheduler and schedutil governor in order to propagate all information down to the cpufreq governor. effective_cpu_util() interface changes and now returns the actual utilization of the CPU with 2 optional inputs: - The minimum performance for this CPU; typically the capacity to handle the deadline task and the interrupt pressure. But also uclamp_min request when available. - The maximum targeting performance for this CPU which reflects the maximum level that we would like to not exceed. By default it will be the CPU capacity but can be reduced because of some performance hints set with uclamp. The value can be lower than actual utilization and/or min performance level. A new sugov_effective_cpu_perf() interface is also available to compute the final performance level that is targeted for the CPU, after applying some cpufreq headroom and taking into account all inputs. With these 2 functions, schedutil is now able to decide when it must go above uclamp hints. It now also has a generic way to get the min performance level. The dependency between energy model and cpufreq governor and its headroom policy doesn't exist anymore. eenv_pd_max_util() asks schedutil for the targeted performance after applying the impact of the waking task. [ mingo: Refined the changelog & C comments. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-05 09:43:59 -04:00
Phil Auld	49d1b3f5c9	sched/topology: Consolidate and clean up access to a CPU's max compute capacity JIRA: https://issues.redhat.com/browse/RHEL-29020 commit 7bc263840bc3377186cb06b003ac287bb2f18ce2 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Mon Oct 9 12:36:16 2023 +0200 sched/topology: Consolidate and clean up access to a CPU's max compute capacity Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity() instead. The scheduler uses 3 methods to get access to a CPU's max compute capacity: - arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity. - cpu_capacity_orig field which is periodically updated with arch_scale_cpu_capacity(). - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig. There is no real need to save the value returned by arch_scale_cpu_capacity() in struct rq. arch_scale_cpu_capacity() returns: - either a per_cpu variable. - or a const value for systems which have only one capacity. Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere. No functional changes. Some performance tests on Arm64: - small SMP device (hikey): no noticeable changes - HMP device (RB5): hackbench shows minor improvement (1-2%) - large smp (thx2): hackbench and tbench shows minor improvement (1%) Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-05 09:43:50 -04:00
Phil Auld	fe04f5c470	sched/timers: Explain why idle task schedules out on remote timer enqueue JIRA: https://issues.redhat.com/browse/RHEL-29020 commit 194600008d5c43b5a4ba98c4b81633397e34ffad Author: Frederic Weisbecker <frederic@kernel.org> Date: Tue Nov 14 14:38:40 2023 -0500 sched/timers: Explain why idle task schedules out on remote timer enqueue Trying to avoid that didn't bring much value after testing, add comment about this. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lkml.kernel.org/r/20231114193840.4041-3-frederic@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-04 12:46:56 -04:00
Waiman Long	707cdda6c1	sched: Provide rt_mutex specific scheduler helpers JIRA: https://issues.redhat.com/browse/RHEL-28616 commit 6b596e62ed9f90c4a97e68ae1f7b1af5beeb3c05 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri, 8 Sep 2023 18:22:51 +0200 sched: Provide rt_mutex specific scheduler helpers With PREEMPT_RT there is a rt_mutex recursion problem where sched_submit_work() can use an rtlock (aka spinlock_t). More specifically what happens is: mutex_lock() /* really rt_mutex / ... __rt_mutex_slowlock_locked() task_blocks_on_rt_mutex() // enqueue current task as waiter // do PI chain walk rt_mutex_slowlock_block() schedule() sched_submit_work() ... spin_lock() / really rtlock / ... __rt_mutex_slowlock_locked() task_blocks_on_rt_mutex() // enqueue current task as waiter AGAIN* // CONFUSION Fix this by making rt_mutex do the sched_submit_work() early, before it enqueues itself as a waiter -- before it even knows if it will wait. [[ basically Thomas' patch but with different naming and a few asserts added ]] Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230908162254.999499-5-bigeasy@linutronix.de Signed-off-by: Waiman Long <longman@redhat.com>	2024-03-27 10:05:58 -04:00
Waiman Long	d11684a859	sched: Extract __schedule_loop() JIRA: https://issues.redhat.com/browse/RHEL-28616 commit de1474b46d889ee0367f6e71d9adfeb0711e4a8d Author: Thomas Gleixner <tglx@linutronix.de> Date: Fri, 8 Sep 2023 18:22:50 +0200 sched: Extract __schedule_loop() There are currently two implementations of this basic __schedule() loop, and there is soon to be a third. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230908162254.999499-4-bigeasy@linutronix.de Signed-off-by: Waiman Long <longman@redhat.com>	2024-03-27 10:05:58 -04:00
Waiman Long	f87f8018c0	sched: Constrain locks in sched_submit_work() JIRA: https://issues.redhat.com/browse/RHEL-28616 commit 28bc55f654de49f6122c7475b01b5d5ef4bdf0d4 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri, 8 Sep 2023 18:22:48 +0200 sched: Constrain locks in sched_submit_work() Even though sched_submit_work() is ran from preemptible context, it is discouraged to have it use blocking locks due to the recursion potential. Enforce this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230908162254.999499-2-bigeasy@linutronix.de Signed-off-by: Waiman Long <longman@redhat.com>	2024-03-27 10:05:57 -04:00
Waiman Long	41316178dc	Revert "sched/core: Provide sched_rtmutex() and expose sched work helpers") JIRA: https://issues.redhat.com/browse/RHEL-28616 Upstream Status: RHEL only Revert linux-rt-devel specific commit ca66ec3b9994 ("sched/core: Provide sched_rtmutex() and expose sched work helpers") to prepare for the submission of upstream equivalent. Signed-off-by: Waiman Long <longman@redhat.com>	2024-03-27 09:56:36 -04:00
Waiman Long	07a160c823	Revert "sched/core: Add __always_inline to schedule_loop()" JIRA: https://issues.redhat.com/browse/RHEL-28616 Upstream Status: RHEL only Revert RHEL only commit `ec180d083a` ("sched/core: Add __always_inline to schedule_loop()"). Signed-off-by: Waiman Long <longman@redhat.com>	2024-03-27 09:56:34 -04:00
Phil Auld	b4dc1de6f4	sched: Misc cleanups JIRA: https://issues.redhat.com/browse/RHEL-29017 Conflicts: Context differences due to having RT patchset `6bc27040eb` ("sched: Add support for lazy preemption"). A number of hunks skipped due to not having af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") and related commits in that series. commit 0e34600ac9317dbe5f0a7bfaa3d7187d757572ed Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 9 20:52:49 2023 +0200 sched: Misc cleanups Random remaining guard use... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-15 12:27:27 -04:00
Phil Auld	a3ff38e571	sched: Simplify tg_set_cfs_bandwidth() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 6fb45460615358157a6d3c990e74f9c1395247e2 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 9 20:45:16 2023 +0200 sched: Simplify tg_set_cfs_bandwidth() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-15 12:27:27 -04:00
Phil Auld	7be2f9d32c	sched: Simplify sched_move_task() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit fa614b4feb5a246474ac71b45e520a8ddefc809c Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 9 20:41:09 2023 +0200 sched: Simplify sched_move_task() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-15 12:27:27 -04:00
Phil Auld	660ea6ce0b	sched: Simplify sched_rr_get_interval() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit af7c5763f5e8bc1b3f827354a283ccaf6a8c8098 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 9 16:59:05 2023 +0200 sched: Simplify sched_rr_get_interval() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-15 12:27:26 -04:00
Phil Auld	a5f564e4a8	sched: Simplify yield_to() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 7a50f76674f8b6f4f30a1cec954179f10e20110c Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 9 16:58:23 2023 +0200 sched: Simplify yield_to() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-15 12:27:26 -04:00
Phil Auld	09540ce68b	sched: Simplify sched_{set,get}affinity() JIRA: https://issues.redhat.com/browse/RHEL-29017 Conflicts: Minor manual fixup needed due to having RHEL-only patch `05fddaaaac` ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()"). commit 92c2ec5bc1081e6bbbe172bcfb1a566ad7b4f809 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 9 16:57:35 2023 +0200 sched: Simplify sched_{set,get}affinity() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-15 12:27:15 -04:00
Phil Auld	e3d25cdfd8	sched: Simplify syscalls JIRA: https://issues.redhat.com/browse/RHEL-29017 commit febe162d4d9158cf2b5d48fdd440db7bb55dd622 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 9 16:54:54 2023 +0200 sched: Simplify syscalls Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:36:17 -04:00
Phil Auld	7d8b86de57	sched: Simplify set_user_nice() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 94b548a15e8ec47dfbf6925bdfb64bb5657dce0c Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 9 20:52:55 2023 +0200 sched: Simplify set_user_nice() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:36:13 -04:00
Phil Auld	653b9d8a24	sched: Simplify sched_core_cpu_{starting,deactivate}() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 7170509cadbb76e5fa7d7b090d2cbdb93d56a2de Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Aug 1 22:41:30 2023 +0200 sched: Simplify sched_core_cpu_{starting,deactivate}() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.371787909@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:36:09 -04:00
Phil Auld	f1dfda18ac	sched: Simplify try_steal_cookie() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit b4e1fa1e14286f7a825b10d8ebb2e9c0f77c241b Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Aug 1 22:41:29 2023 +0200 sched: Simplify try_steal_cookie() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.304154828@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:36:05 -04:00
Phil Auld	035259148c	sched: Simplify sched_tick_remote() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 6dafc713e3b0d8ffbd696d200d8c9dd212ddcdfc Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Aug 1 22:41:28 2023 +0200 sched: Simplify sched_tick_remote() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.236247952@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:36:01 -04:00
Phil Auld	bf350e4481	sched: Simplify sched_exec() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 4bdada79f3464d85f6e187213c088e7c934e0554 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Aug 1 22:41:27 2023 +0200 sched: Simplify sched_exec() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.168490417@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:35:56 -04:00
Phil Auld	8774c3dfad	sched: Simplify ttwu() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 857d315f1201cfcf60e5849c96d2b4dd20f90ebf Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Aug 1 22:41:26 2023 +0200 sched: Simplify ttwu() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.101069260@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:35:52 -04:00
Phil Auld	2d0ed06667	sched: Simplify wake_up_if_idle() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 4eb054f92b066ec0a0cba6896ee8eff4c91dfc9e Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Aug 1 22:41:25 2023 +0200 sched: Simplify wake_up_if_idle() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.032678917@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:35:47 -04:00
Phil Auld	f4b0880d3d	sched: Simplify: migrate_swap_stop() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 5bb76f1ddf2a7dd98f5a89d7755600ed1b4a7fcd Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Aug 1 22:41:24 2023 +0200 sched: Simplify: migrate_swap_stop() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211811.964370836@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:35:43 -04:00
Phil Auld	fe6b874eba	sched: Simplify sysctl_sched_uclamp_handler() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 0f92cdf36f848f1c077924f857a49789e00331c0 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Aug 1 22:41:23 2023 +0200 sched: Simplify sysctl_sched_uclamp_handler() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211811.896559109@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:35:39 -04:00
Phil Auld	ba129ce513	sched: Simplify get_nohz_timer_target() JIRA: https://issues.redhat.com/browse/RHEL-29017 commit 7537b90c0036759e0b1b43dfbc6224dc5e900b13 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Aug 1 22:41:22 2023 +0200 sched: Simplify get_nohz_timer_target() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211811.828443100@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-03-13 14:35:32 -04:00
Waiman Long	9b35f92491	sched/core: Make sched_setaffinity() always return -EINVAL on empty cpumask JIRA: https://issues.redhat.com/browse/RHEL-21440 Upstream Status: RHEL only Since RHEL commit `05fddaaaac` ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()"), an empty cpumask is used for resetting a user requested CPU affinity set by a previous sched_setaffinity() call. An error code of -ENODEV is returned for a successful reset. However, this broke some test cases that only expects an error code of -EINVAL. So another RHEL commit `90f7bb0c18` ("sched/core: Don't return -ENODEV from sched_setaffinity()") was merged to return 0 in that case. Again, this still broke some other test cases. This patch restores the old behavior of always returning -EINVAL on an empty cpumask with the exception that 0 may still be returned if the empty cpumask is caused by a input len parameter of 0 which is another way of resetting user requested CPU affinity that I had proposed to upstream before. Fixes: `90f7bb0c18` ("sched/core: Don't return -ENODEV from sched_setaffinity()") Signed-off-by: Waiman Long <longman@redhat.com>	2024-02-07 13:45:10 -05:00
Scott Weaver	02be8ace82	Merge: sched/core: Don't return -ENODEV from sched_setaffinity() MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3388 JIRA: https://issues.redhat.com/browse/RHEL-16613 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3388/edit Upstream Status: RHEL only Tested: A unit test was run to verify that CPU affinity reset only happened when the full mask was empty not when any of the out of bound bits were set. RHEL commit `05fddaaaac` ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()") enables the use of an empty cpumask to reset a user-provided CPU affinity via sched_setaffinity(2) syscall. It will return a value of -ENODEV to indicate a success in resetting the user provided CPU affinity. However, the current way to check for empty cpumask using cpumask_emtpy() is not robust enough to avoid many false positives leading to an erroneous return of -ENODEV which can confuse user applications leading to incorrect behavior. For example, if the system has 28 CPUs and bit 28 is set, it will treat the cpumask as empty. Instead of cpumask_empty(), bitmap_empty() should be used to check if all the bits in the cpumask_size() buffer are zero. This should avoid many false positives. However, false positive can still happen if the bit set is outside the range allowed by cpumask_size(). So we need to check the full user_mask buffer to see if it is really empty to avoid any false positive. By doing so, there should be no need to return a -ENODEV error code which is a workaround to handle the false positives. A value of 0 will be returned if the reset is successful or -EINVAL will be if user-provided CPU affinity hasn't been properly set by sched_setaffinity(2). Fixes: `05fddaaaac` ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()") Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: John B. Wyatt IV <jwyatt@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2024-01-09 10:42:38 -05:00
Phil Auld	ea3a71ed56	sched/core: Fix RQCF_ACT_SKIP leak JIRA: https://issues.redhat.com/browse/RHEL-15489 Conflicts: Applied with fuzz due to not having mm_cid changes, specifically 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") commit 5ebde09d91707a4a9bec1e3d213e3c12ffde348f Author: Hao Jia <jiahao.os@bytedance.com> Date: Thu Oct 12 17:00:03 2023 +0800 sched/core: Fix RQCF_ACT_SKIP leak Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning. This warning may be triggered in the following situations: CPU0 CPU1 __schedule() rq->clock_update_flags <<= 1; unregister_fair_sched_group() pick_next_task_fair+0x4a/0x410 destroy_cfs_bandwidth() newidle_balance+0x115/0x3e0 for_each_possible_cpu(i) i=0 rq_unpin_lock(this_rq, rf) __cfsb_csd_unthrottle() raw_spin_rq_unlock(this_rq) rq_lock(CPU0_rq, &rf) rq_clock_start_loop_update() rq->clock_update_flags & RQCF_ACT_SKIP <-- raw_spin_rq_lock(this_rq) The purpose of RQCF_ACT_SKIP is to skip the update rq clock, but the update is very early in __schedule(), but we clear RQCF__SKIP very late, causing it to span that gap above and triggering this warning. In __schedule() we can clear the RQCF__SKIP flag immediately after update_rq_clock() to avoid this RQCF_ACT_SKIP leak warning. And set rq->clock_update_flags to RQCF_UPDATED to avoid rq->clock_update_flags < RQCF_ACT_SKIP warning that may be triggered later. Fixes: ebb83d84e49b ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") Closes: https://lore.kernel.org/all/20230913082424.73252-1-jiahao.os@bytedance.com Reported-by: Igor Raits <igor.raits@gmail.com> Reported-by: Bagas Sanjaya <bagasdotme@gmail.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-11-28 13:07:59 -05:00
Waiman Long	90f7bb0c18	sched/core: Don't return -ENODEV from sched_setaffinity() JIRA: https://issues.redhat.com/browse/RHEL-16613 Upstream Status: RHEL only Tested: A unit test was run to verify that CPU affinity reset only happened when the full mask was empty not when any of the out of bound bits were set. RHEL commit `05fddaaaac` ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()") enables the use of an empty cpumask to reset a user-provided CPU affinity via sched_setaffinity(2) syscall. It will return a value of -ENODEV to indicate a success in resetting the user provided CPU affinity. However, the current way to check for empty cpumask using cpumask_emtpy() is not robust enough to avoid many false positives leading to an erroneous return of -ENODEV which can confuse user applications leading to incorrect behavior. For example, if the system has 28 CPUs and bit 28 is set, it will treat the cpumask as empty. Instead of cpumask_empty(), bitmap_empty() should be used to check if all the bits in the cpumask_size() buffer are zero. This should avoid many false positives. However, false positive can still happen if the bit set is outside the range allowed by cpumask_size(). So we need to check the full user_mask buffer to see if it is really empty to avoid any false positive. By doing so, there should be no need to return a -ENODEV error code which is a workaround to handle the false positives. A value of 0 will be returned if the reset is successful or -EINVAL will be if user-provided CPU affinity hasn't been properly set by sched_setaffinity(2). Fixes: `05fddaaaac` ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()") Signed-off-by: Waiman Long <longman@redhat.com>	2023-11-20 09:33:18 -05:00
Chris von Recklinghausen	b92cce1ea6	mm: multi-gen LRU: support page table walks Conflicts: fs/exec.c - We already have 33a2d6bc3480 ("Revert "fs/exec: allow to unshare a time namespace on vfork+exec"") so don't add call to timens_on_fork back in include/linux/mmzone.h - We already have e6ad640bc404 ("mm: deduplicate cacheline padding code") so keep CACHELINE_PADDING(_pad2_) over ZONE_PADDING(_pad2_) mm/vmscan.c - The backport of badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs") added an #include <linux/debugfs.h>. Keep it. JIRA: https://issues.redhat.com/browse/RHEL-1848 commit bd74fdaea146029e4fa12c6de89adbe0779348a9 Author: Yu Zhao <yuzhao@google.com> Date: Sun Sep 18 02:00:05 2022 -0600 mm: multi-gen LRU: support page table walks To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[8, 10]% Ops/sec KB/sec patch1-7: 1147696.57 44640.29 patch1-8: 1245274.91 48435.66 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 48.16% lzo1x_1_do_compress (real work) 8.20% page_vma_mapped_walk (overhead) 7.06% _raw_spin_unlock_irq 2.92% ptep_clear_flush 2.53% __zram_bvec_write 2.11% do_raw_spin_lock 2.02% memmove 1.93% lru_gen_look_around 1.56% free_unref_page_list 1.40% memset patch1-8 49.44% lzo1x_1_do_compress (real work) 6.19% page_vma_mapped_walk (overhead) 5.97% _raw_spin_unlock_irq 3.13% get_pfn_folio 2.85% ptep_clear_flush 2.42% __zram_bvec_write 2.08% do_raw_spin_lock 1.92% memmove 1.44% alloc_zspage 1.36% memset Configurations: no change Thanks to the following developers for their efforts [3]. kernel test robot <lkp@intel.com> [1] https://lwn.net/Articles/23732/ [2] https://llvm.org/docs/ScudoHardenedAllocator.html [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/ Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:13:46 -04:00
Chris von Recklinghausen	8e351dfac4	memory tiering: adjust hot threshold automatically JIRA: https://issues.redhat.com/browse/RHEL-1848 commit c959924b0dc53bf6252793f41480bc01b9792570 Author: Huang Ying <ying.huang@intel.com> Date: Wed Jul 13 16:39:53 2022 +0800 memory tiering: adjust hot threshold automatically The promotion hot threshold is workload and system configuration dependent. So in this patch, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. If the hint page fault latency of a page is less than the hot threshold, we will try to promote the page, and the page is called the candidate promotion page. If the number of the candidate promotion pages in the statistics interval is much more than the promotion rate limit, the hot threshold will be decreased to reduce the number of the candidate promotion pages. Otherwise, the hot threshold will be increased to increase the number of the candidate promotion pages. To make the above method works, in each statistics interval, the total number of the pages to check (on which the hint page faults occur) and the hot/cold distribution need to be stable. Because the page tables are scanned linearly in NUMA balancing, but the hot/cold distribution isn't uniform along the address usually, the statistics interval should be larger than the NUMA balancing scan period. So in the patch, the max scan period is used as statistics interval and it works well in our tests. Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:13:31 -04:00
Scott Weaver	c007b2ed95	Merge: rcu: Backport upstream RCU commits up to v6.4 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3099 JIRA: https://issues.redhat.com/browse/RHEL-5228 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3099 Omitted-fix: 4d09caec2fab ("arm64: kcsan: Support detecting more missing memory barriers") This MR backports upstream RCU commits up to v6.1 with related fixes, if applicable. It also includes a number of KCSAN commits which provide helpers and APIs that may be referenced by commits from other subsystems. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-10-17 09:01:45 -04:00
Scott Weaver	0f4bee1faf	Merge: Scheduler updates for 9.4 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3029 JIRA: https://issues.redhat.com/browse/RHEL-1536 Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2208016 Tested: With scheduelr stress tests, cfs bandwidth tests and unit tests on specific pieces (such as enabling WARN_DOUBLE_CLOCK etc), in addition to cki and perf QE testing. Updates and fixes from up to v6.5 for scheduler related code. This includes a revert of one of the RT merge patches which is then re-applied in the form it took when added to Linus's tree (see the "wait_task_inactive()" commits). Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Juri Lelli <juri.lelli@redhat.com> Approved-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-10-17 09:01:44 -04:00
Scott Weaver	569bd0e035	Merge: trace: Add trace_ipi_send_cpumask() MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3080 Linux has had tracepoints tied to IPI reception for a while, but none tied to IPI emission. This series add tracepoints to the actual codepath sending the IPIs, which makes it possible to trace and track sources of IPI with Ftrace. This is very useful for setups where IPIs to certain CPUs are /mostly/ undesired and a source of unwanted interference (e.g. CPU isolation). Bugzilla: https://bugzilla.redhat.com/2192613 Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Approved-by: John B. Wyatt IV <jwyatt@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Approved-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-09-27 09:34:11 -04:00
Waiman Long	29653e5b7c	sched: Add helper nr_context_switches_cpu() JIRA: https://issues.redhat.com/browse/RHEL-5228 commit 7c182722a0a9447e31f9645de4f311e5bc59b480 Author: Zhen Lei <thunder.leizhen@huawei.com> Date: Sat, 19 Nov 2022 17:25:05 +0800 sched: Add helper nr_context_switches_cpu() Add a function nr_context_switches_cpu() that returns number of context switches since boot on the specified CPU. This information will be used to diagnose RCU CPU stalls. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-09-22 13:21:37 -04:00
Jerome Marchand	04a26afde2	trace: Add trace_ipi_send_cpu() Bugzilla: https://bugzilla.redhat.com/2192613 Conflicts: context change due to missing commit ed29b0b4fd83 ("io_uring: move to separate directory") commit 68e2d17c9eb311ab59aeb6d0c38aad8985fa2596 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Mar 22 11:28:36 2023 +0100 trace: Add trace_ipi_send_cpu() Because copying cpumasks around when targeting a single CPU is a bit daft... Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230322103004.GA571242%40hirez.programming.kicks-ass.net Signed-off-by: Jerome Marchand <jmarchan@redhat.com>	2023-09-14 15:36:30 +02:00
Jerome Marchand	aa5786b04d	sched, smp: Trace smp callback causing an IPI Bugzilla: https://bugzilla.redhat.com/2192613 Conflicts: Need to modify __smp_call_single_queue_debug() too. It was removed upstream by commit 1771257cb447 ("locking/csd_lock: Remove added data from CSD lock debugging") commit 68f4ff04dbada18dad79659c266a8e5e29e458cd Author: Valentin Schneider <vschneid@redhat.com> Date: Tue Mar 7 14:35:58 2023 +0000 sched, smp: Trace smp callback causing an IPI Context ======= The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter which so far has only been fed with NULL. While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing struct layout (meaning their callback func can be accessed without caring about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function attached to its struct. This means we need to check the type of a CSD before eventually dereferencing its associated callback. This isn't as trivial as it sounds: the CSD type is stored in __call_single_node.u_flags, which get cleared right before the callback is executed via csd_unlock(). This implies checking the CSD type before it is enqueued on the call_single_queue, as the target CPU's queue can be flushed before we get to sending an IPI. Furthermore, send_call_function_single_ipi() only has a CPU parameter, and would need to have an additional argument to trickle down the invoked function. This is somewhat silly, as the extra argument will always be pushed down to the function even when nothing is being traced, which is unnecessary overhead. Changes ======= send_call_function_single_ipi() is only used by smp.c, and is defined in sched/core.c as it contains scheduler-specific ops (set_nr_if_polling() of a CPU's idle task). Split it into two parts: the scheduler bits remain in sched/core.c, and the actual IPI emission is moved into smp.c. This lets us define an __always_inline helper function that can take the related callback as parameter without creating useless register pressure in the non-traced path which only gains a (disabled) static branch. Do the same thing for the multi IPI case. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230307143558.294354-8-vschneid@redhat.com Signed-off-by: Jerome Marchand <jmarchan@redhat.com>	2023-09-14 15:36:30 +02:00
Jerome Marchand	160dc2ad5b	sched, smp: Trace IPIs sent via send_call_function_single_ipi() Bugzilla: https://bugzilla.redhat.com/2192613 Conflicts: context change due to missing commit ed29b0b4fd83 ("io_uring: move to separate directory") commit cc9cb0a71725aa8dd8d8f534a9b562bbf7981f75 Author: Valentin Schneider <vschneid@redhat.com> Date: Tue Mar 7 14:35:53 2023 +0000 sched, smp: Trace IPIs sent via send_call_function_single_ipi() send_call_function_single_ipi() is the thing that sends IPIs at the bottom of smp_call_function*() via either generic_exec_single() or smp_call_function_many_cond(). Give it an IPI-related tracepoint. Note that this ends up tracing any IPI sent via __smp_call_single_queue(), which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free". Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230307143558.294354-3-vschneid@redhat.com Signed-off-by: Jerome Marchand <jmarchan@redhat.com>	2023-09-14 15:36:30 +02:00
Phil Auld	c11309550b	sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop() JIRA: https://issues.redhat.com/browse/RHEL-1536 commit 96500560f0c73c71bca1b27536c6254fa0e8ce37 Author: Hao Jia <jiahao.os@bytedance.com> Date: Tue Jun 13 16:20:10 2023 +0800 sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop() There is a double update_rq_clock() invocation: __balance_push_cpu_stop() update_rq_clock() __migrate_task() update_rq_clock() Sadly select_fallback_rq() also needs update_rq_clock() for __do_set_cpus_allowed(), it is not possible to remove the update from __balance_push_cpu_stop(). So remove it from __migrate_task() and ensure all callers of this function call update_rq_clock() prior to calling it. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20230613082012.49615-3-jiahao.os@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:59 -04:00
Phil Auld	57aa0597d5	sched/core: Fixed missing rq clock update before calling set_rq_offline() JIRA: https://issues.redhat.com/browse/RHEL-1536 commit cab3ecaed5cdcc9c36a96874b4c45056a46ece45 Author: Hao Jia <jiahao.os@bytedance.com> Date: Tue Jun 13 16:20:09 2023 +0800 sched/core: Fixed missing rq clock update before calling set_rq_offline() When using a cpufreq governor that uses cpufreq_add_update_util_hook(), it is possible to trigger a missing update_rq_clock() warning for the CPU hotplug path: rq_attach_root() set_rq_offline() rq_offline_rt() __disable_runtime() sched_rt_rq_enqueue() enqueue_top_rt_rq() cpufreq_update_util() data->func(data, rq_clock(rq), flags) Move update_rq_clock() from sched_cpu_deactivate() (one of it's callers) into set_rq_offline() such that it covers all set_rq_offline() usage. Additionally change rq_attach_root() to use rq_lock_irqsave() so that it will properly manage the runqueue clock flags. Suggested-by: Ben Segall <bsegall@google.com> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20230613082012.49615-2-jiahao.os@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:59 -04:00
Phil Auld	51c8946826	sched: Consider task_struct::saved_state in wait_task_inactive() JIRA: https://issues.redhat.com/browse/RHEL-1536 commit 1c06918788e8ae6e69e4381a2806617312922524 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 16:39:07 2023 +0200 sched: Consider task_struct::saved_state in wait_task_inactive() With the introduction of task_struct::saved_state in commit 5f220be21418 ("sched/wakeup: Prepare for RT sleeping spin/rwlocks") matching the task state has gotten more complicated. That same commit changed try_to_wake_up() to consider both states, but wait_task_inactive() has been neglected. Sebastian noted that the wait_task_inactive() usage in ptrace_check_attach() can misbehave when ptrace_stop() is blocked on the tasklist_lock after it sets TASK_TRACED. Therefore extract a common helper from ttwu_state_match() and use that to teach wait_task_inactive() about the PREEMPT_RT locks. Originally-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20230601091234.GW83892@hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:59 -04:00
Phil Auld	9e3d063131	sched: Unconditionally use full-fat wait_task_inactive() JIRA: https://issues.redhat.com/browse/RHEL-1536 commit d5e1586617be7093ea3419e3fa9387ed833cdbb1 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 2 10:42:53 2023 +0200 sched: Unconditionally use full-fat wait_task_inactive() While modifying wait_task_inactive() for PREEMPT_RT; the build robot noted that UP got broken. This led to audit and consideration of the UP implementation of wait_task_inactive(). It looks like the UP implementation is also broken for PREEMPT; consider task_current_syscall() getting preempted between the two calls to wait_task_inactive(). Therefore move the wait_task_inactive() implementation out of CONFIG_SMP and unconditionally use it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230602103731.GA630648%40hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:59 -04:00
Phil Auld	ec56b1904a	sched: Change wait_task_inactive()s match_state JIRA: https://issues.redhat.com/browse/RHEL-1536 Conflicts: This was applied out of order with f9fc8cad9728 ("sched: Add TASK_ANY for wait_task_inactive()") so adjusted code to match what the results should have been. commit 9204a97f7ae862fc8a3330ec8335917534c3fb63 Author: Peter Zijlstra <peterz@infradead.org> Date: Mon Aug 22 13:18:19 2022 +0200 sched: Change wait_task_inactive()s match_state Make wait_task_inactive()'s @match_state work like ttwu()'s @state. That is, instead of an equal comparison, use it as a mask. This allows matching multiple block conditions. (removes the unlikely; it doesn't make sense how it's only part of the condition) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:58 -04:00
Phil Auld	418216578b	Revert "sched: Consider task_struct::saved_state in wait_task_inactive()." JIRA: https://issues.redhat.com/browse/RHEL-1536 Upstream status: RHEL only Conflicts: A later patch renamed task_running() to task_on_cpu() so this did not revert cleanly. In addition match_state does not need to be checked for 0 due to f9fc8cad9728 sched: Add TASK_ANY for wait_task_inactive(). This reverts commit `3673cc2e61`. This is commit a015745ca41f from the RT tree merge. It will be re-applied in the form it was in when merged to Linus' tree as 1c06918788. Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:30:58 -04:00
Phil Auld	c59c893622	sched/core: Make sched_dynamic_mutex static JIRA: https://issues.redhat.com/browse/RHEL-1536 commit 9b8e17813aeccc29c2f9f2e6e68997a6eac2d26d Author: Josh Poimboeuf <jpoimboe@kernel.org> Date: Tue Apr 11 22:26:41 2023 -0700 sched/core: Make sched_dynamic_mutex static The sched_dynamic_mutex is only used within the file. Make it static. Fixes: e3ff7c609f39 ("livepatch,sched: Add livepatch task switching to cond_resched()") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/oe-kbuild-all/202304062335.tNuUjgsl-lkp@intel.com/ Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:26:06 -04:00
Phil Auld	63797cb734	sched/core: Reduce cost of sched_move_task when config autogroup JIRA: https://issues.redhat.com/browse/RHEL-1536 commit eff6c8ce8d4d7faef75f66614dd20bb50595d261 Author: wuchi <wuchi.zero@gmail.com> Date: Tue Mar 21 14:44:59 2023 +0800 sched/core: Reduce cost of sched_move_task when config autogroup Some sched_move_task calls are useless because that task_struct->sched_task_group maybe not changed (equals task_group of cpu_cgroup) when system enable autogroup. So do some checks in sched_move_task. sched_move_task eg: task A belongs to cpu_cgroup0 and autogroup0, it will always belong to cpu_cgroup0 when do_exit. So there is no need to do {de\|en}queue. The call graph is as follow. do_exit sched_autogroup_exit_task sched_move_task dequeue_task sched_change_group A.sched_task_group = sched_get_task_group (=cpu_cgroup0) enqueue_task Performance results: =========================== 1. env cpu: bogomips=4600.00 kernel: 6.3.0-rc3 cpu_cgroup: 6:cpu,cpuacct:/user.slice 2. cmds do_exit script: for i in {0..10000}; do sleep 0 & done wait Run the above script, then use the following bpftrace cmd to get the cost of sched_move_task: bpftrace -e 'k:sched_move_task { @ts[tid] = nsecs; } kr:sched_move_task /@ts[tid]/ { @ns += nsecs - @ts[tid]; delete(@ts[tid]); }' 3. cost time(ns): without patch: 43528033 with patch: 18541416 diff:-24986617 -57.4% As the result show, the patch will save 57.4% in the scenario. Signed-off-by: wuchi <wuchi.zero@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230321064459.39421-1-wuchi.zero@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:26:06 -04:00
Phil Auld	112765493a	sched/core: Avoid selecting the task that is throttled to run when core-sched enable JIRA: https://issues.redhat.com/browse/RHEL-1536 commit 530bfad1d53d103f98cec66a3e491a36d397884d Author: Hao Jia <jiahao.os@bytedance.com> Date: Thu Mar 16 16:18:06 2023 +0800 sched/core: Avoid selecting the task that is throttled to run when core-sched enable When {rt, cfs}_rq or dl task is throttled, since cookied tasks are not dequeued from the core tree, So sched_core_find() and sched_core_next() may return throttled task, which may cause throttled task to run on the CPU. So we add checks in sched_core_find() and sched_core_next() to make sure that the return is a runnable task that is not throttled. Co-developed-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230316081806.69544-1-jiahao.os@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:26:06 -04:00
Phil Auld	2e4b079146	sched_getaffinity: don't assume 'cpumask_size()' is fully initialized JIRA: https://issues.redhat.com/browse/RHEL-1536 commit 6015b1aca1a233379625385feb01dd014aca60b5 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue Mar 14 19:32:38 2023 -0700 sched_getaffinity: don't assume 'cpumask_size()' is fully initialized The getaffinity() system call uses 'cpumask_size()' to decide how big the CPU mask is - so far so good. It is indeed the allocation size of a cpumask. But the code also assumes that the whole allocation is initialized without actually doing so itself. That's wrong, because we might have fixed-size allocations (making copying and clearing more efficient), but not all of it is then necessarily used if 'nr_cpu_ids' is smaller. Having checked other users of 'cpumask_size()', they all seem to be ok, either using it purely for the allocation size, or explicitly zeroing the cpumask before using the size in bytes to copy it. See for example the ublk_ctrl_get_queue_affinity() function that uses the proper 'zalloc_cpumask_var()' to make sure that the whole mask is cleared, whether the storage is on the stack or if it was an external allocation. Fix this by just zeroing the allocation before using it. Do the same for the compat version of sched_getaffinity(), which had the same logic. Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to access the bits. For a cpumask_var_t, it ends up being a pointer to the same data either way, but it's just a good idea to treat it like you would a 'cpumask_t'. The compat case already did that. Reported-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/ Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:26:06 -04:00
Phil Auld	8142c03a19	livepatch,sched: Add livepatch task switching to cond_resched() JIRA: https://issues.redhat.com/browse/RHEL-1536 Conflicts: Minor fixup due to already having `8df1947c71` ("livepatch: Replace the fake signal sending with TIF_NOTIFY_SIGNAL infrastructure") commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8 Author: Josh Poimboeuf <jpoimboe@kernel.org> Date: Fri Feb 24 08:50:00 2023 -0800 livepatch,sched: Add livepatch task switching to cond_resched() There have been reports [1][2] of live patches failing to complete within a reasonable amount of time due to CPU-bound kthreads. Fix it by patching tasks in cond_resched(). There are four different flavors of cond_resched(), depending on the kernel configuration. Hook into all of them. A more elegant solution might be to use a preempt notifier. However, non-ORC unwinders can't unwind a preempted task reliably. [1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/ [2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:26:06 -04:00
Phil Auld	4e3b05f4b0	sched/fair: Block nohz tick_stop when cfs bandwidth in use Bugzilla: https://bugzilla.redhat.com/2208016 Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core Conflicts: minor fuzz due to context. commit 88c56cfeaec4642aee8aac58b38d5708c6aae0d3 Author: Phil Auld <pauld@redhat.com> Date: Wed Jul 12 09:33:57 2023 -0400 sched/fair: Block nohz tick_stop when cfs bandwidth in use CFS bandwidth limits and NOHZ full don't play well together. Tasks can easily run well past their quotas before a remote tick does accounting. This leads to long, multi-period stalls before such tasks can run again. Currently, when presented with these conflicting requirements the scheduler is favoring nohz_full and letting the tick be stopped. However, nohz tick stopping is already best-effort, there are a number of conditions that can prevent it, whereas cfs runtime bandwidth is expected to be enforced. Make the scheduler favor bandwidth over stopping the tick by setting TICK_DEP_BIT_SCHED when the only running task is a cfs task with runtime limit enabled. We use cfs_b->hierarchical_quota to determine if the task requires the tick. Add check in pick_next_task_fair() as well since that is where we have a handle on the task that is actually going to be running. Add check in sched_can_stop_tick() to cover some edge cases such as nr_running going from 2->1 and the 1 remains the running task. Reviewed-By: Ben Segall <bsegall@google.com> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:25:42 -04:00
Phil Auld	3f3cb409d3	sched, cgroup: Restore meaning to hierarchical_quota Bugzilla: https://bugzilla.redhat.com/2208016 Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core commit c98c18270be115678f4295b10a5af5dcc9c4efa0 Author: Phil Auld <pauld@redhat.com> Date: Fri Jul 14 08:57:46 2023 -0400 sched, cgroup: Restore meaning to hierarchical_quota In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task groups due to the previous fix simply taking the min. It should reflect a limit imposed at that level or by an ancestor. Even though cgroupv2 does not require child quota to be less than or equal to that of its ancestors the task group will still be constrained by such a quota so this should be shown here. Cgroupv1 continues to set this correctly. In both cases, add initialization when a new task group is created based on the current parent's value (or RUNTIME_INF in the case of root_task_group). Otherwise, the field is wrong until a quota is changed after creation and __cfs_schedulable() is called. Fixes: `c53593e5cb` ("sched, cgroup: Don't reject lower cpu.max on ancestors") Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-09-07 14:25:41 -04:00
Jan Stancek	8d19d78fab	Merge: sched/core: Use empty mask to reset cpumasks in sched_setaffinity() MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2962 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219681 Upstream Status: RHEL only Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask"), user provided CPU affinity via sched_setaffinity(2) is perserved even if the task is being moved to a different cpuset. However, that affinity is also being inherited by any subsequently created child processes which may not want or be aware of that affinity. One way to solve this problem is to provide a way to back off from that user provided CPU affinity. This patch implements such a scheme by using an empty cpumask to signal a reset of the cpumasks to the default as allowed by the current cpuset. Before this patch, passing in an empty cpumask to sched_setaffinity(2) will always return an -EINVAL error. With this patch, an alternative error of -ENODEV will be returned returned if sched_setaffinity(2) has been called before to set up user_cpus_ptr. In this case, the user_cpus_ptr that stores the user provided affinity will be cleared and the task's CPU affinity will be reset to that of the current cpuset. This alternative error code of -ENODEV signals that the no CPU is specified and, at the same time, a side effect of resetting cpu affinity to the cpuset default. If sched_setaffinity(2) has not been called previously, an EINVAL error will be returned with an empty cpumask just like before. Tests or tools that rely on the behavior that an empty cpumask will return an error code will not be affected. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: John B. Wyatt IV <jwyatt@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-09-01 21:26:13 +02:00
Jan Stancek	f2a2d5da21	Merge: cgroup/cpuset: Provide better cpuset API to enable creation of isolated partition MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2957 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2957 OpenShfit requires support to disable CPU load balancing for the Telco use cases and this is a gating factor in determining if it can switch to use cgroup v2 as the default. The current RHEL9 kernel is able to create an isolated cpuset partition of exclusive CPUs with load balancing disabled for cgroup v2. However, it currently has the limitation that isolated cpuset partitions can only be formed clustered around the cgroup root. That doesn't fit the current OpenShift use case where systemd is primarily responsible for managing the cgroup filesystem and OpenShift can only manage child cgroups further away from the cgroup root. To address the need of OpenShift, a patch series [1] has been proposed upstream to extend the v2 cpuset partition semantics to allow the creation of isolated partitions further away from cgroup root by adding a new cpuset control file "cpuset.cpus.exclusive" to distribute potential exclusive CPUs down the cgroup hierarchy for the creation of isolated cpuset partition. This MR incorporates the proposed upstream patches with its dependency patches to provide a way for OpenShift to move forward with switching the default cgroup from v1 to v2 for the 4.14 release. The last 6 patches are the proposed upstream patches and the rests have been merged upstream either in the mainline or the cgroup maintainer's tree. [1] https://lore.kernel.org/lkml/20230817132454.755459-1-longman@redhat.com/ Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Aristeu Rozanski <arozansk@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-09-01 21:26:08 +02:00
Waiman Long	132876f2ff	cgroup/cpuset: Free DL BW in case can_attach() fails Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568 commit 2ef269ef1ac006acf974793d975539244d77b28f Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Mon, 8 May 2023 09:58:54 +0200 cgroup/cpuset: Free DL BW in case can_attach() fails cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks have been checked. DL BW is not allocated per-task but as a sum over all DL tasks migrating. If multiple controllers are attached to the cgroup next to the cpuset controller a non-cpuset can_attach() can fail. In this case free DL BW in cpuset_cancel_attach(). Finally, update cpuset DL task count (nr_deadline_tasks) only in cpuset_attach(). Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-08-28 11:07:05 -04:00
Waiman Long	5503327426	sched/deadline: Create DL BW alloc, free & check overflow interface Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568 commit 85989106feb734437e2d598b639991b9185a43a6 Author: Dietmar Eggemann <dietmar.eggemann@arm.com> Date: Mon, 8 May 2023 09:58:53 +0200 sched/deadline: Create DL BW alloc, free & check overflow interface While moving a set of tasks between exclusive cpusets, cpuset_can_attach() -> task_can_attach() calls dl_cpu_busy(..., p) for DL BW overflow checking and per-task DL BW allocation on the destination root_domain for the DL tasks in this set. This approach has the issue of not freeing already allocated DL BW in the following error cases: (1) The set of tasks includes multiple DL tasks and DL BW overflow checking fails for one of the subsequent DL tasks. (2) Another controller next to the cpuset controller which is attached to the same cgroup fails in its can_attach(). To address this problem rework dl_cpu_busy(): (1) Split it into dl_bw_check_overflow() & dl_bw_alloc() and add a dedicated dl_bw_free(). (2) dl_bw_alloc() & dl_bw_free() take a `u64 dl_bw` parameter instead of a `struct task_struct *p` used in dl_cpu_busy(). This allows to allocate DL BW for a set of tasks too rather than only for a single task. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-08-28 11:07:05 -04:00
Waiman Long	3493ed9e35	sched/cpuset: Bring back cpuset_mutex Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174568 commit 111cd11bbc54850f24191c52ff217da88a5e639b Author: Juri Lelli <juri.lelli@redhat.com> Date: Mon, 8 May 2023 09:58:50 +0200 sched/cpuset: Bring back cpuset_mutex Turns out percpu_cpuset_rwsem - commit `1243dc518c` ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea, as it has been reported to cause slowdowns in workloads that need to change cpuset configuration frequently and it is also not implementing priority inheritance (which causes troubles with realtime workloads). Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it only for SCHED_DEADLINE tasks (other policies don't care about stable cpusets anyway). Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-08-28 11:07:04 -04:00
Crystal Wood	ec180d083a	sched/core: Add __always_inline to schedule_loop() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232098 Upstream Status: RHEL only Without __always_inline, this function breaks wchan. schedule_loop() was added by patches from the upstream RT tree; a respin of the patches for upstream has __always_inline. Signed-off-by: Crystal Wood <swood@redhat.com>	2023-08-21 09:57:26 -05:00
Waiman Long	05fddaaaac	sched/core: Use empty mask to reset cpumasks in sched_setaffinity() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219681 Upstream Status: RHEL only Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask"), user provided CPU affinity via sched_setaffinity(2) is perserved even if the task is being moved to a different cpuset. However, that affinity is also being inherited by any subsequently created child processes which may not want or be aware of that affinity. One way to solve this problem is to provide a way to back off from that user provided CPU affinity. This patch implements such a scheme by using an empty cpumask to signal a reset of the cpumasks to the default as allowed by the current cpuset. Before this patch, passing in an empty cpumask to sched_setaffinity(2) will always return an -EINVAL error. With this patch, an alternative error of -ENODEV will be returned returned if sched_setaffinity(2) has been called before to set up user_cpus_ptr. In this case, the user_cpus_ptr that stores the user provided affinity will be cleared and the task's CPU affinity will be reset to that of the current cpuset. This alternative error code of -ENODEV signals that the no CPU is specified and, at the same time, a side effect of resetting cpu affinity to the cpuset default. If sched_setaffinity(2) has not been called previously, an EINVAL error will be returned with an empty cpumask just like before. Tests or tools that rely on the behavior that an empty cpumask will return an error code will not be affected. We will have to update the sched_setaffinity(2) manpage to document this possible side effect of passing in an empty cpumask. Signed-off-by: Waiman Long <longman@redhat.com>	2023-08-19 14:53:37 -04:00
Jan Stancek	b7217f6931	Merge: sched/core: Provide sched_rtmutex() and expose sched work helpers MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2829 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218724 Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git Avoid corrupting lock state due to blocking on a lock in sched_submit_work() while in the process of blocking on another lock. Signed-off-by: Crystal Wood <swood@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-07-31 16:05:41 +02:00
Crystal Wood	09e4f82619	sched/core: Provide sched_rtmutex() and expose sched work helpers Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218724 Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git commit ca66ec3b9994e5f82b433697e37512f7d28b6d22 Author: Thomas Gleixner <tglx@linutronix.de> Date: Thu Apr 27 13:19:34 2023 +0200 sched/core: Provide sched_rtmutex() and expose sched work helpers schedule() invokes sched_submit_work() before scheduling and sched_update_worker() afterwards to ensure that queued block requests are flushed and the (IO)worker machineries can instantiate new workers if required. This avoids deadlocks and starvation. With rt_mutexes this can lead to subtle problem: When rtmutex blocks current::pi_blocked_on points to the rtmutex it blocks on. When one of the functions in sched_submit/resume_work() contends on a rtmutex based lock then that would corrupt current::pi_blocked_on. Make it possible to let rtmutex issue the calls outside of the slowpath, i.e. when it is guaranteed that current::pi_blocked_on is NULL, by: - Exposing sched_submit_work() and moving the task_running() condition into schedule() - Renamimg sched_update_worker() to sched_resume_work() and exposing it too. - Providing sched_rtmutex() which just does the inner loop of scheduling until need_resched() is not longer set. Split out the loop so this does not create yet another copy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Crystal Wood <swood@redhat.com>	2023-07-18 17:22:36 -05:00
Oleg Nesterov	b85b393abb	ptrace: Don't change __state Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325 commit 2500ad1c7fa42ad734677853961a3a8bec0772c5 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Fri Apr 29 08:43:34 2022 -0500 ptrace: Don't change __state Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace command is executing. Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and implement a new jobctl flag TASK_PTRACE_FROZEN. This new flag is set in jobctl_freeze_task and cleared when ptrace_stop is awoken or in jobctl_unfreeze_task (when ptrace_stop remains asleep). In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL when the wake up is for a fatal signal. Skip adding __TASK_TRACED when TASK_PTRACE_FROZEN is not set. This has the same effect as changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use TASK_KILLABLE go through signal_wake_up. Handle a ptrace_stop being called with a pending fatal signal. Previously it would have been handled by schedule simply failing to sleep. As TASK_WAKEKILL is no longer part of TASK_TRACED schedule will sleep with a fatal_signal_pending. The code in signal_wake_up guarantees that the code will be awaked by any fatal signal that codes after TASK_TRACED is set. Previously the __state value of __TASK_TRACED was changed to TASK_RUNNING when woken up or back to TASK_TRACED when the code was left in ptrace_stop. Now when woken up ptrace_stop now clears JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced clears JOBCTL_PTRACE_FROZEN. Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2023-07-06 15:55:31 +02:00
Jan Stancek	704d11b087	Merge: enable io_uring MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375 # Merge Request Required Information ## Summary of Changes This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits). The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option. ## Approved Development Ticket Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214 Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation") This is actually just an optimization, and it has non-trivial conflicts which would require additional backports to resolve. Skip it. Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce") This fix is incorrectly tagged. The code that it applies to is not present in our tree. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Approved-by: John Meneghini <jmeneghi@redhat.com> Approved-by: Ming Lei <ming.lei@redhat.com> Approved-by: Maurizio Lombardi <mlombard@redhat.com> Approved-by: Brian Foster <bfoster@redhat.com> Approved-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-17 07:47:08 +02:00
Jan Stancek	3b12a1f1fc	Merge: Scheduler updates for 9.3 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2392 JIRA: https://issues.redhat.com/browse/RHEL-282 Tested: With scheduler stress tests. Perf QE is running performance regression tests. Update the kernel's core scheduler and related code with fixes and minor changes from the upstream kernel. This will sync up to roughly linux v6.3-rc6. Added a couple of cpumask things which fit better here. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-16 11:49:47 +02:00
Jan Stancek	eeab15fa15	Merge: Scheduler uclamp and asym updates to v6.3-rc1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2337 JIRA: https://issues.redhat.com/browse/RHEL-310 Tested: scheduler stress tests. This is a collection of commits that update (mostly) the uclamp code in the scheduler. We don't have CONFIG_UCLAMP_TASK enabled right now but we might in the future. We do though have EAS enabled and this helps keep the code in sync to reduce issues withother patches. It's broken out of the main scheduler update for 9.3 to keep it contained and make the other MR smaller. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-15 09:35:54 +02:00
Jan Stancek	f58fc750ef	Merge: Sched/psi: updates to v6.3-rc1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2325 JIRA: https://issues.redhat.com/browse/RHEL-311 Tested: Enabled PSI and ran various stress tests. Updates and bug fixes for the PSI subsystem. This brings the code up to about v6.3-rc1. It does not include the runtime enablement interface (34f26a15611 "sched/psi: Per-cgroup PSI accounting disable/re-enable interfaceas") that required a larger set of cgroup and kernfs patches. That may be take later if the prerequisites are provided. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Approved-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-11 12:12:13 +02:00
Jeff Moyer	2b8780eae3	io_uring: move to separate directory Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit ed29b0b4fd835b058ddd151c49d021e28d631ee6 Author: Jens Axboe <axboe@kernel.dk> Date: Mon May 23 17:05:03 2022 -0600 io_uring: move to separate directory In preparation for splitting io_uring up a bit, move it into its own top level directory. It didn't really belong in fs/ anyway, as it's not a file system only API. This adds io_uring/ and moves the core files in there, and updates the MAINTAINERS file for the new location. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-04-29 04:49:02 -04:00
Jan Stancek	567f50bcff	Merge: sched/core: Fix arch_scale_freq_tick() on tickless systems MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2276 Bugzilla: https://bugzilla.redhat.com/1996625 commit 7fb3ff22ad8772bbf0e3ce1ef3eb7b09f431807f Author: Yair Podemsky <ypodemsk@redhat.com> Date: Wed Nov 30 14:51:21 2022 +0200 sched/core: Fix arch_scale_freq_tick() on tickless systems In order for the scheduler to be frequency invariant we measure the ratio between the maximum CPU frequency and the actual CPU frequency. During long tickless periods of time the calculations that keep track of that might overflow, in the function scale_freq_tick(): if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt)) goto error; eventually forcing the kernel to disable the feature for all CPUs, and show the warning message: "Scheduler frequency invariance went wobbly, disabling!". Let's avoid that by limiting the frequency invariant calculations to CPUs with regular tick. Fixes: `e2b0d619b4` ("x86, sched: check for counters overflow in frequency invariant accounting") Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Yair Podemsky <ypodemsk@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz> Link: https://lore.kernel.org/r/20221130125121.34407-1-ypodemsk@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Juri Lelli <juri.lelli@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-04-25 06:58:25 +02:00
Phil Auld	02c4ba58b7	sched/fair: Sanitize vruntime of entity being migrated JIRA: https://issues.redhat.com/browse/RHEL-282 commit a53ce18cacb477dd0513c607f187d16f0fa96f71 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Fri Mar 17 17:08:10 2023 +0100 sched/fair: Sanitize vruntime of entity being migrated Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed") fixes an overflowing bug, but ignore a case that se->exec_start is reset after a migration. For fixing this case, we delay the reset of se->exec_start after placing the entity which se->exec_start to detect long sleeping task. In order to take into account a possible divergence between the clock_task of 2 rqs, we increase the threshold to around 104 days. Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed") Originally-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 14:16:00 -04:00
Phil Auld	d3f2df660a	sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() JIRA: https://issues.redhat.com/browse/RHEL-282 commit 585463f0d58aa4d29b744c7c53b222b8028de87f Author: Valentin Schneider <vschneid@redhat.com> Date: Mon Oct 3 16:34:20 2022 +0100 sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() This removes the second use of the sched_core_mask temporary mask. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 10:04:09 -04:00
Phil Auld	f089b6b716	sched/core: Fix a missed update of user_cpus_ptr JIRA: https://issues.redhat.com/browse/RHEL-282 commit df14b7f9efcda35e59bb6f50351aac25c50f6e24 Author: Waiman Long <longman@redhat.com> Date: Fri Feb 3 13:18:49 2023 -0500 sched/core: Fix a missed update of user_cpus_ptr Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask"), a successful call to sched_setaffinity() should always save the user requested cpu affinity mask in a task's user_cpus_ptr. However, when the given cpu mask is the same as the current one, user_cpus_ptr is not updated. Fix this by saving the user mask in this case too. Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230203181849.221943-1-longman@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:20 -04:00
Phil Auld	bf73c54d24	sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs JIRA: https://issues.redhat.com/browse/RHEL-282 commit 5657c116783545fb49cd7004994c187128552b12 Author: Waiman Long <longman@redhat.com> Date: Sun Jan 15 14:31:22 2023 -0500 sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs The kernel commit 9a5418bc48ba ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()") introduces a bug for kernels built with non-SMP configs. Calling sched_setaffinity() on such a uniprocessor kernel will cause cpumask_copy() to be called with a NULL pointer leading to general protection fault. This is not really a problem in real use cases as there aren't that many uniprocessor kernel configs in use and calling sched_setaffinity() on such a uniprocessor system doesn't make sense. Fix this problem by making sure cpumask_copy() will not be called in such a case. Fixes: 9a5418bc48ba ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()") Reported-by: kernel test robot <yujie.liu@intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230115193122.563036-1-longman@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:20 -04:00
Phil Auld	ffd9ddbf5a	sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate() JIRA: https://issues.redhat.com/browse/RHEL-282 commit 160fb0d83f206b3429fc495864a022110f9e4978 Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Fri Dec 23 18:32:57 2022 +0800 sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate() ttwu_do_activate() is used for a complete wakeup, in which we will activate_task() and use ttwu_do_wakeup() to mark the task runnable and perform wakeup-preemption, also call class->task_woken() callback and update the rq->idle_stamp. Since ttwu_runnable() is not a complete wakeup, don't need all those done in ttwu_do_wakeup(), so we can move those to ttwu_do_activate() to simplify ttwu_do_wakeup(), making it only mark the task runnable to be reused in ttwu_runnable() and try_to_wake_up(). This patch should not have any functional changes. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221223103257.4962-2-zhouchengming@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:20 -04:00
Phil Auld	a09a99cf93	sched/core: Micro-optimize ttwu_runnable() JIRA: https://issues.redhat.com/browse/RHEL-282 commit efe09385864f3441c71711f91e621992f9423c01 Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Fri Dec 23 18:32:56 2022 +0800 sched/core: Micro-optimize ttwu_runnable() ttwu_runnable() is used as a fast wakeup path when the wakee task is running on CPU or runnable on RQ, in both cases we can just set its state to TASK_RUNNING to prevent a sleep. If the wakee task is on_cpu running, we don't need to update_rq_clock() or check_preempt_curr(). But if the wakee task is on_rq && !on_cpu (e.g. an IRQ hit before the task got to schedule() and the task been preempted), we should check_preempt_curr() to see if it can preempt the current running. This also removes the class->task_woken() callback from ttwu_runnable(), which wasn't required per the RT/DL implementations: any required push operation would have been queued during class->set_next_task() when p got preempted. ttwu_runnable() also loses the update to rq->idle_stamp, as by definition the rq cannot be idle in this scenario. Suggested-by: Valentin Schneider <vschneid@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20221223103257.4962-1-zhouchengming@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:19 -04:00
Phil Auld	4881a62e1d	sched: Make const-safe JIRA: https://issues.redhat.com/browse/RHEL-282 commit 904cbab71dda1689d41a240541179f21ff433c40 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Mon Dec 12 14:49:46 2022 +0000 sched: Make const-safe With a modified container_of() that preserves constness, the compiler finds some pointers which should have been marked as const. task_of() also needs to become const-preserving for the !FAIR_GROUP_SCHED case so that cfs_rq_of() can take a const argument. No change to generated code. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221212144946.2657785-1-willy@infradead.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:19 -04:00
Phil Auld	6a7d52383a	sched: Clear ttwu_pending after enqueue_task() JIRA: https://issues.redhat.com/browse/RHEL-282 commit d6962c4fe8f96f7d384d6489b6b5ab5bf3e35991 Author: Tianchen Ding <dtcccc@linux.alibaba.com> Date: Fri Nov 4 10:36:01 2022 +0800 sched: Clear ttwu_pending after enqueue_task() We found a long tail latency in schbench whem mt is close to nr_cpus. (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.) This is because when the wakee cpu is idle, rq->ttwu_pending is cleared too early, and idle_cpu() will return true until the wakee task enqueued. This will mislead the waker when selecting idle cpu, and wake multiple worker threads on the same wakee cpu. This situation is enlarged by commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle") because it tends to use wakelist. Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu (Intel(R) Xeon(R) Platinum 8369B). Latency percentiles (usec): base base+revert_f3dd3f674555 base+this_patch 50.0000th: 9 13 9 75.0000th: 12 19 12 90.0000th: 15 22 15 95.0000th: 18 24 17 99.0000th: 27 31 24 99.5000th: 3364 33 27 99.9000th: 12560 36 30 We also tested on unixbench and hackbench, and saw no performance change. Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20221104023601.12844-1-dtcccc@linux.alibaba.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:19 -04:00
Phil Auld	a774282315	sched/fair: Cleanup loop_max and loop_break JIRA: https://issues.redhat.com/browse/RHEL-282 commit c59862f8265f8060b6650ee1dc12159fe5c89779 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Thu Aug 25 14:27:24 2022 +0200 sched/fair: Cleanup loop_max and loop_break sched_nr_migrate_break is set to a fix value and never changes so we can replace it by a define SCHED_NR_MIGRATE_BREAK. Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value of sysctl_sched_nr_migrate which can be init to different values. Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate. The behavior stays unchanged unless you modify sysctl_sched_nr_migrate trough debugfs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-3-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:18 -04:00
Phil Auld	a9fcc51032	sched: Add TASK_ANY for wait_task_inactive() JIRA: https://issues.redhat.com/browse/RHEL-282 Conflicts: Context differences caused by having PREEMPT_RT merged, specifically a015745ca41f ("sched: Consider task_struct::saved_state in wait_task_inactive()"). commit f9fc8cad9728124cefe8844fb53d1814c92c6bfc Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Sep 6 12:39:55 2022 +0200 sched: Add TASK_ANY for wait_task_inactive() Now that wait_task_inactive()'s @match_state argument is a mask (like ttwu()) it is possible to replace the special !match_state case with an 'all-states' value such that any blocked state will match. Suggested-by: Ingo Molnar (mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:18 -04:00
Phil Auld	7ab9d04d74	sched: Rename task_running() to task_on_cpu() JIRA: https://issues.redhat.com/browse/RHEL-282 Conflicts: Context differences caused by having PREEMPT_RT merged, specifically a015745ca41f ("sched: Consider task_struct::saved_state in wait_task_inactive()"). commit 0b9d46fc5ef7a457cc635b30b010081228cb81ac Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Sep 6 12:33:04 2022 +0200 sched: Rename task_running() to task_on_cpu() There is some ambiguity about task_running() in that it is unrelated to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing task_on_cpu(). Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:18 -04:00
Phil Auld	3f3a0eeee3	sched/fair: Allow changing cgroup of new forked task JIRA: https://issues.redhat.com/browse/RHEL-282 commit df16b71c686cb096774e30153c9ce6756450796c Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Thu Aug 18 20:48:03 2022 +0800 sched/fair: Allow changing cgroup of new forked task commit `7dc603c902` ("sched/fair: Fix PELT integrity for new tasks") introduce a TASK_NEW state and an unnessary limitation that would fail when changing cgroup of new forked task. Because at that time, we can't handle task_change_group_fair() for new forked fair task which hasn't been woken up by wake_up_new_task(), which will cause detach on an unattached task sched_avg problem. This patch delete this unnessary limitation by adding check before do detach or attach in task_change_group_fair(). So cpu_cgrp_subsys.can_attach() has nothing to do for fair tasks, only define it in #ifdef CONFIG_RT_GROUP_SCHED. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-8-zhouchengming@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:17 -04:00
Phil Auld	fb17b0f886	sched/fair: Remove redundant cpu_cgrp_subsys->fork() JIRA: https://issues.redhat.com/browse/RHEL-282 commit 39c4261191bf05e7eb310f852980a6d0afe5582a Author: Chengming Zhou <zhouchengming@bytedance.com> Date: Thu Aug 18 20:47:58 2022 +0800 sched/fair: Remove redundant cpu_cgrp_subsys->fork() We use cpu_cgrp_subsys->fork() to set task group for the new fair task in cgroup_post_fork(). Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races") has already set_task_rq() for the new fair task in sched_cgroup_fork(), so cpu_cgrp_subsys->fork() can be removed. cgroup_can_fork() --> pin parent's sched_task_group sched_cgroup_fork() __set_task_cpu() set_task_rq() cgroup_post_fork() ss->fork() := cpu_cgroup_fork() sched_change_group(..., TASK_SET_GROUP) task_set_group_fair() set_task_rq() --> can be removed After this patch's change, task_change_group_fair() only need to care about task cgroup migration, make the code much simplier. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20220818124805.601-3-zhouchengming@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:17 -04:00
Phil Auld	9b10d97986	sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() JIRA: https://issues.redhat.com/browse/RHEL-282 commit 09348d75a6ce60eec85c86dd0ab7babc4db3caf6 Author: Ingo Molnar <mingo@kernel.org> Date: Thu Aug 11 08:54:52 2022 +0200 sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either. By using a WARN_ON_ONCE() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs. None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON_ONCE(). There's one exception: WARN_ON_ONCE() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON_ONCE(), such as in: - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return; Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:17 -04:00
Phil Auld	30180b878d	sched/fair: Make per-cpu cpumasks static JIRA: https://issues.redhat.com/browse/RHEL-282 commit 18c31c9711a90b48a77b78afb65012d9feec444c Author: Bing Huang <huangbing@kylinos.cn> Date: Sat Jul 23 05:36:09 2022 +0800 sched/fair: Make per-cpu cpumasks static The load_balance_mask and select_rq_mask percpu variables are only used in kernel/sched/fair.c. Make them static and move their allocation into init_sched_fair_class(). Replace kzalloc_node() with zalloc_cpumask_var_node() to get rid of the CONFIG_CPUMASK_OFFSTACK #ifdef and to align with per-cpu cpumask allocation for RT (local_cpu_mask in init_sched_rt_class()) and DL class (local_cpu_mask_dl in init_sched_dl_class()). [ mingo: Tidied up changelog & touched up the code. ] Signed-off-by: Bing Huang <huangbing@kylinos.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220722213609.3901-1-huangbing775@126.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:16 -04:00
Phil Auld	680e019203	sched/debug: Print each field value left-aligned in sched_show_task() JIRA: https://issues.redhat.com/browse/RHEL-282 commit 0f03d6805bfc454279169a1460abb3f6b3db317f Author: Zhen Lei <thunder.leizhen@huawei.com> Date: Wed Jul 27 14:08:19 2022 +0800 sched/debug: Print each field value left-aligned in sched_show_task() Currently, the values of some fields are printed right-aligned, causing the field value to be next to the next field name rather than next to its own field name. So print each field value left-aligned, to make it more readable. Before: stack: 0 pid: 307 ppid: 2 flags:0x00000008 After: stack:0 pid:308 ppid:2 flags:0x0000000a This also makes them print in the same style as the other two fields: task:demo0 state:R running task Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220727060819.1085-1-thunder.leizhen@huawei.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:16 -04:00
Phil Auld	78210daf7b	sched: Snapshot thread flags JIRA: https://issues.redhat.com/browse/RHEL-282 commit 0569b245132c40015281610353935a50e282eb94 Author: Mark Rutland <mark.rutland@arm.com> Date: Mon Nov 29 13:06:45 2021 +0000 sched: Snapshot thread flags Some thread flags can be set remotely, and so even when IRQs are disabled, the flags can change under our feet. Generally this is unlikely to cause a problem in practice, but it is somewhat unsound, and KCSAN will legitimately warn that there is a data race. To avoid such issues, a snapshot of the flags has to be taken prior to using them. Some places already use READ_ONCE() for that, others do not. Convert them all to the new flag accessor helpers. The READ_ONCE(ti->flags) .. cmpxchg(ti->flags) loop in set_nr_if_polling() is left as-is for clarity. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211129130653.2037928-4-mark.rutland@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-18 09:34:16 -04:00
Phil Auld	cb73223615	sched/core: Adjusting the order of scanning CPU JIRA: https://issues.redhat.com/browse/RHEL-310 commit 8589018acc65e5ddfd111f0a7ee85f9afde3a830 Author: Hao Jia <jiahao.os@bytedance.com> Date: Fri Dec 16 14:24:06 2022 +0800 sched/core: Adjusting the order of scanning CPU When select_idle_capacity() starts scanning for an idle CPU, it starts with target CPU that has already been checked in select_idle_sibling(). So we start checking from the next CPU and try the target CPU at the end. Similarly for task_numa_assign(), we have just checked numa_migrate_on of dst_cpu, so start from the next CPU. This also works for steal_cookie_task(), the first scan must fail and start directly from the next one. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20221216062406.7812-3-jiahao.os@bytedance.com Signed-off-by: Phil Auld <pauld@redhat.com>	2023-04-17 16:14:35 -04:00

1 2 3 4 5 ...

1596 Commits