Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Luis Claudio R. Goncalves	0857fd208a	sched: Fix stop_one_cpu_nowait() vs hotplug JIRA: https://issues.redhat.com/browse/RHEL-84526 commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Oct 10 20:57:39 2023 +0200 sched: Fix stop_one_cpu_nowait() vs hotplug Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning. Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete(). The race scenario is: CPU0 CPU1 // doing _cpu_down() __set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..) <PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock(); <PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!! That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete. OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks. Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination: preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable(); This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed. Apply this pattern to all similar stop_one_cpu_nowait() invocations. Fixes: `6d337eab04` ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>	2025-03-21 18:50:20 -03:00
Phil Auld	cb633355af	sched: Don't account irq time if sched_clock_irqtime is disabled JIRA: https://issues.redhat.com/browse/RHEL-78821 commit 763a744e24a8cfbcc13f699dcdae13a627b8588e Author: Yafang Shao <laoar.shao@gmail.com> Date: Fri Jan 3 10:24:07 2025 +0800 sched: Don't account irq time if sched_clock_irqtime is disabled sched_clock_irqtime may be disabled due to the clock source, in which case IRQ time should not be accounted. Let's add a conditional check to avoid unnecessary logic. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-3-laoar.shao@gmail.com Signed-off-by: Phil Auld <pauld@redhat.com>	2025-02-27 15:13:11 +00:00
Phil Auld	9228c0fc48	sched: Unify HK_TYPE_{TIMER\|TICK\|MISC} to HK_TYPE_KERNEL_NOISE JIRA: https://issues.redhat.com/browse/RHEL-78821 commit c907cd44a108eff7005a2b5689bb91f50637df8b Author: Waiman Long <longman@redhat.com> Date: Wed Oct 30 13:52:53 2024 -0400 sched: Unify HK_TYPE_{TIMER\|TICK\|MISC} to HK_TYPE_KERNEL_NOISE As all the non-domain and non-managed_irq housekeeping types have been unified to HK_TYPE_KERNEL_NOISE, replace all these references in the scheduler to use HK_TYPE_KERNEL_NOISE. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-5-longman@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2025-02-27 15:13:10 +00:00
Phil Auld	699009fa9c	sched: Don't try to catch up excess steal time. JIRA: https://issues.redhat.com/browse/RHEL-78821 commit 108ad0999085df2366dd9ef437573955cb3f5586 Author: Suleiman Souhlal <suleiman@google.com> Date: Mon Nov 18 13:37:45 2024 +0900 sched: Don't try to catch up excess steal time. When steal time exceeds the measured delta when updating clock_task, we currently try to catch up the excess in future updates. However, this results in inaccurate run times for the future things using clock_task, in some situations, as they end up getting additional steal time that did not actually happen. This is because there is a window between reading the elapsed time in update_rq_clock() and sampling the steal time in update_rq_clock_task(). If the VCPU gets preempted between those two points, any additional steal time is accounted to the outgoing task even though the calculated delta did not actually contain any of that "stolen" time. When this race happens, we can end up with steal time that exceeds the calculated delta, and the previous code would try to catch up that excess steal time in future clock updates, which is given to the next, incoming task, even though it did not actually have any time stolen. This behavior is particularly bad when steal time can be very long, which we've seen when trying to extend steal time to contain the duration that the host was suspended [0]. When this happens, clock_task stays frozen, during which the running task stays running for the whole duration, since its run time doesn't increase. However the race can happen even under normal operation. Ideally we would read the elapsed cpu time and the steal time atomically, to prevent this race from happening in the first place, but doing so is non-trivial. Since the time between those two points isn't otherwise accounted anywhere, neither to the outgoing task nor the incoming task (because the "end of outgoing task" and "start of incoming task" timestamps are the same), I would argue that the right thing to do is to simply drop any excess steal time, in order to prevent these issues. [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/ Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com Signed-off-by: Phil Auld <pauld@redhat.com>	2025-02-27 15:13:09 +00:00
Phil Auld	bcbc94b5b8	sched: Initialize idle tasks only once JIRA: https://issues.redhat.com/browse/RHEL-78821 commit b23decf8ac9102fc52c4de5196f4dc0a5f3eb80b Author: Thomas Gleixner <tglx@linutronix.de> Date: Mon Oct 28 11:43:42 2024 +0100 sched: Initialize idle tasks only once Idle tasks are initialized via __sched_fork() twice: fork_idle() copy_process() sched_fork() __sched_fork() init_idle() __sched_fork() Instead of cleaning this up, sched_ext hacked around it. Even when analyis and solution were provided in a discussion, nobody cared to clean this up. init_idle() is also invoked from sched_init() to initialize the boot CPU's idle task, which requires the __sched_fork() invocation. But this can be trivially solved by invoking __sched_fork() before init_idle() in sched_init() and removing the __sched_fork() invocation from init_idle(). Do so and clean up the comments explaining this historical leftover. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241028103142.359584747@linutronix.de Signed-off-by: Phil Auld <pauld@redhat.com>	2025-02-27 15:13:09 +00:00
Phil Auld	aa347a1976	sched/uclamp: Fix unnused variable warning JIRA: https://issues.redhat.com/browse/RHEL-78821 commit 23f1178ad706a1aa69ac3dfaa6559f1fb876c14e Author: Christian Loehle <christian.loehle@arm.com> Date: Fri Oct 25 11:53:17 2024 +0100 sched/uclamp: Fix unnused variable warning uclamp_mutex is only used for CONFIG_SYSCTL or CONFIG_UCLAMP_TASK_GROUP so declare it __maybe_unused. Closes: https://lore.kernel.org/oe-kbuild-all/202410060258.bPl2ZoUo-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202410250459.EJe6PJI5-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/a1e9c342-01c9-44f0-a789-2c908e57942b@arm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2025-02-27 15:13:08 +00:00
Phil Auld	8851a9b9ae	sched: Add move_queued_task_locked helper JIRA: https://issues.redhat.com/browse/RHEL-78821 Conflicts: Context diffs in sched.h due to not having eevdf code. commit 2b05a0b4c08ffd6dedfbd27af8708742cde39b95 Author: Connor O'Brien <connoro@google.com> Date: Wed Oct 9 16:53:37 2024 -0700 sched: Add move_queued_task_locked helper Switch logic that deactivates, sets the task cpu, and reactivates a task on a different rq to use a helper that will be later extended to push entire blocked task chains. This patch was broken out from a larger chain migration patch originally by Connor O'Brien. [jstultz: split out from larger chain migration patch] Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-5-jstultz@google.com Signed-off-by: Phil Auld <pauld@redhat.com>	2025-02-27 15:13:08 +00:00
Phil Auld	9d954525ab	sched/core: Add clearing of ->dl_server in put_prev_task_balance() JIRA: https://issues.redhat.com/browse/RHEL-78821 commit c245910049d04fbfa85bb2f5acd591c24e9907c7 Author: Joel Fernandes (Google) <joel@joelfernandes.org> Date: Mon May 27 14:06:48 2024 +0200 sched/core: Add clearing of ->dl_server in put_prev_task_balance() Paths using put_prev_task_balance() need to do a pick shortly after. Make sure they also clear the ->dl_server on prev as a part of that. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/d184d554434bedbad0581cb34656582d78655150.1716811044.git.bristot@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2025-02-12 15:42:08 +00:00
Phil Auld	b26aeaeafa	sched/core: Clear prev->dl_server in CFS pick fast path JIRA: https://issues.redhat.com/browse/RHEL-78821 commit a741b82423f41501e301eb6f9820b45ca202e877 Author: Youssef Esmat <youssefesmat@google.com> Date: Mon May 27 14:06:49 2024 +0200 sched/core: Clear prev->dl_server in CFS pick fast path In case the previous pick was a DL server pick, ->dl_server might be set. Clear it in the fast path as well. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: Youssef Esmat <youssefesmat@google.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/7f7381ccba09efcb4a1c1ff808ed58385eccc222.1716811044.git.bristot@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2025-02-12 15:41:45 +00:00
Phil Auld	5b8b5283a9	sched: remove HZ_BW feature hedge JIRA: https://issues.redhat.com/browse/RHEL-78821 Conflicts: Minor context difference in features.h. commit a58501fb8320d6232507f722b4c9dcd4e03362ee Author: Phil Auld <pauld@redhat.com> Date: Wed May 15 09:37:05 2024 -0400 sched: remove HZ_BW feature hedge As a hedge against unexpected user issues commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in use") included a scheduler feature to disable the new functionality. It's been a few releases (v6.6) and no screams, so remove it. Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240515133705.3632915-1-pauld@redhat.com Signed-off-by: Phil Auld <pauld@redhat.com>	2025-02-12 15:24:41 +00:00
Phil Auld	5dc87bb1b4	sched/core: Prevent wakeup of ksoftirqd during idle load balance JIRA: https://issues.redhat.com/browse/RHEL-70904 commit e932c4ab38f072ce5894b2851fea8bc5754bb8e5 Author: K Prateek Nayak <kprateek.nayak@amd.com> Date: Tue Nov 19 05:44:32 2024 +0000 sched/core: Prevent wakeup of ksoftirqd during idle load balance Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on from the IPI handler on the idle CPU. If the SMP function is invoked from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd because soft interrupts are handled before ksoftirqd get on the CPU. Adding a trace_printk() in nohz_csd_func() at the spot of raising SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup, and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the current behavior: <idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func <idle>-0 [000] dN.4.: sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000 <idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED] <idle>-0 [000] .Ns1.: softirq_exit: vec=7 [action=SCHED] <idle>-0 [000] d..2.: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120 ksoftirqd/0-16 [000] d..2.: sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 ... Use __raise_softirq_irqoff() to raise the softirq. The SMP function call is always invoked on the requested CPU in an interrupt handler. It is guaranteed that soft interrupts are handled at the end. Following are the observations with the changes when enabling the same set of events: <idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance <idle>-0 [000] dN.1.: softirq_raise: vec=7 [action=SCHED] <idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED] No unnecessary ksoftirqd wakeups are seen from idle task's context to service the softirq. Fixes: `b2a02fc43a` ("smp: Optimize send_call_function_single_ipi()") Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1] Reported-by: Julia Lawall <julia.lawall@inria.fr> Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20241119054432.6405-5-kprateek.nayak@amd.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-12-11 19:38:11 +00:00
Phil Auld	04d352ce17	sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() JIRA: https://issues.redhat.com/browse/RHEL-70904 commit ea9cffc0a154124821531991d5afdd7e8b20d7aa Author: K Prateek Nayak <kprateek.nayak@amd.com> Date: Tue Nov 19 05:44:30 2024 +0000 sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() The need_resched() check currently in nohz_csd_func() can be tracked to have been added in scheduler_ipi() back in 2011 via commit `ca38062e57` ("sched: Use resched IPI to kick off the nohz idle balance") Since then, it has travelled quite a bit but it seems like an idle_cpu() check currently is sufficient to detect the need to bail out from an idle load balancing. To justify this removal, consider all the following case where an idle load balancing could race with a task wakeup: o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle") a target perceived to be idle (target_rq->nr_running == 0) will return true for ttwu_queue_cond(target) which will offload the task wakeup to the idle target via an IPI. In all such cases target_rq->ttwu_pending will be set to 1 before queuing the wake function. If an idle load balance races here, following scenarios are possible: - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual IPI is sent to the CPU to wake it out of idle. If the nohz_csd_func() queues before sched_ttwu_pending(), the idle load balance will bail out since idle_cpu(target) returns 0 since target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after sched_ttwu_pending() it should see rq->nr_running to be non-zero and bail out of idle load balancing. - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI, the sender will simply set TIF_NEED_RESCHED for the target to put it out of idle and flush_smp_call_function_queue() in do_idle() will execute the call function. Depending on the ordering of the queuing of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in nohz_csd_func() should either see target_rq->ttwu_pending = 1 or target_rq->nr_running to be non-zero if there is a genuine task wakeup racing with the idle load balance kick. o The waker CPU perceives the target CPU to be busy (targer_rq->nr_running != 0) but the CPU is in fact going idle and due to a series of unfortunate events, the system reaches a case where the waker CPU decides to perform the wakeup by itself in ttwu_queue() on the target CPU but target is concurrently selected for idle load balance (XXX: Can this happen? I'm not sure, but we'll consider the mother of all coincidences to estimate the worst case scenario). ttwu_do_activate() calls enqueue_task() which would increment "rq->nr_running" post which it calls wakeup_preempt() which is responsible for setting TIF_NEED_RESCHED (via a resched IPI or by setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key thing to note in this case is that rq->nr_running is already non-zero in case of a wakeup before TIF_NEED_RESCHED is set which would lead to idle_cpu() check returning false. In all cases, it seems that need_resched() check is unnecessary when checking for idle_cpu() first since an impending wakeup racing with idle load balancer will either set the "rq->ttwu_pending" or indicate a newly woken task via "rq->nr_running". Chasing the reason why this check might have existed in the first place, I came across Peter's suggestion on the fist iteration of Suresh's patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was: sched_ttwu_do_pending(list); if (unlikely((rq->idle == current) && rq->nohz_balance_kick && !need_resched())) raise_softirq_irqoff(SCHED_SOFTIRQ); Since the condition to raise the SCHED_SOFIRQ was preceded by sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in the current upstream kernel, the need_resched() check was necessary to catch a newly queued task. Peter suggested modifying it to: if (idle_cpu() && rq->nohz_balance_kick && !need_resched()) raise_softirq_irqoff(SCHED_SOFTIRQ); where idle_cpu() seems to have replaced "rq->idle == current" check. Even back then, the idle_cpu() check would have been sufficient to catch a new task being enqueued. Since commit `b2a02fc43a` ("smp: Optimize send_call_function_single_ipi()") overloads the interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based on Peter's suggestion. Fixes: `b2a02fc43a` ("smp: Optimize send_call_function_single_ipi()") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@amd.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-12-11 19:37:47 +00:00
Audra Mitchell	8947f5b14c	lazy tlb: introduce lazy tlb mm refcount helper functions JIRA: https://issues.redhat.com/browse/RHEL-55462 This patch is a backport of the following upstream commit: commit aa464ba9a1e444d5ef95bb63ee3b2ef26fc96ed7 Author: Nicholas Piggin <npiggin@gmail.com> Date: Fri Feb 3 17:18:34 2023 +1000 lazy tlb: introduce lazy tlb mm refcount helper functions Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. There is no functional change with this patch. Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Audra Mitchell <audra@redhat.com>	2024-11-04 09:14:17 -05:00
Rado Vrbovsky	d2bd7080ef	Merge: Sched: Updates and fixes for 9.6 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5250 JIRA: https://issues.redhat.com/browse/RHEL-56494 JIRA: https://issues.redhat.com/browse/RHEL-57142 CVE: CVE-2024-44958 Tested: Ran scheduler tests and general stress testing. Have asked perf QE for sanity tests. Omitted-fix: c049acee3c71 ("selftests/ftrace: Fix test to handle both old and new kernels"): Somewhat out of scope for this MR and should not need to run test against old kernels in RHEL. Series of scheduler related fixes and updates, up to v6.11. A large number of these are refactoring (making naming consistent, breaking out code into new files etc) with no functional changes. Otherwise, primarily bug fixes and cleanups, no real feature additions. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Tony Camuso <tcamuso@redhat.com> Approved-by: Mark Langsdorf <mlangsdo@redhat.com> Approved-by: Juri Lelli <juri.lelli@redhat.com> Approved-by: Eric Chanudet <echanude@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-10-25 16:52:35 +00:00
Rado Vrbovsky	d30d477e21	Merge: rcu: Backport upstream RCU commits up to v6.10 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5074 JIRA: https://issues.redhat.com/browse/RHEL-55557 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5074 This MR backports upstream RCU commits up to v6.10 with relevant bug fixes, if applicable. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Tony Camuso <tcamuso@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-10-25 16:11:27 +00:00
Ming Lei	23a842d0ff	sched: Add a new function to compare if two cpus have the same capacity JIRA: https://issues.redhat.com/browse/RHEL-56837 commit b361c9027b4e4159e7bcca4eb64fd26507c19994 Author: Qais Yousef <qyousef@layalina.io> Date: Fri Feb 23 15:57:48 2024 +0000 sched: Add a new function to compare if two cpus have the same capacity The new helper function is needed to help blk-mq check if it needs to dispatch the softirq on another CPU to match the performance level the IO requester is running at. This is important on HMP systems where not all CPUs have the same compute capacity. Signed-off-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240223155749.2958009-2-qyousef@layalina.io Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2024-09-27 11:18:38 +08:00
Ming Lei	3a1c398968	block: update cached timestamp post schedule/preemption JIRA: https://issues.redhat.com/browse/RHEL-56837 commit 06b23f92af87a84d70881b2ecaa72e00f7838264 Author: Jens Axboe <axboe@kernel.dk> Date: Tue Jan 16 09:18:39 2024 -0700 block: update cached timestamp post schedule/preemption Mark the task as having a cached timestamp when set assign it, so we can efficiently check if it needs updating post being scheduled back in. This covers both the actual schedule out case, which would've flushed the plug, and the preemption case which doesn't touch the plugged requests (for many reasons, one of them being then we'd need to have preemption disabled around plug state manipulation). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2024-09-27 11:18:33 +08:00
Phil Auld	f2e299d329	sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate() JIRA: https://issues.redhat.com/browse/RHEL-56494 commit fe7a11c78d2a9bdb8b50afc278a31ac177000948 Author: Yang Yingliang <yangyingliang@huawei.com> Date: Wed Jul 3 11:16:10 2024 +0800 sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate() If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback. Fixes: `120455c514` ("sched: Fix hotplug vs CPU bandwidth control") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-26 08:11:31 -04:00
Phil Auld	bcf83b5d1b	sched/smt: Fix unbalance sched_smt_present dec/inc JIRA: https://issues.redhat.com/browse/RHEL-57142 CVE: CVE-2024-44958 commit e22f910a26cc2a3ac9c66b8e935ef2a7dd881117 Author: Yang Yingliang <yangyingliang@huawei.com> Date: Wed Jul 3 11:16:08 2024 +0800 sched/smt: Fix unbalance sched_smt_present dec/inc I got the following warn report while doing stress test: jump label: negative count! WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0 Call Trace: <TASK> __static_key_slow_dec_cpuslocked+0x16/0x70 sched_cpu_deactivate+0x26e/0x2a0 cpuhp_invoke_callback+0x3ad/0x10d0 cpuhp_thread_fun+0x3f5/0x680 smpboot_thread_fn+0x56d/0x8d0 kthread+0x309/0x400 ret_from_fork+0x41/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(), the cpu offline failed, but sched_smt_present is decremented before calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so fix it by incrementing sched_smt_present in the error path. Fixes: `c5511d03ec` ("sched/smt: Make sched_smt_present track topology") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-26 08:11:14 -04:00
Phil Auld	9d3f3b5053	sched/core: Introduce sched_set_rq_on/offline() helper JIRA: https://issues.redhat.com/browse/RHEL-56494 commit 2f027354122f58ee846468a6f6b48672fff92e9b Author: Yang Yingliang <yangyingliang@huawei.com> Date: Wed Jul 3 11:16:09 2024 +0800 sched/core: Introduce sched_set_rq_on/offline() helper Introduce sched_set_rq_on/offline() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-26 08:11:14 -04:00
Phil Auld	0fc32c5847	sched/smt: Introduce sched_smt_present_inc/dec() helper JIRA: https://issues.redhat.com/browse/RHEL-57142 CVE: CVE-2024-44958 commit 31b164e2e4af84d08d2498083676e7eeaa102493 Author: Yang Yingliang <yangyingliang@huawei.com> Date: Wed Jul 3 11:16:07 2024 +0800 sched/smt: Introduce sched_smt_present_inc/dec() helper Introduce sched_smt_present_inc/dec() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-26 08:10:44 -04:00
Phil Auld	a1d464759c	rcu/tasks: Fix stale task snaphot for Tasks Trace JIRA: https://issues.redhat.com/browse/RHEL-56494 commit 399ced9594dfab51b782798efe60a2376cd5b724 Author: Frederic Weisbecker <frederic@kernel.org> Date: Fri May 17 17:23:02 2024 +0200 rcu/tasks: Fix stale task snaphot for Tasks Trace When RCU-TASKS-TRACE pre-gp takes a snapshot of the current task running on all online CPUs, no explicit ordering synchronizes properly with a context switch. This lack of ordering can permit the new task to miss pre-grace-period update-side accesses. The following diagram, courtesy of Paul, shows the possible bad scenario: CPU 0 CPU 1 ----- ----- // Pre-GP update side access WRITE_ONCE(X, 1); smp_mb(); r0 = rq->curr; RCU_INIT_POINTER(rq->curr, TASK_B) spin_unlock(rq) rcu_read_lock_trace() r1 = X; / ignore TASK_B */ Either r0==TASK_B or r1==1 is needed but neither is guaranteed. One possible solution to solve this is to wait for an RCU grace period at the beginning of the RCU-tasks-trace grace period before taking the current tasks snaphot. However this would introduce large additional latencies to RCU-tasks-trace grace periods. Another solution is to lock the target runqueue while taking the current task snapshot. This ensures that the update side sees the latest context switch and subsequent context switches will see the pre-grace-period update side accesses. This commit therefore adds runqueue locking to cpu_curr_snapshot(). Fixes: e386b6725798 ("rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-23 13:33:03 -04:00
Phil Auld	fb5d3f735e	sched/core: Simplify prefetch_curr_exec_start() JIRA: https://issues.redhat.com/browse/RHEL-56494 commit 85c9a8f4531c6c0862ecda50cac662b0b78d1974 Author: Ingo Molnar <mingo@kernel.org> Date: Wed Jun 5 13:01:44 2024 +0200 sched/core: Simplify prefetch_curr_exec_start() Remove unnecessary use of the address operator. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-23 13:33:03 -04:00
Phil Auld	e8bf69e6e0	sched: Fix spelling in comments JIRA: https://issues.redhat.com/browse/RHEL-56494 Conflicts: Dropped hunks in mm_cid code which we don't have. Minor context diffs due to still having IA64 in tree and previous Kabi workarounds. commit 402de7fc880fef055bc984957454b532987e9ad0 Author: Ingo Molnar <mingo@kernel.org> Date: Mon May 27 16:54:52 2024 +0200 sched: Fix spelling in comments Do a spell-checking pass. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-23 13:33:02 -04:00
Phil Auld	10626dfce1	sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c JIRA: https://issues.redhat.com/browse/RHEL-56494 Conflicts: Worked around RHEL-only commits `9b35f92491` ("sched/core: Make sched_setaffinity() always return -EINVAL on empty cpumask"),90f7bb0c1823 ("sched/core: Don't return -ENODEV from sched_setaffinity()") and `05fddaaaac` ("sched/core: Use empty mask to reset cpumasks in sched_setaffinity()") by removing the changes and re-applying them to the new syscalls.c file. Reverting and re-applying was not possible since there have been other changes on top of these as well. commit 04746ed80bcf3130951ed4d5c1bc5b0bcabdde22 Author: Ingo Molnar <mingo@kernel.org> Date: Sun Apr 7 10:43:15 2024 +0200 sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c core.c has become rather large, move most scheduler syscall related functionality into a separate file, syscalls.c. This is about ~15% of core.c's raw linecount. Move the alloc_user_cpus_ptr(), __rt_effective_prio(), rt_effective_prio(), uclamp_none(), uclamp_se_set() and uclamp_bucket_id() inlines to kernel/sched/sched.h. Internally export the __sched_setscheduler(), __sched_setaffinity(), __setscheduler_prio(), set_load_weight(), enqueue_task(), dequeue_task(), check_class_changed(), splice_balance_callbacks() and balance_callbacks() methods to better facilitate this. Move the new file's build to sched_policy.c, because it fits there semantically, but also because it's the smallest of the 4 build units under an allmodconfig build: -rw-rw-r-- 1 mingo mingo 7.3M May 27 12:35 kernel/sched/core.i -rw-rw-r-- 1 mingo mingo 6.4M May 27 12:36 kernel/sched/build_utility.i -rw-rw-r-- 1 mingo mingo 6.3M May 27 12:36 kernel/sched/fair.i -rw-rw-r-- 1 mingo mingo 5.8M May 27 12:36 kernel/sched/build_policy.i This better balances build time for scheduler subsystem rebuilds. I build-tested this new file as a standalone syscalls.o file for a bit, to make sure all the encapsulations & abstractions are robust. Also update/add my copyright notices to these files. Build time measurements: # -Before/+After: kepler:~/tip> perf stat -e 'cycles,instructions,duration_time' --sync --repeat 5 --pre 'rm -f kernel/sched/*.o' m kernel/sched/built-in.a >/dev/null Performance counter stats for 'm kernel/sched/built-in.a' (5 runs): - 71,938,508,607 cycles ( +- 0.17% ) + 71,992,916,493 cycles ( +- 0.22% ) - 106,214,780,964 instructions # 1.48 insn per cycle ( +- 0.01% ) + 105,450,231,154 instructions # 1.46 insn per cycle ( +- 0.01% ) - 5,878,232,620 ns duration_time ( +- 0.38% ) + 5,290,085,069 ns duration_time ( +- 0.21% ) - 5.8782 +- 0.0221 seconds time elapsed ( +- 0.38% ) + 5.2901 +- 0.0111 seconds time elapsed ( +- 0.21% ) Build time improvement of -11.1% (duration_time) is expected: the parallel build time of the scheduler subsystem is determined by the largest, slowest to build object file, which is kernel/sched/core.o. By moving ~15% of its complexity into another build unit, we reduced build time by -11%. Measured cycles spent on building is within its ~0.2% stddev noise envelope. The -0.7% reduction in instructions spent on building the scheduler is statistically reliable and somewhat surprising - I can only speculate: maybe compilers aren't that efficient at building & optimizing 10+ KLOC files (core.c), and it's an overall win to balance the linecount a bit. Anyway, this might be a data point that suggests that reducing the linecount of our largest files will improve not just code readability and maintainability, but might also improve build times a bit. Code generation got a bit worse, by 0.5kb text on an x86 defconfig build: # -Before/+After: kepler:~/tip> size vmlinux text data bss dec hex filename -26475475 10439178 1740804 38655457 24dd5e1 vmlinux +26476003 10439178 1740804 38655985 24dd7f1 vmlinux kepler:~/tip> size kernel/sched/built-in.a text data bss dec hex filename - 76056 30025 489 106570 1a04a kernel/sched/core.o (ex kernel/sched/built-in.a) + 63452 29453 489 93394 16cd2 kernel/sched/core.o (ex kernel/sched/built-in.a) 44299 2181 104 46584 b5f8 kernel/sched/fair.o (ex kernel/sched/built-in.a) - 42764 3424 120 46308 b4e4 kernel/sched/build_policy.o (ex kernel/sched/built-in.a) + 55651 4044 120 59815 e9a7 kernel/sched/build_policy.o (ex kernel/sched/built-in.a) 44866 12655 2192 59713 e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a) 44866 12655 2192 59713 e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a) This is primarily due to the extra functions exported, and the size gets exaggerated somewhat by __pfx CFI function padding: ffffffff810cc710 <__pfx_enqueue_task>: ffffffff810cc710: 90 nop ffffffff810cc711: 90 nop ffffffff810cc712: 90 nop ffffffff810cc713: 90 nop ffffffff810cc714: 90 nop ffffffff810cc715: 90 nop ffffffff810cc716: 90 nop ffffffff810cc717: 90 nop ffffffff810cc718: 90 nop ffffffff810cc719: 90 nop ffffffff810cc71a: 90 nop ffffffff810cc71b: 90 nop ffffffff810cc71c: 90 nop ffffffff810cc71d: 90 nop ffffffff810cc71e: 90 nop ffffffff810cc71f: 90 nop AFAICS the cost is primarily not to core.o and fair.o though (which contain most performance sensitive scheduler functions), only to syscalls.o that get called with much lower frequency - so I think this is an acceptable trade-off for better code separation. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20240407084319.1462211-2-mingo@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-23 13:33:02 -04:00
Phil Auld	14a470e760	sched/pelt: Remove shift of thermal clock JIRA: https://issues.redhat.com/browse/RHEL-56494 commit 97450eb909658573dcacc1063b06d3d08642c0c1 Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Tue Mar 26 10:16:16 2024 +0100 sched/pelt: Remove shift of thermal clock The optional shift of the clock used by thermal/hw load avg has been introduced to handle case where the signal was not always a high frequency hw signal. Now that cpufreq provides a signal for firmware and SW pressure, we can remove this exception and always keep this PELT signal aligned with other signals. Mark sysctl_sched_migration_cost boot parameter as deprecated Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-6-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-23 13:33:02 -04:00
Phil Auld	51c743b331	sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure() JIRA: https://issues.redhat.com/browse/RHEL-56494 Conflicts: Minor differences since we already have ddae0ca2a8f ("sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath") which changes some nearby code. commit d4dbc991714eefcbd8d54a3204bd77a0a52bd32d Author: Vincent Guittot <vincent.guittot@linaro.org> Date: Tue Mar 26 10:16:15 2024 +0100 sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure() Now that cpufreq provides a pressure value to the scheduler, rename arch_update_thermal_pressure into HW pressure to reflect that it returns a pressure applied by HW (i.e. with a high frequency change) and not always related to thermal mitigation but also generated by max current limitation as an example. Such high frequency signal needs filtering to be smoothed and provide an value that reflects the average available capacity into the scheduler time scale. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-5-vincent.guittot@linaro.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-23 13:33:02 -04:00
Phil Auld	32938a738c	sched/balancing: Rename trigger_load_balance() => sched_balance_trigger() JIRA: https://issues.redhat.com/browse/RHEL-56494 Conflicts: Dropped CN documentation since not in RHEL. Minor fuzz in sched-domains.rst. commit 983be0628c061989b6cc175d2f5e429b40699fbb Author: Ingo Molnar <mingo@kernel.org> Date: Fri Mar 8 12:18:09 2024 +0100 sched/balancing: Rename trigger_load_balance() => sched_balance_trigger() Standardize scheduler load-balancing function names on the sched_balance_() prefix. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-4-mingo@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-20 04:38:48 -04:00
Phil Auld	88d1c5d2ed	sched/balancing: Rename scheduler_tick() => sched_tick() JIRA: https://issues.redhat.com/browse/RHEL-56494 Conflicts: Dropped CN documentation since not in RHEL, context diffs in sched-domains.rst. Skipped hunk in func_set_ftrace_file.tc due to not having 6fec1ab67f8 ("selftests/ftrace: Do not trace do_softirq because of PREEMPT_RT") in tree. commit 86dd6c04ef9f213e14d60c9f64bce1cc019f816e Author: Ingo Molnar <mingo@kernel.org> Date: Fri Mar 8 12:18:08 2024 +0100 sched/balancing: Rename scheduler_tick() => sched_tick() - Standardize on prefixing scheduler-internal functions defined in <linux/sched.h> with sched_() prefix. scheduler_tick() was the only function using the scheduler_ prefix. Harmonize it. - The other reason to rename it is the NOHZ scheduler tick handling functions are already named sched_tick_(). Make the 'git grep sched_tick' more meaningful. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-3-mingo@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-20 04:38:48 -04:00
Phil Auld	79fb2c348b	sched/core: Simplify code by removing duplicate #ifdefs JIRA: https://issues.redhat.com/browse/RHEL-56494 commit 8cec3dd9e5930c82c6bd0af3fdb3a36bcd428310 Author: Shrikanth Hegde <sshegde@linux.ibm.com> Date: Fri Feb 16 11:44:33 2024 +0530 sched/core: Simplify code by removing duplicate #ifdefs There's a few cases of nested #ifdefs in the scheduler code that can be simplified: #ifdef DEFINE_A ...code block... #ifdef DEFINE_A <-- This is a duplicate. ...code block... #endif #else #ifndef DEFINE_A <-- This is also duplicate. ...code block... #endif #endif More details about the script and methods used to find these code patterns can be found at: https://lore.kernel.org/all/20240118080326.13137-1-sshegde@linux.ibm.com/ No change in functionality intended. [ mingo: Clarified the changelog. ] Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240216061433.535522-1-sshegde@linux.ibm.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-20 04:38:46 -04:00
Waiman Long	dcf25e4ab2	rcu/tasks: Fix stale task snaphot for Tasks Trace JIRA: https://issues.redhat.com/browse/RHEL-55557 commit 399ced9594dfab51b782798efe60a2376cd5b724 Author: Frederic Weisbecker <frederic@kernel.org> Date: Fri, 17 May 2024 17:23:02 +0200 rcu/tasks: Fix stale task snaphot for Tasks Trace When RCU-TASKS-TRACE pre-gp takes a snapshot of the current task running on all online CPUs, no explicit ordering synchronizes properly with a context switch. This lack of ordering can permit the new task to miss pre-grace-period update-side accesses. The following diagram, courtesy of Paul, shows the possible bad scenario: CPU 0 CPU 1 ----- ----- // Pre-GP update side access WRITE_ONCE(X, 1); smp_mb(); r0 = rq->curr; RCU_INIT_POINTER(rq->curr, TASK_B) spin_unlock(rq) rcu_read_lock_trace() r1 = X; / ignore TASK_B */ Either r0==TASK_B or r1==1 is needed but neither is guaranteed. One possible solution to solve this is to wait for an RCU grace period at the beginning of the RCU-tasks-trace grace period before taking the current tasks snaphot. However this would introduce large additional latencies to RCU-tasks-trace grace periods. Another solution is to lock the target runqueue while taking the current task snapshot. This ensures that the update side sees the latest context switch and subsequent context switches will see the pre-grace-period update side accesses. This commit therefore adds runqueue locking to cpu_curr_snapshot(). Fixes: e386b6725798 ("rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-08-26 10:57:51 -04:00
Phil Auld	d414c1e069	sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath JIRA: https://issues.redhat.com/browse/RHEL-48226 Conflicts: Minor context differences in sched/core.c due to not having scheduler_tick() renamed sched_tick and d4dbc991714e ("sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()"). commit ddae0ca2a8fe12d0e24ab10ba759c3fbd755ada8 Author: John Stultz <jstultz@google.com> Date: Tue Jun 18 14:58:55 2024 -0700 sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath It was reported that in moving to 6.1, a larger then 10% regression was seen in the performance of clock_gettime(CLOCK_THREAD_CPUTIME_ID,...). Using a simple reproducer, I found: 5.10: 100000000 calls in 24345994193 ns => 243.460 ns per call 100000000 calls in 24288172050 ns => 242.882 ns per call 100000000 calls in 24289135225 ns => 242.891 ns per call 6.1: 100000000 calls in 28248646742 ns => 282.486 ns per call 100000000 calls in 28227055067 ns => 282.271 ns per call 100000000 calls in 28177471287 ns => 281.775 ns per call The cause of this was finally narrowed down to the addition of psi_account_irqtime() in update_rq_clock_task(), in commit 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure"). In my initial attempt to resolve this, I leaned towards moving all accounting work out of the clock_gettime() call path, but it wasn't very pretty, so it will have to wait for a later deeper rework. Instead, Peter shared this approach: Rework psi_account_irqtime() to use its own psi_irq_time base for accounting, and move it out of the hotpath, calling it instead from sched_tick() and __schedule(). In testing this, we found the importance of ensuring psi_account_irqtime() is run under the rq_lock, which Johannes Weiner helpfully explained, so also add some lockdep annotations to make that requirement clear. With this change the performance is back in-line with 5.10: 6.1+fix: 100000000 calls in 24297324597 ns => 242.973 ns per call 100000000 calls in 24318869234 ns => 243.189 ns per call 100000000 calls in 24291564588 ns => 242.916 ns per call Reported-by: Jimmy Shiu <jimmyshiu@google.com> Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-07-15 11:13:20 -04:00
Phil Auld	f2ab2fc5c2	sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write() JIRA: https://issues.redhat.com/browse/RHEL-48226 commit 49217ea147df7647cb89161b805c797487783fc0 Author: Cheng Yu <serein.chengyu@huawei.com> Date: Wed Apr 24 21:24:38 2024 +0800 sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write() In the cgroup v2 CPU subsystem, assuming we have a cgroup named 'test', and we set cpu.max and cpu.max.burst: # echo 1000000 > /sys/fs/cgroup/test/cpu.max # echo 1000000 > /sys/fs/cgroup/test/cpu.max.burst then we check cpu.max and cpu.max.burst: # cat /sys/fs/cgroup/test/cpu.max 1000000 100000 # cat /sys/fs/cgroup/test/cpu.max.burst 1000000 Next we set cpu.max again and check cpu.max and cpu.max.burst: # echo 2000000 > /sys/fs/cgroup/test/cpu.max # cat /sys/fs/cgroup/test/cpu.max 2000000 100000 # cat /sys/fs/cgroup/test/cpu.max.burst 1000 ... we find that the cpu.max.burst value changed unexpectedly. In cpu_max_write(), the unit of the burst value returned by tg_get_cfs_burst() is microseconds, while in cpu_max_write(), the burst unit used for calculation should be nanoseconds, which leads to the bug. To fix it, get the burst value directly from tg->cfs_bandwidth.burst. Fixes: `f4183717b3` ("sched/fair: Introduce the burstable CFS controller") Reported-by: Qixin Liao <liaoqixin@huawei.com> Signed-off-by: Cheng Yu <serein.chengyu@huawei.com> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240424132438.514720-1-serein.chengyu@huawei.com Signed-off-by: Phil Auld <pauld@redhat.com>	2024-07-15 11:13:08 -04:00
Phil Auld	e997ca8943	delayacct: track delays from IRQ/SOFTIRQ JIRA: https://issues.redhat.com/browse/RHEL-48226 Conflicts: Context difference in delayacct.h due to different location of delayaact_tsk_init in rhel codebase. commit a3b2aeac9d154e5e15ddbf19de934c0c606b6acd Author: Yang Yang <yang.yang19@zte.com.cn> Date: Sat Apr 8 17:28:35 2023 +0800 delayacct: track delays from IRQ/SOFTIRQ Delay accounting does not track the delay of IRQ/SOFTIRQ. While IRQ/SOFTIRQ could have obvious impact on some workloads productivity, such as when workloads are running on system which is busy handling network IRQ/SOFTIRQ. Get the delay of IRQ/SOFTIRQ could help users to reduce such delay. Such as setting interrupt affinity or task affinity, using kernel thread for NAPI etc. This is inspired by "sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure"[1]. Also fix some code indent problems of older code. And update tools/accounting/getdelays.c: / # ./getdelays -p 156 -di print delayacct stats ON printing IO accounting PID 156 CPU count real total virtual total delay total delay average 15 15836008 16218149 275700790 18.380ms IO count delay total delay average 0 0 0.000ms SWAP count delay total delay average 0 0 0.000ms RECLAIM count delay total delay average 0 0 0.000ms THRASHING count delay total delay average 0 0 0.000ms COMPACT count delay total delay average 0 0 0.000ms WPCOPY count delay total delay average 36 7586118 0.211ms IRQ count delay total delay average 42 929161 0.022ms [1] commit 52b1364ba0b1("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure") Link: https://lkml.kernel.org/r/202304081728353557233@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Jiang Xuexin <jiang.xuexin@zte.com.cn> Cc: wangyong <wang.yong12@zte.com.cn> Cc: junhua huang <huang.junhua@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2024-07-15 11:12:08 -04:00
Lucas Zampieri	f6029bf351	Merge: workqueue: Backport workqueue commits to v6.9 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3910 JIRA: https://issues.redhat.com/browse/RHEL-25103 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3910 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3847 The primary purpose of this MR is to backport those upstream workqueue commits which enables ordered workqueues and rescuers to follow changes in workqueue unbound cpumask which is necessary to make sure that isolated CPUs won't be disturbed due to unbound work items being handled by those CPUs. These upstream commits were merged into the v6.9 kernel which also contains some major changes in workqueue code. This makes the required commits dependent on some of the v6.9 workqueue commits. It is less risky to sync the workqueue code up to v6.9 instead of selective backports of some dependent commits. This MR also includes some miscellaneous commits in other subsystems due to changes in the underlying workqueue implementations. A follow-up proactive workqueue fixes MR will be created later on, if necessary. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Tony Camuso <tcamuso@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Approved-by: Vladis Dronov <vdronov@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Radu Rendec <rrendec@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-06-13 13:07:43 +00:00
Waiman Long	6d0328a7cf	Revert "Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8"" JIRA: https://issues.redhat.com/browse/RHEL-36683 Upstream Status: RHEL only This reverts commit `08637d76a2` which is a revert of "Merge: cgroup: Backport upstream cgroup commits up to v6.8" Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-18 21:38:20 -04:00
Lucas Zampieri	08637d76a2	Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8" This reverts merge request !4128	2024-05-16 15:26:41 +00:00
Lucas Zampieri	1ce55b7cbb	Merge: cgroup: Backport upstream cgroup commits up to v6.8 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4128 JIRA: https://issues.redhat.com/browse/RHEL-34600 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4128 This MR backports upstream cgroup commits up to v6.8 with related fixes, if applicable. It also pulls in a number of scheduler and PSI related commits due to their interaction with cgroup. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Tony Camuso <tcamuso@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Xin Long <lxin@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-05-16 13:28:22 +00:00
Lucas Zampieri	f67ab7550c	Merge: Scheduler: rhel9.5 updates MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3975 JIRA: https://issues.redhat.com/browse/RHEL-25535 JIRA: https://issues.redhat.com/browse/RHEL-20158 JIRA: https://issues.redhat.com/browse/RHEL-15622 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3935 Tested: Scheduler stress tests. Perf Qe will do a performance regression test. A collection of fixes and updates that brings the core scheduler code up to v6.8. EEVDF related commits are skipped since we are not planning to take the new task scheduler in rhel9. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Juri Lelli <juri.lelli@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-05-08 20:13:47 +00:00
Waiman Long	1665f6ac9c	workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 616db8779b1e3f93075df691432cccc5ef3c3ba0 Author: Tejun Heo <tj@kernel.org> Date: Wed, 17 May 2023 17:02:08 -1000 workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE If a per-cpu work item hogs the CPU, it can prevent other work items from starting through concurrency management. A per-cpu workqueue which intends to host such CPU-hogging work items can choose to not participate in concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be error-prone and difficult to debug when missed. This patch adds an automatic CPU usage based detection. If a concurrency-managed work item consumes more CPU time than the threshold (10ms by default) continuously without intervening sleeps, wq_worker_tick() which is called from scheduler_tick() will detect the condition and automatically mark it CPU_INTENSIVE. The mechanism isn't foolproof: * Detection depends on tick hitting the work item. Getting preempted at the right timings may allow a violating work item to evade detection at least temporarily. * nohz_full CPUs may not be running ticks and thus can fail detection. * Even when detection is working, the 10ms detection delays can add up if many CPU-hogging work items are queued at the same time. However, in vast majority of cases, this should be able to detect violations reliably and provide reasonable protection with a small increase in code complexity. If some work items trigger this condition repeatedly, the bigger problem likely is the CPU being saturated with such per-cpu work items and the solution would be making them UNBOUND. The next patch will add a debug mechanism to help spot such cases. v4: Documentation for workqueue.cpu_intensive_thresh_us added to kernel-parameters.txt. v3: Switch to use wq_worker_tick() instead of hooking into preemptions as suggested by Peter. v2: Lai pointed out that wq_worker_stopping() also needs to be called from preemption and rtlock paths and an earlier patch was updated accordingly. This patch adds a comment describing the risk of infinte recursions and how they're avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:24 -04:00
Lucas Zampieri	d83249ff08	Merge: futex: Rebase futex code to v6.8 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3827 JIRA: https://issues.redhat.com/browse/RHEL-28616 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3827 This MR rebases the RHEL9 futex code base to align to the v6.8 upstream kernel to gain access to new futex syscalls and functionality that are likely needed by userspace applications and other kernel subsystems. It also includes the reverting of some linux-rt-devel specific rt-mutex and scheduler patches and replacing them with upstream linux equivalents. It also includes some unrelated syscall patches. These are all done to ease the current and future backporting effort. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Crystal Wood <crwood@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-04-29 14:06:58 +00:00
Waiman Long	fe6fb41529	sched/core: Update stale comment in try_to_wake_up() JIRA: https://issues.redhat.com/browse/RHEL-34600 commit ea41bb514fe286bf50498b3c6d7f7a5dc2b6c5e0 Author: Ingo Molnar <mingo@kernel.org> Date: Wed, 4 Oct 2023 11:33:36 +0200 sched/core: Update stale comment in try_to_wake_up() The following commit: 9b3c4ab3045e ("sched,rcu: Rework try_invoke_on_locked_down_task()") ... renamed try_invoke_on_locked_down_task() to task_call_func(), but forgot to update the comment in try_to_wake_up(). But it turns out that the smp_rmb() doesn't live in task_call_func() either, it was moved to __task_needs_rq_lock() in: 91dabf33ae5d ("sched: Fix race in task_call_func()") Fix that now. Also fix the s/smb/smp typo while at it. Reported-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230731085759.11443-1-zhangqiao22@huawei.com Signed-off-by: Waiman Long <longman@redhat.com>	2024-04-26 22:49:23 -04:00
Waiman Long	5246c3da80	sched: add throttled time stat for throttled children JIRA: https://issues.redhat.com/browse/RHEL-34600 commit 677ea015f231aa38b3972aa7be54ecd2637e99fd Author: Josh Don <joshdon@google.com> Date: Tue, 20 Jun 2023 11:32:47 -0700 sched: add throttled time stat for throttled children We currently export the total throttled time for cgroups that are given a bandwidth limit. This patch extends this accounting to also account the total time that each children cgroup has been throttled. This is useful to understand the degree to which children have been affected by the throttling control. Children which are not runnable during the entire throttled period, for example, will not show any self-throttling time during this period. Expose this in a new interface, 'cpu.stat.local', which is similar to how non-hierarchical events are accounted in 'memory.events.local'. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com Signed-off-by: Waiman Long <longman@redhat.com>	2024-04-26 22:49:11 -04:00
Waiman Long	9dedff9054	sched: Fix race in task_call_func() JIRA: https://issues.redhat.com/browse/RHEL-34600 commit 91dabf33ae5df271da63e87ad7833e5fdb4a44b9 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed, 26 Oct 2022 13:43:00 +0200 sched: Fix race in task_call_func() There is a very narrow race between schedule() and task_call_func(). CPU0 CPU1 __schedule() rq_lock(); prev_state = READ_ONCE(prev->__state); if (... && prev_state) { deactivate_tasl(rq, prev, ...) prev->on_rq = 0; task_call_func() raw_spin_lock_irqsave(p->pi_lock); state = READ_ONCE(p->__state); smp_rmb(); if (... \|\| p->on_rq) // false!!! rq = __task_rq_lock() ret = func(); next = pick_next_task(); rq = context_switch(prev, next) prepare_lock_switch() spin_release(&__rq_lockp(rq)->dep_map...) So while the task is on it's way out, it still holds rq->lock for a little while, and right then task_call_func() comes in and figures it doesn't need rq->lock anymore (because the task is already dequeued -- but still running there) and then the __set_task_frozen() thing observes it's holding rq->lock and yells murder. Avoid this by waiting for p->on_cpu to get cleared, which guarantees the task is fully finished on the old CPU. ( While arguably the fixes tag is 'wrong' -- none of the previous task_call_func() users appears to care for this case. ) Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://lkml.kernel.org/r/Y1kdRNNfUeAU+FNl@hirez.programming.kicks-ass.net Signed-off-by: Waiman Long <longman@redhat.com>	2024-04-26 22:49:08 -04:00
Waiman Long	812de711d8	sched: Fix TASK_state comparisons JIRA: https://issues.redhat.com/browse/RHEL-34600 commit 5aec788aeb8eb74282b75ac1b317beb0fbb69a42 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue, 27 Sep 2022 21:02:34 +0200 sched: Fix TASK_state comparisons Task state is fundamentally a bitmask; direct comparisons are probably not working as intended. Specifically the normal wait-state have a number of possible modifiers: TASK_UNINTERRUPTIBLE: TASK_WAKEKILL, TASK_NOLOAD, TASK_FREEZABLE TASK_INTERRUPTIBLE: TASK_FREEZABLE Specifically, the addition of TASK_FREEZABLE wrecked __wait_is_interruptible(). This however led to an audit of direct comparisons yielding the rest of the changes. Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com> Debugged-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-04-26 22:49:07 -04:00
Waiman Long	724656e7cf	freezer,sched: Rewrite core freezer logic JIRA: https://issues.redhat.com/browse/RHEL-34600 Conflicts: 1) A merge conflict in the kernel/signal.c hunk due to the presence of RHEL-only commit `975d318867` ("signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT."). 2) A merge conflict in the kernel/time/hrtimer.c hunk due to the presence of RHEL-only commit `5f76194136` ("time/hrtimer: Embed hrtimer mode into hrtimer_sleeper"). 3) The fs/cifs/inode.c hunk was applied to fs/smb/client/inode.c due to the presence of upstream commit 38c8a9a52082 ("smb: move client and server files to common directory fs/smb"). 4) Similarly, the fs/cifs/transport.c hunk was applied to fs/smb/client/transport.c manually due to the presence of a later upstream commit d527f51331ca ("cifs: Fix UAF in cifs_demultiplex_thread()"). Note that all the prerequiste patches in the same patch series (https://lore.kernel.org/lkml/20220822111816.760285417@infradead.org/) had already been merged into RHEL9. commit f5d39b020809146cc28e6e73369bf8065e0310aa Author: Peter Zijlstra <peterz@infradead.org> Date: Mon, 22 Aug 2022 13:18:22 +0200 freezer,sched: Rewrite core freezer logic Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states can be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org Signed-off-by: Waiman Long <longman@redhat.com>	2024-04-26 22:49:06 -04:00
Lucas Zampieri	d23522d08a	Merge: Sched: schedutil/cpufreq updates MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3935 JIRA: https://issues.redhat.com/browse/RHEL-29020 Bring schedutil code up to about v6.8. This includes some fixes for code in rhel9 from the 5.14 rebase. There are few pieces in cpufreq driver code and the arm architectures needed to make it complete. Tested: Ran stress tests with schedutil governor. Ran general scheduler stress and performance tests. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Mark Langsdorf <mlangsdo@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-04-26 12:34:20 +00:00
Lucas Zampieri	79eb65d175	Merge: sched: apply class and guard cleanups MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3865 JIRA: https://issues.redhat.com/browse/RHEL-29017 Apply the changes using the macros in include/linux/cleanup.h providing scoped guards. There is no real functional change. We rely on the compiler to cleanup rather than having explicit unwiding with gotos. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Juri Lelli <juri.lelli@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-04-22 12:41:20 +00:00
Audra Mitchell	0f85367c44	panic: Consolidate open-coded panic_on_warn checks JIRA: https://issues.redhat.com/browse/RHEL-27739 This patch is a backport of the following upstream commit: commit 79cc1ba7badf9e7a12af99695a557e9ce27ee967 Author: Kees Cook <keescook@chromium.org> Date: Thu Nov 17 15:43:24 2022 -0800 panic: Consolidate open-coded panic_on_warn checks Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll their own warnings, and each check "panic_on_warn". Consolidate this into a single function so that future instrumentation can be added in a single location. Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Jann Horn <jannh@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org Signed-off-by: Audra Mitchell <audra@redhat.com>	2024-04-09 09:43:00 -04:00
Phil Auld	8e7f4729fa	sched/deadline: Introduce deadline servers JIRA: https://issues.redhat.com/browse/RHEL-25535 Conflicts: Context diff in include/linux/sched.h mostly due to not having fd593511cdfc ("tracing/user_events: Track fork/exec/exit for mm lifetime"). commit 63ba8422f876e32ee564ea95da9a7313b13ff0a1 Author: Peter Zijlstra <peterz@infradead.org> Date: Sat Nov 4 11:59:21 2023 +0100 sched/deadline: Introduce deadline servers Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks with higher priority (e.g., SCHED_FIFO) monopolize CPU(s). RT Throttling has been introduced a while ago as a (mostly debug) countermeasure one can utilize to reserve some CPU time for low priority tasks (usually background type of work, e.g. workqueues, timers, etc.). It however has its own problems (see documentation) and the undesired effect of unconditionally throttling FIFO tasks even when no lower priority activity needs to run (there are mechanisms to fix this issue as well, but, again, with their own problems). Introduce deadline servers to service low priority tasks needs under starvation conditions. Deadline servers are built extending SCHED_DEADLINE implementation to allow 2-level scheduling (a sched_deadline entity becomes a container for lower priority scheduling entities). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-04-08 15:47:16 -04:00

1 2 3 4 5 ...

1596 Commits