Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Radostin Stoyanov	3a901bcc52	cgroup: Do not report unavailable v1 controllers in /proc/cgroups JIRA: https://issues.redhat.com/browse/RHEL-80382 commit af000ce85293b8e608f696f0c6c280bc3a75887f Author: Michal Koutný <mkoutny@suse.com> Date: Mon Sep 9 18:32:23 2024 +0200 cgroup: Do not report unavailable v1 controllers in /proc/cgroups This is a followup to CONFIG-urability of cpuset and memory controllers for v1 hierarchies. Make the output in /proc/cgroups reflect that !CONFIG_CPUSETS_V1 is like !CONFIG_CPUSETS and !CONFIG_MEMCG_V1 is like !CONFIG_MEMCG. The intended effect is that hiding the unavailable controllers will hint users not to try mounting them on v1. Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>	2025-04-28 11:26:39 +01:00
Radostin Stoyanov	b16f7d3e66	cgroup: Disallow mounting v1 hierarchies without controller implementation JIRA: https://issues.redhat.com/browse/RHEL-80382 commit 3c41382e920f1dd5c9f432948fe799c07af1cced Author: Michal Koutný <mkoutny@suse.com> Date: Mon Sep 9 18:32:22 2024 +0200 cgroup: Disallow mounting v1 hierarchies without controller implementation The configs that disable some v1 controllers would still allow mounting them but with no controller-specific files. (Making such hierarchies equivalent to named v1 hierarchies.) To achieve behavior consistent with actual out-compilation of a whole controller, the mounts should treat respective controllers as non-existent. Wrap implementation into a helper function, leverage legacy_files to detect compiled out controllers. The effect is that mounts on v1 would fail and produce a message like: [ 1543.999081] cgroup: Unknown subsys name 'memory' Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>	2025-04-28 11:26:39 +01:00
Radostin Stoyanov	1e0823a037	cgroup: Fix potential overflow issue when checking max_depth JIRA: https://issues.redhat.com/browse/RHEL-80382 commit 3cc4e13bb1617f6a13e5e6882465984148743cf4 Author: Xiu Jianfeng <xiujianfeng@huawei.com> Date: Sat Oct 12 07:22:46 2024 +0000 cgroup: Fix potential overflow issue when checking max_depth cgroup.max.depth is the maximum allowed descent depth below the current cgroup. If the actual descent depth is equal or larger, an attempt to create a new child cgroup will fail. However due to the cgroup->max_depth is of int type and having the default value INT_MAX, the condition 'level > cgroup->max_depth' will never be satisfied, and it will cause an overflow of the level after it reaches to INT_MAX. Fix it by starting the level from 0 and using '>=' instead. It's worth mentioning that this issue is unlikely to occur in reality, as it's impossible to have a depth of INT_MAX hierarchy, but should be be avoided logically. Fixes: `1a926e0bba` ("cgroup: implement hierarchy limits") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>	2025-04-28 10:54:34 +01:00
Radostin Stoyanov	d795c506d2	cgroup/cpuset: Check for partition roots with overlapping CPUs JIRA: https://issues.redhat.com/browse/RHEL-80382 commit 99570300d3b4c8a1463491754d58e7a8d87cacef Author: Waiman Long <longman@redhat.com> Date: Sun Aug 4 21:30:18 2024 -0400 cgroup/cpuset: Check for partition roots with overlapping CPUs With the previous commit that eliminates the overlapping partition root corner cases in the hotplug code, the partition roots passed down to generate_sched_domains() should not have overlapping CPUs. Enable overlapping cpuset check for v2 and warn if that happens. This patch also has the benefit of increasing test coverage of the new Union-Find cpuset merging code to cgroup v2. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>	2025-04-28 10:54:34 +01:00
Radostin Stoyanov	a387509480	cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU JIRA: https://issues.redhat.com/browse/RHEL-80382 commit 0e40cf2a8b2c847950e025d5aa594bd545118d26 Author: Kinsey Ho <kinseyho@google.com> Date: Thu Sep 5 00:30:50 2024 +0000 cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU Patch series "Improve mem_cgroup_iter()", v4. Incremental cgroup iteration is being used again [1]. This patchset improves the reliability of mem_cgroup_iter(). It also improves simplicity and code readability. [1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/ This patch (of 5): Explicitly document that css sibling/descendant linkage is protected by cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and similar functions that it isn't necessary to hold a ref on @pos. The following changes in this patchset rely on this clarification for simplification in memcg iteration code. Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.com Suggested-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Kinsey Ho <kinseyho@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>	2025-04-28 10:54:33 +01:00
Radostin Stoyanov	702a637010	cgroup/cpuset: Remove cpuset_slab_spread_rotor JIRA: https://issues.redhat.com/browse/RHEL-80382 commit c149c4a48b19afbf0c383614e57b452d39b154de Author: Xiu Jianfeng <xiujianfeng@huawei.com> Date: Sat Jul 13 08:59:16 2024 +0000 cgroup/cpuset: Remove cpuset_slab_spread_rotor Since the SLAB implementation was removed in v6.8, so the cpuset_slab_spread_rotor is no longer used and can be removed. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>	2025-04-28 10:54:33 +01:00
Radostin Stoyanov	6d25e55fb0	cgroup: update some statememt about delegation JIRA: https://issues.redhat.com/browse/RHEL-80382 commit d1a92d2d6c5dbeba9a87bfb57fa0142cdae7b206 Author: Chen Ridong <chenridong@huawei.com> Date: Thu Aug 15 13:14:08 2024 +0000 cgroup: update some statememt about delegation The comment in cgroup_file_write is missing some interfaces, such as 'cgroup.threads'. All delegatable files are listed in '/sys/kernel/cgroup/delegate', so update the comment in cgroup_file_write. Besides, add a statement that files outside the namespace shouldn't be visible from inside the delegated namespace. tj: Reflowed text for consistency. Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>	2025-04-28 10:54:33 +01:00
Augusto Caringi	e2a8c62ac1	Merge: livepatch: selected fixes for rhel-9.7 v2 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6654 JIRA: https://issues.redhat.com/browse/RHEL-85303 A small series of fixes for the RHEL9.7 livepatch subsystem. Signed-off-by: Denis Aleksandrov <daleksan@redhat.com> Approved-by: Joe Lawrence <joe.lawrence@redhat.com> Approved-by: Ryan Sullivan <rysulliv@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-04-24 12:23:31 -03:00
Augusto Caringi	63ccd7ece5	Merge: mm: backport of proactive fixes MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6650 JIRA: https://issues.redhat.com/browse/RHEL-78989 JIRA: https://issues.redhat.com/browse/RHEL-80529 JIRA: https://issues.redhat.com/browse/RHEL-83249 JIRA: https://issues.redhat.com/browse/RHEL-84184 CVE: CVE-2025-21691 CVE: CVE-2025-21696 CVE: CVE-2025-21861 Proactively backport a set of selected follow-up Fixes for the MM patches previously backported into RHEL-9 minor releases. Dependencies and follow-up fixes for the selected commits are also selectively backported. Omitted-fix: e080a26725fb ("erofs: allow large folios for compressed files") Omitted-fix: 3488af097044 ("mm/damon/core: handle zero {aggregation,ops_update} intervals") Omitted-fix: 5e06ad590096 ("mm/damon/core-test: test max_nr_accesses overflow caused divide-by-zero") Omitted-fix: 25e8acbcf19c ("mm/damon/tests/core-kunit: skip damon_test_nr_accesses_to_accesses_bp() if aggr_interval is zero") Omitted-fix: 1390a3334a48 ("mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio") Omitted-fix: 7ddeb91f5b03 ("mm: kmemleak: add support for dumping physical and __percpu object info") Signed-off-by: Rafael Aquini <raquini@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Approved-by: Čestmír Kalina <ckalina@redhat.com> Approved-by: Herton R. Krzesinski <herton@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-04-24 12:23:31 -03:00
Augusto Caringi	3f25b9462f	Merge: sched: Fix stop_one_cpu_nowait() vs hotplug [rhel-9] MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6623 JIRA: https://issues.redhat.com/browse/RHEL-84526 Sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test have been reported and fixed upstream – Notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning. Both C10S and RHEL-10 already carry the fix from this upstream commit: f0498d2a54e79 sched: Fix stop_one_cpu_nowait() vs hotplug Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Approved-by: Gabriele Monaco <gmonaco@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-04-24 12:23:25 -03:00
Augusto Caringi	ac837b9b45	Merge: cgroup/cpuset: Fix issues in the cpuset partition code MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6722 JIRA: https://issues.redhat.com/browse/RHEL-83455 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6722 The Jira ticket reports a loss of isolated CPUs when isolated partitions are being created, i.e. Some of the isolated CPUs are missing and are not in any of the existing cpusets. We are not able to reproduce this problem in-house, but detailed analysis of the cpuset partition code does reveal issues that need to be fixed. This MR incorporates the latest cpuset fixes that were merged upstream and hopefully will be able to address the issues seen by the customers. To reduce conflicts and other complications, some other recent cpuset commits are also included as well. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Herton R. Krzesinski <herton@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Radostin Stoyanov <rstoyano@redhat.com> Approved-by: Rafael Aquini <raquini@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-04-24 12:23:21 -03:00
Augusto Caringi	b65573c720	Merge: rtla: Add timerlat BPF sample collection, Set all tracer options by default [rhel-9] MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6686 # Merge Request Required Information JIRA: https://issues.redhat.com/browse/RHEL-77358 JIRA: https://issues.redhat.com/browse/RHEL-86051 ## Summary of Changes Two upstream patchsets are contained in this merge request: * Collect timerlat samples using a BPF program instead of pulling them through a tracefs pipe. This helps with both performance and CPU usage, and fixes an issue where on systems with \>100 CPUs, rtla cannot keep up with timerlat samples and drops most of them, making it useless. * Always set default values of all tracer options (osnoise or timerlat) if not specified otherwise, as they might be set to unexpected values either by another user of the tracers or by a previous abnormally exited run of rtla. Those are combined into a single MR, because the latter depends on a refactoring done in the former. A dependency (rtla test suite) is also pulled. ## Approved Development Ticket(s) All submissions to CentOS Stream must reference a ticket in [Red Hat Jira](https://issues.redhat.com/). <details><summary>Click for formatting instructions</summary> Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved. Signed-off-by: Tomas Glozar <tglozar@redhat.com> List tickets each on their own line of this description using the format "Resolves: RHEL-76229", "Related: RHEL-76229" or "Reverts: RHEL-76229", as appropriate. </details> Approved-by: Wander Lairson Costa <wander@redhat.com> Approved-by: John Kacur <jkacur@redhat.com> Approved-by: Gabriele Monaco <gmonaco@redhat.com> Approved-by: Derek Barbosa <debarbos@redhat.com> Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-04-24 12:23:18 -03:00
Augusto Caringi	921043e372	Merge: watch_queue: fix pipe accounting mismatch MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6667 JIRA: https://issues.redhat.com/browse/RHEL-78249 commit f13abc1e8e1a3b7455511c4e122750127f6bc9b0 Author: Eric Sandeen <sandeen@redhat.com> Date: Thu Feb 27 11:41:08 2025 -0600 watch_queue: fix pipe accounting mismatch Currently, watch_queue_set_size() modifies the pipe buffers charged to user->pipe_bufs without updating the pipe->nr_accounted on the pipe itself, due to the if (!pipe_has_watch_queue()) test in pipe_resize_ring(). This means that when the pipe is ultimately freed, we decrement user->pipe_bufs by something other than what than we had charged to it, potentially leading to an underflow. This in turn can cause subsequent too_many_pipe_buffers_soft() tests to fail with -EPERM. To remedy this, explicitly account for the pipe usage in watch_queue_set_size() to match the number set via account_pipe_buffers() (It's unclear why watch_queue_set_size() does not update nr_accounted; it may be due to intentional overprovisioning in watch_queue_set_size()?) Fixes: e95aada4cb93d ("pipe: wakeup wr_wait after setting max_usage") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/206682a8-0604-49e5-8224-fdbe0c12b460@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Approved-by: Rafael Aquini <raquini@redhat.com> Approved-by: David Howells <dhowells@redhat.com> Approved-by: Pavel Reichl <preichl@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-04-24 12:23:17 -03:00
Rafael Aquini	e4205ccf96	kernel: be more careful about dup_mmap() failures and uprobe registering JIRA: https://issues.redhat.com/browse/RHEL-84184 CVE: CVE-2025-21709 Conflicts: * kernel/events/uprobes.c: a notable context difference in the 1st hunk due to RHEL-9 missing the following upstream commits: 87195a1ee332a, 2bf8e5aceff89, and dd1a7567784e2; and a notable contex difference in the 2nd hunk due to RHEL-9 missing the following upstream commits: 84455e6923c7 and 8617408f7a01. None of the aforelisted commits are of any relevance for this backport work. This patch is a backport of the following upstream commit: commit 64c37e134b120fb462fb4a80694bfb8e7be77b14 Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Mon Jan 27 12:02:21 2025 -0500 kernel: be more careful about dup_mmap() failures and uprobe registering If a memory allocation fails during dup_mmap(), the maple tree can be left in an unsafe state for other iterators besides the exit path. All the locks are dropped before the exit_mmap() call (in mm/mmap.c), but the incomplete mm_struct can be reached through (at least) the rmap finding the vmas which have a pointer back to the mm_struct. Up to this point, there have been no issues with being able to find an mm_struct that was only partially initialised. Syzbot was able to make the incomplete mm_struct fail with recent forking changes, so it has been proven unsafe to use the mm_struct that hasn't been initialised, as referenced in the link below. Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to invalid mm") fixed the uprobe access, it does not completely remove the race. This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the oom side (even though this is extremely unlikely to be selected as an oom victim in the race window), and sets MMF_UNSTABLE to avoid other potential users from using a partially initialised mm_struct. When registering vmas for uprobe, skip the vmas in an mm that is marked unstable. Modifying a vma in an unstable mm may cause issues if the mm isn't fully initialised. Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/ Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2025-04-18 08:39:53 -04:00
Rafael Aquini	6abde438ad	fork: avoid inappropriate uprobe access to invalid mm JIRA: https://issues.redhat.com/browse/RHEL-84184 Conflicts: * kernel/fork.c: minor difference from upstream due to an extra blank line that was left behind when commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") was backported into RHEL-9 This patch is a backport of the following upstream commit: commit 8ac662f5da19f5873fdd94c48a5cdb45b2e1b58f Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Date: Tue Dec 10 17:24:12 2024 +0000 fork: avoid inappropriate uprobe access to invalid mm If dup_mmap() encounters an issue, currently uprobe is able to access the relevant mm via the reverse mapping (in build_map_info()), and if we are very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which we establish as part of the fork error path. This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which in turn invokes find_mergeable_anon_vma() that uses a VMA iterator, invoking vma_iter_load() which uses the advanced maple tree API and thus is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()"). This change was made on the assumption that only process tear-down code would actually observe (and make use of) these values. However this very unlikely but still possible edge case with uprobes exists and unfortunately does make these observable. The uprobe operation prevents races against the dup_mmap() operation via the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap() and dropped via uprobe_end_dup_mmap(), and held across register_for_each_vma() prior to invoking build_map_info() which does the reverse mapping lookup. Currently these are acquired and dropped within dup_mmap(), which exposes the race window prior to error handling in the invoking dup_mm() which tears down the mm. We can avoid all this by just moving the invocation of uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm() and only release this lock once the dup_mmap() operation succeeds or clean up is done. This means that the uprobe code can never observe an incompletely constructed mm and resolves the issue in this case. Link: https://lkml.kernel.org/r/20241210172412.52995-1-lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/ Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2025-04-18 08:39:52 -04:00
Waiman Long	f397624dbd	cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs JIRA: https://issues.redhat.com/browse/RHEL-83455 Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git commit 86888c7bd117c29eab169c37e5f6bbbf583da983 Author: Waiman Long <longman@redhat.com> Date: Mon, 7 Apr 2025 17:21:05 -0400 cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs Add WARN_ON_ONCE() statements whenever new exclusive CPUs are being added to a partition root to catch inconsistency in the way exclusive CPUs are being handled in the cpuset code. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:43 -04:00
Waiman Long	19950eb6cd	cgroup/cpuset: Code cleanup and comment update JIRA: https://issues.redhat.com/browse/RHEL-83455 Conflicts: Some minor context diffs due to missing upstream commit 381b53c3b549 ("cgroup/cpuset: rename functions shared between v1 and v2") and commit 2ff899e35164 ("sched/deadline: Rebuild root domain accounting after every update"). commit f0a0bd3d23a44a2c5f628e8ca8ad882498ca5aae Author: Waiman Long <longman@redhat.com> Date: Sun, 30 Mar 2025 17:52:44 -0400 cgroup/cpuset: Code cleanup and comment update Rename partition_xcpus_newstate() to isolated_cpus_update(), update_partition_exclusive() to update_partition_exclusive_flag() and the new_xcpus_state variable to isolcpus_updated to make their meanings more explicit. Also add some comments to further clarify the code. No functional change is expected. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:41 -04:00
Waiman Long	710ee1a4b8	cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition JIRA: https://issues.redhat.com/browse/RHEL-83455 commit f62a5d39368e34a966c8df63e1f05eed7fe9c5de Author: Waiman Long <longman@redhat.com> Date: Sun, 30 Mar 2025 17:52:42 -0400 cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition Currently, changes in exclusive CPUs are being handled in remote_partition_check() by disabling conflicting remote partitions. However, that may lead to results unexpected by the users. Fix this problem by removing remote_partition_check() and making update_cpumasks_hier() handle changes in descendant remote partitions properly. The compute_effective_exclusive_cpumask() function is enhanced to check the exclusive_cpus and effective_xcpus from siblings and excluded them in its effective exclusive CPUs computation and return a value to show if there is any sibling conflicts. This is somewhat like the cpu_exclusive flag check in validate_change(). This is the initial step to enable us to retire the use of cpu_exclusive flag in cgroup v2 in the future. One of the tests in the TEST_MATRIX of the test_cpuset_prs.sh script has to be updated due to changes in the way a child remote partition root is being handled (updated instead of invalidation) in update_cpumasks_hier(). Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:41 -04:00
Waiman Long	b6ee781e20	cgroup/cpuset: Fix error handling in remote_partition_disable() JIRA: https://issues.redhat.com/browse/RHEL-83455 commit 8bf450f3aec3d1bbd725d179502c64b8992588e4 Author: Waiman Long <longman@redhat.com> Date: Sun, 30 Mar 2025 17:52:41 -0400 cgroup/cpuset: Fix error handling in remote_partition_disable() When remote_partition_disable() is called to disable a remote partition, it always sets the partition to an invalid partition state. It should only do so if an error code (prs_err) has been set. Correct that and add proper error code in places where remote_partition_disable() is called due to error. Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:41 -04:00
Waiman Long	d6a8d6bd83	cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask() JIRA: https://issues.redhat.com/browse/RHEL-83455 commit 668e041662e92ab3ebcb9eb606d3ec01884546ab Author: Waiman Long <longman@redhat.com> Date: Sun, 30 Mar 2025 17:52:40 -0400 cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask() Before commit f0af1bfc27b5 ("cgroup/cpuset: Relax constraints to partition & cpus changes"), a cpuset partition cannot be enabled if not all the requested CPUs can be granted from the parent cpuset. After that commit, a cpuset partition can be created even if the requested exclusive CPUs contain CPUs not allowed its parent. The delmask containing exclusive CPUs to be removed from its parent wasn't adjusted accordingly. That is not a problem until the introduction of a new isolated_cpus mask in commit 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions") as the CPUs in the delmask may be added directly into isolated_cpus. As a result, isolated_cpus may incorrectly contain CPUs that are not isolated leading to incorrect data reporting. Fix this by adjusting the delmask to reflect the actual exclusive CPUs for the creation of the partition. Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:40 -04:00
Waiman Long	9022c81a98	cgroup/cpuset: Fix race between newly created partition and dying one JIRA: https://issues.redhat.com/browse/RHEL-83455 Conflicts: A merge conflict in the cpuset_css_offline() hunk due to missing upstream commit c4c9cebe2fb9 ("cgroup/cpuset: Further optimize code if CONFIG_CPUSETS_V1 not set"). commit a22b3d54de94f82ca057cc2ebf9496fa91ebf698 Author: Waiman Long <longman@redhat.com> Date: Sun, 30 Mar 2025 17:52:39 -0400 cgroup/cpuset: Fix race between newly created partition and dying one There is a possible race between removing a cgroup diectory that is a partition root and the creation of a new partition. The partition to be removed can be dying but still online, it doesn't not currently participate in checking for exclusive CPUs conflict, but the exclusive CPUs are still there in subpartitions_cpus and isolated_cpus. These two cpumasks are global states that affect the operation of cpuset partitions. The exclusive CPUs in dying cpusets will only be removed when cpuset_css_offline() function is called after an RCU delay. As a result, it is possible that a new partition can be created with exclusive CPUs that overlap with those of a dying one. When that dying partition is finally offlined, it removes those overlapping exclusive CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an incorrect CPU configuration. This bug was found when a warning was triggered in remote_partition_disable() during testing because the subpartitions_cpus mask was empty. One possible way to fix this is to iterate the dying cpusets as well and avoid using the exclusive CPUs in those dying cpusets. However, this can still cause random partition creation failures or other anomalies due to racing. A better way to fix this race is to reset the partition state at the moment when a cpuset is being killed. Introduce a new css_killed() CSS function pointer and call it, if defined, before setting CSS_DYING flag in kill_css(). Also update the css_is_dying() helper to use the CSS_DYING flag introduced by commit `33c35aa481` ("cgroup: Prevent kill_css() from being called more than once") for proper synchronization. Add a new cpuset_css_killed() function to reset the partition state of a valid partition root if it is being killed. Fixes: `ee8dde0cd2` ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:40 -04:00
Waiman Long	16bd4a6994	cgroup/cpuset: Prevent leakage of isolated CPUs into sched domains JIRA: https://issues.redhat.com/browse/RHEL-83455 commit 9b496a8bbed9cc292b0dfd796f38ec58b6d0375f Author: Waiman Long <longman@redhat.com> Date: Thu, 5 Dec 2024 14:51:01 -0500 cgroup/cpuset: Prevent leakage of isolated CPUs into sched domains Isolated CPUs are not allowed to be used in a non-isolated partition. The only exception is the top cpuset which is allowed to contain boot time isolated CPUs. Commit ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem") introduces a simplified scheme of including only partition roots in sched domain generation. However, it does not properly account for this exception case. This can result in leakage of isolated CPUs into a sched domain. Fix it by making sure that isolated CPUs are excluded from the top cpuset before generating sched domains. Also update the way the boot time isolated CPUs are handled in test_cpuset_prs.sh to make sure that those isolated CPUs are really isolated instead of just skipping them in the tests. Fixes: ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:39 -04:00
Waiman Long	9a10873e27	cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per operation JIRA: https://issues.redhat.com/browse/RHEL-83455 Conflicts: A context diff in the cpuset_update_flag() hunk due to missing upstream commit 381b53c3b549 ("cgroup/cpuset: rename functions shared between v1 and v2") and the removal of IS_ENABLED(CONFIG_CPUSETS_V1) check. commit a040c351283e3ac75422621ea205b1d8d687e108 Author: Waiman Long <longman@redhat.com> Date: Sat, 9 Nov 2024 21:50:22 -0500 cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per operation Since commit ff0ce721ec21 ("cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug"), there is only one rebuild_sched_domains_locked() call per hotplug operation. However, writing to the various cpuset control files may still casue more than one rebuild_sched_domains_locked() call to happen in some cases. Juri had found that two rebuild_sched_domains_locked() calls in update_prstate(), one from update_cpumasks_hier() and another one from update_partition_sd_lb() could cause cpuset partition to be created with null total_bw for DL tasks. IOW, DL tasks may not be scheduled correctly in such a partition. A sample command sequence that can reproduce null total_bw is as follows. # echo Y >/sys/kernel/debug/sched/verbose # echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control # mkdir /sys/fs/cgroup/test # echo 0-7 > /sys/fs/cgroup/test/cpuset.cpus # echo 6-7 > /sys/fs/cgroup/test/cpuset.cpus.exclusive # echo root >/sys/fs/cgroup/test/cpuset.cpus.partition Fix this double rebuild_sched_domains_locked() calls problem by replacing existing calls with cpuset_force_rebuild() except the rebuild_sched_domains_cpuslocked() call at the end of cpuset_handle_hotplug(). Checking of the force_sd_rebuild flag is now done at the end of cpuset_write_resmask() and update_prstate() to determine if rebuild_sched_domains_locked() should be called or not. The cpuset v1 code can still call rebuild_sched_domains_locked() directly as double rebuild_sched_domains_locked() calls is not possible. Reported-by: Juri Lelli <juri.lelli@redhat.com> Closes: https://lore.kernel.org/lkml/ZyuUcJDPBln1BK1Y@jlelli-thinkpadt14gen4.remote.csb/ Signed-off-by: Waiman Long <longman@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:39 -04:00
Waiman Long	d55b5f7e82	cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in update_cpumasks_hier()" JIRA: https://issues.redhat.com/browse/RHEL-83455 commit bcd7012afd7bcd45fcd7a0e2f48e57b273702317 Author: Waiman Long <longman@redhat.com> Date: Sat, 9 Nov 2024 21:50:21 -0500 cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in update_cpumasks_hier()" Revert commit 3ae0b773211e ("cgroup/cpuset: Allow suppression of sched domain rebuild in update_cpumasks_hier()") to allow for an alternative way to suppress unnecessary rebuild_sched_domains_locked() calls in update_cpumasks_hier() and elsewhere in a following commit. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:39 -04:00
Waiman Long	1a2685707f	cgroup/cpuset: Fix spelling errors in file kernel/cgroup/cpuset.c JIRA: https://issues.redhat.com/browse/RHEL-83455 Conflicts: Minor context diff in 3 hunks due to missing upstream commit 381b53c3b549 ("cgroup/cpuset: rename functions shared between v1 and v2"). commit 95a616d89ccd2d2af0bd26c13c50143b301d82e8 Author: everestkc <everestkc@everestkc.com.np> Date: Sun, 15 Sep 2024 02:29:21 -0600 cgroup/cpuset: Fix spelling errors in file kernel/cgroup/cpuset.c Corrected the spelling errors repoted by codespell as follows: temparary ==> temporary Proprogate ==> Propagate constrainted ==> constrained Signed-off-by: Everest K.C. <everestkc@everestkc.com.np> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:38 -04:00
Waiman Long	4916b3852c	cgroup/cpuset: Account for boot time isolated CPUs JIRA: https://issues.redhat.com/browse/RHEL-83455 commit c188f33c864e3dba49a1ad0dc9fddf2f49ac42ae Author: Waiman Long <longman@redhat.com> Date: Tue, 20 Aug 2024 15:55:35 -0400 cgroup/cpuset: Account for boot time isolated CPUs With the "isolcpus" boot command line parameter, we are able to create isolated CPUs at boot time. These isolated CPUs aren't fully accounted for in the cpuset code. For instance, the root cgroup's "cpuset.cpus.isolated" control file does not include the boot time isolated CPUs. Fix that by looking for pre-isolated CPUs at init time. The prstate_housekeeping_conflict() function does check the HK_TYPE_DOMAIN housekeeping cpumask to make sure that CPUs outside of it can only be used in isolated partition. Given the fact that we are going to make housekeeping cpumasks dynamic, the current check may not be right anymore. Save the boot time HK_TYPE_DOMAIN cpumask and check against it instead of the upcoming dynamic HK_TYPE_DOMAIN housekeeping cpumask. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:37 -04:00
Waiman Long	cc3669da09	cgroup/cpuset: remove use_parent_ecpus of cpuset JIRA: https://issues.redhat.com/browse/RHEL-83455 commit 3c2acae88844e7423a50b5cbe0a2c9d430fcd20c Author: Chen Ridong <chenridong@huawei.com> Date: Tue, 20 Aug 2024 03:01:26 +0000 cgroup/cpuset: remove use_parent_ecpus of cpuset use_parent_ecpus is used to track whether the children are using the parent's effective_cpus. When a parent's effective_cpus is changed due to changes in a child partition's effective_xcpus, any child using parent'effective_cpus must call update_cpumasks_hier. However, if a child is not a valid partition, it is sufficient to determine whether to call update_cpumasks_hier based on whether the child's effective_cpus is going to change. To make the code more succinct, it is suggested to remove use_parent_ecpus. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:37 -04:00
Waiman Long	09205586fe	cgroup/cpuset: remove fetch_xcpus JIRA: https://issues.redhat.com/browse/RHEL-83455 commit 9414f68d454529ff7e68f0c2aefe0a007060c66a Author: Chen Ridong <chenridong@huawei.com> Date: Tue, 20 Aug 2024 03:01:25 +0000 cgroup/cpuset: remove fetch_xcpus Both fetch_xcpus and user_xcpus functions are used to retrieve the value of exclusive_cpus. If exclusive_cpus is not set, cpus_allowed is the implicit value used as exclusive in a local partition. I can not imagine a scenario where effective_xcpus is not empty when exclusive_cpus is empty. Therefore, I suggest removing the fetch_xcpus function. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:37 -04:00
Waiman Long	6981025ee0	cgroup/cpuset: remove child_ecpus_count JIRA: https://issues.redhat.com/browse/RHEL-83455 commit d6326047576266991d88639e1e9739a9a9a20ef4 Author: Chen Ridong <chenridong@huawei.com> Date: Wed, 24 Jul 2024 10:24:18 +0000 cgroup/cpuset: remove child_ecpus_count The child_ecpus_count variable was previously used to update sibling cpumask when parent's effective_cpus is updated. However, it became obsolete after commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2"). It should be removed. tj: Restored {} for style consistency. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:36 -04:00
Waiman Long	73f605febe	cpuset: use Union-Find to optimize the merging of cpumasks JIRA: https://issues.redhat.com/browse/RHEL-83455 commit 8a895c2e6a7ed264a1b917616db205ed934e8306 Author: Xavier <xavier_qy@163.com> Date: Thu, 4 Jul 2024 14:24:44 +0800 cpuset: use Union-Find to optimize the merging of cpumasks The process of constructing scheduling domains involves multiple loops and repeated evaluations, leading to numerous redundant and ineffective assessments that impact code efficiency. Here, we use union-find to optimize the merging of cpumasks. By employing path compression and union by rank, we effectively reduce the number of lookups and merge comparisons. Signed-off-by: Xavier <xavier_qy@163.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:35 -04:00
Augusto Caringi	74782eb600	Merge: block: update with v6.14 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6580 JIRA: https://issues.redhat.com/browse/RHEL-79409 we don't backport "block: Fix potential deadlock while freezing queue and acquiring sysfs_lock Omitted-Fix: 224749be6c23 ("block: Revert "block: Fix potential deadlock while freezing queue and acquiring sysfs_lock"") Omitted-Fix: 2fa07d7a0f00 ("btrfs: pass write-hint for buffered IO") Omitted-Fix: e559ee022658 ("btrfs: validate queue limits") Omitted-Fix: 7467bc5959bf ("btrfs: zoned: calculate max_extent_size properly on non-zoned setup") Omitted-Fix: c7c97ceff98c ("btrfs: handle bio_split() errors") Signed-off-by: Ming Lei <ming.lei@redhat.com> Approved-by: Ewan D. Milne <emilne@redhat.com> Approved-by: Maurizio Lombardi <mlombard@redhat.com> Approved-by: Rafael Aquini <raquini@redhat.com> Approved-by: Vitaly Kuznetsov <vkuznets@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-04-04 12:34:54 -03:00
Augusto Caringi	c82044acec	Merge: Sched: /proc/schedstat improvements MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6405 JIRA: https://issues.redhat.com/browse/RHEL-23495 Update /proc/schedstat with fixes and improved information from upstream. AMD requested these and they don't carry a large risk. Signed-off-by: Phil Auld <pauld@redhat.com> Approved-by: Juri Lelli <juri.lelli@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Rafael Aquini <raquini@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-04-04 12:34:51 -03:00
Augusto Caringi	652f8a293b	Merge: CVE-2025-21726: padata: avoid UAF for reorder_work MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6499 JIRA: https://issues.redhat.com/browse/RHEL-81522 CVE: CVE-2025-21726 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6499 This MR backports the 3-patch series that includes the fix to CVE-2025-21726 as well as two other minor cleanup and fix patches. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Herton R. Krzesinski <herton@redhat.com> Approved-by: Rafael Aquini <raquini@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-04-04 12:34:50 -03:00
Tomas Glozar	331cb7536e	trace/osnoise: Add trace events for samples JIRA: https://issues.redhat.com/browse/RHEL-77358 commit a065bbf776d32a71e748bd948861e6deca803d78 Author: Tomas Glozar <tglozar@redhat.com> Date: Mon Feb 3 10:04:18 2025 +0100 trace/osnoise: Add trace events for samples Add trace events that fire at osnoise and timerlat sample generation, in addition to the already existing noise and threshold events. This allows processing the samples directly in the kernel, either with ftrace triggers or with BPF. Cc: John Kacur <jkacur@redhat.com> Cc: Luis Goncalves <lgoncalv@redhat.com> Link: https://lore.kernel.org/20250203090418.1458923-1-tglozar@redhat.com Signed-off-by: Tomas Glozar <tglozar@redhat.com> Tested-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Tomas Glozar <tglozar@redhat.com>	2025-04-04 11:14:44 +02:00
Denis Aleksandrov	7c3f326164	livepatch: Add stack_order sysfs attribute JIRA: https://issues.redhat.com/browse/RHEL-85303 Add "stack_order" sysfs attribute which holds the order in which a live patch module was loaded into the system. A user can then determine an active live patched version of a function. cat /sys/kernel/livepatch/livepatch_1/stack_order -> 1 means that livepatch_1 is the first live patch applied cat /sys/kernel/livepatch/livepatch_module/stack_order -> N means that livepatch_module is the Nth live patch applied Suggested-by: Petr Mladek <pmladek@suse.com> Suggested-by: Miroslav Benes <mbenes@suse.cz> Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Wardenjohn <zhangwarden@gmail.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Link: https://lore.kernel.org/r/20241008014856.3729-2-zhangwarden@gmail.com [pmladek@suse.com: Updated kernel version and date in the ABI documentation.] Signed-off-by: Petr Mladek <pmladek@suse.com> (cherry picked from commit 3dae09de406167123449d9ece1f51855d5bac01a) Signed-off-by: Denis Aleksandrov <daleksan@redhat.com>	2025-04-03 13:23:15 -04:00
Denis Aleksandrov	cab71a4b8d	livepatch: Use kallsyms_on_each_match_symbol() to improve performance JIRA: https://issues.redhat.com/browse/RHEL-85303 Based on the test results of kallsyms_on_each_match_symbol() and kallsyms_on_each_symbol(), the average performance can be improved by more than 1500 times. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> (cherry picked from commit 9cb37357dfce1b596041ad68a20407c8b4e76635) Signed-off-by: Denis Aleksandrov <daleksan@redhat.com>	2025-04-03 13:22:35 -04:00
Denis Aleksandrov	579531376b	livepatch: Fix build failure on 32 bits processors JIRA: https://issues.redhat.com/browse/RHEL-85303 Trying to build livepatch on powerpc/32 results in: kernel/livepatch/core.c: In function 'klp_resolve_symbols': kernel/livepatch/core.c:221:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 221 \| sym = (Elf64_Sym )sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info); \| ^ kernel/livepatch/core.c:221:21: error: assignment to 'Elf32_Sym ' {aka 'struct elf32_sym '} from incompatible pointer type 'Elf64_Sym ' {aka 'struct elf64_sym '} [-Werror=incompatible-pointer-types] 221 \| sym = (Elf64_Sym )sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info); \| ^ kernel/livepatch/core.c: In function 'klp_apply_section_relocs': kernel/livepatch/core.c:312:35: error: passing argument 1 of 'klp_resolve_symbols' from incompatible pointer type [-Werror=incompatible-pointer-types] 312 \| ret = klp_resolve_symbols(sechdrs, strtab, symndx, sec, sec_objname); \| ^~~~~~~ \| \| \| Elf32_Shdr * {aka struct elf32_shdr } kernel/livepatch/core.c:193:44: note: expected 'Elf64_Shdr ' {aka 'struct elf64_shdr '} but argument is of type 'Elf32_Shdr ' {aka 'struct elf32_shdr '} 193 \| static int klp_resolve_symbols(Elf64_Shdr sechdrs, const char *strtab, \| ~~~~~~~~~~~~^~~~~~~ Fix it by using the right types instead of forcing 64 bits types. Fixes: `7c8e2bdd5f` ("livepatch: Apply vmlinux-specific KLP relocations early") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Petr Mladek <pmladek@suse.com> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/5288e11b018a762ea3351cc8fb2d4f15093a4457.1640017960.git.christophe.leroy@csgroup.eu (cherry picked from commit 2f293651eca3eacaeb56747dede31edace7329d2) Signed-off-by: Denis Aleksandrov <daleksan@redhat.com>	2025-04-03 13:21:39 -04:00
Carlos Maiolino	972aed5f39	watch_queue: fix pipe accounting mismatch JIRA: https://issues.redhat.com/browse/RHEL-78249 commit f13abc1e8e1a3b7455511c4e122750127f6bc9b0 Author: Eric Sandeen <sandeen@redhat.com> Date: Thu Feb 27 11:41:08 2025 -0600 watch_queue: fix pipe accounting mismatch Currently, watch_queue_set_size() modifies the pipe buffers charged to user->pipe_bufs without updating the pipe->nr_accounted on the pipe itself, due to the if (!pipe_has_watch_queue()) test in pipe_resize_ring(). This means that when the pipe is ultimately freed, we decrement user->pipe_bufs by something other than what than we had charged to it, potentially leading to an underflow. This in turn can cause subsequent too_many_pipe_buffers_soft() tests to fail with -EPERM. To remedy this, explicitly account for the pipe usage in watch_queue_set_size() to match the number set via account_pipe_buffers() (It's unclear why watch_queue_set_size() does not update nr_accounted; it may be due to intentional overprovisioning in watch_queue_set_size()?) Fixes: e95aada4cb93d ("pipe: wakeup wr_wait after setting max_usage") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/206682a8-0604-49e5-8224-fdbe0c12b460@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>	2025-04-02 10:32:41 +02:00
Augusto Caringi	f4ca2d23b8	Merge: tracing: Add division and multiplication support for hist triggers [rhel-9] MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6586 # Merge Request Required Information JIRA: https://issues.redhat.com/browse/RHEL-67679 ## Summary of Changes Backport support for division and multiplication in histogram triggers as well as fixes and optimizations for it. Support for creating hist trigger variables from literal is added, too, as it is a dependency of one of the fixes and is documented together with the former. ## Approved Development Ticket(s) All submissions to CentOS Stream must reference a ticket in [Red Hat Jira](https://issues.redhat.com/). <details><summary>Click for formatting instructions</summary> Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved. List tickets each on their own line of this description using the format "Resolves: RHEL-76229", "Related: RHEL-76229" or "Reverts: RHEL-76229", as appropriate. </details> Signed-off-by: Tomas Glozar <tglozar@redhat.com> Approved-by: Joe Lawrence <joe.lawrence@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-03-31 16:55:04 -03:00
Augusto Caringi	201242e8d4	Merge: cgroup: Remove steal time from usage_usec MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6423 JIRA: https://issues.redhat.com/browse/RHEL-79933 commit db5fd3cf8bf41b84b577b8ad5234ea95f327c9be Author: Muhammad Adeel <Muhammad.Adeel@ibm.com> Date: Fri, 7 Feb 2025 14:24:32 +0000 cgroup: Remove steal time from usage_usec The CPU usage time is the time when user, system or both are using the CPU. Steal time is the time when CPU is waiting to be run by the Hypervisor. It should not be added to the CPU usage time, hence removing it from the usage_usec entry. Fixes: `936f2a70f2` ("cgroup: add cpu.stat file to root cgroup") Acked-by: Axel Busch <axel.busch@ibm.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Muhammad Adeel <muhammad.adeel@ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Thomas Huth <thuth@redhat.com> Approved-by: Radostin Stoyanov <rstoyano@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Augusto Caringi <acaringi@redhat.com>	2025-03-27 16:28:29 -03:00
Luis Claudio R. Goncalves	0857fd208a	sched: Fix stop_one_cpu_nowait() vs hotplug JIRA: https://issues.redhat.com/browse/RHEL-84526 commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Oct 10 20:57:39 2023 +0200 sched: Fix stop_one_cpu_nowait() vs hotplug Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning. Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete(). The race scenario is: CPU0 CPU1 // doing _cpu_down() __set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..) <PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock(); <PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!! That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete. OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks. Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination: preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable(); This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed. Apply this pattern to all similar stop_one_cpu_nowait() invocations. Fixes: `6d337eab04` ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>	2025-03-21 18:50:20 -03:00
Tomas Glozar	2b200783ae	tracing/histogram: Fix semicolon.cocci warnings JIRA: https://issues.redhat.com/browse/RHEL-67679 commit feea69ec121f067073868cebe0cb9d003e64ad80 Author: kernel test robot <lkp@intel.com> Date: Sat Oct 30 08:56:15 2021 +0800 tracing/histogram: Fix semicolon.cocci warnings kernel/trace/trace_events_hist.c:6039:2-3: Unneeded semicolon Remove unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Link: https://lkml.kernel.org/r/20211030005615.GA41257@3074f0d39c61 Fixes: c5eac6ee8bc5 ("tracing/histogram: Simplify handling of .sym-offset in expressions") CC: Kalesh Singh <kaleshsingh@google.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tomas Glozar <tglozar@redhat.com>	2025-03-21 08:09:00 +01:00
Tomas Glozar	68f93cbb54	tracing/histogram: Fix check for missing operands in an expression JIRA: https://issues.redhat.com/browse/RHEL-67679 commit 1cab6bce42e62bba2ff2c2370d139618c1828b42 Author: Kalesh Singh <kaleshsingh@google.com> Date: Fri Nov 12 11:13:24 2021 -0800 tracing/histogram: Fix check for missing operands in an expression If a binary operation is detected while parsing an expression string, the operand strings are deduced by splitting the experssion string at the position of the detected binary operator. Both operand strings are sub-strings (can be empty string) of the expression string but will never be NULL. Currently a NULL check is used for missing operands, fix this by checking for empty strings instead. Link: https://lkml.kernel.org/r/20211112191324.1302505-1-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Fixes: 9710b2f341a0 ("tracing: Fix operator precedence for hist triggers expression") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tomas Glozar <tglozar@redhat.com>	2025-03-21 08:09:00 +01:00
Tomas Glozar	adca273068	tracing/histogram: Optimize division by a power of 2 JIRA: https://issues.redhat.com/browse/RHEL-67679 commit 722eddaa4043acee8f031cf238ced5f7514ad638 Author: Kalesh Singh <kaleshsingh@google.com> Date: Mon Oct 25 13:08:38 2021 -0700 tracing/histogram: Optimize division by a power of 2 The division is a slow operation. If the divisor is a power of 2, use a shift instead. Results were obtained using Android's version of perf (simpleperf[1]) as described below: 1. hist_field_div() is modified to call 2 test functions: test_hist_field_div_[not]_optimized(); passing them the same args. Use noinline and volatile to ensure these are not optimized out by the compiler. 2. Create a hist event trigger that uses division: events/kmem/rss_stat$ echo 'hist:keys=common_pid:x=size/<divisor>' >> trigger events/kmem/rss_stat$ echo 'hist:keys=common_pid:vals=$x' >> trigger 3. Run Android's lmkd_test[2] to generate rss_stat events, and record CPU samples with Android's simpleperf: simpleperf record -a --exclude-perf --post-unwind=yes -m 16384 -g -f 2000 -o perf.data == Results == Divisor is a power of 2 (divisor == 32): test_hist_field_div_not_optimized \| 8,717,091 cpu-cycles test_hist_field_div_optimized \| 1,643,137 cpu-cycles If the divisor is a power of 2, the optimized version is ~5.3x faster. Divisor is not a power of 2 (divisor == 33): test_hist_field_div_not_optimized \| 4,444,324 cpu-cycles test_hist_field_div_optimized \| 5,497,958 cpu-cycles If the divisor is not a power of 2, as expected, the optimized version is slightly slower (~24% slower). [1] https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md [2] https://cs.android.com/android/platform/superproject/+/master:system/memory/lmkd/tests/lmkd_test.cpp Link: https://lkml.kernel.org/r/20211025200852.3002369-7-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tomas Glozar <tglozar@redhat.com>	2025-03-21 08:09:00 +01:00
Tomas Glozar	f234ceb346	tracing/histogram: Covert expr to const if both operands are constants JIRA: https://issues.redhat.com/browse/RHEL-67679 commit f47716b7a955e40e2591b960d1eccb1fde967a70 Author: Kalesh Singh <kaleshsingh@google.com> Date: Mon Oct 25 13:08:37 2021 -0700 tracing/histogram: Covert expr to const if both operands are constants If both operands of a hist trigger expression are constants, convert the expression to a constant. This optimization avoids having to perform the same calculation multiple times and also saves on memory since the merged constants are represented by a single struct hist_field instead or multiple. Link: https://lkml.kernel.org/r/20211025200852.3002369-6-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tomas Glozar <tglozar@redhat.com>	2025-03-21 08:09:00 +01:00
Tomas Glozar	65b7fbbb4e	tracing/histogram: Simplify handling of .sym-offset in expressions JIRA: https://issues.redhat.com/browse/RHEL-67679 commit c5eac6ee8bc5d32e48b3845472b547574061f49f Author: Kalesh Singh <kaleshsingh@google.com> Date: Mon Oct 25 13:08:36 2021 -0700 tracing/histogram: Simplify handling of .sym-offset in expressions The '-' in .sym-offset can confuse the hist trigger arithmetic expression parsing. Simplify the handling of this by replacing the 'sym-offset' with 'symXoffset'. This allows us to correctly evaluate expressions where the user may have inadvertently added a .sym-offset modifier to one of the operands in an expression, instead of bailing out. In this case the .sym-offset has no effect on the evaluation of the expression. The only valid use of the .sym-offset is as a hist key modifier. Link: https://lkml.kernel.org/r/20211025200852.3002369-5-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tomas Glozar <tglozar@redhat.com>	2025-03-21 08:09:00 +01:00
Tomas Glozar	c61b871f9d	tracing: Fix operator precedence for hist triggers expression JIRA: https://issues.redhat.com/browse/RHEL-67679 commit 9710b2f341a0d96f35b911580639853cfda4677d Author: Kalesh Singh <kaleshsingh@google.com> Date: Mon Oct 25 13:08:35 2021 -0700 tracing: Fix operator precedence for hist triggers expression The current histogram expression evaluation logic evaluates the expression from right to left. This can lead to incorrect results if the operations are not associative (as is the case for subtraction and, the now added, division operators). e.g. 16-8-4-2 should be 2 not 10 --> 16-8-4-2 = ((16-8)-4)-2 64/8/4/2 should be 1 not 16 --> 64/8/4/2 = ((64/8)/4)/2 Division and multiplication are currently limited to single operation expression due to operator precedence support not yet implemented. Rework the expression parsing to support the correct evaluation of expressions containing operators of different precedences; and fix the associativity error by evaluating expressions with operators of the same precedence from left to right. Examples: (1) echo 'hist:keys=common_pid:a=8,b=4,c=2,d=1,w=$a-$b-$c-$d' \ >> event/trigger (2) echo 'hist:keys=common_pid:x=$a/$b/3/2' >> event/trigger (3) echo 'hist:keys=common_pid:y=$a+10/$c1024' >> event/trigger (4) echo 'hist:keys=common_pid:z=$a/$b+$c$d' >> event/trigger Link: https://lkml.kernel.org/r/20211025200852.3002369-4-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tomas Glozar <tglozar@redhat.com>	2025-03-21 08:09:00 +01:00
Tomas Glozar	41bc755fde	tracing: Add division and multiplication support for hist triggers JIRA: https://issues.redhat.com/browse/RHEL-67679 commit bcef044150320217e2a00c65050114e509c222b8 Author: Kalesh Singh <kaleshsingh@google.com> Date: Mon Oct 25 13:08:34 2021 -0700 tracing: Add division and multiplication support for hist triggers Adds basic support for division and multiplication operations for hist trigger variable expressions. For simplicity this patch only supports, division and multiplication for a single operation expression (e.g. x=$a/$b), as currently expressions are always evaluated right to left. This can lead to some incorrect results: e.g. echo 'hist:keys=common_pid:x=8-4-2' >> event/trigger 8-4-2 should evaluate to 2 i.e. (8-4)-2 but currently x evaluate to 6 i.e. 8-(4-2) Multiplication and division in sub-expressions will work correctly, once correct operator precedence support is added (See next patch in this series). For the undefined case of division by 0, the histogram expression evaluates to (u64)(-1). Since this cannot be detected when the expression is created, it is the responsibility of the user to be aware and account for this possibility. Examples: echo 'hist:keys=common_pid:a=8,b=4,x=$a/$b' \ >> event/trigger echo 'hist:keys=common_pid:y=5*$b' \ >> event/trigger Link: https://lkml.kernel.org/r/20211025200852.3002369-3-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tomas Glozar <tglozar@redhat.com>	2025-03-21 08:09:00 +01:00
Tomas Glozar	1f1a764877	tracing: Add support for creating hist trigger variables from literal JIRA: https://issues.redhat.com/browse/RHEL-67679 commit 52cfb373536a7fb744b0ec4b748518e5dc874fb7 Author: Kalesh Singh <kaleshsingh@google.com> Date: Mon Oct 25 13:08:33 2021 -0700 tracing: Add support for creating hist trigger variables from literal Currently hist trigger expressions don't support the use of numeric literals: e.g. echo 'hist:keys=common_pid:x=$y-1234' --> is not valid expression syntax Having the ability to use numeric constants in hist triggers supports a wider range of expressions for creating variables. Add support for creating trace event histogram variables from numeric literals. e.g. echo 'hist:keys=common_pid:x=1234,y=size-1024' >> event/trigger A negative numeric constant is created, using unary minus operator (parentheses are required). e.g. echo 'hist:keys=common_pid:z=-(2)' >> event/trigger Constants can be used with division/multiplication (added in the next patch in this series) to implement granularity filters for frequent trace events. For instance we can limit emitting the rss_stat trace event to when there is a 512KB cross over in the rss size: # Create a synthetic event to monitor instead of the high frequency # rss_stat event echo 'rss_stat_throttled unsigned int mm_id; unsigned int curr; int member; long size' >> tracing/synthetic_events # Create a hist trigger that emits the synthetic rss_stat_throttled # event only when the rss size crosses a 512KB boundary. echo 'hist:keys=keys=mm_id,member:bucket=size/0x80000:onchange($bucket) .rss_stat_throttled(mm_id,curr,member,size)' >> events/kmem/rss_stat/trigger A use case for using constants with addition/subtraction is not yet known, but for completeness the use of constants are supported for all operators. Link: https://lkml.kernel.org/r/20211025200852.3002369-2-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tomas Glozar <tglozar@redhat.com>	2025-03-21 08:09:00 +01:00
Tomas Glozar	ffbbdfa334	tracing: Have histogram types be constant when possible JIRA: https://issues.redhat.com/browse/RHEL-67679 commit 3347d80baa41c357cf263923f60aa8051a753d76 Author: Steven Rostedt (VMware) <rostedt@goodmis.org> Date: Thu Jul 22 10:27:06 2021 -0400 tracing: Have histogram types be constant when possible Instead of kstrdup("const", GFP_KERNEL), have the hist_field type simply assign the constant hist_field->type = "const"; And when the value passed to it is a variable, use "kstrdup_const(var, GFP_KERNEL);" which will just copy the value if the variable is already a constant. This saves on having to allocate when not needed. All frees of the hist_field->type will need to use kfree_const(). Link: https://lkml.kernel.org/r/20210722142837.280718447@goodmis.org Suggested-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tomas Glozar <tglozar@redhat.com>	2025-03-21 08:09:00 +01:00

1 2 3 4 5 ...

42911 Commits