JIRA: https://issues.redhat.com/browse/RHEL-80382
commit af000ce85293b8e608f696f0c6c280bc3a75887f
Author: Michal Koutný <mkoutny@suse.com>
Date: Mon Sep 9 18:32:23 2024 +0200
cgroup: Do not report unavailable v1 controllers in /proc/cgroups
This is a followup to CONFIG-urability of cpuset and memory controllers
for v1 hierarchies. Make the output in /proc/cgroups reflect that
!CONFIG_CPUSETS_V1 is like !CONFIG_CPUSETS and
!CONFIG_MEMCG_V1 is like !CONFIG_MEMCG.
The intended effect is that hiding the unavailable controllers will hint
users not to try mounting them on v1.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-80382
commit 3c41382e920f1dd5c9f432948fe799c07af1cced
Author: Michal Koutný <mkoutny@suse.com>
Date: Mon Sep 9 18:32:22 2024 +0200
cgroup: Disallow mounting v1 hierarchies without controller implementation
The configs that disable some v1 controllers would still allow mounting
them but with no controller-specific files. (Making such hierarchies
equivalent to named v1 hierarchies.) To achieve behavior consistent with
actual out-compilation of a whole controller, the mounts should treat
respective controllers as non-existent.
Wrap implementation into a helper function, leverage legacy_files to
detect compiled out controllers. The effect is that mounts on v1 would
fail and produce a message like:
[ 1543.999081] cgroup: Unknown subsys name 'memory'
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-80382
commit 3cc4e13bb1617f6a13e5e6882465984148743cf4
Author: Xiu Jianfeng <xiujianfeng@huawei.com>
Date: Sat Oct 12 07:22:46 2024 +0000
cgroup: Fix potential overflow issue when checking max_depth
cgroup.max.depth is the maximum allowed descent depth below the current
cgroup. If the actual descent depth is equal or larger, an attempt to
create a new child cgroup will fail. However due to the cgroup->max_depth
is of int type and having the default value INT_MAX, the condition
'level > cgroup->max_depth' will never be satisfied, and it will cause
an overflow of the level after it reaches to INT_MAX.
Fix it by starting the level from 0 and using '>=' instead.
It's worth mentioning that this issue is unlikely to occur in reality,
as it's impossible to have a depth of INT_MAX hierarchy, but should be
be avoided logically.
Fixes: 1a926e0bba ("cgroup: implement hierarchy limits")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-80382
commit 99570300d3b4c8a1463491754d58e7a8d87cacef
Author: Waiman Long <longman@redhat.com>
Date: Sun Aug 4 21:30:18 2024 -0400
cgroup/cpuset: Check for partition roots with overlapping CPUs
With the previous commit that eliminates the overlapping partition
root corner cases in the hotplug code, the partition roots passed down
to generate_sched_domains() should not have overlapping CPUs. Enable
overlapping cpuset check for v2 and warn if that happens.
This patch also has the benefit of increasing test coverage of the new
Union-Find cpuset merging code to cgroup v2.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-80382
commit 0e40cf2a8b2c847950e025d5aa594bd545118d26
Author: Kinsey Ho <kinseyho@google.com>
Date: Thu Sep 5 00:30:50 2024 +0000
cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
Patch series "Improve mem_cgroup_iter()", v4.
Incremental cgroup iteration is being used again [1]. This patchset
improves the reliability of mem_cgroup_iter(). It also improves
simplicity and code readability.
[1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/
This patch (of 5):
Explicitly document that css sibling/descendant linkage is protected by
cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and
similar functions that it isn't necessary to hold a ref on @pos.
The following changes in this patchset rely on this clarification for
simplification in memcg iteration code.
Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com
Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.com
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-80382
commit c149c4a48b19afbf0c383614e57b452d39b154de
Author: Xiu Jianfeng <xiujianfeng@huawei.com>
Date: Sat Jul 13 08:59:16 2024 +0000
cgroup/cpuset: Remove cpuset_slab_spread_rotor
Since the SLAB implementation was removed in v6.8, so the
cpuset_slab_spread_rotor is no longer used and can be removed.
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-80382
commit d1a92d2d6c5dbeba9a87bfb57fa0142cdae7b206
Author: Chen Ridong <chenridong@huawei.com>
Date: Thu Aug 15 13:14:08 2024 +0000
cgroup: update some statememt about delegation
The comment in cgroup_file_write is missing some interfaces, such as
'cgroup.threads'. All delegatable files are listed in
'/sys/kernel/cgroup/delegate', so update the comment in cgroup_file_write.
Besides, add a statement that files outside the namespace shouldn't be
visible from inside the delegated namespace.
tj: Reflowed text for consistency.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6650
JIRA: https://issues.redhat.com/browse/RHEL-78989
JIRA: https://issues.redhat.com/browse/RHEL-80529
JIRA: https://issues.redhat.com/browse/RHEL-83249
JIRA: https://issues.redhat.com/browse/RHEL-84184
CVE: CVE-2025-21691
CVE: CVE-2025-21696
CVE: CVE-2025-21861
Proactively backport a set of selected follow-up Fixes for the
MM patches previously backported into RHEL-9 minor releases.
Dependencies and follow-up fixes for the selected commits
are also selectively backported.
Omitted-fix: e080a26725fb ("erofs: allow large folios for compressed files")
Omitted-fix: 3488af097044 ("mm/damon/core: handle zero {aggregation,ops_update} intervals")
Omitted-fix: 5e06ad590096 ("mm/damon/core-test: test max_nr_accesses overflow caused divide-by-zero")
Omitted-fix: 25e8acbcf19c ("mm/damon/tests/core-kunit: skip damon_test_nr_accesses_to_accesses_bp() if aggr_interval is zero")
Omitted-fix: 1390a3334a48 ("mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio")
Omitted-fix: 7ddeb91f5b03 ("mm: kmemleak: add support for dumping physical and __percpu object info")
Signed-off-by: Rafael Aquini <raquini@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Čestmír Kalina <ckalina@redhat.com>
Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Augusto Caringi <acaringi@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6623
JIRA: https://issues.redhat.com/browse/RHEL-84526
Sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test
have been reported and fixed upstream – Notably affine_move_task()
remains stuck in wait_for_completion(), leading to a hung-task detector
warning.
Both C10S and RHEL-10 already carry the fix from this upstream commit:
f0498d2a54e79 sched: Fix stop_one_cpu_nowait() vs hotplug
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Approved-by: Gabriele Monaco <gmonaco@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Augusto Caringi <acaringi@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6722
JIRA: https://issues.redhat.com/browse/RHEL-83455
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6722
The Jira ticket reports a loss of isolated CPUs when isolated partitions
are being created, i.e. Some of the isolated CPUs are missing and are
not in any of the existing cpusets. We are not able to reproduce this
problem in-house, but detailed analysis of the cpuset partition code
does reveal issues that need to be fixed.
This MR incorporates the latest cpuset fixes that were merged upstream
and hopefully will be able to address the issues seen by the customers.
To reduce conflicts and other complications, some other recent cpuset
commits are also included as well.
Signed-off-by: Waiman Long <longman@redhat.com>
Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Radostin Stoyanov <rstoyano@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Augusto Caringi <acaringi@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6686
# Merge Request Required Information
JIRA: https://issues.redhat.com/browse/RHEL-77358
JIRA: https://issues.redhat.com/browse/RHEL-86051
## Summary of Changes
Two upstream patchsets are contained in this merge request:
* Collect timerlat samples using a BPF program instead of pulling them through a tracefs pipe. This helps with both performance and CPU usage, and fixes an issue where on systems with \>100 CPUs, rtla cannot keep up with timerlat samples and drops most of them, making it useless.
* Always set default values of all tracer options (osnoise or timerlat) if not specified otherwise, as they might be set to unexpected values either by another user of the tracers or by a previous abnormally exited run of rtla.
Those are combined into a single MR, because the latter depends on a refactoring done in the former. A dependency (rtla test suite) is also pulled.
## Approved Development Ticket(s)
All submissions to CentOS Stream must reference a ticket in [Red Hat Jira](https://issues.redhat.com/).
<details><summary>Click for formatting instructions</summary>
Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
List tickets each on their own line of this description using the format "Resolves: RHEL-76229", "Related: RHEL-76229" or "Reverts: RHEL-76229", as appropriate.
</details>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: John Kacur <jkacur@redhat.com>
Approved-by: Gabriele Monaco <gmonaco@redhat.com>
Approved-by: Derek Barbosa <debarbos@redhat.com>
Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Augusto Caringi <acaringi@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6667
JIRA: https://issues.redhat.com/browse/RHEL-78249
commit f13abc1e8e1a3b7455511c4e122750127f6bc9b0
Author: Eric Sandeen <sandeen@redhat.com>
Date: Thu Feb 27 11:41:08 2025 -0600
watch_queue: fix pipe accounting mismatch
Currently, watch_queue_set_size() modifies the pipe buffers charged to
user->pipe_bufs without updating the pipe->nr_accounted on the pipe
itself, due to the if (!pipe_has_watch_queue()) test in
pipe_resize_ring(). This means that when the pipe is ultimately freed,
we decrement user->pipe_bufs by something other than what than we had
charged to it, potentially leading to an underflow. This in turn can
cause subsequent too_many_pipe_buffers_soft() tests to fail with -EPERM.
To remedy this, explicitly account for the pipe usage in
watch_queue_set_size() to match the number set via account_pipe_buffers()
(It's unclear why watch_queue_set_size() does not update nr_accounted;
it may be due to intentional overprovisioning in watch_queue_set_size()?)
Fixes: e95aada4cb93d ("pipe: wakeup wr_wait after setting max_usage")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/206682a8-0604-49e5-8224-fdbe0c12b460@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: David Howells <dhowells@redhat.com>
Approved-by: Pavel Reichl <preichl@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Augusto Caringi <acaringi@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-84184
CVE: CVE-2025-21709
Conflicts:
* kernel/events/uprobes.c: a notable context difference in the 1st hunk
due to RHEL-9 missing the following upstream commits: 87195a1ee332a,
2bf8e5aceff89, and dd1a7567784e2; and a notable contex difference in
the 2nd hunk due to RHEL-9 missing the following upstream commits:
84455e6923c7 and 8617408f7a01. None of the aforelisted commits are
of any relevance for this backport work.
This patch is a backport of the following upstream commit:
commit 64c37e134b120fb462fb4a80694bfb8e7be77b14
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date: Mon Jan 27 12:02:21 2025 -0500
kernel: be more careful about dup_mmap() failures and uprobe registering
If a memory allocation fails during dup_mmap(), the maple tree can be left
in an unsafe state for other iterators besides the exit path. All the
locks are dropped before the exit_mmap() call (in mm/mmap.c), but the
incomplete mm_struct can be reached through (at least) the rmap finding
the vmas which have a pointer back to the mm_struct.
Up to this point, there have been no issues with being able to find an
mm_struct that was only partially initialised. Syzbot was able to make
the incomplete mm_struct fail with recent forking changes, so it has been
proven unsafe to use the mm_struct that hasn't been initialised, as
referenced in the link below.
Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to
invalid mm") fixed the uprobe access, it does not completely remove the
race.
This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the
oom side (even though this is extremely unlikely to be selected as an oom
victim in the race window), and sets MMF_UNSTABLE to avoid other potential
users from using a partially initialised mm_struct.
When registering vmas for uprobe, skip the vmas in an mm that is marked
unstable. Modifying a vma in an unstable mm may cause issues if the mm
isn't fully initialised.
Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-84184
Conflicts:
* kernel/fork.c: minor difference from upstream due to an extra blank line
that was left behind when commit d24062914837 ("fork: use __mt_dup()
to duplicate maple tree in dup_mmap()") was backported into RHEL-9
This patch is a backport of the following upstream commit:
commit 8ac662f5da19f5873fdd94c48a5cdb45b2e1b58f
Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Tue Dec 10 17:24:12 2024 +0000
fork: avoid inappropriate uprobe access to invalid mm
If dup_mmap() encounters an issue, currently uprobe is able to access the
relevant mm via the reverse mapping (in build_map_info()), and if we are
very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
we establish as part of the fork error path.
This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
invoking vma_iter_load() which uses the advanced maple tree API and thus
is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
dup_mmap()").
This change was made on the assumption that only process tear-down code
would actually observe (and make use of) these values. However this very
unlikely but still possible edge case with uprobes exists and
unfortunately does make these observable.
The uprobe operation prevents races against the dup_mmap() operation via
the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
and dropped via uprobe_end_dup_mmap(), and held across
register_for_each_vma() prior to invoking build_map_info() which does the
reverse mapping lookup.
Currently these are acquired and dropped within dup_mmap(), which exposes
the race window prior to error handling in the invoking dup_mm() which
tears down the mm.
We can avoid all this by just moving the invocation of
uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
and only release this lock once the dup_mmap() operation succeeds or clean
up is done.
This means that the uprobe code can never observe an incompletely
constructed mm and resolves the issue in this case.
Link: https://lkml.kernel.org/r/20241210172412.52995-1-lorenzo.stoakes@oracle.com
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
commit 86888c7bd117c29eab169c37e5f6bbbf583da983
Author: Waiman Long <longman@redhat.com>
Date: Mon, 7 Apr 2025 17:21:05 -0400
cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs
Add WARN_ON_ONCE() statements whenever new exclusive CPUs are being
added to a partition root to catch inconsistency in the way exclusive
CPUs are being handled in the cpuset code.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
Conflicts: Some minor context diffs due to missing upstream commit
381b53c3b549 ("cgroup/cpuset: rename functions shared
between v1 and v2") and commit 2ff899e35164 ("sched/deadline:
Rebuild root domain accounting after every update").
commit f0a0bd3d23a44a2c5f628e8ca8ad882498ca5aae
Author: Waiman Long <longman@redhat.com>
Date: Sun, 30 Mar 2025 17:52:44 -0400
cgroup/cpuset: Code cleanup and comment update
Rename partition_xcpus_newstate() to isolated_cpus_update(),
update_partition_exclusive() to update_partition_exclusive_flag() and
the new_xcpus_state variable to isolcpus_updated to make their meanings
more explicit. Also add some comments to further clarify the code.
No functional change is expected.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
commit f62a5d39368e34a966c8df63e1f05eed7fe9c5de
Author: Waiman Long <longman@redhat.com>
Date: Sun, 30 Mar 2025 17:52:42 -0400
cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition
Currently, changes in exclusive CPUs are being handled in
remote_partition_check() by disabling conflicting remote partitions.
However, that may lead to results unexpected by the users. Fix
this problem by removing remote_partition_check() and making
update_cpumasks_hier() handle changes in descendant remote partitions
properly.
The compute_effective_exclusive_cpumask() function is enhanced to check
the exclusive_cpus and effective_xcpus from siblings and excluded them
in its effective exclusive CPUs computation and return a value to show if
there is any sibling conflicts. This is somewhat like the cpu_exclusive
flag check in validate_change(). This is the initial step to enable us
to retire the use of cpu_exclusive flag in cgroup v2 in the future.
One of the tests in the TEST_MATRIX of the test_cpuset_prs.sh
script has to be updated due to changes in the way a child remote
partition root is being handled (updated instead of invalidation)
in update_cpumasks_hier().
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
commit 8bf450f3aec3d1bbd725d179502c64b8992588e4
Author: Waiman Long <longman@redhat.com>
Date: Sun, 30 Mar 2025 17:52:41 -0400
cgroup/cpuset: Fix error handling in remote_partition_disable()
When remote_partition_disable() is called to disable a remote partition,
it always sets the partition to an invalid partition state. It should
only do so if an error code (prs_err) has been set. Correct that and
add proper error code in places where remote_partition_disable() is
called due to error.
Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
commit 668e041662e92ab3ebcb9eb606d3ec01884546ab
Author: Waiman Long <longman@redhat.com>
Date: Sun, 30 Mar 2025 17:52:40 -0400
cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask()
Before commit f0af1bfc27b5 ("cgroup/cpuset: Relax constraints to
partition & cpus changes"), a cpuset partition cannot be enabled if not
all the requested CPUs can be granted from the parent cpuset. After
that commit, a cpuset partition can be created even if the requested
exclusive CPUs contain CPUs not allowed its parent. The delmask
containing exclusive CPUs to be removed from its parent wasn't
adjusted accordingly.
That is not a problem until the introduction of a new isolated_cpus
mask in commit 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in
isolated partitions") as the CPUs in the delmask may be added directly
into isolated_cpus.
As a result, isolated_cpus may incorrectly contain CPUs that are not
isolated leading to incorrect data reporting. Fix this by adjusting
the delmask to reflect the actual exclusive CPUs for the creation of
the partition.
Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
Conflicts: A merge conflict in the cpuset_css_offline() hunk due to
missing upstream commit c4c9cebe2fb9 ("cgroup/cpuset:
Further optimize code if CONFIG_CPUSETS_V1 not set").
commit a22b3d54de94f82ca057cc2ebf9496fa91ebf698
Author: Waiman Long <longman@redhat.com>
Date: Sun, 30 Mar 2025 17:52:39 -0400
cgroup/cpuset: Fix race between newly created partition and dying one
There is a possible race between removing a cgroup diectory that is
a partition root and the creation of a new partition. The partition
to be removed can be dying but still online, it doesn't not currently
participate in checking for exclusive CPUs conflict, but the exclusive
CPUs are still there in subpartitions_cpus and isolated_cpus. These
two cpumasks are global states that affect the operation of cpuset
partitions. The exclusive CPUs in dying cpusets will only be removed
when cpuset_css_offline() function is called after an RCU delay.
As a result, it is possible that a new partition can be created with
exclusive CPUs that overlap with those of a dying one. When that dying
partition is finally offlined, it removes those overlapping exclusive
CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an
incorrect CPU configuration.
This bug was found when a warning was triggered in
remote_partition_disable() during testing because the subpartitions_cpus
mask was empty.
One possible way to fix this is to iterate the dying cpusets as well and
avoid using the exclusive CPUs in those dying cpusets. However, this
can still cause random partition creation failures or other anomalies
due to racing. A better way to fix this race is to reset the partition
state at the moment when a cpuset is being killed.
Introduce a new css_killed() CSS function pointer and call it, if
defined, before setting CSS_DYING flag in kill_css(). Also update the
css_is_dying() helper to use the CSS_DYING flag introduced by commit
33c35aa481 ("cgroup: Prevent kill_css() from being called more than
once") for proper synchronization.
Add a new cpuset_css_killed() function to reset the partition state of
a valid partition root if it is being killed.
Fixes: ee8dde0cd2 ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
commit 9b496a8bbed9cc292b0dfd796f38ec58b6d0375f
Author: Waiman Long <longman@redhat.com>
Date: Thu, 5 Dec 2024 14:51:01 -0500
cgroup/cpuset: Prevent leakage of isolated CPUs into sched domains
Isolated CPUs are not allowed to be used in a non-isolated partition.
The only exception is the top cpuset which is allowed to contain boot
time isolated CPUs.
Commit ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation
problem") introduces a simplified scheme of including only partition
roots in sched domain generation. However, it does not properly account
for this exception case. This can result in leakage of isolated CPUs
into a sched domain.
Fix it by making sure that isolated CPUs are excluded from the top
cpuset before generating sched domains.
Also update the way the boot time isolated CPUs are handled in
test_cpuset_prs.sh to make sure that those isolated CPUs are really
isolated instead of just skipping them in the tests.
Fixes: ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
Conflicts: A context diff in the cpuset_update_flag() hunk due to
missing upstream commit 381b53c3b549 ("cgroup/cpuset: rename
functions shared between v1 and v2") and the removal of
IS_ENABLED(CONFIG_CPUSETS_V1) check.
commit a040c351283e3ac75422621ea205b1d8d687e108
Author: Waiman Long <longman@redhat.com>
Date: Sat, 9 Nov 2024 21:50:22 -0500
cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per operation
Since commit ff0ce721ec21 ("cgroup/cpuset: Eliminate unncessary
sched domains rebuilds in hotplug"), there is only one
rebuild_sched_domains_locked() call per hotplug operation. However,
writing to the various cpuset control files may still casue more than
one rebuild_sched_domains_locked() call to happen in some cases.
Juri had found that two rebuild_sched_domains_locked() calls in
update_prstate(), one from update_cpumasks_hier() and another one from
update_partition_sd_lb() could cause cpuset partition to be created
with null total_bw for DL tasks. IOW, DL tasks may not be scheduled
correctly in such a partition.
A sample command sequence that can reproduce null total_bw is as
follows.
# echo Y >/sys/kernel/debug/sched/verbose
# echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
# mkdir /sys/fs/cgroup/test
# echo 0-7 > /sys/fs/cgroup/test/cpuset.cpus
# echo 6-7 > /sys/fs/cgroup/test/cpuset.cpus.exclusive
# echo root >/sys/fs/cgroup/test/cpuset.cpus.partition
Fix this double rebuild_sched_domains_locked() calls problem
by replacing existing calls with cpuset_force_rebuild() except
the rebuild_sched_domains_cpuslocked() call at the end of
cpuset_handle_hotplug(). Checking of the force_sd_rebuild flag is
now done at the end of cpuset_write_resmask() and update_prstate()
to determine if rebuild_sched_domains_locked() should be called or not.
The cpuset v1 code can still call rebuild_sched_domains_locked()
directly as double rebuild_sched_domains_locked() calls is not possible.
Reported-by: Juri Lelli <juri.lelli@redhat.com>
Closes: https://lore.kernel.org/lkml/ZyuUcJDPBln1BK1Y@jlelli-thinkpadt14gen4.remote.csb/
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
commit bcd7012afd7bcd45fcd7a0e2f48e57b273702317
Author: Waiman Long <longman@redhat.com>
Date: Sat, 9 Nov 2024 21:50:21 -0500
cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in update_cpumasks_hier()"
Revert commit 3ae0b773211e ("cgroup/cpuset: Allow suppression of sched
domain rebuild in update_cpumasks_hier()") to allow for an alternative
way to suppress unnecessary rebuild_sched_domains_locked() calls in
update_cpumasks_hier() and elsewhere in a following commit.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
commit c188f33c864e3dba49a1ad0dc9fddf2f49ac42ae
Author: Waiman Long <longman@redhat.com>
Date: Tue, 20 Aug 2024 15:55:35 -0400
cgroup/cpuset: Account for boot time isolated CPUs
With the "isolcpus" boot command line parameter, we are able to
create isolated CPUs at boot time. These isolated CPUs aren't fully
accounted for in the cpuset code. For instance, the root cgroup's
"cpuset.cpus.isolated" control file does not include the boot time
isolated CPUs. Fix that by looking for pre-isolated CPUs at init time.
The prstate_housekeeping_conflict() function does check the
HK_TYPE_DOMAIN housekeeping cpumask to make sure that CPUs outside of it
can only be used in isolated partition. Given the fact that we are going
to make housekeeping cpumasks dynamic, the current check may not be right
anymore. Save the boot time HK_TYPE_DOMAIN cpumask and check against
it instead of the upcoming dynamic HK_TYPE_DOMAIN housekeeping cpumask.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
commit 3c2acae88844e7423a50b5cbe0a2c9d430fcd20c
Author: Chen Ridong <chenridong@huawei.com>
Date: Tue, 20 Aug 2024 03:01:26 +0000
cgroup/cpuset: remove use_parent_ecpus of cpuset
use_parent_ecpus is used to track whether the children are using the
parent's effective_cpus. When a parent's effective_cpus is changed
due to changes in a child partition's effective_xcpus, any child
using parent'effective_cpus must call update_cpumasks_hier. However,
if a child is not a valid partition, it is sufficient to determine
whether to call update_cpumasks_hier based on whether the child's
effective_cpus is going to change. To make the code more succinct,
it is suggested to remove use_parent_ecpus.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
commit 9414f68d454529ff7e68f0c2aefe0a007060c66a
Author: Chen Ridong <chenridong@huawei.com>
Date: Tue, 20 Aug 2024 03:01:25 +0000
cgroup/cpuset: remove fetch_xcpus
Both fetch_xcpus and user_xcpus functions are used to retrieve the value
of exclusive_cpus. If exclusive_cpus is not set, cpus_allowed is the
implicit value used as exclusive in a local partition. I can not imagine
a scenario where effective_xcpus is not empty when exclusive_cpus is
empty. Therefore, I suggest removing the fetch_xcpus function.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
commit d6326047576266991d88639e1e9739a9a9a20ef4
Author: Chen Ridong <chenridong@huawei.com>
Date: Wed, 24 Jul 2024 10:24:18 +0000
cgroup/cpuset: remove child_ecpus_count
The child_ecpus_count variable was previously used to update
sibling cpumask when parent's effective_cpus is updated. However, it became
obsolete after commit e2ffe502ba45 ("cgroup/cpuset: Add
cpuset.cpus.exclusive for v2"). It should be removed.
tj: Restored {} for style consistency.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83455
commit 8a895c2e6a7ed264a1b917616db205ed934e8306
Author: Xavier <xavier_qy@163.com>
Date: Thu, 4 Jul 2024 14:24:44 +0800
cpuset: use Union-Find to optimize the merging of cpumasks
The process of constructing scheduling domains
involves multiple loops and repeated evaluations, leading to numerous
redundant and ineffective assessments that impact code efficiency.
Here, we use union-find to optimize the merging of cpumasks. By employing
path compression and union by rank, we effectively reduce the number of
lookups and merge comparisons.
Signed-off-by: Xavier <xavier_qy@163.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6405
JIRA: https://issues.redhat.com/browse/RHEL-23495
Update /proc/schedstat with fixes and improved information
from upstream. AMD requested these and they don't carry
a large risk.
Signed-off-by: Phil Auld <pauld@redhat.com>
Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Augusto Caringi <acaringi@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-77358
commit a065bbf776d32a71e748bd948861e6deca803d78
Author: Tomas Glozar <tglozar@redhat.com>
Date: Mon Feb 3 10:04:18 2025 +0100
trace/osnoise: Add trace events for samples
Add trace events that fire at osnoise and timerlat sample generation, in
addition to the already existing noise and threshold events.
This allows processing the samples directly in the kernel, either with
ftrace triggers or with BPF.
Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20250203090418.1458923-1-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Tested-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-85303
Add "stack_order" sysfs attribute which holds the order in which a live
patch module was loaded into the system. A user can then determine an
active live patched version of a function.
cat /sys/kernel/livepatch/livepatch_1/stack_order -> 1
means that livepatch_1 is the first live patch applied
cat /sys/kernel/livepatch/livepatch_module/stack_order -> N
means that livepatch_module is the Nth live patch applied
Suggested-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Miroslav Benes <mbenes@suse.cz>
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Wardenjohn <zhangwarden@gmail.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lore.kernel.org/r/20241008014856.3729-2-zhangwarden@gmail.com
[pmladek@suse.com: Updated kernel version and date in the ABI documentation.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
(cherry picked from commit 3dae09de406167123449d9ece1f51855d5bac01a)
Signed-off-by: Denis Aleksandrov <daleksan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-85303
Based on the test results of kallsyms_on_each_match_symbol() and
kallsyms_on_each_symbol(), the average performance can be improved by
more than 1500 times.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
(cherry picked from commit 9cb37357dfce1b596041ad68a20407c8b4e76635)
Signed-off-by: Denis Aleksandrov <daleksan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-78249
commit f13abc1e8e1a3b7455511c4e122750127f6bc9b0
Author: Eric Sandeen <sandeen@redhat.com>
Date: Thu Feb 27 11:41:08 2025 -0600
watch_queue: fix pipe accounting mismatch
Currently, watch_queue_set_size() modifies the pipe buffers charged to
user->pipe_bufs without updating the pipe->nr_accounted on the pipe
itself, due to the if (!pipe_has_watch_queue()) test in
pipe_resize_ring(). This means that when the pipe is ultimately freed,
we decrement user->pipe_bufs by something other than what than we had
charged to it, potentially leading to an underflow. This in turn can
cause subsequent too_many_pipe_buffers_soft() tests to fail with -EPERM.
To remedy this, explicitly account for the pipe usage in
watch_queue_set_size() to match the number set via account_pipe_buffers()
(It's unclear why watch_queue_set_size() does not update nr_accounted;
it may be due to intentional overprovisioning in watch_queue_set_size()?)
Fixes: e95aada4cb93d ("pipe: wakeup wr_wait after setting max_usage")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/206682a8-0604-49e5-8224-fdbe0c12b460@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6586
# Merge Request Required Information
JIRA: https://issues.redhat.com/browse/RHEL-67679
## Summary of Changes
Backport support for division and multiplication in histogram triggers as well as fixes and optimizations for it. Support for creating hist trigger variables from literal is added, too, as it is a dependency of one of the fixes and is documented together with the former.
## Approved Development Ticket(s)
All submissions to CentOS Stream must reference a ticket in [Red Hat Jira](https://issues.redhat.com/).
<details><summary>Click for formatting instructions</summary>
Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.
List tickets each on their own line of this description using the format "Resolves: RHEL-76229", "Related: RHEL-76229" or "Reverts: RHEL-76229", as appropriate.
</details>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Approved-by: Joe Lawrence <joe.lawrence@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Augusto Caringi <acaringi@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6423
JIRA: https://issues.redhat.com/browse/RHEL-79933
commit db5fd3cf8bf41b84b577b8ad5234ea95f327c9be
Author: Muhammad Adeel <Muhammad.Adeel@ibm.com>
Date: Fri, 7 Feb 2025 14:24:32 +0000
cgroup: Remove steal time from usage_usec
The CPU usage time is the time when user, system or both are using the CPU.
Steal time is the time when CPU is waiting to be run by the Hypervisor. It
should not be added to the CPU usage time, hence removing it from the
usage_usec entry.
Fixes: 936f2a70f2 ("cgroup: add cpu.stat file to root cgroup")
Acked-by: Axel Busch <axel.busch@ibm.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Muhammad Adeel <muhammad.adeel@ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Approved-by: Thomas Huth <thuth@redhat.com>
Approved-by: Radostin Stoyanov <rstoyano@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Augusto Caringi <acaringi@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-84526
commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07
Author: Peter Zijlstra <peterz@infradead.org>
Date: Tue Oct 10 20:57:39 2023 +0200
sched: Fix stop_one_cpu_nowait() vs hotplug
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU
hotplug stress-test -- notably affine_move_task() remains stuck in
wait_for_completion(), leading to a hung-task detector warning.
Specifically, it was reported that stop_one_cpu_nowait(.fn =
migration_cpu_stop) returns false -- this stopper is responsible for
the matching complete().
The race scenario is:
CPU0 CPU1
// doing _cpu_down()
__set_cpus_allowed_ptr()
task_rq_lock();
takedown_cpu()
stop_machine_cpuslocked(take_cpu_down..)
<PREEMPT: cpu_stopper_thread()
MULTI_STOP_PREPARE
...
__set_cpus_allowed_ptr_locked()
affine_move_task()
task_rq_unlock();
<PREEMPT: cpu_stopper_thread()\>
ack_state()
MULTI_STOP_RUN
take_cpu_down()
__cpu_disable();
stop_machine_park();
stopper->enabled = false;
/>
/>
stop_one_cpu_nowait(.fn = migration_cpu_stop);
if (stopper->enabled) // false!!!
That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the
stopper thread gets a chance to preempt and allows the cpu-down for
the target CPU to complete.
OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to
issue a wakeup, it must not be ran under the scheduler locks.
Solve this apparent contradiction by keeping preemption disabled over
the unlock + queue_stopper combination:
preempt_disable();
task_rq_unlock(...);
if (!stop_pending)
stop_one_cpu_nowait(...)
preempt_enable();
This respects the lock ordering contraints while still avoiding the
above race. That is, if we find the CPU is online under rq-lock, the
targeted stop_one_cpu_nowait() must succeed.
Apply this pattern to all similar stop_one_cpu_nowait() invocations.
Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-67679
commit 1cab6bce42e62bba2ff2c2370d139618c1828b42
Author: Kalesh Singh <kaleshsingh@google.com>
Date: Fri Nov 12 11:13:24 2021 -0800
tracing/histogram: Fix check for missing operands in an expression
If a binary operation is detected while parsing an expression string,
the operand strings are deduced by splitting the experssion string at
the position of the detected binary operator. Both operand strings are
sub-strings (can be empty string) of the expression string but will
never be NULL.
Currently a NULL check is used for missing operands, fix this by
checking for empty strings instead.
Link: https://lkml.kernel.org/r/20211112191324.1302505-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9710b2f341a0 ("tracing: Fix operator precedence for hist triggers expression")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-67679
commit 722eddaa4043acee8f031cf238ced5f7514ad638
Author: Kalesh Singh <kaleshsingh@google.com>
Date: Mon Oct 25 13:08:38 2021 -0700
tracing/histogram: Optimize division by a power of 2
The division is a slow operation. If the divisor is a power of 2, use a
shift instead.
Results were obtained using Android's version of perf (simpleperf[1]) as
described below:
1. hist_field_div() is modified to call 2 test functions:
test_hist_field_div_[not]_optimized(); passing them the
same args. Use noinline and volatile to ensure these are
not optimized out by the compiler.
2. Create a hist event trigger that uses division:
events/kmem/rss_stat$ echo 'hist:keys=common_pid:x=size/<divisor>'
>> trigger
events/kmem/rss_stat$ echo 'hist:keys=common_pid:vals=$x'
>> trigger
3. Run Android's lmkd_test[2] to generate rss_stat events, and
record CPU samples with Android's simpleperf:
simpleperf record -a --exclude-perf --post-unwind=yes -m 16384 -g
-f 2000 -o perf.data
== Results ==
Divisor is a power of 2 (divisor == 32):
test_hist_field_div_not_optimized | 8,717,091 cpu-cycles
test_hist_field_div_optimized | 1,643,137 cpu-cycles
If the divisor is a power of 2, the optimized version is ~5.3x faster.
Divisor is not a power of 2 (divisor == 33):
test_hist_field_div_not_optimized | 4,444,324 cpu-cycles
test_hist_field_div_optimized | 5,497,958 cpu-cycles
If the divisor is not a power of 2, as expected, the optimized version is
slightly slower (~24% slower).
[1] https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md
[2] https://cs.android.com/android/platform/superproject/+/master:system/memory/lmkd/tests/lmkd_test.cpp
Link: https://lkml.kernel.org/r/20211025200852.3002369-7-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-67679
commit f47716b7a955e40e2591b960d1eccb1fde967a70
Author: Kalesh Singh <kaleshsingh@google.com>
Date: Mon Oct 25 13:08:37 2021 -0700
tracing/histogram: Covert expr to const if both operands are constants
If both operands of a hist trigger expression are constants, convert the
expression to a constant. This optimization avoids having to perform the
same calculation multiple times and also saves on memory since the
merged constants are represented by a single struct hist_field instead
or multiple.
Link: https://lkml.kernel.org/r/20211025200852.3002369-6-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-67679
commit c5eac6ee8bc5d32e48b3845472b547574061f49f
Author: Kalesh Singh <kaleshsingh@google.com>
Date: Mon Oct 25 13:08:36 2021 -0700
tracing/histogram: Simplify handling of .sym-offset in expressions
The '-' in .sym-offset can confuse the hist trigger arithmetic
expression parsing. Simplify the handling of this by replacing the
'sym-offset' with 'symXoffset'. This allows us to correctly evaluate
expressions where the user may have inadvertently added a .sym-offset
modifier to one of the operands in an expression, instead of bailing
out. In this case the .sym-offset has no effect on the evaluation of the
expression. The only valid use of the .sym-offset is as a hist key
modifier.
Link: https://lkml.kernel.org/r/20211025200852.3002369-5-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-67679
commit 9710b2f341a0d96f35b911580639853cfda4677d
Author: Kalesh Singh <kaleshsingh@google.com>
Date: Mon Oct 25 13:08:35 2021 -0700
tracing: Fix operator precedence for hist triggers expression
The current histogram expression evaluation logic evaluates the
expression from right to left. This can lead to incorrect results
if the operations are not associative (as is the case for subtraction
and, the now added, division operators).
e.g. 16-8-4-2 should be 2 not 10 --> 16-8-4-2 = ((16-8)-4)-2
64/8/4/2 should be 1 not 16 --> 64/8/4/2 = ((64/8)/4)/2
Division and multiplication are currently limited to single operation
expression due to operator precedence support not yet implemented.
Rework the expression parsing to support the correct evaluation of
expressions containing operators of different precedences; and fix
the associativity error by evaluating expressions with operators of
the same precedence from left to right.
Examples:
(1) echo 'hist:keys=common_pid:a=8,b=4,c=2,d=1,w=$a-$b-$c-$d' \
>> event/trigger
(2) echo 'hist:keys=common_pid:x=$a/$b/3/2' >> event/trigger
(3) echo 'hist:keys=common_pid:y=$a+10/$c*1024' >> event/trigger
(4) echo 'hist:keys=common_pid:z=$a/$b+$c*$d' >> event/trigger
Link: https://lkml.kernel.org/r/20211025200852.3002369-4-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-67679
commit bcef044150320217e2a00c65050114e509c222b8
Author: Kalesh Singh <kaleshsingh@google.com>
Date: Mon Oct 25 13:08:34 2021 -0700
tracing: Add division and multiplication support for hist triggers
Adds basic support for division and multiplication operations for
hist trigger variable expressions.
For simplicity this patch only supports, division and multiplication
for a single operation expression (e.g. x=$a/$b), as currently
expressions are always evaluated right to left. This can lead to some
incorrect results:
e.g. echo 'hist:keys=common_pid:x=8-4-2' >> event/trigger
8-4-2 should evaluate to 2 i.e. (8-4)-2
but currently x evaluate to 6 i.e. 8-(4-2)
Multiplication and division in sub-expressions will work correctly, once
correct operator precedence support is added (See next patch in this
series).
For the undefined case of division by 0, the histogram expression
evaluates to (u64)(-1). Since this cannot be detected when the
expression is created, it is the responsibility of the user to be
aware and account for this possibility.
Examples:
echo 'hist:keys=common_pid:a=8,b=4,x=$a/$b' \
>> event/trigger
echo 'hist:keys=common_pid:y=5*$b' \
>> event/trigger
Link: https://lkml.kernel.org/r/20211025200852.3002369-3-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-67679
commit 52cfb373536a7fb744b0ec4b748518e5dc874fb7
Author: Kalesh Singh <kaleshsingh@google.com>
Date: Mon Oct 25 13:08:33 2021 -0700
tracing: Add support for creating hist trigger variables from literal
Currently hist trigger expressions don't support the use of numeric
literals:
e.g. echo 'hist:keys=common_pid:x=$y-1234'
--> is not valid expression syntax
Having the ability to use numeric constants in hist triggers supports
a wider range of expressions for creating variables.
Add support for creating trace event histogram variables from numeric
literals.
e.g. echo 'hist:keys=common_pid:x=1234,y=size-1024' >> event/trigger
A negative numeric constant is created, using unary minus operator
(parentheses are required).
e.g. echo 'hist:keys=common_pid:z=-(2)' >> event/trigger
Constants can be used with division/multiplication (added in the
next patch in this series) to implement granularity filters for frequent
trace events. For instance we can limit emitting the rss_stat
trace event to when there is a 512KB cross over in the rss size:
# Create a synthetic event to monitor instead of the high frequency
# rss_stat event
echo 'rss_stat_throttled unsigned int mm_id; unsigned int curr;
int member; long size' >> tracing/synthetic_events
# Create a hist trigger that emits the synthetic rss_stat_throttled
# event only when the rss size crosses a 512KB boundary.
echo 'hist:keys=keys=mm_id,member:bucket=size/0x80000:onchange($bucket)
.rss_stat_throttled(mm_id,curr,member,size)'
>> events/kmem/rss_stat/trigger
A use case for using constants with addition/subtraction is not yet
known, but for completeness the use of constants are supported for all
operators.
Link: https://lkml.kernel.org/r/20211025200852.3002369-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-67679
commit 3347d80baa41c357cf263923f60aa8051a753d76
Author: Steven Rostedt (VMware) <rostedt@goodmis.org>
Date: Thu Jul 22 10:27:06 2021 -0400
tracing: Have histogram types be constant when possible
Instead of kstrdup("const", GFP_KERNEL), have the hist_field type simply
assign the constant hist_field->type = "const"; And when the value passed
to it is a variable, use "kstrdup_const(var, GFP_KERNEL);" which will just
copy the value if the variable is already a constant. This saves on having
to allocate when not needed.
All frees of the hist_field->type will need to use kfree_const().
Link: https://lkml.kernel.org/r/20210722142837.280718447@goodmis.org
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>