Commit Graph

42911 Commits

Author SHA1 Message Date
Radostin Stoyanov 3a901bcc52 cgroup: Do not report unavailable v1 controllers in /proc/cgroups
JIRA: https://issues.redhat.com/browse/RHEL-80382

commit af000ce85293b8e608f696f0c6c280bc3a75887f
Author: Michal Koutný <mkoutny@suse.com>
Date:   Mon Sep 9 18:32:23 2024 +0200

    cgroup: Do not report unavailable v1 controllers in /proc/cgroups

    This is a followup to CONFIG-urability of cpuset and memory controllers
    for v1 hierarchies. Make the output in /proc/cgroups reflect that
    !CONFIG_CPUSETS_V1 is like !CONFIG_CPUSETS and
    !CONFIG_MEMCG_V1 is like !CONFIG_MEMCG.

    The intended effect is that hiding the unavailable controllers will hint
    users not to try mounting them on v1.

    Signed-off-by: Michal Koutný <mkoutny@suse.com>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
2025-04-28 11:26:39 +01:00
Radostin Stoyanov b16f7d3e66 cgroup: Disallow mounting v1 hierarchies without controller implementation
JIRA: https://issues.redhat.com/browse/RHEL-80382

commit 3c41382e920f1dd5c9f432948fe799c07af1cced
Author: Michal Koutný <mkoutny@suse.com>
Date:   Mon Sep 9 18:32:22 2024 +0200

    cgroup: Disallow mounting v1 hierarchies without controller implementation

    The configs that disable some v1 controllers would still allow mounting
    them but with no controller-specific files. (Making such hierarchies
    equivalent to named v1 hierarchies.) To achieve behavior consistent with
    actual out-compilation of a whole controller, the mounts should treat
    respective controllers as non-existent.

    Wrap implementation into a helper function, leverage legacy_files to
    detect compiled out controllers. The effect is that mounts on v1 would
    fail and produce a message like:
      [ 1543.999081] cgroup: Unknown subsys name 'memory'

    Signed-off-by: Michal Koutný <mkoutny@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
2025-04-28 11:26:39 +01:00
Radostin Stoyanov 1e0823a037 cgroup: Fix potential overflow issue when checking max_depth
JIRA: https://issues.redhat.com/browse/RHEL-80382

commit 3cc4e13bb1617f6a13e5e6882465984148743cf4
Author: Xiu Jianfeng <xiujianfeng@huawei.com>
Date:   Sat Oct 12 07:22:46 2024 +0000

    cgroup: Fix potential overflow issue when checking max_depth

    cgroup.max.depth is the maximum allowed descent depth below the current
    cgroup. If the actual descent depth is equal or larger, an attempt to
    create a new child cgroup will fail. However due to the cgroup->max_depth
    is of int type and having the default value INT_MAX, the condition
    'level > cgroup->max_depth' will never be satisfied, and it will cause
    an overflow of the level after it reaches to INT_MAX.

    Fix it by starting the level from 0 and using '>=' instead.

    It's worth mentioning that this issue is unlikely to occur in reality,
    as it's impossible to have a depth of INT_MAX hierarchy, but should be
    be avoided logically.

    Fixes: 1a926e0bba ("cgroup: implement hierarchy limits")
    Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
    Reviewed-by: Michal Koutný <mkoutny@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
2025-04-28 10:54:34 +01:00
Radostin Stoyanov d795c506d2 cgroup/cpuset: Check for partition roots with overlapping CPUs
JIRA: https://issues.redhat.com/browse/RHEL-80382

commit 99570300d3b4c8a1463491754d58e7a8d87cacef
Author: Waiman Long <longman@redhat.com>
Date:   Sun Aug 4 21:30:18 2024 -0400

    cgroup/cpuset: Check for partition roots with overlapping CPUs

    With the previous commit that eliminates the overlapping partition
    root corner cases in the hotplug code, the partition roots passed down
    to generate_sched_domains() should not have overlapping CPUs. Enable
    overlapping cpuset check for v2 and warn if that happens.

    This patch also has the benefit of increasing test coverage of the new
    Union-Find cpuset merging code to cgroup v2.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
2025-04-28 10:54:34 +01:00
Radostin Stoyanov a387509480 cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
JIRA: https://issues.redhat.com/browse/RHEL-80382

commit 0e40cf2a8b2c847950e025d5aa594bd545118d26
Author: Kinsey Ho <kinseyho@google.com>
Date:   Thu Sep 5 00:30:50 2024 +0000

    cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU

    Patch series "Improve mem_cgroup_iter()", v4.

    Incremental cgroup iteration is being used again [1]. This patchset
    improves the reliability of mem_cgroup_iter(). It also improves
    simplicity and code readability.

    [1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/

    This patch (of 5):

    Explicitly document that css sibling/descendant linkage is protected by
    cgroup_mutex or RCU.  Also, document in css_next_descendant_pre() and
    similar functions that it isn't necessary to hold a ref on @pos.

    The following changes in this patchset rely on this clarification for
    simplification in memcg iteration code.

    Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com
    Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.com
    Suggested-by: Yosry Ahmed <yosryahmed@google.com>
    Reviewed-by: Michal Koutný <mkoutny@suse.com>
    Signed-off-by: Kinsey Ho <kinseyho@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeel.butt@linux.dev>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: T.J. Mercier <tjmercier@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
2025-04-28 10:54:33 +01:00
Radostin Stoyanov 702a637010 cgroup/cpuset: Remove cpuset_slab_spread_rotor
JIRA: https://issues.redhat.com/browse/RHEL-80382

commit c149c4a48b19afbf0c383614e57b452d39b154de
Author: Xiu Jianfeng <xiujianfeng@huawei.com>
Date:   Sat Jul 13 08:59:16 2024 +0000

    cgroup/cpuset: Remove cpuset_slab_spread_rotor

    Since the SLAB implementation was removed in v6.8, so the
    cpuset_slab_spread_rotor is no longer used and can be removed.

    Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
2025-04-28 10:54:33 +01:00
Radostin Stoyanov 6d25e55fb0 cgroup: update some statememt about delegation
JIRA: https://issues.redhat.com/browse/RHEL-80382

commit d1a92d2d6c5dbeba9a87bfb57fa0142cdae7b206
Author: Chen Ridong <chenridong@huawei.com>
Date:   Thu Aug 15 13:14:08 2024 +0000

    cgroup: update some statememt about delegation

    The comment in cgroup_file_write is missing some interfaces, such as
    'cgroup.threads'. All delegatable files are listed in
    '/sys/kernel/cgroup/delegate', so update the comment in cgroup_file_write.
    Besides, add a statement that files outside the namespace shouldn't be
    visible from inside the delegated namespace.

    tj: Reflowed text for consistency.

    Signed-off-by: Chen Ridong <chenridong@huawei.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Radostin Stoyanov <rstoyano@redhat.com>
2025-04-28 10:54:33 +01:00
Augusto Caringi e2a8c62ac1 Merge: livepatch: selected fixes for rhel-9.7 v2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6654

JIRA: https://issues.redhat.com/browse/RHEL-85303

A small series of fixes for the RHEL9.7 livepatch subsystem.

Signed-off-by: Denis Aleksandrov <daleksan@redhat.com>

Approved-by: Joe Lawrence <joe.lawrence@redhat.com>
Approved-by: Ryan Sullivan <rysulliv@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-04-24 12:23:31 -03:00
Augusto Caringi 63ccd7ece5 Merge: mm: backport of proactive fixes
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6650

JIRA: https://issues.redhat.com/browse/RHEL-78989
JIRA: https://issues.redhat.com/browse/RHEL-80529
JIRA: https://issues.redhat.com/browse/RHEL-83249
JIRA: https://issues.redhat.com/browse/RHEL-84184
CVE: CVE-2025-21691
CVE: CVE-2025-21696
CVE: CVE-2025-21861

Proactively backport a set of selected follow-up Fixes for the
MM patches previously backported into RHEL-9 minor releases.
Dependencies and follow-up fixes for the selected commits
are also selectively backported.

Omitted-fix: e080a26725fb ("erofs: allow large folios for compressed files")
Omitted-fix: 3488af097044 ("mm/damon/core: handle zero {aggregation,ops_update} intervals")
Omitted-fix: 5e06ad590096 ("mm/damon/core-test: test max_nr_accesses overflow caused divide-by-zero")
Omitted-fix: 25e8acbcf19c ("mm/damon/tests/core-kunit: skip damon_test_nr_accesses_to_accesses_bp() if aggr_interval is zero")
Omitted-fix: 1390a3334a48 ("mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio")
Omitted-fix: 7ddeb91f5b03 ("mm: kmemleak: add support for dumping physical and __percpu object info")

Signed-off-by: Rafael Aquini <raquini@redhat.com>

Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Čestmír Kalina <ckalina@redhat.com>
Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-04-24 12:23:31 -03:00
Augusto Caringi 3f25b9462f Merge: sched: Fix stop_one_cpu_nowait() vs hotplug [rhel-9]
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6623

JIRA: https://issues.redhat.com/browse/RHEL-84526

Sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test
have been reported and fixed upstream – Notably affine_move_task()
remains stuck in wait_for_completion(), leading to a hung-task detector
warning.

Both C10S and RHEL-10 already carry the fix from this upstream commit:

    f0498d2a54e79 sched: Fix stop_one_cpu_nowait() vs hotplug

Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>

Approved-by: Gabriele Monaco <gmonaco@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-04-24 12:23:25 -03:00
Augusto Caringi ac837b9b45 Merge: cgroup/cpuset: Fix issues in the cpuset partition code
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6722

JIRA: https://issues.redhat.com/browse/RHEL-83455
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6722

The Jira ticket reports a loss of isolated CPUs when isolated partitions
are being created, i.e. Some of the isolated CPUs are missing and are
not in any of the existing cpusets. We are not able to reproduce this
problem in-house, but detailed analysis of the cpuset partition code
does reveal issues that need to be fixed.

This MR incorporates the latest cpuset fixes that were merged upstream
and hopefully will be able to address the issues seen by the customers.
To reduce conflicts and other complications, some other recent cpuset
commits are also included as well.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Radostin Stoyanov <rstoyano@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-04-24 12:23:21 -03:00
Augusto Caringi b65573c720 Merge: rtla: Add timerlat BPF sample collection, Set all tracer options by default [rhel-9]
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6686

# Merge Request Required Information

JIRA: https://issues.redhat.com/browse/RHEL-77358
JIRA: https://issues.redhat.com/browse/RHEL-86051

## Summary of Changes

Two upstream patchsets are contained in this merge request:

* Collect timerlat samples using a BPF program instead of pulling them through a tracefs pipe. This helps with both performance and CPU usage, and fixes an issue where on systems with \>100 CPUs, rtla cannot keep up with timerlat samples and drops most of them, making it useless.
* Always set default values of all tracer options (osnoise or timerlat) if not specified otherwise, as they might be set to unexpected values either by another user of the tracers or by a previous abnormally exited run of rtla.

Those are combined into a single MR, because the latter depends on a refactoring done in the former. A dependency (rtla test suite) is also pulled.

## Approved Development Ticket(s)
All submissions to CentOS Stream must reference a ticket in [Red Hat Jira](https://issues.redhat.com/).

<details><summary>Click for formatting instructions</summary>
Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>

List tickets each on their own line of this description using the format "Resolves: RHEL-76229", "Related: RHEL-76229" or "Reverts: RHEL-76229", as appropriate.
</details>

Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: John Kacur <jkacur@redhat.com>
Approved-by: Gabriele Monaco <gmonaco@redhat.com>
Approved-by: Derek Barbosa <debarbos@redhat.com>
Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-04-24 12:23:18 -03:00
Augusto Caringi 921043e372 Merge: watch_queue: fix pipe accounting mismatch
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6667

JIRA: https://issues.redhat.com/browse/RHEL-78249
commit f13abc1e8e1a3b7455511c4e122750127f6bc9b0
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Thu Feb 27 11:41:08 2025 -0600

    watch_queue: fix pipe accounting mismatch

    Currently, watch_queue_set_size() modifies the pipe buffers charged to
    user->pipe_bufs without updating the pipe->nr_accounted on the pipe
    itself, due to the if (!pipe_has_watch_queue()) test in
    pipe_resize_ring(). This means that when the pipe is ultimately freed,
    we decrement user->pipe_bufs by something other than what than we had
    charged to it, potentially leading to an underflow. This in turn can
    cause subsequent too_many_pipe_buffers_soft() tests to fail with -EPERM.

    To remedy this, explicitly account for the pipe usage in
    watch_queue_set_size() to match the number set via account_pipe_buffers()

    (It's unclear why watch_queue_set_size() does not update nr_accounted;
    it may be due to intentional overprovisioning in watch_queue_set_size()?)

    Fixes: e95aada4cb93d ("pipe: wakeup wr_wait after setting max_usage")
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Link: https://lore.kernel.org/r/206682a8-0604-49e5-8224-fdbe0c12b460@redhat.com
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>

Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: David Howells <dhowells@redhat.com>
Approved-by: Pavel Reichl <preichl@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-04-24 12:23:17 -03:00
Rafael Aquini e4205ccf96 kernel: be more careful about dup_mmap() failures and uprobe registering
JIRA: https://issues.redhat.com/browse/RHEL-84184
CVE: CVE-2025-21709
Conflicts:
  * kernel/events/uprobes.c: a notable context difference in the 1st hunk
    due to RHEL-9 missing the following upstream commits: 87195a1ee332a,
    2bf8e5aceff89, and dd1a7567784e2; and a notable contex difference in
    the 2nd hunk due to RHEL-9 missing the following upstream commits:
    84455e6923c7 and 8617408f7a01. None of the aforelisted commits are
    of any relevance for this backport work.

This patch is a backport of the following upstream commit:
commit 64c37e134b120fb462fb4a80694bfb8e7be77b14
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Mon Jan 27 12:02:21 2025 -0500

    kernel: be more careful about dup_mmap() failures and uprobe registering

    If a memory allocation fails during dup_mmap(), the maple tree can be left
    in an unsafe state for other iterators besides the exit path.  All the
    locks are dropped before the exit_mmap() call (in mm/mmap.c), but the
    incomplete mm_struct can be reached through (at least) the rmap finding
    the vmas which have a pointer back to the mm_struct.

    Up to this point, there have been no issues with being able to find an
    mm_struct that was only partially initialised.  Syzbot was able to make
    the incomplete mm_struct fail with recent forking changes, so it has been
    proven unsafe to use the mm_struct that hasn't been initialised, as
    referenced in the link below.

    Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to
    invalid mm") fixed the uprobe access, it does not completely remove the
    race.

    This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the
    oom side (even though this is extremely unlikely to be selected as an oom
    victim in the race window), and sets MMF_UNSTABLE to avoid other potential
    users from using a partially initialised mm_struct.

    When registering vmas for uprobe, skip the vmas in an mm that is marked
    unstable.  Modifying a vma in an unstable mm may cause issues if the mm
    isn't fully initialised.

    Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
    Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com
    Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: Jann Horn <jannh@google.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Peng Zhang <zhangpeng.00@bytedance.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:39:53 -04:00
Rafael Aquini 6abde438ad fork: avoid inappropriate uprobe access to invalid mm
JIRA: https://issues.redhat.com/browse/RHEL-84184
Conflicts:
  * kernel/fork.c: minor difference from upstream due to an extra blank line
      that was left behind when commit d24062914837 ("fork: use __mt_dup()
      to duplicate maple tree in dup_mmap()") was backported into RHEL-9

This patch is a backport of the following upstream commit:
commit 8ac662f5da19f5873fdd94c48a5cdb45b2e1b58f
Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date:   Tue Dec 10 17:24:12 2024 +0000

    fork: avoid inappropriate uprobe access to invalid mm

    If dup_mmap() encounters an issue, currently uprobe is able to access the
    relevant mm via the reverse mapping (in build_map_info()), and if we are
    very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
    we establish as part of the fork error path.

    This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
    in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
    invoking vma_iter_load() which uses the advanced maple tree API and thus
    is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
    d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
    dup_mmap()").

    This change was made on the assumption that only process tear-down code
    would actually observe (and make use of) these values.  However this very
    unlikely but still possible edge case with uprobes exists and
    unfortunately does make these observable.

    The uprobe operation prevents races against the dup_mmap() operation via
    the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
    and dropped via uprobe_end_dup_mmap(), and held across
    register_for_each_vma() prior to invoking build_map_info() which does the
    reverse mapping lookup.

    Currently these are acquired and dropped within dup_mmap(), which exposes
    the race window prior to error handling in the invoking dup_mm() which
    tears down the mm.

    We can avoid all this by just moving the invocation of
    uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
    and only release this lock once the dup_mmap() operation succeeds or clean
    up is done.

    This means that the uprobe code can never observe an incompletely
    constructed mm and resolves the issue in this case.

    Link: https://lkml.kernel.org/r/20241210172412.52995-1-lorenzo.stoakes@oracle.com
    Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
    Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
    Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Kan Liang <kan.liang@linux.intel.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peng Zhang <zhangpeng.00@bytedance.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:39:52 -04:00
Waiman Long f397624dbd cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs
JIRA: https://issues.redhat.com/browse/RHEL-83455
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git

commit 86888c7bd117c29eab169c37e5f6bbbf583da983
Author: Waiman Long <longman@redhat.com>
Date:   Mon, 7 Apr 2025 17:21:05 -0400

    cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs

    Add WARN_ON_ONCE() statements whenever new exclusive CPUs are being
    added to a partition root to catch inconsistency in the way exclusive
    CPUs are being handled in the cpuset code.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:43 -04:00
Waiman Long 19950eb6cd cgroup/cpuset: Code cleanup and comment update
JIRA: https://issues.redhat.com/browse/RHEL-83455
Conflicts: Some minor context diffs due to missing upstream commit
	   381b53c3b549 ("cgroup/cpuset: rename functions shared
	   between v1 and v2") and commit 2ff899e35164 ("sched/deadline:
	   Rebuild root domain accounting after every update").

commit f0a0bd3d23a44a2c5f628e8ca8ad882498ca5aae
Author: Waiman Long <longman@redhat.com>
Date:   Sun, 30 Mar 2025 17:52:44 -0400

    cgroup/cpuset: Code cleanup and comment update

    Rename partition_xcpus_newstate() to isolated_cpus_update(),
    update_partition_exclusive() to update_partition_exclusive_flag() and
    the new_xcpus_state variable to isolcpus_updated to make their meanings
    more explicit. Also add some comments to further clarify the code.
    No functional change is expected.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:41 -04:00
Waiman Long 710ee1a4b8 cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition
JIRA: https://issues.redhat.com/browse/RHEL-83455

commit f62a5d39368e34a966c8df63e1f05eed7fe9c5de
Author: Waiman Long <longman@redhat.com>
Date:   Sun, 30 Mar 2025 17:52:42 -0400

    cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition

    Currently, changes in exclusive CPUs are being handled in
    remote_partition_check() by disabling conflicting remote partitions.
    However, that may lead to results unexpected by the users. Fix
    this problem by removing remote_partition_check() and making
    update_cpumasks_hier() handle changes in descendant remote partitions
    properly.

    The compute_effective_exclusive_cpumask() function is enhanced to check
    the exclusive_cpus and effective_xcpus from siblings and excluded them
    in its effective exclusive CPUs computation and return a value to show if
    there is any sibling conflicts.  This is somewhat like the cpu_exclusive
    flag check in validate_change(). This is the initial step to enable us
    to retire the use of cpu_exclusive flag in cgroup v2 in the future.

    One of the tests in the TEST_MATRIX of the test_cpuset_prs.sh
    script has to be updated due to changes in the way a child remote
    partition root is being handled (updated instead of invalidation)
    in update_cpumasks_hier().

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:41 -04:00
Waiman Long b6ee781e20 cgroup/cpuset: Fix error handling in remote_partition_disable()
JIRA: https://issues.redhat.com/browse/RHEL-83455

commit 8bf450f3aec3d1bbd725d179502c64b8992588e4
Author: Waiman Long <longman@redhat.com>
Date:   Sun, 30 Mar 2025 17:52:41 -0400

    cgroup/cpuset: Fix error handling in remote_partition_disable()

    When remote_partition_disable() is called to disable a remote partition,
    it always sets the partition to an invalid partition state. It should
    only do so if an error code (prs_err) has been set. Correct that and
    add proper error code in places where remote_partition_disable() is
    called due to error.

    Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:41 -04:00
Waiman Long d6a8d6bd83 cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask()
JIRA: https://issues.redhat.com/browse/RHEL-83455

commit 668e041662e92ab3ebcb9eb606d3ec01884546ab
Author: Waiman Long <longman@redhat.com>
Date:   Sun, 30 Mar 2025 17:52:40 -0400

    cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask()

    Before commit f0af1bfc27b5 ("cgroup/cpuset: Relax constraints to
    partition & cpus changes"), a cpuset partition cannot be enabled if not
    all the requested CPUs can be granted from the parent cpuset. After
    that commit, a cpuset partition can be created even if the requested
    exclusive CPUs contain CPUs not allowed its parent.  The delmask
    containing exclusive CPUs to be removed from its parent wasn't
    adjusted accordingly.

    That is not a problem until the introduction of a new isolated_cpus
    mask in commit 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in
    isolated partitions") as the CPUs in the delmask may be added directly
    into isolated_cpus.

    As a result, isolated_cpus may incorrectly contain CPUs that are not
    isolated leading to incorrect data reporting. Fix this by adjusting
    the delmask to reflect the actual exclusive CPUs for the creation of
    the partition.

    Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:40 -04:00
Waiman Long 9022c81a98 cgroup/cpuset: Fix race between newly created partition and dying one
JIRA: https://issues.redhat.com/browse/RHEL-83455
Conflicts: A merge conflict in the cpuset_css_offline() hunk due to
	   missing upstream commit c4c9cebe2fb9 ("cgroup/cpuset:
	   Further optimize code if CONFIG_CPUSETS_V1 not set").

commit a22b3d54de94f82ca057cc2ebf9496fa91ebf698
Author: Waiman Long <longman@redhat.com>
Date:   Sun, 30 Mar 2025 17:52:39 -0400

    cgroup/cpuset: Fix race between newly created partition and dying one

    There is a possible race between removing a cgroup diectory that is
    a partition root and the creation of a new partition.  The partition
    to be removed can be dying but still online, it doesn't not currently
    participate in checking for exclusive CPUs conflict, but the exclusive
    CPUs are still there in subpartitions_cpus and isolated_cpus. These
    two cpumasks are global states that affect the operation of cpuset
    partitions. The exclusive CPUs in dying cpusets will only be removed
    when cpuset_css_offline() function is called after an RCU delay.

    As a result, it is possible that a new partition can be created with
    exclusive CPUs that overlap with those of a dying one. When that dying
    partition is finally offlined, it removes those overlapping exclusive
    CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an
    incorrect CPU configuration.

    This bug was found when a warning was triggered in
    remote_partition_disable() during testing because the subpartitions_cpus
    mask was empty.

    One possible way to fix this is to iterate the dying cpusets as well and
    avoid using the exclusive CPUs in those dying cpusets. However, this
    can still cause random partition creation failures or other anomalies
    due to racing. A better way to fix this race is to reset the partition
    state at the moment when a cpuset is being killed.

    Introduce a new css_killed() CSS function pointer and call it, if
    defined, before setting CSS_DYING flag in kill_css(). Also update the
    css_is_dying() helper to use the CSS_DYING flag introduced by commit
    33c35aa481 ("cgroup: Prevent kill_css() from being called more than
    once") for proper synchronization.

    Add a new cpuset_css_killed() function to reset the partition state of
    a valid partition root if it is being killed.

    Fixes: ee8dde0cd2 ("cpuset: Add new v2 cpuset.sched.partition flag")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:40 -04:00
Waiman Long 16bd4a6994 cgroup/cpuset: Prevent leakage of isolated CPUs into sched domains
JIRA: https://issues.redhat.com/browse/RHEL-83455

commit 9b496a8bbed9cc292b0dfd796f38ec58b6d0375f
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 5 Dec 2024 14:51:01 -0500

    cgroup/cpuset: Prevent leakage of isolated CPUs into sched domains

    Isolated CPUs are not allowed to be used in a non-isolated partition.
    The only exception is the top cpuset which is allowed to contain boot
    time isolated CPUs.

    Commit ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation
    problem") introduces a simplified scheme of including only partition
    roots in sched domain generation. However, it does not properly account
    for this exception case. This can result in leakage of isolated CPUs
    into a sched domain.

    Fix it by making sure that isolated CPUs are excluded from the top
    cpuset before generating sched domains.

    Also update the way the boot time isolated CPUs are handled in
    test_cpuset_prs.sh to make sure that those isolated CPUs are really
    isolated instead of just skipping them in the tests.

    Fixes: ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:39 -04:00
Waiman Long 9a10873e27 cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per operation
JIRA: https://issues.redhat.com/browse/RHEL-83455
Conflicts: A context diff in the cpuset_update_flag() hunk due to
	   missing upstream commit 381b53c3b549 ("cgroup/cpuset: rename
	   functions shared between v1 and v2") and the removal of
	   IS_ENABLED(CONFIG_CPUSETS_V1) check.

commit a040c351283e3ac75422621ea205b1d8d687e108
Author: Waiman Long <longman@redhat.com>
Date:   Sat, 9 Nov 2024 21:50:22 -0500

    cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per operation

    Since commit ff0ce721ec21 ("cgroup/cpuset: Eliminate unncessary
    sched domains rebuilds in hotplug"), there is only one
    rebuild_sched_domains_locked() call per hotplug operation. However,
    writing to the various cpuset control files may still casue more than
    one rebuild_sched_domains_locked() call to happen in some cases.

    Juri had found that two rebuild_sched_domains_locked() calls in
    update_prstate(), one from update_cpumasks_hier() and another one from
    update_partition_sd_lb() could cause cpuset partition to be created
    with null total_bw for DL tasks. IOW, DL tasks may not be scheduled
    correctly in such a partition.

    A sample command sequence that can reproduce null total_bw is as
    follows.

      # echo Y >/sys/kernel/debug/sched/verbose
      # echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
      # mkdir /sys/fs/cgroup/test
      # echo 0-7 > /sys/fs/cgroup/test/cpuset.cpus
      # echo 6-7 > /sys/fs/cgroup/test/cpuset.cpus.exclusive
      # echo root >/sys/fs/cgroup/test/cpuset.cpus.partition

    Fix this double rebuild_sched_domains_locked() calls problem
    by replacing existing calls with cpuset_force_rebuild() except
    the rebuild_sched_domains_cpuslocked() call at the end of
    cpuset_handle_hotplug(). Checking of the force_sd_rebuild flag is
    now done at the end of cpuset_write_resmask() and update_prstate()
    to determine if rebuild_sched_domains_locked() should be called or not.

    The cpuset v1 code can still call rebuild_sched_domains_locked()
    directly as double rebuild_sched_domains_locked() calls is not possible.

    Reported-by: Juri Lelli <juri.lelli@redhat.com>
    Closes: https://lore.kernel.org/lkml/ZyuUcJDPBln1BK1Y@jlelli-thinkpadt14gen4.remote.csb/
    Signed-off-by: Waiman Long <longman@redhat.com>
    Tested-by: Juri Lelli <juri.lelli@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:39 -04:00
Waiman Long d55b5f7e82 cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in update_cpumasks_hier()"
JIRA: https://issues.redhat.com/browse/RHEL-83455

commit bcd7012afd7bcd45fcd7a0e2f48e57b273702317
Author: Waiman Long <longman@redhat.com>
Date:   Sat, 9 Nov 2024 21:50:21 -0500

    cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in update_cpumasks_hier()"

    Revert commit 3ae0b773211e ("cgroup/cpuset: Allow suppression of sched
    domain rebuild in update_cpumasks_hier()") to allow for an alternative
    way to suppress unnecessary rebuild_sched_domains_locked() calls in
    update_cpumasks_hier() and elsewhere in a following commit.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:39 -04:00
Waiman Long 1a2685707f cgroup/cpuset: Fix spelling errors in file kernel/cgroup/cpuset.c
JIRA: https://issues.redhat.com/browse/RHEL-83455
Conflicts: Minor context diff in 3 hunks due to missing upstream
	   commit 381b53c3b549 ("cgroup/cpuset: rename functions shared
	   between v1 and v2").

commit 95a616d89ccd2d2af0bd26c13c50143b301d82e8
Author: everestkc <everestkc@everestkc.com.np>
Date:   Sun, 15 Sep 2024 02:29:21 -0600

    cgroup/cpuset: Fix spelling errors in file kernel/cgroup/cpuset.c

    Corrected the spelling errors repoted by codespell as follows:
            temparary ==> temporary
            Proprogate ==> Propagate
            constrainted ==> constrained

    Signed-off-by: Everest K.C. <everestkc@everestkc.com.np>
    Acked-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:38 -04:00
Waiman Long 4916b3852c cgroup/cpuset: Account for boot time isolated CPUs
JIRA: https://issues.redhat.com/browse/RHEL-83455

commit c188f33c864e3dba49a1ad0dc9fddf2f49ac42ae
Author: Waiman Long <longman@redhat.com>
Date:   Tue, 20 Aug 2024 15:55:35 -0400

    cgroup/cpuset: Account for boot time isolated CPUs

    With the "isolcpus" boot command line parameter, we are able to
    create isolated CPUs at boot time. These isolated CPUs aren't fully
    accounted for in the cpuset code. For instance, the root cgroup's
    "cpuset.cpus.isolated" control file does not include the boot time
    isolated CPUs. Fix that by looking for pre-isolated CPUs at init time.

    The prstate_housekeeping_conflict() function does check the
    HK_TYPE_DOMAIN housekeeping cpumask to make sure that CPUs outside of it
    can only be used in isolated partition. Given the fact that we are going
    to make housekeeping cpumasks dynamic, the current check may not be right
    anymore. Save the boot time HK_TYPE_DOMAIN cpumask and check against
    it instead of the upcoming dynamic HK_TYPE_DOMAIN housekeeping cpumask.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:37 -04:00
Waiman Long cc3669da09 cgroup/cpuset: remove use_parent_ecpus of cpuset
JIRA: https://issues.redhat.com/browse/RHEL-83455

commit 3c2acae88844e7423a50b5cbe0a2c9d430fcd20c
Author: Chen Ridong <chenridong@huawei.com>
Date:   Tue, 20 Aug 2024 03:01:26 +0000

    cgroup/cpuset: remove use_parent_ecpus of cpuset

    use_parent_ecpus is used to track whether the children are using the
    parent's effective_cpus. When a parent's effective_cpus is changed
    due to changes in a child partition's effective_xcpus, any child
    using parent'effective_cpus must call update_cpumasks_hier. However,
    if a child is not a valid partition, it is sufficient to determine
    whether to call update_cpumasks_hier based on whether the child's
    effective_cpus is going to change. To make the code more succinct,
    it is suggested to remove use_parent_ecpus.

    Signed-off-by: Chen Ridong <chenridong@huawei.com>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:37 -04:00
Waiman Long 09205586fe cgroup/cpuset: remove fetch_xcpus
JIRA: https://issues.redhat.com/browse/RHEL-83455

commit 9414f68d454529ff7e68f0c2aefe0a007060c66a
Author: Chen Ridong <chenridong@huawei.com>
Date:   Tue, 20 Aug 2024 03:01:25 +0000

    cgroup/cpuset: remove fetch_xcpus

    Both fetch_xcpus and user_xcpus functions are used to retrieve the value
    of exclusive_cpus. If exclusive_cpus is not set, cpus_allowed is the
    implicit value used as exclusive in a local partition. I can not imagine
    a scenario where effective_xcpus is not empty when exclusive_cpus is
    empty. Therefore, I suggest removing the fetch_xcpus function.

    Signed-off-by: Chen Ridong <chenridong@huawei.com>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:37 -04:00
Waiman Long 6981025ee0 cgroup/cpuset: remove child_ecpus_count
JIRA: https://issues.redhat.com/browse/RHEL-83455

commit d6326047576266991d88639e1e9739a9a9a20ef4
Author: Chen Ridong <chenridong@huawei.com>
Date:   Wed, 24 Jul 2024 10:24:18 +0000

    cgroup/cpuset: remove child_ecpus_count

    The child_ecpus_count variable was previously used to update
    sibling cpumask when parent's effective_cpus is updated. However, it became
    obsolete after commit e2ffe502ba45 ("cgroup/cpuset: Add
    cpuset.cpus.exclusive for v2"). It should be removed.

    tj: Restored {} for style consistency.

    Signed-off-by: Chen Ridong <chenridong@huawei.com>
    Acked-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:36 -04:00
Waiman Long 73f605febe cpuset: use Union-Find to optimize the merging of cpumasks
JIRA: https://issues.redhat.com/browse/RHEL-83455

commit 8a895c2e6a7ed264a1b917616db205ed934e8306
Author: Xavier <xavier_qy@163.com>
Date:   Thu, 4 Jul 2024 14:24:44 +0800

    cpuset: use Union-Find to optimize the merging of cpumasks

    The process of constructing scheduling domains
     involves multiple loops and repeated evaluations, leading to numerous
     redundant and ineffective assessments that impact code efficiency.

    Here, we use union-find to optimize the merging of cpumasks. By employing
    path compression and union by rank, we effectively reduce the number of
    lookups and merge comparisons.

    Signed-off-by: Xavier <xavier_qy@163.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:35 -04:00
Augusto Caringi 74782eb600 Merge: block: update with v6.14
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6580

JIRA: https://issues.redhat.com/browse/RHEL-79409

we don't backport  "block: Fix potential deadlock while freezing queue and acquiring sysfs_lock

Omitted-Fix: 224749be6c23 ("block: Revert "block: Fix potential deadlock while freezing queue and acquiring sysfs_lock"")

Omitted-Fix: 2fa07d7a0f00 ("btrfs: pass write-hint for buffered IO")

Omitted-Fix: e559ee022658 ("btrfs: validate queue limits")

Omitted-Fix: 7467bc5959bf ("btrfs: zoned: calculate max_extent_size properly on non-zoned setup")

Omitted-Fix: c7c97ceff98c ("btrfs: handle bio_split() errors")

Signed-off-by: Ming Lei <ming.lei@redhat.com>

Approved-by: Ewan D. Milne <emilne@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-04-04 12:34:54 -03:00
Augusto Caringi c82044acec Merge: Sched: /proc/schedstat improvements
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6405

JIRA: https://issues.redhat.com/browse/RHEL-23495

Update /proc/schedstat with fixes and improved information
from upstream.  AMD requested these and they don't carry
a large risk.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-04-04 12:34:51 -03:00
Augusto Caringi 652f8a293b Merge: CVE-2025-21726: padata: avoid UAF for reorder_work
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6499

JIRA: https://issues.redhat.com/browse/RHEL-81522
CVE: CVE-2025-21726
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6499

This MR backports the 3-patch series that includes the fix to CVE-2025-21726
as well as two other minor cleanup and fix patches.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-04-04 12:34:50 -03:00
Tomas Glozar 331cb7536e trace/osnoise: Add trace events for samples
JIRA: https://issues.redhat.com/browse/RHEL-77358

commit a065bbf776d32a71e748bd948861e6deca803d78
Author: Tomas Glozar <tglozar@redhat.com>
Date:   Mon Feb 3 10:04:18 2025 +0100

    trace/osnoise: Add trace events for samples

    Add trace events that fire at osnoise and timerlat sample generation, in
    addition to the already existing noise and threshold events.

    This allows processing the samples directly in the kernel, either with
    ftrace triggers or with BPF.

    Cc: John Kacur <jkacur@redhat.com>
    Cc: Luis Goncalves <lgoncalv@redhat.com>
    Link: https://lore.kernel.org/20250203090418.1458923-1-tglozar@redhat.com
    Signed-off-by: Tomas Glozar <tglozar@redhat.com>
    Tested-by: Gabriele Monaco <gmonaco@redhat.com>
    Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
2025-04-04 11:14:44 +02:00
Denis Aleksandrov 7c3f326164 livepatch: Add stack_order sysfs attribute
JIRA: https://issues.redhat.com/browse/RHEL-85303

Add "stack_order" sysfs attribute which holds the order in which a live
patch module was loaded into the system. A user can then determine an
active live patched version of a function.

cat /sys/kernel/livepatch/livepatch_1/stack_order -> 1

means that livepatch_1 is the first live patch applied

cat /sys/kernel/livepatch/livepatch_module/stack_order -> N

means that livepatch_module is the Nth live patch applied

Suggested-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Miroslav Benes <mbenes@suse.cz>
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Wardenjohn <zhangwarden@gmail.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lore.kernel.org/r/20241008014856.3729-2-zhangwarden@gmail.com
[pmladek@suse.com: Updated kernel version and date in the ABI documentation.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
(cherry picked from commit 3dae09de406167123449d9ece1f51855d5bac01a)
Signed-off-by: Denis Aleksandrov <daleksan@redhat.com>
2025-04-03 13:23:15 -04:00
Denis Aleksandrov cab71a4b8d livepatch: Use kallsyms_on_each_match_symbol() to improve performance
JIRA: https://issues.redhat.com/browse/RHEL-85303

Based on the test results of kallsyms_on_each_match_symbol() and
kallsyms_on_each_symbol(), the average performance can be improved by
more than 1500 times.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
(cherry picked from commit 9cb37357dfce1b596041ad68a20407c8b4e76635)
Signed-off-by: Denis Aleksandrov <daleksan@redhat.com>
2025-04-03 13:22:35 -04:00
Denis Aleksandrov 579531376b livepatch: Fix build failure on 32 bits processors
JIRA: https://issues.redhat.com/browse/RHEL-85303

Trying to build livepatch on powerpc/32 results in:

	kernel/livepatch/core.c: In function 'klp_resolve_symbols':
	kernel/livepatch/core.c:221:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
	  221 |                 sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
	      |                       ^
	kernel/livepatch/core.c:221:21: error: assignment to 'Elf32_Sym *' {aka 'struct elf32_sym *'} from incompatible pointer type 'Elf64_Sym *' {aka 'struct elf64_sym *'} [-Werror=incompatible-pointer-types]
	  221 |                 sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
	      |                     ^
	kernel/livepatch/core.c: In function 'klp_apply_section_relocs':
	kernel/livepatch/core.c:312:35: error: passing argument 1 of 'klp_resolve_symbols' from incompatible pointer type [-Werror=incompatible-pointer-types]
	  312 |         ret = klp_resolve_symbols(sechdrs, strtab, symndx, sec, sec_objname);
	      |                                   ^~~~~~~
	      |                                   |
	      |                                   Elf32_Shdr * {aka struct elf32_shdr *}
	kernel/livepatch/core.c:193:44: note: expected 'Elf64_Shdr *' {aka 'struct elf64_shdr *'} but argument is of type 'Elf32_Shdr *' {aka 'struct elf32_shdr *'}
	  193 | static int klp_resolve_symbols(Elf64_Shdr *sechdrs, const char *strtab,
	      |                                ~~~~~~~~~~~~^~~~~~~

Fix it by using the right types instead of forcing 64 bits types.

Fixes: 7c8e2bdd5f ("livepatch: Apply vmlinux-specific KLP relocations early")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Petr Mladek <pmladek@suse.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5288e11b018a762ea3351cc8fb2d4f15093a4457.1640017960.git.christophe.leroy@csgroup.eu

(cherry picked from commit 2f293651eca3eacaeb56747dede31edace7329d2)
Signed-off-by: Denis Aleksandrov <daleksan@redhat.com>
2025-04-03 13:21:39 -04:00
Carlos Maiolino 972aed5f39 watch_queue: fix pipe accounting mismatch
JIRA: https://issues.redhat.com/browse/RHEL-78249

commit f13abc1e8e1a3b7455511c4e122750127f6bc9b0
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Thu Feb 27 11:41:08 2025 -0600

    watch_queue: fix pipe accounting mismatch

    Currently, watch_queue_set_size() modifies the pipe buffers charged to
    user->pipe_bufs without updating the pipe->nr_accounted on the pipe
    itself, due to the if (!pipe_has_watch_queue()) test in
    pipe_resize_ring(). This means that when the pipe is ultimately freed,
    we decrement user->pipe_bufs by something other than what than we had
    charged to it, potentially leading to an underflow. This in turn can
    cause subsequent too_many_pipe_buffers_soft() tests to fail with -EPERM.

    To remedy this, explicitly account for the pipe usage in
    watch_queue_set_size() to match the number set via account_pipe_buffers()

    (It's unclear why watch_queue_set_size() does not update nr_accounted;
    it may be due to intentional overprovisioning in watch_queue_set_size()?)

    Fixes: e95aada4cb93d ("pipe: wakeup wr_wait after setting max_usage")
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Link: https://lore.kernel.org/r/206682a8-0604-49e5-8224-fdbe0c12b460@redhat.com
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2025-04-02 10:32:41 +02:00
Augusto Caringi f4ca2d23b8 Merge: tracing: Add division and multiplication support for hist triggers [rhel-9]
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6586

# Merge Request Required Information

JIRA: https://issues.redhat.com/browse/RHEL-67679

## Summary of Changes

Backport support for division and multiplication in histogram triggers as well as fixes and optimizations for it. Support for creating hist trigger variables from literal is added, too, as it is a dependency of one of the fixes and is documented together with the former.

## Approved Development Ticket(s)
All submissions to CentOS Stream must reference a ticket in [Red Hat Jira](https://issues.redhat.com/).

<details><summary>Click for formatting instructions</summary>
Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.

List tickets each on their own line of this description using the format "Resolves: RHEL-76229", "Related: RHEL-76229" or "Reverts: RHEL-76229", as appropriate.
</details>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>

Approved-by: Joe Lawrence <joe.lawrence@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-03-31 16:55:04 -03:00
Augusto Caringi 201242e8d4 Merge: cgroup: Remove steal time from usage_usec
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6423

JIRA: https://issues.redhat.com/browse/RHEL-79933

commit db5fd3cf8bf41b84b577b8ad5234ea95f327c9be
Author: Muhammad Adeel <Muhammad.Adeel@ibm.com>
Date:   Fri, 7 Feb 2025 14:24:32 +0000

    cgroup: Remove steal time from usage_usec

    The CPU usage time is the time when user, system or both are using the CPU.
    Steal time is the time when CPU is waiting to be run by the Hypervisor. It
    should not be added to the CPU usage time, hence removing it from the
    usage_usec entry.

    Fixes: 936f2a70f2 ("cgroup: add cpu.stat file to root cgroup")
    Acked-by: Axel Busch <axel.busch@ibm.com>
    Acked-by: Michal Koutný <mkoutny@suse.com>
    Signed-off-by: Muhammad Adeel <muhammad.adeel@ibm.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Thomas Huth <thuth@redhat.com>
Approved-by: Radostin Stoyanov <rstoyano@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Augusto Caringi <acaringi@redhat.com>
2025-03-27 16:28:29 -03:00
Luis Claudio R. Goncalves 0857fd208a sched: Fix stop_one_cpu_nowait() vs hotplug
JIRA: https://issues.redhat.com/browse/RHEL-84526

commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Oct 10 20:57:39 2023 +0200

    sched: Fix stop_one_cpu_nowait() vs hotplug

    Kuyo reported sporadic failures on a sched_setaffinity() vs CPU
    hotplug stress-test -- notably affine_move_task() remains stuck in
    wait_for_completion(), leading to a hung-task detector warning.

    Specifically, it was reported that stop_one_cpu_nowait(.fn =
    migration_cpu_stop) returns false -- this stopper is responsible for
    the matching complete().

    The race scenario is:

            CPU0                                    CPU1

                                            // doing _cpu_down()

      __set_cpus_allowed_ptr()
        task_rq_lock();
                                            takedown_cpu()
                                              stop_machine_cpuslocked(take_cpu_down..)

                                            <PREEMPT: cpu_stopper_thread()
                                              MULTI_STOP_PREPARE
                                              ...
        __set_cpus_allowed_ptr_locked()
          affine_move_task()
            task_rq_unlock();

      <PREEMPT: cpu_stopper_thread()\>
        ack_state()
                                              MULTI_STOP_RUN
                                                take_cpu_down()
                                                  __cpu_disable();
                                                  stop_machine_park();
                                                    stopper->enabled = false;
                                             />
       />
            stop_one_cpu_nowait(.fn = migration_cpu_stop);
              if (stopper->enabled) // false!!!

    That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the
    stopper thread gets a chance to preempt and allows the cpu-down for
    the target CPU to complete.

    OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to
    issue a wakeup, it must not be ran under the scheduler locks.

    Solve this apparent contradiction by keeping preemption disabled over
    the unlock + queue_stopper combination:

            preempt_disable();
            task_rq_unlock(...);
            if (!stop_pending)
              stop_one_cpu_nowait(...)
            preempt_enable();

    This respects the lock ordering contraints while still avoiding the
    above race. That is, if we find the CPU is online under rq-lock, the
    targeted stop_one_cpu_nowait() must succeed.

    Apply this pattern to all similar stop_one_cpu_nowait() invocations.

    Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
    Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
    Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net

Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
2025-03-21 18:50:20 -03:00
Tomas Glozar 2b200783ae tracing/histogram: Fix semicolon.cocci warnings
JIRA: https://issues.redhat.com/browse/RHEL-67679

commit feea69ec121f067073868cebe0cb9d003e64ad80
Author: kernel test robot <lkp@intel.com>
Date:   Sat Oct 30 08:56:15 2021 +0800

    tracing/histogram: Fix semicolon.cocci warnings

    kernel/trace/trace_events_hist.c:6039:2-3: Unneeded semicolon

     Remove unneeded semicolon.

    Generated by: scripts/coccinelle/misc/semicolon.cocci

    Link: https://lkml.kernel.org/r/20211030005615.GA41257@3074f0d39c61

    Fixes: c5eac6ee8bc5 ("tracing/histogram: Simplify handling of .sym-offset in expressions")
    CC: Kalesh Singh <kaleshsingh@google.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
2025-03-21 08:09:00 +01:00
Tomas Glozar 68f93cbb54 tracing/histogram: Fix check for missing operands in an expression
JIRA: https://issues.redhat.com/browse/RHEL-67679

commit 1cab6bce42e62bba2ff2c2370d139618c1828b42
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Fri Nov 12 11:13:24 2021 -0800

    tracing/histogram: Fix check for missing operands in an expression

    If a binary operation is detected while parsing an expression string,
    the operand strings are deduced by splitting the experssion string at
    the position of the detected binary operator. Both operand strings are
    sub-strings (can be empty string) of the expression string but will
    never be NULL.

    Currently a NULL check is used for missing operands, fix this by
    checking for empty strings instead.

    Link: https://lkml.kernel.org/r/20211112191324.1302505-1-kaleshsingh@google.com

    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Fixes: 9710b2f341a0 ("tracing: Fix operator precedence for hist triggers expression")
    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
2025-03-21 08:09:00 +01:00
Tomas Glozar adca273068 tracing/histogram: Optimize division by a power of 2
JIRA: https://issues.redhat.com/browse/RHEL-67679

commit 722eddaa4043acee8f031cf238ced5f7514ad638
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Mon Oct 25 13:08:38 2021 -0700

    tracing/histogram: Optimize division by a power of 2

    The division is a slow operation. If the divisor is a power of 2, use a
    shift instead.

    Results were obtained using Android's version of perf (simpleperf[1]) as
    described below:

    1. hist_field_div() is modified to call 2 test functions:
       test_hist_field_div_[not]_optimized(); passing them the
       same args. Use noinline and volatile to ensure these are
       not optimized out by the compiler.
    2. Create a hist event trigger that uses division:
          events/kmem/rss_stat$ echo 'hist:keys=common_pid:x=size/<divisor>'
             >> trigger
          events/kmem/rss_stat$ echo 'hist:keys=common_pid:vals=$x'
             >> trigger
    3. Run Android's lmkd_test[2] to generate rss_stat events, and
       record CPU samples with Android's simpleperf:
          simpleperf record -a --exclude-perf --post-unwind=yes -m 16384 -g
             -f 2000 -o perf.data

    == Results ==

    Divisor is a power of 2 (divisor == 32):

       test_hist_field_div_not_optimized  | 8,717,091 cpu-cycles
       test_hist_field_div_optimized      | 1,643,137 cpu-cycles

    If the divisor is a power of 2, the optimized version is ~5.3x faster.

    Divisor is not a power of 2 (divisor == 33):

       test_hist_field_div_not_optimized  | 4,444,324 cpu-cycles
       test_hist_field_div_optimized      | 5,497,958 cpu-cycles

    If the divisor is not a power of 2, as expected, the optimized version is
    slightly slower (~24% slower).

    [1] https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md
    [2] https://cs.android.com/android/platform/superproject/+/master:system/memory/lmkd/tests/lmkd_test.cpp

    Link: https://lkml.kernel.org/r/20211025200852.3002369-7-kaleshsingh@google.com

    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Suggested-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
2025-03-21 08:09:00 +01:00
Tomas Glozar f234ceb346 tracing/histogram: Covert expr to const if both operands are constants
JIRA: https://issues.redhat.com/browse/RHEL-67679

commit f47716b7a955e40e2591b960d1eccb1fde967a70
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Mon Oct 25 13:08:37 2021 -0700

    tracing/histogram: Covert expr to const if both operands are constants

    If both operands of a hist trigger expression are constants, convert the
    expression to a constant. This optimization avoids having to perform the
    same calculation multiple times and also saves on memory since the
    merged constants are represented by a single struct hist_field instead
    or multiple.

    Link: https://lkml.kernel.org/r/20211025200852.3002369-6-kaleshsingh@google.com

    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Suggested-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
2025-03-21 08:09:00 +01:00
Tomas Glozar 65b7fbbb4e tracing/histogram: Simplify handling of .sym-offset in expressions
JIRA: https://issues.redhat.com/browse/RHEL-67679

commit c5eac6ee8bc5d32e48b3845472b547574061f49f
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Mon Oct 25 13:08:36 2021 -0700

    tracing/histogram: Simplify handling of .sym-offset in expressions

    The '-' in .sym-offset can confuse the hist trigger arithmetic
    expression parsing. Simplify the handling of this by replacing the
    'sym-offset' with 'symXoffset'. This allows us to correctly evaluate
    expressions where the user may have inadvertently added a .sym-offset
    modifier to one of the operands in an expression, instead of bailing
    out. In this case the .sym-offset has no effect on the evaluation of the
    expression. The only valid use of the .sym-offset is as a hist key
    modifier.

    Link: https://lkml.kernel.org/r/20211025200852.3002369-5-kaleshsingh@google.com

    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Suggested-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
2025-03-21 08:09:00 +01:00
Tomas Glozar c61b871f9d tracing: Fix operator precedence for hist triggers expression
JIRA: https://issues.redhat.com/browse/RHEL-67679

commit 9710b2f341a0d96f35b911580639853cfda4677d
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Mon Oct 25 13:08:35 2021 -0700

    tracing: Fix operator precedence for hist triggers expression

    The current histogram expression evaluation logic evaluates the
    expression from right to left. This can lead to incorrect results
    if the operations are not associative (as is the case for subtraction
    and, the now added, division operators).
            e.g. 16-8-4-2 should be 2 not 10 --> 16-8-4-2 = ((16-8)-4)-2
                 64/8/4/2 should be 1 not 16 --> 64/8/4/2 = ((64/8)/4)/2

    Division and multiplication are currently limited to single operation
    expression due to operator precedence support not yet implemented.

    Rework the expression parsing to support the correct evaluation of
    expressions containing operators of different precedences; and fix
    the associativity error by evaluating expressions with operators of
    the same precedence from left to right.

    Examples:
            (1) echo 'hist:keys=common_pid:a=8,b=4,c=2,d=1,w=$a-$b-$c-$d' \
                      >> event/trigger
            (2) echo 'hist:keys=common_pid:x=$a/$b/3/2' >> event/trigger
            (3) echo 'hist:keys=common_pid:y=$a+10/$c*1024' >> event/trigger
            (4) echo 'hist:keys=common_pid:z=$a/$b+$c*$d' >> event/trigger

    Link: https://lkml.kernel.org/r/20211025200852.3002369-4-kaleshsingh@google.com

    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Reviewed-by: Namhyung Kim <namhyung@kernel.org>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
2025-03-21 08:09:00 +01:00
Tomas Glozar 41bc755fde tracing: Add division and multiplication support for hist triggers
JIRA: https://issues.redhat.com/browse/RHEL-67679

commit bcef044150320217e2a00c65050114e509c222b8
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Mon Oct 25 13:08:34 2021 -0700

    tracing: Add division and multiplication support for hist triggers

    Adds basic support for division and multiplication operations for
    hist trigger variable expressions.

    For simplicity this patch only supports, division and multiplication
    for a single operation expression (e.g. x=$a/$b), as currently
    expressions are always evaluated right to left. This can lead to some
    incorrect results:

            e.g. echo 'hist:keys=common_pid:x=8-4-2' >> event/trigger

                 8-4-2 should evaluate to 2 i.e. (8-4)-2
                 but currently x evaluate to  6 i.e. 8-(4-2)

    Multiplication and division in sub-expressions will work correctly, once
    correct operator precedence support is added (See next patch in this
    series).

    For the undefined case of division by 0, the histogram expression
    evaluates to (u64)(-1). Since this cannot be detected when the
    expression is created, it is the responsibility of the user to be
    aware and account for this possibility.

    Examples:
            echo 'hist:keys=common_pid:a=8,b=4,x=$a/$b' \
                       >> event/trigger

            echo 'hist:keys=common_pid:y=5*$b' \
                       >> event/trigger

    Link: https://lkml.kernel.org/r/20211025200852.3002369-3-kaleshsingh@google.com

    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
2025-03-21 08:09:00 +01:00
Tomas Glozar 1f1a764877 tracing: Add support for creating hist trigger variables from literal
JIRA: https://issues.redhat.com/browse/RHEL-67679

commit 52cfb373536a7fb744b0ec4b748518e5dc874fb7
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Mon Oct 25 13:08:33 2021 -0700

    tracing: Add support for creating hist trigger variables from literal

    Currently hist trigger expressions don't support the use of numeric
    literals:
            e.g. echo 'hist:keys=common_pid:x=$y-1234'
                    --> is not valid expression syntax

    Having the ability to use numeric constants in hist triggers supports
    a wider range of expressions for creating variables.

    Add support for creating trace event histogram variables from numeric
    literals.

            e.g. echo 'hist:keys=common_pid:x=1234,y=size-1024' >> event/trigger

    A negative numeric constant is created, using unary minus operator
    (parentheses are required).

            e.g. echo 'hist:keys=common_pid:z=-(2)' >> event/trigger

    Constants can be used with division/multiplication (added in the
    next patch in this series) to implement granularity filters for frequent
    trace events. For instance we can limit emitting the rss_stat
    trace event to when there is a 512KB cross over in the rss size:

      # Create a synthetic event to monitor instead of the high frequency
      # rss_stat event
      echo 'rss_stat_throttled unsigned int mm_id; unsigned int curr;
            int member; long size' >> tracing/synthetic_events

      # Create a hist trigger that emits the synthetic rss_stat_throttled
      # event only when the rss size crosses a 512KB boundary.
      echo 'hist:keys=keys=mm_id,member:bucket=size/0x80000:onchange($bucket)
          .rss_stat_throttled(mm_id,curr,member,size)'
            >> events/kmem/rss_stat/trigger

    A use case for using constants with addition/subtraction is not yet
    known, but for completeness the use of constants are supported for all
    operators.

    Link: https://lkml.kernel.org/r/20211025200852.3002369-2-kaleshsingh@google.com

    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
2025-03-21 08:09:00 +01:00
Tomas Glozar ffbbdfa334 tracing: Have histogram types be constant when possible
JIRA: https://issues.redhat.com/browse/RHEL-67679

commit 3347d80baa41c357cf263923f60aa8051a753d76
Author: Steven Rostedt (VMware) <rostedt@goodmis.org>
Date:   Thu Jul 22 10:27:06 2021 -0400

    tracing: Have histogram types be constant when possible

    Instead of kstrdup("const", GFP_KERNEL), have the hist_field type simply
    assign the constant hist_field->type = "const"; And when the value passed
    to it is a variable, use "kstrdup_const(var, GFP_KERNEL);" which will just
    copy the value if the variable is already a constant. This saves on having
    to allocate when not needed.

    All frees of the hist_field->type will need to use kfree_const().

    Link: https://lkml.kernel.org/r/20210722142837.280718447@goodmis.org

    Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
    Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
2025-03-21 08:09:00 +01:00