Commit Graph

1181 Commits

Author SHA1 Message Date
Rafael Aquini e4205ccf96 kernel: be more careful about dup_mmap() failures and uprobe registering
JIRA: https://issues.redhat.com/browse/RHEL-84184
CVE: CVE-2025-21709
Conflicts:
  * kernel/events/uprobes.c: a notable context difference in the 1st hunk
    due to RHEL-9 missing the following upstream commits: 87195a1ee332a,
    2bf8e5aceff89, and dd1a7567784e2; and a notable contex difference in
    the 2nd hunk due to RHEL-9 missing the following upstream commits:
    84455e6923c7 and 8617408f7a01. None of the aforelisted commits are
    of any relevance for this backport work.

This patch is a backport of the following upstream commit:
commit 64c37e134b120fb462fb4a80694bfb8e7be77b14
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Mon Jan 27 12:02:21 2025 -0500

    kernel: be more careful about dup_mmap() failures and uprobe registering

    If a memory allocation fails during dup_mmap(), the maple tree can be left
    in an unsafe state for other iterators besides the exit path.  All the
    locks are dropped before the exit_mmap() call (in mm/mmap.c), but the
    incomplete mm_struct can be reached through (at least) the rmap finding
    the vmas which have a pointer back to the mm_struct.

    Up to this point, there have been no issues with being able to find an
    mm_struct that was only partially initialised.  Syzbot was able to make
    the incomplete mm_struct fail with recent forking changes, so it has been
    proven unsafe to use the mm_struct that hasn't been initialised, as
    referenced in the link below.

    Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to
    invalid mm") fixed the uprobe access, it does not completely remove the
    race.

    This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the
    oom side (even though this is extremely unlikely to be selected as an oom
    victim in the race window), and sets MMF_UNSTABLE to avoid other potential
    users from using a partially initialised mm_struct.

    When registering vmas for uprobe, skip the vmas in an mm that is marked
    unstable.  Modifying a vma in an unstable mm may cause issues if the mm
    isn't fully initialised.

    Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
    Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com
    Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: Jann Horn <jannh@google.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Peng Zhang <zhangpeng.00@bytedance.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:39:53 -04:00
Rafael Aquini 6abde438ad fork: avoid inappropriate uprobe access to invalid mm
JIRA: https://issues.redhat.com/browse/RHEL-84184
Conflicts:
  * kernel/fork.c: minor difference from upstream due to an extra blank line
      that was left behind when commit d24062914837 ("fork: use __mt_dup()
      to duplicate maple tree in dup_mmap()") was backported into RHEL-9

This patch is a backport of the following upstream commit:
commit 8ac662f5da19f5873fdd94c48a5cdb45b2e1b58f
Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date:   Tue Dec 10 17:24:12 2024 +0000

    fork: avoid inappropriate uprobe access to invalid mm

    If dup_mmap() encounters an issue, currently uprobe is able to access the
    relevant mm via the reverse mapping (in build_map_info()), and if we are
    very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
    we establish as part of the fork error path.

    This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
    in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
    invoking vma_iter_load() which uses the advanced maple tree API and thus
    is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
    d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
    dup_mmap()").

    This change was made on the assumption that only process tear-down code
    would actually observe (and make use of) these values.  However this very
    unlikely but still possible edge case with uprobes exists and
    unfortunately does make these observable.

    The uprobe operation prevents races against the dup_mmap() operation via
    the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
    and dropped via uprobe_end_dup_mmap(), and held across
    register_for_each_vma() prior to invoking build_map_info() which does the
    reverse mapping lookup.

    Currently these are acquired and dropped within dup_mmap(), which exposes
    the race window prior to error handling in the invoking dup_mm() which
    tears down the mm.

    We can avoid all this by just moving the invocation of
    uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
    and only release this lock once the dup_mmap() operation succeeds or clean
    up is done.

    This means that the uprobe code can never observe an incompletely
    constructed mm and resolves the issue in this case.

    Link: https://lkml.kernel.org/r/20241210172412.52995-1-lorenzo.stoakes@oracle.com
    Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
    Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
    Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Kan Liang <kan.liang@linux.intel.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peng Zhang <zhangpeng.00@bytedance.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:39:52 -04:00
Patrick Talbert 143d5ac2a9 Merge: CVE-2024-50271: ucounts: Split rlimit and ucount values and max values
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6027

JIRA: https://issues.redhat.com/browse/RHEL-68020

CVE: CVE-2024-50271

- 012f4d5d25e9ef92ee129bd5aa7aa60f692681e1 signal: restore the override_rlimit logic
- de399236e240743ad2dd10d719c37b97ddf31996 ucounts: Split rlimit and ucount values and max values

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>

Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Patrick Talbert <ptalbert@redhat.com>
2025-02-03 10:00:41 -05:00
Radostin Stoyanov 46364cd74c ucounts: Split rlimit and ucount values and max values
JIRA: https://issues.redhat.com/browse/RHEL-68020
CVE: CVE-2024-50271

commit de399236e240743ad2dd10d719c37b97ddf31996
Author: Alexey Gladkov <legion@kernel.org>
Date:   Wed Mat 18 19:17:30 2022 +0200

    ucounts: Split rlimit and ucount values and max values

    Since the semantics of maximum rlimit values are different, it would be
    better not to mix ucount and rlimit values. This will prevent the error
    of using inc_count/dec_ucount for rlimit parameters.

    This patch also renames the functions to emphasize the lack of
    connection between rlimit and ucount.

    v3:
    - Fix BUG:KASAN:use-after-free_in_dec_ucount.

    v2:
    - Fix the array-index-out-of-bounds that was found by the lkp project.

    Reported-by: kernel test robot <oliver.sang@intel.com>
    Signed-off-by: Alexey Gladkov <legion@kernel.org>
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Link: https://lkml.kernel.org/r/20220518171730.l65lmnnjtnxnftpq@example.org
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
2024-12-20 15:31:08 +00:00
Rafael Aquini a0b0ebfbe2 fork: only invoke khugepaged, ksm hooks if no error
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 985da552a98e27096444508ce5d853244019111f
Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date:   Tue Oct 15 18:56:06 2024 +0100

    fork: only invoke khugepaged, ksm hooks if no error

    There is no reason to invoke these hooks early against an mm that is in an
    incomplete state.

    The change in commit d24062914837 ("fork: use __mt_dup() to duplicate
    maple tree in dup_mmap()") makes this more pertinent as we may be in a
    state where entries in the maple tree are not yet consistent.

    Their placement early in dup_mmap() only appears to have been meaningful
    for early error checking, and since functionally it'd require a very small
    allocation to fail (in practice 'too small to fail') that'd only occur in
    the most dire circumstances, meaning the fork would fail or be OOM'd in
    any case.

    Since both khugepaged and KSM tracking are there to provide optimisations
    to memory performance rather than critical functionality, it doesn't
    really matter all that much if, under such dire memory pressure, we fail
    to register an mm with these.

    As a result, we follow the example of commit d2081b2bf819 ("mm:
    khugepaged: make khugepaged_enter() void function") and make ksm_fork() a
    void function also.

    We only expose the mm to these functions once we are done with them and
    only if no error occurred in the fork operation.

    Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.1729014377.git.lorenzo.stoakes@oracle.com
    Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
    Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
    Reported-by: Jann Horn <jannh@google.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Jann Horn <jannh@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Linus Torvalds <torvalds@linuxfoundation.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:52 -05:00
Rafael Aquini 4ecda75778 fork: do not invoke uffd on fork if error occurs
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit f64e67e5d3a45a4a04286c47afade4b518acd47b
Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date:   Tue Oct 15 18:56:05 2024 +0100

    fork: do not invoke uffd on fork if error occurs

    Patch series "fork: do not expose incomplete mm on fork".

    During fork we may place the virtual memory address space into an
    inconsistent state before the fork operation is complete.

    In addition, we may encounter an error during the fork operation that
    indicates that the virtual memory address space is invalidated.

    As a result, we should not be exposing it in any way to external machinery
    that might interact with the mm or VMAs, machinery that is not designed to
    deal with incomplete state.

    We specifically update the fork logic to defer khugepaged and ksm to the
    end of the operation and only to be invoked if no error arose, and
    disallow uffd from observing fork events should an error have occurred.

    This patch (of 2):

    Currently on fork we expose the virtual address space of a process to
    userland unconditionally if uffd is registered in VMAs, regardless of
    whether an error arose in the fork.

    This is performed in dup_userfaultfd_complete() which is invoked
    unconditionally, and performs two duties - invoking registered handlers
    for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down
    userfaultfd_fork_ctx objects established in dup_userfaultfd().

    This is problematic, because the virtual address space may not yet be
    correctly initialised if an error arose.

    The change in commit d24062914837 ("fork: use __mt_dup() to duplicate
    maple tree in dup_mmap()") makes this more pertinent as we may be in a
    state where entries in the maple tree are not yet consistent.

    We address this by, on fork error, ensuring that we roll back state that
    we would otherwise expect to clean up through the event being handled by
    userland and perform the memory freeing duty otherwise performed by
    dup_userfaultfd_complete().

    We do this by implementing a new function, dup_userfaultfd_fail(), which
    performs the same loop, only decrementing reference counts.

    Note that we perform mmgrab() on the parent and child mm's, however
    userfaultfd_ctx_put() will mmdrop() this once the reference count drops to
    zero, so we will avoid memory leaks correctly here.

    Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com
    Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com
    Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
    Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
    Reported-by: Jann Horn <jannh@google.com>
    Reviewed-by: Jann Horn <jannh@google.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Linus Torvalds <torvalds@linuxfoundation.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:51 -05:00
Rafael Aquini 15c37b2ac9 fork: use __mt_dup() to duplicate maple tree in dup_mmap()
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * kernel/fork.c: differences on the 3rd and 4th hunks are due to out-of-order
    backport of commit 35e351780fa9 ("fork: defer linking file vma until vma is
    fully initialized"), and we have one RHEL-only hunk here that reverts commit
    2b4f3b4987b5 ("fork: lock VMAs of the parent process when forking") which was
    a temporary measure upstream that ended up in a dead branch after v6.6;

This patch is a backport of the following upstream commit:
commit d2406291483775ecddaee929231a39c70c08fda2
Author: Peng Zhang <zhangpeng.00@bytedance.com>
Date:   Fri Oct 27 11:38:45 2023 +0800

    fork: use __mt_dup() to duplicate maple tree in dup_mmap()

    In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
    directly replacing the entries of VMAs in the new maple tree can result in
    better performance.  __mt_dup() uses DFS pre-order to duplicate the maple
    tree, so it is efficient.

    The average time complexity of __mt_dup() is O(n), where n is the number
    of VMAs.  The proof of the time complexity is provided in the commit log
    that introduces __mt_dup().  After duplicating the maple tree, each
    element is traversed and replaced (ignoring the cases of deletion, which
    are rare).  Since it is only a replacement operation for each element,
    this process is also O(n).

    Analyzing the exact time complexity of the previous algorithm is
    challenging because each insertion can involve appending to a node,
    pushing data to adjacent nodes, or even splitting nodes.  The frequency of
    each action is difficult to calculate.  The worst-case scenario for a
    single insertion is when the tree undergoes splitting at every level.  If
    we consider each insertion as the worst-case scenario, we can determine
    that the upper bound of the time complexity is O(n*log(n)), although this
    is a loose upper bound.  However, based on the test data, it appears that
    the actual time complexity is likely to be O(n).

    As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
    fails, there will be a portion of VMAs that have not been duplicated in
    the maple tree.  To handle this, we mark the failure point with
    XA_ZERO_ENTRY.  In exit_mmap(), if this marker is encountered, stop
    releasing VMAs that have not been duplicated after this point.

    There is a "spawn" in byte-unixbench[1], which can be used to test the
    performance of fork().  I modified it slightly to make it work with
    different number of VMAs.

    Below are the test results.  The first row shows the number of VMAs.  The
    second and third rows show the number of fork() calls per ten seconds,
    corresponding to next-20231006 and the this patchset, respectively.  The
    test results were obtained with CPU binding to avoid scheduler load
    balancing that could cause unstable results.  There are still some
    fluctuations in the test results, but at least they are better than the
    original performance.

    21     121   221    421    821    1621   3221   6421   12821  25621  51221
    112100 76261 54227  34035  20195  11112  6017   3161   1606   802    393
    114558 83067 65008  45824  28751  16072  8922   4747   2436   1233   599
    2.19%  8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%

    [1] https://github.com/kdlucas/byte-unixbench/tree/master

    Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
    Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
    Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Mateusz Guzik <mjguzik@gmail.com>
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Mike Christie <michael.christie@oracle.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:34 -05:00
Audra Mitchell bc133597f4 lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
JIRA: https://issues.redhat.com/browse/RHEL-55462
Conflicts:
  Minor extra RHEL-only hunk to create the required CONFIG_DEBUG_VM_SHOOT_LAZIES
  fle under rhel's config database.

This patch is a backport of the following upstream commit:
commit 2655421ae69fa479df1575cb2630af9131d28939
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Fri Feb 3 17:18:36 2023 +1000

    lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme

    On big systems, the mm refcount can become highly contented when doing a
    lot of context switching with threaded applications.  user<->idle switch
    is one of the important cases.  Abandoning lazy tlb entirely slows this
    switching down quite a bit in the common uncontended case, so that is not
    viable.

    Implement a scheme where lazy tlb mm references do not contribute to the
    refcount, instead they get explicitly removed when the refcount reaches
    zero.

    The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
    switch away from this mm to init_mm if it was being used as the lazy tlb
    mm.  Enabling the shoot lazies option therefore requires that the arch
    ensures that mm_cpumask contains all CPUs that could possibly be using mm.
    A DEBUG_VM option IPIs every CPU in the system after this to ensure there
    are no references remaining before the mm is freed.

    Shootdown IPIs cost could be an issue, but they have not been observed to
    be a serious problem with this scheme, because short-lived processes tend
    not to migrate CPUs much, therefore they don't get much chance to leave
    lazy tlb mm references on remote CPUs.  There are a lot of options to
    reduce them if necessary, described in comments.

    The near-worst-case can be benchmarked with will-it-scale:

      context_switch1_threads -t $(($(nproc) / 2))

    This will create nproc threads (nproc / 2 switching pairs) all sharing the
    same mm that spread over all CPUs so each CPU does thread->idle->thread
    switching.

    [ Rik came up with basically the same idea a few years ago, so credit
      to him for that. ]

    Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
    Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
    Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-11-04 09:14:17 -05:00
Waiman Long 49848304ee rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 46faf9d8e1d52e4a91c382c6c72da6bd8e68297b
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Mon, 5 Feb 2024 13:10:19 -0800

    rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks

    Holding a mutex across synchronize_rcu_tasks() and acquiring
    that same mutex in code called from do_exit() after its call to
    exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
    results in deadlock.  This is by design, because tasks that are far
    enough into do_exit() are no longer present on the tasks list, making
    it a bit difficult for RCU Tasks to find them, let alone wait on them
    to do a voluntary context switch.  However, such deadlocks are becoming
    more frequent.  In addition, lockdep currently does not detect such
    deadlocks and they can be difficult to reproduce.

    In addition, if a task voluntarily context switches during that time
    (for example, if it blocks acquiring a mutex), then this task is in an
    RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
    just as well take advantage of that fact.

    This commit therefore initializes the data structures that will be needed
    to rely on these quiescent states and to eliminate these deadlocks.

    Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/

    Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
    Reported-by: Yang Jihong <yangjihong1@huawei.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Tested-by: Yang Jihong <yangjihong1@huawei.com>
    Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
    Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:18 -04:00
Rafael Aquini 81cf488de5 fork: lock VMAs of the parent process when forking
JIRA: https://issues.redhat.com/browse/RHEL-48221

This patch is a backport of the following upstream commit:
commit fb49c455323ff8319a123dd312be9082c49a23a5
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Sat Jul 8 12:12:12 2023 -0700

    fork: lock VMAs of the parent process when forking

    When forking a child process, the parent write-protects anonymous pages
    and COW-shares them with the child being forked using copy_present_pte().

    We must not take any concurrent page faults on the source vma's as they
    are being processed, as we expect both the vma and the pte's behind it
    to be stable.  For example, the anon_vma_fork() expects the parents
    vma->anon_vma to not change during the vma copy.

    A concurrent page fault on a page newly marked read-only by the page
    copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
    source vma, defeating the anon_vma_clone() that wasn't done because the
    parent vma originally didn't have an anon_vma, but we now might end up
    copying a pte entry for a page that has one.

    Before the per-vma lock based changes, the mmap_lock guaranteed
    exclusion with concurrent page faults.  But now we need to do a
    vma_start_write() to make sure no concurrent faults happen on this vma
    while it is being processed.

    This fix can potentially regress some fork-heavy workloads.  Kernel
    build time did not show noticeable regression on a 56-core machine while
    a stress test mapping 10000 VMAs and forking 5000 times in a tight loop
    shows ~5% regression.  If such fork time regression is unacceptable,
    disabling CONFIG_PER_VMA_LOCK should restore its performance.  Further
    optimizations are possible if this regression proves to be problematic.

    Suggested-by: David Hildenbrand <david@redhat.com>
    Reported-by: Jiri Slaby <jirislaby@kernel.org>
    Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
    Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
    Reported-by: Jacob Young <jacobly.alt@gmail.com>
    Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
    Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
    Cc: stable@vger.kernel.org
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-07-16 09:30:22 -04:00
Rafael Aquini 6f66ad6087 fork: lock VMAs of the parent process when forking
JIRA: https://issues.redhat.com/browse/RHEL-48221

This patch is a backport of the following upstream commit:
commit 2b4f3b4987b56365b981f44a7e843efa5b6619b9
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Wed Jul 5 18:13:59 2023 -0700

    fork: lock VMAs of the parent process when forking

    Patch series "Avoid memory corruption caused by per-VMA locks", v4.

    A memory corruption was reported in [1] with bisection pointing to the
    patch [2] enabling per-VMA locks for x86.  Based on the reproducer
    provided in [1] we suspect this is caused by the lack of VMA locking while
    forking a child process.

    Patch 1/2 in the series implements proper VMA locking during fork.  I
    tested the fix locally using the reproducer and was unable to reproduce
    the memory corruption problem.

    This fix can potentially regress some fork-heavy workloads.  Kernel build
    time did not show noticeable regression on a 56-core machine while a
    stress test mapping 10000 VMAs and forking 5000 times in a tight loop
    shows ~7% regression.  If such fork time regression is unacceptable,
    disabling CONFIG_PER_VMA_LOCK should restore its performance.  Further
    optimizations are possible if this regression proves to be problematic.

    Patch 2/2 disables per-VMA locks until the fix is tested and verified.

    This patch (of 2):

    When forking a child process, parent write-protects an anonymous page and
    COW-shares it with the child being forked using copy_present_pte().
    Parent's TLB is flushed right before we drop the parent's mmap_lock in
    dup_mmap().  If we get a write-fault before that TLB flush in the parent,
    and we end up replacing that anonymous page in the parent process in
    do_wp_page() (because, COW-shared with the child), this might lead to some
    stale writable TLB entries targeting the wrong (old) page.  Similar issue
    happened in the past with userfaultfd (see flush_tlb_page() call inside
    do_wp_page()).

    Lock VMAs of the parent process when forking a child, which prevents
    concurrent page faults during fork operation and avoids this issue.  This
    fix can potentially regress some fork-heavy workloads.  Kernel build time
    did not show noticeable regression on a 56-core machine while a stress
    test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~7%
    regression.  If such fork time regression is unacceptable, disabling
    CONFIG_PER_VMA_LOCK should restore its performance.  Further optimizations
    are possible if this regression proves to be problematic.

    Link: https://lkml.kernel.org/r/20230706011400.2949242-1-surenb@google.com
    Link: https://lkml.kernel.org/r/20230706011400.2949242-2-surenb@google.com
    Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: David Hildenbrand <david@redhat.com>
    Reported-by: Jiri Slaby <jirislaby@kernel.org>
    Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
    Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
    Reported-by: Jacob Young <jacobly.alt@gmail.com>
    Closes: https://bugzilla.kernel.org/show_bug.cgi?id=3D217624
    Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Tested-by: Holger Hoffsttte <holger@applied-asynchrony.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-07-16 09:30:18 -04:00
Lucas Zampieri bf8d4449d4 Merge: fork: defer linking file vma until vma is fully initialized
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4349

JIRA: https://issues.redhat.com/browse/RHEL-35022  
CVE: CVE-2024-27022  
  
Signed-off-by: Rafael Aquini <aquini@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Aristeu Rozanski <arozansk@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-17 12:51:50 +00:00
Lucas Zampieri ef511f48cf Merge: cgroup: Reapply the cgroup commits merged in RHEL-34600
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4257

JIRA: https://issues.redhat.com/browse/RHEL-36683    
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4257

CS9 MR #4128 ("cgroup: Backport upstream cgroup commits up to v6.8")
and MR #3975 ("Scheduler: rhel9.5 updates") have 2 commits in common.

 - commit 79462e8c879a ("sched: don't account throttle time for empty groups")
 - commit 677ea015f231 ("sched: add throttled time stat for throttled children")

However, the merging of MR #4128 on top of MR #3975 produced an
unexpected result that the 2 new functions (cgroup_local_stat_show &
cpu_local_stat_show) introduced by commit 677ea015f231 ("sched: add
throttled time stat for throttled children") were duplicated resulting
in build failure. This leads to the revert of MR #4128 by the maintainer
to address the build problem.

With the merge and revert commits in place, it is now no longer possible
to rebase the original MR on top of main. There are now two ways to
workaround it. In both case, a new MR and Jiri issue will be needed.

 1) Reapplying all the patches except the 2 duplicated ones on top of main.
 2) Revert the revert commit and remove the duplicated functions.

The second alternative is chosen as it will be easier to review and 
we don't need to duplicate the commits with a different set of git
hashes. The first patch is just the revert of the revert as all the
code changes had been reviewed before. The second patch removes the
duplicated functions.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-05 20:05:37 +00:00
Lucas Zampieri 304e2a4e29 Merge: [RHEL-9.5.0] iommu and dma mapping api updates
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4151

# Merge Request Required Information

JIRA: https://issues.redhat.com/browse/RHEL-28780  
JIRA: https://issues.redhat.com/browse/RHEL-12083  
JIRA: https://issues.redhat.com/browse/RHEL-12322  
JIRA: https://issues.redhat.com/browse/RHEL-29105  
JIRA: https://issues.redhat.com/browse/RHEL-29357  
JIRA: https://issues.redhat.com/browse/RHEL-29359  

Omitted-fix: ed8b94f6e0ac ("powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add")  
     - Reverted by 1fba2bf8e9d5 ("Revert "powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add"")

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git  
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git  
                 branch: next   
Tested: In progress  

	- general cki coverage  

	- Nvidia testing arm-smmu-v3 and iommufd related changes they have requested.  

	- Multiple rounds testing of amd_iommu, intel_iommu, and arm-smmu-v3 with  
	  various iommu configurations with disk i/o using fio,  
	  covering lazy iotlb invalidation, strict iotlb invalidation,  
	  and passthrough. Also tested with forcedac set. Intel  
	  Scalable Mode capable systems tested with the iotlb invalidation  
	  policies, and passthrough with scalable mode enabled, and disabled.  
	  AMD systems tested tested with v1 pages tables and v2.  

	- Tested booting with various iommu configurations, and verifying system  
	  in correct state on AMD, Intel, and ARM.  

	- Limited test on ppc64le. The system I had access to was  
	  setting up a 64-bit bypass window, and using dma_direct  
	  calls.  It ran, but since I don't normally touch ppc64le  
	  iommu code, I need to investigate more or get IBM assistance  
	  to more thoroughly test it.  

	- Working on getting testing assistance from IBM for the s390x changes.  

## Summary of Changes


This brings iommu, iommufd, and dma mapping api up to 6.9 with some additions from Joerg's  
next branch minus some commits changes in a 6.9 SEV-SNP pull for AMD. Some hightlights:  

- The removal of the amd_iommu_v2 code, and the addition of it's replacement based on the    
  iommu core SVA api, along with a re-org of the amd_iommu code.  
- The migration of s390 to the iommu core dma-iommu dma ops implementation, joining Intel,  
  AMD, and ARM as users of the same code base.  
- The beginnings of a re-work of the arm-smmu-v3 driver by Jason, and others.  
- A number of changes to iommufd as it continues to get fleshed out.  
- IOPT memory usage observability (code that was basis for talk at LPC last year)  

  Example output in vmstat files:  

```
    # grep iommu /sys/devices/system/node/node*/vmstat  
    /sys/devices/system/node/node0/vmstat:nr_iommu_pages 342  
    /sys/devices/system/node/node1/vmstat:nr_iommu_pages 0  
```

- Continued work on shared virtual addressing and io page faulting (PRI).  
- Dynamic swiotlb memory pools. This is not enabled yet, as they still seem to be  
  shaking out issues upstream, but the code is in place now.  
- Re-working of iommu core domain allocation.  

Note: iommufd selftest is being enabled in separate work that has been delegated to    
      another engineer starting to help with iommu. So that will be enabled in the  
      next few weeks to add more coverage for iommufd.  

Conflicts wise, they should be noted in the individual commits, but  
not too bad overall. 13/30 were dropping unsupported bits, and another  
8 were context diffs. A couple caused by out of order backports due  
to fixes, and couple upstream conflicts from colliding patchsets that  
had to be resolved in the merge commits.  

Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>

Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>
Approved-by: Lenny Szubowicz <lszubowi@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-05 20:03:50 +00:00
Rafael Aquini 2c0f74959e fork: defer linking file vma until vma is fully initialized
JIRA: https://issues.redhat.com/browse/RHEL-35022
CVE: CVE-2024-27022
Conflicts:
    * context differences due to RHEL missing upstream commit d24062914837
      ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()"), which
      is part of a v6.8 series targeting performance enhancements for the
      maple tree VMA iterator (unrelated to the security fix in this port)

This patch is a backport of the following upstream commit:
commit 35e351780fa9d8240dd6f7e4f245f9ea37e96c19
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Wed Apr 10 17:14:41 2024 +0800

    fork: defer linking file vma until vma is fully initialized

    Thorvald reported a WARNING [1]. And the root cause is below race:

     CPU 1                                  CPU 2
     fork                                   hugetlbfs_fallocate
      dup_mmap                               hugetlbfs_punch_hole
       i_mmap_lock_write(mapping);
       vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
       i_mmap_unlock_write(mapping);
       hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!
                                             i_mmap_lock_write(mapping);
                                             hugetlb_vmdelete_list
                                              vma_interval_tree_foreach
                                               hugetlb_vma_trylock_write -- Vma_lock is cleared.
       tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
                                               hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
                                             i_mmap_unlock_write(mapping);

    hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside
    i_mmap_rwsem lock while vma lock can be used in the same time.  Fix this
    by deferring linking file vma until vma is fully initialized.  Those vmas
    should be initialized first before they can be used.

    Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com
    Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reported-by: Thorvald Natvig <thorvald@google.com>
    Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1]
    Reviewed-by: Jane Chu <jane.chu@oracle.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Mateusz Guzik <mjguzik@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peng Zhang <zhangpeng.00@bytedance.com>
    Cc: Tycho Andersen <tandersen@netflix.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-05-28 09:31:39 -04:00
Waiman Long 6d0328a7cf Revert "Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8""
JIRA: https://issues.redhat.com/browse/RHEL-36683
Upstream Status: RHEL only

This reverts commit 08637d76a2 which is a
revert of "Merge: cgroup: Backport upstream cgroup commits up to v6.8"

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-18 21:38:20 -04:00
Lucas Zampieri 08637d76a2 Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8"
This reverts merge request !4128
2024-05-16 15:26:41 +00:00
Lucas Zampieri 1ce55b7cbb Merge: cgroup: Backport upstream cgroup commits up to v6.8
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4128

JIRA: https://issues.redhat.com/browse/RHEL-34600    
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4128

This MR backports upstream cgroup commits up to v6.8 with related fixes,
if applicable. It also pulls in a number of scheduler and PSI related
commits due to their interaction with cgroup.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-16 13:28:22 +00:00
Jerry Snitselaar 2edf07cd91 iommu: Change kconfig around IOMMU_SVA
JIRA: https://issues.redhat.com/browse/RHEL-28780
JIRA: https://issues.redhat.com/browse/RHEL-29105
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Conflicts: Context diff in mm/Kconfig, and RH_KABI macro used in include/linux/sched.h

commit 8f23f5dba6b4693448144bde4dd6f537543442c2
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Fri Oct 27 08:05:20 2023 +0800

    iommu: Change kconfig around IOMMU_SVA

    Linus suggested that the kconfig here is confusing:

    https://lore.kernel.org/all/CAHk-=wgUiAtiszwseM1p2fCJ+sC4XWQ+YN4TanFhUgvUqjr9Xw@mail.gmail.com/

    Let's break it into three kconfigs controlling distinct things:

     - CONFIG_IOMMU_MM_DATA controls if the mm_struct has the additional
       fields for the IOMMU. Currently only PASID, but later patches store
       a struct iommu_mm_data *

     - CONFIG_ARCH_HAS_CPU_PASID controls if the arch needs the scheduling bit
       for keeping track of the ENQCMD instruction. x86 will select this if
       IOMMU_SVA is enabled

     - IOMMU_SVA controls if the IOMMU core compiles in the SVA support code
       for iommu driver use and the IOMMU exported API

    This way ARM will not enable CONFIG_ARCH_HAS_CPU_PASID

    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
    Link: https://lore.kernel.org/r/20231027000525.1278806-2-tina.zhang@intel.com
    Signed-off-by: Joerg Roedel <jroedel@suse.de>

(cherry picked from commit 8f23f5dba6b4693448144bde4dd6f537543442c2)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2024-05-13 08:18:36 -07:00
Nico Pache 12c51977be mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl
commit 24e41bf8a6b424c76c5902fb999e9eca61bdf83d
Author: Florent Revest <revest@chromium.org>
Date:   Mon Aug 28 17:08:57 2023 +0200

    mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl

    This extends the current PR_SET_MDWE prctl arg with a bit to indicate that
    the process doesn't want MDWE protection to propagate to children.

    To implement this no-inherit mode, the tag in current->mm->flags must be
    absent from MMF_INIT_MASK.  This means that the encoding for "MDWE but
    without inherit" is different in the prctl than in the mm flags.  This
    leads to a bit of bit-mangling in the prctl implementation.

    Link: https://lkml.kernel.org/r/20230828150858.393570-6-revest@chromium.org
    Signed-off-by: Florent Revest <revest@chromium.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Alexey Izbyshev <izbyshev@ispras.ru>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Ayush Jain <ayush.jain3@amd.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Joey Gouly <joey.gouly@arm.com>
    Cc: KP Singh <kpsingh@kernel.org>
    Cc: Mark Brown <broonie@kernel.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
    Cc: Topi Miettinen <toiwoton@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:24 -06:00
Chris von Recklinghausen 81f271e685 sched/numa: apply the scan delay to every new vma
Conflicts: mm/mmap.c, include/linux/mm_types.h - fuzz

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit ef6a22b70f6d90449a5c797b8968a682824e2011
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Wed Mar 1 17:49:00 2023 +0530

    sched/numa: apply the scan delay to every new vma

    Pach series "sched/numa: Enhance vma scanning", v3.

    The patchset proposes one of the enhancements to numa vma scanning
    suggested by Mel.  This is continuation of [3].

    Reposting the rebased patchset to akpm mm-unstable tree (March 1)

    Existing mechanism of scan period involves, scan period derived from
    per-thread stats.  Process Adaptive autoNUMA [1] proposed to gather NUMA
    fault stats at per-process level to capture aplication behaviour better.

    During that course of discussion, Mel proposed several ideas to enhance
    current numa balancing.  One of the suggestion was below

    Track what threads access a VMA.  The suggestion was to use an unsigned
    long pid_mask and use the lower bits to tag approximately what threads
    access a VMA.  Skip VMAs that did not trap a fault.  This would be
    approximate because of PID collisions but would reduce scanning of areas
    the thread is not interested in.  The above suggestion intends not to
    penalize threads that has no interest in the vma, thus reduce scanning
    overhead.

    V3 changes are mostly based on PeterZ comments (details below in changes)

    Summary of patchset:

    Current patchset implements:

    1. Delay the vma scanning logic for newly created VMA's so that
       additional overhead of scanning is not incurred for short lived tasks
       (implementation by Mel)

    2. Store the information of tasks accessing VMA in 2 windows.  It is
       regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval.
       The above time is derived from experimenting (Suggested by PeterZ) to
       balance between frequent clearing vs obsolete access data

    3. hash_32 used to encode task index accessing VMA information

    4. VMA's acess information is used to skip scanning for the tasks
       which had not accessed VMA

    Changes since V2:
    patch1:
     - Renaming of structure, macro to function,
     - Add explanation to heuristics
     - Adding more details from result (PeterZ)
     Patch2:
     - Usage of test and set bit (PeterZ)
     - Move storing access PID info to numa_migrate_prep()
     - Add a note on fainess among tasks allowed to scan
       (PeterZ)
     Patch3:
     - Maintain two windows of access PID information
      (PeterZ supported implementation and Gave idea to extend
       to N if needed)
     Patch4:
     - Apply hash_32 function to track VMA accessing PIDs (PeterZ)

    Changes since RFC V1:
     - Include Mel's vma scan delay patch
     - Change the accessing pid store logic (Thanks Mel)
     - Fencing structure / code to NUMA_BALANCING (David, Mel)
     - Adding clearing access PID logic (Mel)
     - Descriptive change log ( Mike Rapoport)

    Things to ponder over:
    ==========================================

    - Improvement to clearing accessing PIDs logic (discussed in-detail in
      patch3 itself (Done in this patchset by implementing 2 window history)

    - Current scan period is not changed in the patchset, so we do see
      frequent tries to scan.  Relaxing scan period dynamically could improve
      results further.

    [1] sched/numa: Process Adaptive autoNUMA
     Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/

    [2] RFC V1 Link:
      https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/

    [3] V2 Link:
      https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/

    Results:
    Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement
    is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
    (dbench had huge std deviation to post)

    kernbench
    ===========
                          6.2.0-mmunstable-base  6.2.0-mmunstable-patched
    Amean     user-256    22002.51 (   0.00%)    22649.95 *  -2.94%*
    Amean     syst-256    10162.78 (   0.00%)     8214.13 *  19.17%*
    Amean     elsp-256      160.74 (   0.00%)      156.92 *   2.38%*

    Duration User       66017.43    67959.84
    Duration System     30503.15    24657.03
    Duration Elapsed      504.61      493.12

                          6.2.0-mmunstable-base  6.2.0-mmunstable-patched
    Ops NUMA alloc hit                1738835089.00  1738780310.00
    Ops NUMA alloc local              1738834448.00  1738779711.00
    Ops NUMA base-page range updates      477310.00      392566.00
    Ops NUMA PTE updates                  477310.00      392566.00
    Ops NUMA hint faults                   96817.00       87555.00
    Ops NUMA hint local faults %           10150.00        2192.00
    Ops NUMA hint local percent               10.48           2.50
    Ops NUMA pages migrated                86660.00       85363.00
    Ops AutoNUMA cost                        489.07         442.14

    autonumabench
    ===============
                          6.2.0-mmunstable-base  6.2.0-mmunstable-patched
    Amean     syst-NUMA01                  399.50 (   0.00%)       52.05 *  86.97%*
    Amean     syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.22 *  -5.41%*
    Amean     syst-NUMA02                    0.80 (   0.00%)        0.78 *   2.68%*
    Amean     syst-NUMA02_SMT                0.65 (   0.00%)        0.68 *  -3.95%*
    Amean     elsp-NUMA01                  313.26 (   0.00%)      313.11 *   0.05%*
    Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.08 *  -1.76%*
    Amean     elsp-NUMA02                    3.19 (   0.00%)        3.24 *  -1.52%*
    Amean     elsp-NUMA02_SMT                3.72 (   0.00%)        3.61 *   2.92%*

    Duration User      396433.47   324835.96
    Duration System      2808.70      376.66
    Duration Elapsed     2258.61     2258.12

                          6.2.0-mmunstable-base  6.2.0-mmunstable-patched
    Ops NUMA alloc hit                  59921806.00    49623489.00
    Ops NUMA alloc miss                        0.00           0.00
    Ops NUMA interleave hit                    0.00           0.00
    Ops NUMA alloc local                59920880.00    49622594.00
    Ops NUMA base-page range updates   152259275.00       50075.00
    Ops NUMA PTE updates               152259275.00       50075.00
    Ops NUMA PMD updates                       0.00           0.00
    Ops NUMA hint faults               154660352.00       39014.00
    Ops NUMA hint local faults %       138550501.00       23139.00
    Ops NUMA hint local percent               89.58          59.31
    Ops NUMA pages migrated              8179067.00       14147.00
    Ops AutoNUMA cost                     774522.98         195.69

    This patch (of 4):

    Currently whenever a new task is created we wait for
    sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead.
    Extend the same logic to new or very short-lived VMAs.

    [raghavendra.kt@amd.com: add initialization in vm_area_dup())]
    Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com
    Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Disha Talreja <dishaa.talreja@amd.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:45 -04:00
Chris von Recklinghausen 88cd4097c1 mm: separate vma->lock from vm_area_struct
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:32 2023 -0800

    mm: separate vma->lock from vm_area_struct

    vma->lock being part of the vm_area_struct causes performance regression
    during page faults because during contention its count and owner fields
    are constantly updated and having other parts of vm_area_struct used
    during page fault handling next to them causes constant cache line
    bouncing.  Fix that by moving the lock outside of the vm_area_struct.

    All attempts to keep vma->lock inside vm_area_struct in a separate cache
    line still produce performance regression especially on NUMA machines.
    Smallest regression was achieved when lock is placed in the fourth cache
    line but that bloats vm_area_struct to 256 bytes.

    Considering performance and memory impact, separate lock looks like the
    best option.  It increases memory footprint of each VMA but that can be
    optimized later if the new size causes issues.  Note that after this
    change vma_init() does not allocate or initialize vma->lock anymore.  A
    number of drivers allocate a pseudo VMA on the stack but they never use
    the VMA's lock, therefore it does not need to be allocated.  The future
    drivers which might need the VMA lock should use
    vm_area_alloc()/vm_area_free() to allocate the VMA.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-34-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:45 -04:00
Chris von Recklinghausen 624b7d40f8 mm/mmap: free vm_area_struct without call_rcu in exit_mmap
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 0d2ebf9c3f7822e7ba3e4792ea3b6b19aa2da34a
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:31 2023 -0800

    mm/mmap: free vm_area_struct without call_rcu in exit_mmap

    call_rcu() can take a long time when callback offloading is enabled.  Its
    use in the vm_area_free can cause regressions in the exit path when
    multiple VMAs are being freed.

    Because exit_mmap() is called only after the last mm user drops its
    refcount, the page fault handlers can't be racing with it.  Any other
    possible user like oom-reaper or process_mrelease are already synchronized
    using mmap_lock.  Therefore exit_mmap() can free VMAs directly, without
    the use of call_rcu().

    Expose __vm_area_free() and use it from exit_mmap() to avoid possible
    call_rcu() floods and performance regressions caused by it.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-33-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:45 -04:00
Chris von Recklinghausen 4faa49fc27 kernel/fork: assert no VMA readers during its destruction
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit f2e13784c16a98e269b3111ac02ae44446dd589c
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:19 2023 -0800

    kernel/fork: assert no VMA readers during its destruction

    Assert there are no holders of VMA lock for reading when it is about to be
    destroyed.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-21-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:42 -04:00
Chris von Recklinghausen 06ea40f7dc mm: add per-VMA lock and helper functions to control it
Conflicts: include/linux/mm.h, include/linux/mmap_lock.h - fuzz

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 5e31275cc997f8ec5d9e8d65fe9840ebed89db19
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:11 2023 -0800

    mm: add per-VMA lock and helper functions to control it

    Introduce per-VMA locking.  The lock implementation relies on a per-vma
    and per-mm sequence counters to note exclusive locking:

      - read lock - (implemented by vma_start_read) requires the vma
        (vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ.
        If they match then there must be a vma exclusive lock held somewhere.
      - read unlock - (implemented by vma_end_read) is a trivial vma->lock
        unlock.
      - write lock - (vma_start_write) requires the mmap_lock to be held
        exclusively and the current mm counter is assigned to the vma counter.
        This will allow multiple vmas to be locked under a single mmap_lock
        write lock (e.g. during vma merging). The vma counter is modified
        under exclusive vma lock.
      - write unlock - (vma_end_write_all) is a batch release of all vma
        locks held. It doesn't pair with a specific vma_start_write! It is
        done before exclusive mmap_lock is released by incrementing mm
        sequence counter (mm_lock_seq).
      - write downgrade - if the mmap_lock is downgraded to the read lock, all
        vma write locks are released as well (effectivelly same as write
        unlock).

    Link: https://lkml.kernel.org/r/20230227173632.3292573-13-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:40 -04:00
Chris von Recklinghausen 3f6e1d9cb8 mm: rcu safe VMA freeing
Conflicts: include/linux/mm_types.h - fuzz

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 20cce633f4254cc0df39665449726e3172518f6c
Author: Michel Lespinasse <michel@lespinasse.org>
Date:   Mon Feb 27 09:36:09 2023 -0800

    mm: rcu safe VMA freeing

    This prepares for page faults handling under VMA lock, looking up VMAs
    under protection of an rcu read lock, instead of the usual mmap read lock.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-11-surenb@google.com
    Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:39 -04:00
Aristeu Rozanski 9280e1a4aa mm: enable maple tree RCU mode by default
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 3dd4432549415f3c65dd52d5c687629efbf4ece1
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Mon Feb 27 09:36:07 2023 -0800

    mm: enable maple tree RCU mode by default

    Use the maple tree in RCU mode for VMA tracking.

    The maple tree tracks the stack and is able to update the pivot
    (lower/upper boundary) in-place to allow the page fault handler to write
    to the tree while holding just the mmap read lock.  This is safe as the
    writes to the stack have a guard VMA which ensures there will always be a
    NULL in the direction of the growth and thus will only update a pivot.

    It is possible, but not recommended, to have VMAs that grow up/down
    without guard VMAs.  syzbot has constructed a testcase which sets up a VMA
    to grow and consume the empty space.  Overwriting the entire NULL entry
    causes the tree to be altered in a way that is not safe for concurrent
    readers; the readers may see a node being rewritten or one that does not
    match the maple state they are using.

    Enabling RCU mode allows the concurrent readers to see a stable node and
    will return the expected result.

    [Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[
    Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/
    Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
    Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reported-by: syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:26 -04:00
Aristeu Rozanski fec82fff3c mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit e430a95a04efc557bc4ff9b3035c7c85aee5d63f
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:48 2023 -0800

    mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

    To simplify the usage of VM_LOCKED_CLEAR_MASK in vm_flags_clear(), replace
    it with VM_LOCKED_MASK bitmask and convert all users.

    Link: https://lkml.kernel.org/r/20230126193752.297968-4-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski 314860cbc3 kernel/fork: convert vma assignment to a memcpy
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 06e78b614e3780f9ac32056f2861159fd19d9702
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:46 2023 -0800

    kernel/fork: convert vma assignment to a memcpy

    Patch series "introduce vm_flags modifier functions", v4.

    This patchset was originally published as a part of per-VMA locking [1]
    and was split after suggestion that it's viable on its own and to
    facilitate the review process.  It is now a preprequisite for the next
    version of per-VMA lock patchset, which reuses vm_flags modifier functions
    to lock the VMA when vm_flags are being updated.

    VMA vm_flags modifications are usually done under exclusive mmap_lock
    protection because this attrubute affects other decisions like VMA merging
    or splitting and races should be prevented.  Introduce vm_flags modifier
    functions to enforce correct locking.

    This patch (of 7):

    Convert vma assignment in vm_area_dup() to a memcpy() to prevent compiler
    errors when we add a const modifier to vma->vm_flags.

    Link: https://lkml.kernel.org/r/20230126193752.297968-1-surenb@google.com
    Link: https://lkml.kernel.org/r/20230126193752.297968-2-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski e6172d44ab kernel/fork: convert forking to using the vmi iterator
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 3b9dbd5e91b11911d21effbb80d1976fb21660df
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:10 2023 -0500

    kernel/fork: convert forking to using the vmi iterator

    Avoid using the maple tree interface directly.  This gains type safety.

    Link: https://lkml.kernel.org/r/20230120162650.984577-10-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:13 -04:00
Waiman Long 724656e7cf freezer,sched: Rewrite core freezer logic
JIRA: https://issues.redhat.com/browse/RHEL-34600
Conflicts:
 1) A merge conflict in the kernel/signal.c hunk due to the presence
    of RHEL-only commit 975d318867 ("signal: Don't disable preemption
    in ptrace_stop() on PREEMPT_RT.").
 2) A merge conflict in the kernel/time/hrtimer.c hunk due to the
    presence of RHEL-only commit 5f76194136 ("time/hrtimer: Embed
    hrtimer mode into hrtimer_sleeper").
 3) The fs/cifs/inode.c hunk was applied to fs/smb/client/inode.c due
    to the presence of upstream commit 38c8a9a52082 ("smb: move client
    and server files to common directory fs/smb").
 4) Similarly, the fs/cifs/transport.c hunk was applied to
    fs/smb/client/transport.c manually due to the presence of
    a later upstream commit d527f51331ca ("cifs: Fix UAF in
    cifs_demultiplex_thread()").

Note that all the prerequiste patches in the same patch series
(https://lore.kernel.org/lkml/20220822111816.760285417@infradead.org/)
had already been merged into RHEL9.

commit f5d39b020809146cc28e6e73369bf8065e0310aa
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon, 22 Aug 2022 13:18:22 +0200

    freezer,sched: Rewrite core freezer logic

    Rewrite the core freezer to behave better wrt thawing and be simpler
    in general.

    By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
    ensured frozen tasks stay frozen until thawed and don't randomly wake
    up early, as is currently possible.

    As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
    two PF_flags (yay!).

    Specifically; the current scheme works a little like:

            freezer_do_not_count();
            schedule();
            freezer_count();

    And either the task is blocked, or it lands in try_to_freezer()
    through freezer_count(). Now, when it is blocked, the freezer
    considers it frozen and continues.

    However, on thawing, once pm_freezing is cleared, freezer_count()
    stops working, and any random/spurious wakeup will let a task run
    before its time.

    That is, thawing tries to thaw things in explicit order; kernel
    threads and workqueues before doing bringing SMP back before userspace
    etc.. However due to the above mentioned races it is entirely possible
    for userspace tasks to thaw (by accident) before SMP is back.

    This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
    where the userspace task requires a special CPU to run.

    As said; replace this with a special task state TASK_FROZEN and add
    the following state transitions:

            TASK_FREEZABLE  -> TASK_FROZEN
            __TASK_STOPPED  -> TASK_FROZEN
            __TASK_TRACED   -> TASK_FROZEN

    The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
    (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
    is already required to deal with spurious wakeups and the freezer
    causes one such when thawing the task (since the original state is
    lost).

    The special __TASK_{STOPPED,TRACED} states *can* be restored since
    their canonical state is in ->jobctl.

    With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
    free of undue (early / spurious) wakeups.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org

Signed-off-by: Waiman Long <longman@redhat.com>
2024-04-26 22:49:06 -04:00
Chris von Recklinghausen 26437a89ef mm: remove the vma linked list
Conflicts:
	include/linux/mm.h - We already have
		21b85b09527c ("madvise: use zap_page_range_single for madvise dontneed")
		so keep declaration for zap_page_range_single
	kernel/fork.c - We already have
		f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
		so keep declaration of i
	mm/mmap.c - We already have
		a1e8cb93bf ("mm: drop oom code from exit_mmap")
		and
		db3644c677 ("mm: delete unused MMF_OOM_VICTIM flag")
		so keep setting MMF_OOM_SKIP in mm->flags

JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 763ecb035029f500d7e6dc99acd1ad299b7726a1
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:06 2022 +0000

    mm: remove the vma linked list

    Replace any vm_next use with vma_find().

    Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the
    maple tree.

    Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap().  At
    the same time, alter the loop to be more compact.

    Now that free_pgtables() and unmap_vmas() take a maple tree as an
    argument, rearrange do_mas_align_munmap() to use the new tree to hold the
    vmas to remove.

    Remove __vma_link_list() and __vma_unlink_list() as they are exclusively
    used to update the linked list.

    Drop linked list update from __insert_vm_struct().

    Rework validation of tree as it was depending on the linked list.

    [yang.lee@linux.alibaba.com: fix one kernel-doc comment]
      Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1949
      Link: https://lkml.kernel.org/r/20220824021918.94116-1-yang.lee@linux.alib
aba.comLink: https://lkml.kernel.org/r/20220906194824.2110408-69-Liam.Howlett@or
acle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:57 -04:00
Chris von Recklinghausen 11d9f41086 fork: use VMA iterator
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit fa5e587679f034530e8c14bc1c466490053b2ff2
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:59 2022 +0000

    fork: use VMA iterator

    The VMA iterator is faster than the linked list and removing the linked
    list will shrink the vm_area_struct.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-50-Liam.Howlett@oracle.com
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:52 -04:00
Chris von Recklinghausen eb370ae179 mm: remove vmacache
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 7964cf8caa4dfa42c4149f3833d3878713cda3dc
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:51 2022 +0000

    mm: remove vmacache

    By using the maple tree and the maple tree state, the vmacache is no
    longer beneficial and is complicating the VMA code.  Remove the vmacache
    to reduce the work in keeping it up to date and code complexity.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:46 -04:00
Chris von Recklinghausen adeb5664bb mm: remove rb tree.
Conflicts: mm/mmap.c -
	We already have
	54a611b60590 ("Maple Tree: add new data structure")
	so mas_preallocate no longer takes a vma argument.
	We already have
	92b7399695a5 ("mmap: fix copy_vma() failure path")
	so keep check for new_vma->vm_file, the fput on it, and
	the unlink_anon_vmas call

JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 524e00b36e8c547f5582eef3fb645a8d9fc5e3df
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:48 2022 +0000

    mm: remove rb tree.

    Remove the RB tree and start using the maple tree for vm_area_struct
    tracking.

    Drop validate_mm() calls in expand_upwards() and expand_downwards() as the
    lock is not held.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-18-Liam.Howlett@oracl
e.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:44 -04:00
Chris von Recklinghausen 21beb4ef93 kernel/fork: use maple tree for dup_mmap() during forking
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit c9dbe82cb99db5b6029c6bc43fcf7881d3f50268
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:47 2022 +0000

    kernel/fork: use maple tree for dup_mmap() during forking

    The maple tree was already tracking VMAs in this function by an earlier
    commit, but the rbtree iterator was being used to iterate the list.
    Change the iterator to use a maple tree native iterator and switch to the
    maple tree advanced API to avoid multiple walks of the tree during insert
    operations.  Unexport the now-unused vma_store() function.

    For performance reasons we bulk allocate the maple tree nodes.  The node
    calculations are done internally to the tree and use the VMA count and
    assume the worst-case node requirements.  The VM_DONT_COPY flag does not
    allow for the most efficient copy method of the tree and so a bulk loading
    algorithm is used.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-15-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:43 -04:00
Chris von Recklinghausen cde31a5d92 mm: start tracking VMAs with maple tree
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit d4af56c5c7c6781ca6ca8075e2cf5bc119ed33d1
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:45 2022 +0000

    mm: start tracking VMAs with maple tree

    Start tracking the VMAs with the new maple tree structure in parallel with
    the rb_tree.  Add debug and trace events for maple tree operations and
    duplicate the rb_tree that is created on forks into the maple tree.

    The maple tree is added to the mm_struct including the mm_init struct,
    added support in required mm/mmap functions, added tracking in kernel/fork
    for process forking, and used to find the unmapped_area and checked
    against what the rbtree finds.

    This also moves the mmap_lock() in exit_mmap() since the oom reaper call
    does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.

    When splitting a vma fails due to allocations of the maple tree nodes,
    the error path in __split_vma() calls new->vm_ops->close(new).  The page
    accounting for hugetlb is actually in the close() operation,  so it
    accounts for the removal of 1/2 of the VMA which was not adjusted.  This
    results in a negative exit value.  To avoid the negative charge, set
    vm_start = vm_end and vm_pgoff = 0.

    There is also a potential accounting issue in special mappings from
    insert_vm_struct() failing to allocate, so reverse the charge there in
    the failure scenario.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:41 -04:00
Prarit Bhargava 3e4b9d3d7e stackprotector: move get_random_canary() into stackprotector.h
JIRA: https://issues.redhat.com/browse/RHEL-25415

Conflicts: Minor drift issues.

commit b3883a9a1f09e7b41f4dcb1bbd7262216a62d253
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sun Oct 23 22:06:00 2022 +0200

    stackprotector: move get_random_canary() into stackprotector.h

    This has nothing to do with random.c and everything to do with stack
    protectors. Yes, it uses randomness. But many things use randomness.
    random.h and random.c are concerned with the generation of randomness,
    not with each and every use. So move this function into the more
    specific stackprotector.h file where it belongs.

    Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:43:05 -04:00
Prarit Bhargava 8f4ce17181 fork: Generalize PF_IO_WORKER handling
JIRA: https://issues.redhat.com/browse/RHEL-25415

Conflicts: Did not apply to some unsupported arches.  Those changes have
been dropped.  Add idle_dummy() function here instead of backporting
out-of-scope linux commit 36cb0e1cda64 ("fork: Explicity test for idle
tasks in copy_thread")

commit 5bd2e97c868a8a44470950ed01846cab6328e540
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue Apr 12 10:18:48 2022 -0500

    fork: Generalize PF_IO_WORKER handling

    Add fn and fn_arg members into struct kernel_clone_args and test for
    them in copy_thread (instead of testing for PF_KTHREAD | PF_IO_WORKER).
    This allows any task that wants to be a user space task that only runs
    in kernel mode to use this functionality.

    The code on x86 is an exception and still retains a PF_KTHREAD test
    because x86 unlikely everything else handles kthreads slightly
    differently than user space tasks that start with a function.

    The functions that created tasks that start with a function
    have been updated to set ".fn" and ".fn_arg" instead of
    ".stack" and ".stack_size".  These functions are fork_idle(),
    create_io_thread(), kernel_thread(), and user_mode_thread().

    Link: https://lkml.kernel.org/r/20220506141512.516114-4-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:43:04 -04:00
Prarit Bhargava 76f9c1e6a9 x86/split_lock: Make life miserable for split lockers
JIRA: https://issues.redhat.com/browse/RHEL-25415

Conflicts: Minor drift issues.

commit b041b525dab95352fbd666b14dc73ab898df465f
Author: Tony Luck <tony.luck@intel.com>
Date:   Thu Mar 10 12:48:53 2022 -0800

    x86/split_lock: Make life miserable for split lockers

    In https://lore.kernel.org/all/87y22uujkm.ffs@tglx/ Thomas
    said:

      Its's simply wishful thinking that stuff gets fixed because of a
      WARN_ONCE(). This has never worked. The only thing which works is to
      make stuff fail hard or slow it down in a way which makes it annoying
      enough to users to complain.

    He was talking about WBINVD. But it made me think about how we use the
    split lock detection feature in Linux.

    Existing code has three options for applications:

     1) Don't enable split lock detection (allow arbitrary split locks)
     2) Warn once when a process uses split lock, but let the process
        keep running with split lock detection disabled
     3) Kill process that use split locks

    Option 2 falls into the "wishful thinking" territory that Thomas warns does
    nothing. But option 3 might not be viable in a situation with legacy
    applications that need to run.

    Hence make option 2 much stricter to "slow it down in a way which makes
    it annoying".

    Primary reason for this change is to provide better quality of service to
    the rest of the applications running on the system. Internal testing shows
    that even with many processes splitting locks, performance for the rest of
    the system is much more responsive.

    The new "warn" mode operates like this.  When an application tries to
    execute a bus lock the #AC handler.

     1) Delays (interruptibly) 10 ms before moving to next step.

     2) Blocks (interruptibly) until it can get the semaphore
            If interrupted, just return. Assume the signal will either
            kill the task, or direct execution away from the instruction
            that is trying to get the bus lock.
     3) Disables split lock detection for the current core
     4) Schedules a work queue to re-enable split lock detect in 2 jiffies
     5) Returns

    The work queue that re-enables split lock detection also releases the
    semaphore.

    There is a corner case where a CPU may be taken offline while split lock
    detection is disabled. A CPU hotplug handler handles this case.

    Old behaviour was to only print the split lock warning on the first
    occurrence of a split lock from a task. Preserve that by adding a flag to
    the task structure that suppresses subsequent split lock messages from that
    task.

    Signed-off-by: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Link: https://lore.kernel.org/r/20220310204854.31752-2-tony.luck@intel.com

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:42:47 -04:00
Prarit Bhargava 5022b3efeb fork: Pass struct kernel_clone_args into copy_thread
JIRA: https://issues.redhat.com/browse/RHEL-25415

commit c5febea0956fd3874e8fb59c6f84d68f128d68f8
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Apr 8 18:07:50 2022 -0500

    fork: Pass struct kernel_clone_args into copy_thread

    With io_uring we have started supporting tasks that are for most
    purposes user space tasks that exclusively run code in kernel mode.

    The kernel task that exec's init and tasks that exec user mode
    helpers are also user mode tasks that just run kernel code
    until they call kernel execve.

    Pass kernel_clone_args into copy_thread so these oddball
    tasks can be supported more cleanly and easily.

    v2: Fix spelling of kenrel_clone_args on h8300
    Link: https://lkml.kernel.org/r/20220506141512.516114-2-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Omitted-fix: 0626e1c9f3e5 LoongArch: Fix copy_thread() build errors
	loongarch not supported in RHEL9

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:42:34 -04:00
Vitaly Kuznetsov be72381490 x86/mm: Use mm_alloc() in poking_init()
JIRA: https://issues.redhat.com/browse/RHEL-25415

commit 3f4c8211d982099be693be9aa7d6fc4607dff290
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Oct 25 21:38:21 2022 +0200

    x86/mm: Use mm_alloc() in poking_init()

    Instead of duplicating init_mm, allocate a fresh mm. The advantage is
    that mm_alloc() has much simpler dependencies. Additionally it makes
    more conceptual sense, init_mm has no (and must not have) user state
    to duplicate.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20221025201057.816175235@infradead.org

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
2024-03-20 09:42:26 -04:00
Vitaly Kuznetsov 037738e296 mm: Move mm_cachep initialization to mm_init()
JIRA: https://issues.redhat.com/browse/RHEL-25415

commit af80602799681c78f14fbe20b6185a56020dedee
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Oct 25 21:38:18 2022 +0200

    mm: Move mm_cachep initialization to mm_init()

    In order to allow using mm_alloc() much earlier, move initializing
    mm_cachep into mm_init().

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20221025201057.751153381@infradead.org

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
2024-03-20 09:42:26 -04:00
Jan Stancek 78042596b6 Merge: iommu: IOMMU and DMA-mapping API Updates for 9.4
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3180

# Merge Request Required Information

```
Bugzilla: https://bugzilla.redhat.com/2223717
JIRA: https://issues.redhat.com/browse/RHEL-10007
JIRA: https://issues.redhat.com/browse/RHEL-10026
JIRA: https://issues.redhat.com/browse/RHEL-10042
JIRA: https://issues.redhat.com/browse/RHEL-10094
JIRA: https://issues.redhat.com/browse/RHEL-3655
JIRA: https://issues.redhat.com/browse/RHEL-800

Depends: !3244
Depends: !3245

Omitted-fix: c7bd8a1f45ba ("iommu/apple-dart: Handle DMA_FQ domains in attach_dev()")
             - Apple Dart not supported

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Testing: A mix of fio jobs, and various stress-ng io stressors (--hdd, --readahead, --aio, --aiol, --seek,
         --sync_file) run with strict and lazy translation modes on amd, intel, and arm systems. pgtbl_v2
         tested on AMD Genoa host

Conflicts: Should be noted in individual commits. In particular one upstream merge in 6.4, 58390c8ce1bd, had a rather
           messy merge conflict resolution set, so a number of commits have those cleanups added in here.
```

## Summary of Changes

```
        Rebase through v6.5 with a good portion of v6.6 as well (minus the
	dynamic swiotlb mempool support, per numa dma cma support, and arm
	+ mm tlb invalidate changes). For iommufd changes there are
	backports of the underlying functionality in iommufd, but I have left
	the vfio commits that will eventually make use of it for Alex.

Highlights
	* AMD GA Log Overflow refactor and PPR Log support
	* AMD v2 page table support
	* AMD v2 5 level guest page table support
	* Various cleanups and fixes
	* Sync ipmmu-vmsa in preparation for Renesas support  (config not enabled)
	* Continuation of swiotlb rework
	* Continuation of the refactor of core iommu code as part of SVA, iommufd, and pasid support work
	* Continuation of the iommufd prep work (config still not enabled)
	* Support for bounce buffer usage with non cache-line aligned kmallocs on arm64
	* Clean up of in-kernel pasid use for vt-d
	* More cleanup of BUG_ON and warning use in vt-d

        This is based on top of MR !2843 and !3158.
```

Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>

## Approved Development Ticket
All submissions to CentOS Stream must reference an approved ticket in [Red Hat Jira](https://issues.redhat.com/). Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Mika Penttilä <mpenttil@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Eric Auger <eric.auger@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-19 15:53:34 +01:00
Scott Weaver 1e550aa9e1 Merge: kernel/fork: beware of __put_task_struct() calling context
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3002

Bugzilla: https://bugzilla.redhat.com/2060283

commit d243b34459cea30cfe5f3a9b2feb44e7daff9938

Author: Wander Lairson Costa <wander@redhat.com>

Date:   Wed Jun 14 09:23:21 2023 -0300

    kernel/fork: beware of __put_task_struct() calling context

    Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
    locks. Therefore, it can't be called from an non-preemptible context.

    One practical example is splat inside inactive_task_timer(), which is
    called in a interrupt context:

      CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
       Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
       Call Trace:
       dump_stack_lvl+0x57/0x7d
       mark_lock_irq.cold+0x33/0xba
       mark_lock+0x1e7/0x400
       mark_usage+0x11d/0x140
       __lock_acquire+0x30d/0x930
       lock_acquire.part.0+0x9c/0x210
       rt_spin_lock+0x27/0xe0
       refill_obj_stock+0x3d/0x3a0
       kmem_cache_free+0x357/0x560
       inactive_task_timer+0x1ad/0x340
       __run_hrtimer+0x8a/0x1a0
       __hrtimer_run_queues+0x91/0x130
       hrtimer_interrupt+0x10f/0x220
       __sysvec_apic_timer_interrupt+0x7b/0xd0
       sysvec_apic_timer_interrupt+0x4f/0xd0
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       RIP: 0033:0x7fff196bf6f5

    Instead of calling __put_task_struct() directly, we defer it using
    call_rcu(). A more natural approach would use a workqueue, but since
    in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
    the code would become more complex because we would need to put the
    work_struct instance in the task_struct and initialize it when we
    allocate a new task_struct.

    The issue is reproducible with stress-ng:

      while true; do
          stress-ng --sched deadline --sched-period 1000000000 \
                  --sched-runtime 800000000 --sched-deadline \
                  1000000000 --mmapfork 23 -t 20
      done

    Reported-by: Hu Chunyu <chuhu@redhat.com>
    Suggested-by: Oleg Nesterov <oleg@redhat.com>
    Suggested-by: Valentin Schneider <vschneid@redhat.com>
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Wander Lairson Costa <wander@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230614122323.37957-2-wander@redhat.com

Signed-off-by: Wander Lairson Costa <wander@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-11-06 14:54:53 -05:00
Jerry Snitselaar 216afcb34d iommu/sva: Move PASID helpers to sva code
JIRA: https://issues.redhat.com/browse/RHEL-10094
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Conflicts: 400b9b93441c ("iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()")
           changed pasid_valid to mm_valid_pasid.

commit cd3891158a77685aee6129f7374a018d13540b2c
Author: Jacob Pan <jacob.jun.pan@linux.intel.com>
Date:   Wed Mar 22 13:07:58 2023 -0700

    iommu/sva: Move PASID helpers to sva code

    Preparing to remove IOASID infrastructure, PASID management will be
    under SVA code. Decouple mm code from IOASID.

    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
    Reviewed-by: Kevin Tian <kevin.tian@intel.com>
    Link: https://lore.kernel.org/r/20230322200803.869130-3-jacob.jun.pan@linux.intel.com
    Signed-off-by: Joerg Roedel <jroedel@suse.de>

(cherry picked from commit cd3891158a77685aee6129f7374a018d13540b2c)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2023-10-27 01:26:58 -07:00
Chris von Recklinghausen 492e3ef0e5 mm: fix memory leak on mm_init error handling
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b20b0368c614c609badfe16fbd113dfb4780acd9
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Thu Mar 30 09:38:22 2023 -0400

    mm: fix memory leak on mm_init error handling

    commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
    introduces a memory leak by missing a call to destroy_context() when a
    percpu_counter fails to allocate.

    Before introducing the per-cpu counter allocations, init_new_context() was
    the last call that could fail in mm_init(), and thus there was no need to
    ever invoke destroy_context() in the error paths.  Adding the following
    percpu counter allocations adds error paths after init_new_context(),
    which means its associated destroy_context() needs to be called when
    percpu counters fail to allocate.

    Link: https://lkml.kernel.org/r/20230330133822.66271-1-mathieu.desnoyers@efficios.com
    Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Acked-by: Shakeel Butt <shakeelb@google.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:00 -04:00
Chris von Recklinghausen 2a6d71ed47 percpu_counter: add percpu_counter_sum_all interface
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f689054aace2ff13af2e9a44a74fbba650ca31ba
Author: Shakeel Butt <shakeelb@google.com>
Date:   Wed Nov 9 01:20:11 2022 +0000

    percpu_counter: add percpu_counter_sum_all interface

    The percpu_counter is used for scenarios where performance is more
    important than the accuracy.  For percpu_counter users, who want more
    accurate information in their slowpath, percpu_counter_sum is provided
    which traverses all the online CPUs to accumulate the data.  The reason it
    only needs to traverse online CPUs is because percpu_counter does
    implement CPU offline callback which syncs the local data of the offlined
    CPU.

    However there is a small race window between the online CPUs traversal of
    percpu_counter_sum and the CPU offline callback.  The offline callback has
    to traverse all the percpu_counters on the system to flush the CPU local
    data which can be a lot.  During that time, the CPU which is going offline
    has already been published as offline to all the readers.  So, as the
    offline callback is running, percpu_counter_sum can be called for one
    counter which has some state on the CPU going offline.  Since
    percpu_counter_sum only traverses online CPUs, it will skip that specific
    CPU and the offline callback might not have flushed the state for that
    specific percpu_counter on that offlined CPU.

    Normally this is not an issue because percpu_counter users can deal with
    some inaccuracy for small time window.  However a new user i.e.  mm_struct
    on the cleanup path wants to check the exact state of the percpu_counter
    through check_mm().  For such users, this patch introduces
    percpu_counter_sum_all() which traverses all possible CPUs and it is used
    in fork.c:check_mm() to avoid the potential race.

    This issue is exposed by the later patch "mm: convert mm's rss stats into
    percpu_counter".

    Link: https://lkml.kernel.org/r/20221109012011.881058-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:21 -04:00
Chris von Recklinghausen 9cec47342a mm: convert mm's rss stats into percpu_counter
Conflicts:
	include/linux/sched.h - We don't have
		7964cf8caa4d ("mm: remove vmacache")
		so don't remove the declaration for vmacache
	kernel/fork.c - We don't have
		d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
		so don't add calls to mt_init_flags or mt_set_external_lock

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25
Author: Shakeel Butt <shakeelb@google.com>
Date:   Mon Oct 24 05:28:41 2022 +0000

    mm: convert mm's rss stats into percpu_counter

    Currently mm_struct maintains rss_stats which are updated on page fault
    and the unmapping codepaths.  For page fault codepath the updates are
    cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
    The reason for caching is performance for multithreaded applications
    otherwise the rss_stats updates may become hotspot for such applications.

    However this optimization comes with the cost of error margin in the rss
    stats.  The rss_stats for applications with large number of threads can be
    very skewed.  At worst the error margin is (nr_threads * 64) and we have a
    lot of applications with 100s of threads, so the error margin can be very
    high.  Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.

    Recently we started seeing the unbounded errors for rss_stats for specific
    applications which use TCP rx0cp.  It seems like vm_insert_pages()
    codepath does not sync rss_stats at all.

    This patch converts the rss_stats into percpu_counter to convert the error
    margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).  However
    this conversion enable us to get the accurate stats for situations where
    accuracy is more important than the cpu cost.

    This patch does not make such tradeoffs - we can just use
    percpu_counter_add_local() for the updates and percpu_counter_sum() (or
    percpu_counter_sync() + percpu_counter_read) for the readers.  At the
    moment the readers are either procfs interface, oom_killer and memory
    reclaim which I think are not performance critical and should be ok with
    slow read.  However I think we can make that change in a separate patch.

    Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:21 -04:00
Chris von Recklinghausen 41acf1521c kmsan: handle task creation and exiting
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 50b5e49ca694a60f84a2a12d62b6cb6ec8e3649f
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Sep 15 17:03:50 2022 +0200

    kmsan: handle task creation and exiting

    Tell KMSAN that a new task is created, so the tool creates a backing
    metadata structure for that task.

    Link: https://lkml.kernel.org/r/20220915150417.722975-17-glider@google.com
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@google.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Eric Biggers <ebiggers@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Ilya Leoshkevich <iii@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vegard Nossum <vegard.nossum@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:36 -04:00