Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Rafael Aquini	e4205ccf96	kernel: be more careful about dup_mmap() failures and uprobe registering JIRA: https://issues.redhat.com/browse/RHEL-84184 CVE: CVE-2025-21709 Conflicts: * kernel/events/uprobes.c: a notable context difference in the 1st hunk due to RHEL-9 missing the following upstream commits: 87195a1ee332a, 2bf8e5aceff89, and dd1a7567784e2; and a notable contex difference in the 2nd hunk due to RHEL-9 missing the following upstream commits: 84455e6923c7 and 8617408f7a01. None of the aforelisted commits are of any relevance for this backport work. This patch is a backport of the following upstream commit: commit 64c37e134b120fb462fb4a80694bfb8e7be77b14 Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Mon Jan 27 12:02:21 2025 -0500 kernel: be more careful about dup_mmap() failures and uprobe registering If a memory allocation fails during dup_mmap(), the maple tree can be left in an unsafe state for other iterators besides the exit path. All the locks are dropped before the exit_mmap() call (in mm/mmap.c), but the incomplete mm_struct can be reached through (at least) the rmap finding the vmas which have a pointer back to the mm_struct. Up to this point, there have been no issues with being able to find an mm_struct that was only partially initialised. Syzbot was able to make the incomplete mm_struct fail with recent forking changes, so it has been proven unsafe to use the mm_struct that hasn't been initialised, as referenced in the link below. Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to invalid mm") fixed the uprobe access, it does not completely remove the race. This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the oom side (even though this is extremely unlikely to be selected as an oom victim in the race window), and sets MMF_UNSTABLE to avoid other potential users from using a partially initialised mm_struct. When registering vmas for uprobe, skip the vmas in an mm that is marked unstable. Modifying a vma in an unstable mm may cause issues if the mm isn't fully initialised. Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/ Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2025-04-18 08:39:53 -04:00
Rafael Aquini	6abde438ad	fork: avoid inappropriate uprobe access to invalid mm JIRA: https://issues.redhat.com/browse/RHEL-84184 Conflicts: * kernel/fork.c: minor difference from upstream due to an extra blank line that was left behind when commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") was backported into RHEL-9 This patch is a backport of the following upstream commit: commit 8ac662f5da19f5873fdd94c48a5cdb45b2e1b58f Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Date: Tue Dec 10 17:24:12 2024 +0000 fork: avoid inappropriate uprobe access to invalid mm If dup_mmap() encounters an issue, currently uprobe is able to access the relevant mm via the reverse mapping (in build_map_info()), and if we are very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which we establish as part of the fork error path. This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which in turn invokes find_mergeable_anon_vma() that uses a VMA iterator, invoking vma_iter_load() which uses the advanced maple tree API and thus is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()"). This change was made on the assumption that only process tear-down code would actually observe (and make use of) these values. However this very unlikely but still possible edge case with uprobes exists and unfortunately does make these observable. The uprobe operation prevents races against the dup_mmap() operation via the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap() and dropped via uprobe_end_dup_mmap(), and held across register_for_each_vma() prior to invoking build_map_info() which does the reverse mapping lookup. Currently these are acquired and dropped within dup_mmap(), which exposes the race window prior to error handling in the invoking dup_mm() which tears down the mm. We can avoid all this by just moving the invocation of uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm() and only release this lock once the dup_mmap() operation succeeds or clean up is done. This means that the uprobe code can never observe an incompletely constructed mm and resolves the issue in this case. Link: https://lkml.kernel.org/r/20241210172412.52995-1-lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/ Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2025-04-18 08:39:52 -04:00
Patrick Talbert	143d5ac2a9	Merge: CVE-2024-50271: ucounts: Split rlimit and ucount values and max values MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6027 JIRA: https://issues.redhat.com/browse/RHEL-68020 CVE: CVE-2024-50271 - 012f4d5d25e9ef92ee129bd5aa7aa60f692681e1 signal: restore the override_rlimit logic - de399236e240743ad2dd10d719c37b97ddf31996 ucounts: Split rlimit and ucount values and max values Signed-off-by: Radostin Stoyanov <radostin@redhat.com> Approved-by: Rafael Aquini <raquini@redhat.com> Approved-by: Herton R. Krzesinski <herton@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Patrick Talbert <ptalbert@redhat.com>	2025-02-03 10:00:41 -05:00
Radostin Stoyanov	46364cd74c	ucounts: Split rlimit and ucount values and max values JIRA: https://issues.redhat.com/browse/RHEL-68020 CVE: CVE-2024-50271 commit de399236e240743ad2dd10d719c37b97ddf31996 Author: Alexey Gladkov <legion@kernel.org> Date: Wed Mat 18 19:17:30 2022 +0200 ucounts: Split rlimit and ucount values and max values Since the semantics of maximum rlimit values are different, it would be better not to mix ucount and rlimit values. This will prevent the error of using inc_count/dec_ucount for rlimit parameters. This patch also renames the functions to emphasize the lack of connection between rlimit and ucount. v3: - Fix BUG:KASAN:use-after-free_in_dec_ucount. v2: - Fix the array-index-out-of-bounds that was found by the lkp project. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/20220518171730.l65lmnnjtnxnftpq@example.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Radostin Stoyanov <radostin@redhat.com>	2024-12-20 15:31:08 +00:00
Rafael Aquini	a0b0ebfbe2	fork: only invoke khugepaged, ksm hooks if no error JIRA: https://issues.redhat.com/browse/RHEL-27745 This patch is a backport of the following upstream commit: commit 985da552a98e27096444508ce5d853244019111f Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Date: Tue Oct 15 18:56:06 2024 +0100 fork: only invoke khugepaged, ksm hooks if no error There is no reason to invoke these hooks early against an mm that is in an incomplete state. The change in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. Their placement early in dup_mmap() only appears to have been meaningful for early error checking, and since functionally it'd require a very small allocation to fail (in practice 'too small to fail') that'd only occur in the most dire circumstances, meaning the fork would fail or be OOM'd in any case. Since both khugepaged and KSM tracking are there to provide optimisations to memory performance rather than critical functionality, it doesn't really matter all that much if, under such dire memory pressure, we fail to register an mm with these. As a result, we follow the example of commit d2081b2bf819 ("mm: khugepaged: make khugepaged_enter() void function") and make ksm_fork() a void function also. We only expose the mm to these functions once we are done with them and only if no error occurred in the fork operation. Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-12-09 12:25:52 -05:00
Rafael Aquini	4ecda75778	fork: do not invoke uffd on fork if error occurs JIRA: https://issues.redhat.com/browse/RHEL-27745 This patch is a backport of the following upstream commit: commit f64e67e5d3a45a4a04286c47afade4b518acd47b Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Date: Tue Oct 15 18:56:05 2024 +0100 fork: do not invoke uffd on fork if error occurs Patch series "fork: do not expose incomplete mm on fork". During fork we may place the virtual memory address space into an inconsistent state before the fork operation is complete. In addition, we may encounter an error during the fork operation that indicates that the virtual memory address space is invalidated. As a result, we should not be exposing it in any way to external machinery that might interact with the mm or VMAs, machinery that is not designed to deal with incomplete state. We specifically update the fork logic to defer khugepaged and ksm to the end of the operation and only to be invoked if no error arose, and disallow uffd from observing fork events should an error have occurred. This patch (of 2): Currently on fork we expose the virtual address space of a process to userland unconditionally if uffd is registered in VMAs, regardless of whether an error arose in the fork. This is performed in dup_userfaultfd_complete() which is invoked unconditionally, and performs two duties - invoking registered handlers for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down userfaultfd_fork_ctx objects established in dup_userfaultfd(). This is problematic, because the virtual address space may not yet be correctly initialised if an error arose. The change in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. We address this by, on fork error, ensuring that we roll back state that we would otherwise expect to clean up through the event being handled by userland and perform the memory freeing duty otherwise performed by dup_userfaultfd_complete(). We do this by implementing a new function, dup_userfaultfd_fail(), which performs the same loop, only decrementing reference counts. Note that we perform mmgrab() on the parent and child mm's, however userfaultfd_ctx_put() will mmdrop() this once the reference count drops to zero, so we will avoid memory leaks correctly here. Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-12-09 12:25:51 -05:00
Rafael Aquini	15c37b2ac9	fork: use __mt_dup() to duplicate maple tree in dup_mmap() JIRA: https://issues.redhat.com/browse/RHEL-27745 Conflicts: * kernel/fork.c: differences on the 3rd and 4th hunks are due to out-of-order backport of commit 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized"), and we have one RHEL-only hunk here that reverts commit 2b4f3b4987b5 ("fork: lock VMAs of the parent process when forking") which was a temporary measure upstream that ended up in a dead branch after v6.6; This patch is a backport of the following upstream commit: commit d2406291483775ecddaee929231a39c70c08fda2 Author: Peng Zhang <zhangpeng.00@bytedance.com> Date: Fri Oct 27 11:38:45 2023 +0800 fork: use __mt_dup() to duplicate maple tree in dup_mmap() In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then directly replacing the entries of VMAs in the new maple tree can result in better performance. __mt_dup() uses DFS pre-order to duplicate the maple tree, so it is efficient. The average time complexity of __mt_dup() is O(n), where n is the number of VMAs. The proof of the time complexity is provided in the commit log that introduces __mt_dup(). After duplicating the maple tree, each element is traversed and replaced (ignoring the cases of deletion, which are rare). Since it is only a replacement operation for each element, this process is also O(n). Analyzing the exact time complexity of the previous algorithm is challenging because each insertion can involve appending to a node, pushing data to adjacent nodes, or even splitting nodes. The frequency of each action is difficult to calculate. The worst-case scenario for a single insertion is when the tree undergoes splitting at every level. If we consider each insertion as the worst-case scenario, we can determine that the upper bound of the time complexity is O(n*log(n)), although this is a loose upper bound. However, based on the test data, it appears that the actual time complexity is likely to be O(n). As the entire maple tree is duplicated using __mt_dup(), if dup_mmap() fails, there will be a portion of VMAs that have not been duplicated in the maple tree. To handle this, we mark the failure point with XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop releasing VMAs that have not been duplicated after this point. There is a "spawn" in byte-unixbench[1], which can be used to test the performance of fork(). I modified it slightly to make it work with different number of VMAs. Below are the test results. The first row shows the number of VMAs. The second and third rows show the number of fork() calls per ten seconds, corresponding to next-20231006 and the this patchset, respectively. The test results were obtained with CPU binding to avoid scheduler load balancing that could cause unstable results. There are still some fluctuations in the test results, but at least they are better than the original performance. 21 121 221 421 821 1621 3221 6421 12821 25621 51221 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42% [1] https://github.com/kdlucas/byte-unixbench/tree/master Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-12-09 12:23:34 -05:00
Audra Mitchell	bc133597f4	lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme JIRA: https://issues.redhat.com/browse/RHEL-55462 Conflicts: Minor extra RHEL-only hunk to create the required CONFIG_DEBUG_VM_SHOOT_LAZIES fle under rhel's config database. This patch is a backport of the following upstream commit: commit 2655421ae69fa479df1575cb2630af9131d28939 Author: Nicholas Piggin <npiggin@gmail.com> Date: Fri Feb 3 17:18:36 2023 +1000 lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme On big systems, the mm refcount can become highly contented when doing a lot of context switching with threaded applications. user<->idle switch is one of the important cases. Abandoning lazy tlb entirely slows this switching down quite a bit in the common uncontended case, so that is not viable. Implement a scheme where lazy tlb mm references do not contribute to the refcount, instead they get explicitly removed when the refcount reaches zero. The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they switch away from this mm to init_mm if it was being used as the lazy tlb mm. Enabling the shoot lazies option therefore requires that the arch ensures that mm_cpumask contains all CPUs that could possibly be using mm. A DEBUG_VM option IPIs every CPU in the system after this to ensure there are no references remaining before the mm is freed. Shootdown IPIs cost could be an issue, but they have not been observed to be a serious problem with this scheme, because short-lived processes tend not to migrate CPUs much, therefore they don't get much chance to leave lazy tlb mm references on remote CPUs. There are a lot of options to reduce them if necessary, described in comments. The near-worst-case can be benchmarked with will-it-scale: context_switch1_threads -t $(($(nproc) / 2)) This will create nproc threads (nproc / 2 switching pairs) all sharing the same mm that spread over all CPUs so each CPU does thread->idle->thread switching. [ Rik came up with basically the same idea a few years ago, so credit to him for that. ] Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/ Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/ Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Audra Mitchell <audra@redhat.com>	2024-11-04 09:14:17 -05:00
Waiman Long	49848304ee	rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks JIRA: https://issues.redhat.com/browse/RHEL-55557 commit 46faf9d8e1d52e4a91c382c6c72da6bd8e68297b Author: Paul E. McKenney <paulmck@kernel.org> Date: Mon, 5 Feb 2024 13:10:19 -0800 rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce. In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact. This commit therefore initializes the data structures that will be needed to rely on these quiescent states and to eliminate these deadlocks. Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/ Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yang Jihong <yangjihong1@huawei.com> Tested-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-08-26 10:57:18 -04:00
Rafael Aquini	81cf488de5	fork: lock VMAs of the parent process when forking JIRA: https://issues.redhat.com/browse/RHEL-48221 This patch is a backport of the following upstream commit: commit fb49c455323ff8319a123dd312be9082c49a23a5 Author: Suren Baghdasaryan <surenb@google.com> Date: Sat Jul 8 12:12:12 2023 -0700 fork: lock VMAs of the parent process when forking When forking a child process, the parent write-protects anonymous pages and COW-shares them with the child being forked using copy_present_pte(). We must not take any concurrent page faults on the source vma's as they are being processed, as we expect both the vma and the pte's behind it to be stable. For example, the anon_vma_fork() expects the parents vma->anon_vma to not change during the vma copy. A concurrent page fault on a page newly marked read-only by the page copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the source vma, defeating the anon_vma_clone() that wasn't done because the parent vma originally didn't have an anon_vma, but we now might end up copying a pte entry for a page that has one. Before the per-vma lock based changes, the mmap_lock guaranteed exclusion with concurrent page faults. But now we need to do a vma_start_write() to make sure no concurrent faults happen on this vma while it is being processed. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic. Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: Jiri Slaby <jirislaby@kernel.org> Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/ Reported-by: Jacob Young <jacobly.alt@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com>	2024-07-16 09:30:22 -04:00
Rafael Aquini	6f66ad6087	fork: lock VMAs of the parent process when forking JIRA: https://issues.redhat.com/browse/RHEL-48221 This patch is a backport of the following upstream commit: commit 2b4f3b4987b56365b981f44a7e843efa5b6619b9 Author: Suren Baghdasaryan <surenb@google.com> Date: Wed Jul 5 18:13:59 2023 -0700 fork: lock VMAs of the parent process when forking Patch series "Avoid memory corruption caused by per-VMA locks", v4. A memory corruption was reported in [1] with bisection pointing to the patch [2] enabling per-VMA locks for x86. Based on the reproducer provided in [1] we suspect this is caused by the lack of VMA locking while forking a child process. Patch 1/2 in the series implements proper VMA locking during fork. I tested the fix locally using the reproducer and was unable to reproduce the memory corruption problem. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~7% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic. Patch 2/2 disables per-VMA locks until the fix is tested and verified. This patch (of 2): When forking a child process, parent write-protects an anonymous page and COW-shares it with the child being forked using copy_present_pte(). Parent's TLB is flushed right before we drop the parent's mmap_lock in dup_mmap(). If we get a write-fault before that TLB flush in the parent, and we end up replacing that anonymous page in the parent process in do_wp_page() (because, COW-shared with the child), this might lead to some stale writable TLB entries targeting the wrong (old) page. Similar issue happened in the past with userfaultfd (see flush_tlb_page() call inside do_wp_page()). Lock VMAs of the parent process when forking a child, which prevents concurrent page faults during fork operation and avoids this issue. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~7% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic. Link: https://lkml.kernel.org/r/20230706011400.2949242-1-surenb@google.com Link: https://lkml.kernel.org/r/20230706011400.2949242-2-surenb@google.com Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: Jiri Slaby <jirislaby@kernel.org> Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/ Reported-by: Jacob Young <jacobly.alt@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=3D217624 Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Holger Hoffsttte <holger@applied-asynchrony.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com>	2024-07-16 09:30:18 -04:00
Lucas Zampieri	bf8d4449d4	Merge: fork: defer linking file vma until vma is fully initialized MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4349 JIRA: https://issues.redhat.com/browse/RHEL-35022 CVE: CVE-2024-27022 Signed-off-by: Rafael Aquini <aquini@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Aristeu Rozanski <arozansk@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-06-17 12:51:50 +00:00
Lucas Zampieri	ef511f48cf	Merge: cgroup: Reapply the cgroup commits merged in RHEL-34600 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4257 JIRA: https://issues.redhat.com/browse/RHEL-36683 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4257 CS9 MR #4128 ("cgroup: Backport upstream cgroup commits up to v6.8") and MR #3975 ("Scheduler: rhel9.5 updates") have 2 commits in common. - commit 79462e8c879a ("sched: don't account throttle time for empty groups") - commit 677ea015f231 ("sched: add throttled time stat for throttled children") However, the merging of MR #4128 on top of MR #3975 produced an unexpected result that the 2 new functions (cgroup_local_stat_show & cpu_local_stat_show) introduced by commit 677ea015f231 ("sched: add throttled time stat for throttled children") were duplicated resulting in build failure. This leads to the revert of MR #4128 by the maintainer to address the build problem. With the merge and revert commits in place, it is now no longer possible to rebase the original MR on top of main. There are now two ways to workaround it. In both case, a new MR and Jiri issue will be needed. 1) Reapplying all the patches except the 2 duplicated ones on top of main. 2) Revert the revert commit and remove the duplicated functions. The second alternative is chosen as it will be easier to review and we don't need to duplicate the commits with a different set of git hashes. The first patch is just the revert of the revert as all the code changes had been reviewed before. The second patch removes the duplicated functions. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Tony Camuso <tcamuso@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-06-05 20:05:37 +00:00
Lucas Zampieri	304e2a4e29	Merge: [RHEL-9.5.0] iommu and dma mapping api updates MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4151 # Merge Request Required Information JIRA: https://issues.redhat.com/browse/RHEL-28780 JIRA: https://issues.redhat.com/browse/RHEL-12083 JIRA: https://issues.redhat.com/browse/RHEL-12322 JIRA: https://issues.redhat.com/browse/RHEL-29105 JIRA: https://issues.redhat.com/browse/RHEL-29357 JIRA: https://issues.redhat.com/browse/RHEL-29359 Omitted-fix: ed8b94f6e0ac ("powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add") - Reverted by 1fba2bf8e9d5 ("Revert "powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add"") Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git branch: next Tested: In progress - general cki coverage - Nvidia testing arm-smmu-v3 and iommufd related changes they have requested. - Multiple rounds testing of amd_iommu, intel_iommu, and arm-smmu-v3 with various iommu configurations with disk i/o using fio, covering lazy iotlb invalidation, strict iotlb invalidation, and passthrough. Also tested with forcedac set. Intel Scalable Mode capable systems tested with the iotlb invalidation policies, and passthrough with scalable mode enabled, and disabled. AMD systems tested tested with v1 pages tables and v2. - Tested booting with various iommu configurations, and verifying system in correct state on AMD, Intel, and ARM. - Limited test on ppc64le. The system I had access to was setting up a 64-bit bypass window, and using dma_direct calls. It ran, but since I don't normally touch ppc64le iommu code, I need to investigate more or get IBM assistance to more thoroughly test it. - Working on getting testing assistance from IBM for the s390x changes. ## Summary of Changes This brings iommu, iommufd, and dma mapping api up to 6.9 with some additions from Joerg's next branch minus some commits changes in a 6.9 SEV-SNP pull for AMD. Some hightlights: - The removal of the amd_iommu_v2 code, and the addition of it's replacement based on the iommu core SVA api, along with a re-org of the amd_iommu code. - The migration of s390 to the iommu core dma-iommu dma ops implementation, joining Intel, AMD, and ARM as users of the same code base. - The beginnings of a re-work of the arm-smmu-v3 driver by Jason, and others. - A number of changes to iommufd as it continues to get fleshed out. - IOPT memory usage observability (code that was basis for talk at LPC last year) Example output in vmstat files: ``` # grep iommu /sys/devices/system/node/node*/vmstat /sys/devices/system/node/node0/vmstat:nr_iommu_pages 342 /sys/devices/system/node/node1/vmstat:nr_iommu_pages 0 ``` - Continued work on shared virtual addressing and io page faulting (PRI). - Dynamic swiotlb memory pools. This is not enabled yet, as they still seem to be shaking out issues upstream, but the code is in place now. - Re-working of iommu core domain allocation. Note: iommufd selftest is being enabled in separate work that has been delegated to another engineer starting to help with iommu. So that will be enabled in the next few weeks to add more coverage for iommufd. Conflicts wise, they should be noted in the individual commits, but not too bad overall. 13/30 were dropping unsupported bits, and another 8 were context diffs. A couple caused by out of order backports due to fixes, and couple upstream conflicts from colliding patchsets that had to be resolved in the merge commits. Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com> Approved-by: Jan Stancek <jstancek@redhat.com> Approved-by: Donald Dutile <ddutile@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: David Airlie <airlied@redhat.com> Approved-by: Lenny Szubowicz <lszubowi@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Approved-by: Mark Langsdorf <mlangsdo@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-06-05 20:03:50 +00:00
Rafael Aquini	2c0f74959e	fork: defer linking file vma until vma is fully initialized JIRA: https://issues.redhat.com/browse/RHEL-35022 CVE: CVE-2024-27022 Conflicts: * context differences due to RHEL missing upstream commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()"), which is part of a v6.8 series targeting performance enhancements for the maple tree VMA iterator (unrelated to the security fix in this port) This patch is a backport of the following upstream commit: commit 35e351780fa9d8240dd6f7e4f245f9ea37e96c19 Author: Miaohe Lin <linmiaohe@huawei.com> Date: Wed Apr 10 17:14:41 2024 +0800 fork: defer linking file vma until vma is fully initialized Thorvald reported a WARNING [1]. And the root cause is below race: CPU 1 CPU 2 fork hugetlbfs_fallocate dup_mmap hugetlbfs_punch_hole i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. i_mmap_unlock_write(mapping); hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); hugetlb_vmdelete_list vma_interval_tree_foreach hugetlb_vma_trylock_write -- Vma_lock is cleared. tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! hugetlb_vma_unlock_write -- Vma_lock is assigned!!! i_mmap_unlock_write(mapping); hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside i_mmap_rwsem lock while vma lock can be used in the same time. Fix this by deferring linking file vma until vma is fully initialized. Those vmas should be initialized first before they can be used. Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reported-by: Thorvald Natvig <thorvald@google.com> Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1] Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Tycho Andersen <tandersen@netflix.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com>	2024-05-28 09:31:39 -04:00
Waiman Long	6d0328a7cf	Revert "Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8"" JIRA: https://issues.redhat.com/browse/RHEL-36683 Upstream Status: RHEL only This reverts commit `08637d76a2` which is a revert of "Merge: cgroup: Backport upstream cgroup commits up to v6.8" Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-18 21:38:20 -04:00
Lucas Zampieri	08637d76a2	Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8" This reverts merge request !4128	2024-05-16 15:26:41 +00:00
Lucas Zampieri	1ce55b7cbb	Merge: cgroup: Backport upstream cgroup commits up to v6.8 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4128 JIRA: https://issues.redhat.com/browse/RHEL-34600 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4128 This MR backports upstream cgroup commits up to v6.8 with related fixes, if applicable. It also pulls in a number of scheduler and PSI related commits due to their interaction with cgroup. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Tony Camuso <tcamuso@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Xin Long <lxin@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-05-16 13:28:22 +00:00
Jerry Snitselaar	2edf07cd91	iommu: Change kconfig around IOMMU_SVA JIRA: https://issues.redhat.com/browse/RHEL-28780 JIRA: https://issues.redhat.com/browse/RHEL-29105 Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Conflicts: Context diff in mm/Kconfig, and RH_KABI macro used in include/linux/sched.h commit 8f23f5dba6b4693448144bde4dd6f537543442c2 Author: Jason Gunthorpe <jgg@ziepe.ca> Date: Fri Oct 27 08:05:20 2023 +0800 iommu: Change kconfig around IOMMU_SVA Linus suggested that the kconfig here is confusing: https://lore.kernel.org/all/CAHk-=wgUiAtiszwseM1p2fCJ+sC4XWQ+YN4TanFhUgvUqjr9Xw@mail.gmail.com/ Let's break it into three kconfigs controlling distinct things: - CONFIG_IOMMU_MM_DATA controls if the mm_struct has the additional fields for the IOMMU. Currently only PASID, but later patches store a struct iommu_mm_data * - CONFIG_ARCH_HAS_CPU_PASID controls if the arch needs the scheduling bit for keeping track of the ENQCMD instruction. x86 will select this if IOMMU_SVA is enabled - IOMMU_SVA controls if the IOMMU core compiles in the SVA support code for iommu driver use and the IOMMU exported API This way ARM will not enable CONFIG_ARCH_HAS_CPU_PASID Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231027000525.1278806-2-tina.zhang@intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> (cherry picked from commit 8f23f5dba6b4693448144bde4dd6f537543442c2) Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>	2024-05-13 08:18:36 -07:00
Nico Pache	12c51977be	mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl commit 24e41bf8a6b424c76c5902fb999e9eca61bdf83d Author: Florent Revest <revest@chromium.org> Date: Mon Aug 28 17:08:57 2023 +0200 mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl This extends the current PR_SET_MDWE prctl arg with a bit to indicate that the process doesn't want MDWE protection to propagate to children. To implement this no-inherit mode, the tag in current->mm->flags must be absent from MMF_INIT_MASK. This means that the encoding for "MDWE but without inherit" is different in the prctl than in the mm flags. This leads to a bit of bit-mangling in the prctl implementation. Link: https://lkml.kernel.org/r/20230828150858.393570-6-revest@chromium.org Signed-off-by: Florent Revest <revest@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ayush Jain <ayush.jain3@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> JIRA: https://issues.redhat.com/browse/RHEL-5619 Signed-off-by: Nico Pache <npache@redhat.com>	2024-04-30 17:51:24 -06:00
Chris von Recklinghausen	81f271e685	sched/numa: apply the scan delay to every new vma Conflicts: mm/mmap.c, include/linux/mm_types.h - fuzz JIRA: https://issues.redhat.com/browse/RHEL-27741 commit ef6a22b70f6d90449a5c797b8968a682824e2011 Author: Mel Gorman <mgorman@techsingularity.net> Date: Wed Mar 1 17:49:00 2023 +0530 sched/numa: apply the scan delay to every new vma Pach series "sched/numa: Enhance vma scanning", v3. The patchset proposes one of the enhancements to numa vma scanning suggested by Mel. This is continuation of [3]. Reposting the rebased patchset to akpm mm-unstable tree (March 1) Existing mechanism of scan period involves, scan period derived from per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA fault stats at per-process level to capture aplication behaviour better. During that course of discussion, Mel proposed several ideas to enhance current numa balancing. One of the suggestion was below Track what threads access a VMA. The suggestion was to use an unsigned long pid_mask and use the lower bits to tag approximately what threads access a VMA. Skip VMAs that did not trap a fault. This would be approximate because of PID collisions but would reduce scanning of areas the thread is not interested in. The above suggestion intends not to penalize threads that has no interest in the vma, thus reduce scanning overhead. V3 changes are mostly based on PeterZ comments (details below in changes) Summary of patchset: Current patchset implements: 1. Delay the vma scanning logic for newly created VMA's so that additional overhead of scanning is not incurred for short lived tasks (implementation by Mel) 2. Store the information of tasks accessing VMA in 2 windows. It is regularly cleared in (4sysctl_numa_balancing_scan_delay) interval. The above time is derived from experimenting (Suggested by PeterZ) to balance between frequent clearing vs obsolete access data 3. hash_32 used to encode task index accessing VMA information 4. VMA's acess information is used to skip scanning for the tasks which had not accessed VMA Changes since V2: patch1: - Renaming of structure, macro to function, - Add explanation to heuristics - Adding more details from result (PeterZ) Patch2: - Usage of test and set bit (PeterZ) - Move storing access PID info to numa_migrate_prep() - Add a note on fainess among tasks allowed to scan (PeterZ) Patch3: - Maintain two windows of access PID information (PeterZ supported implementation and Gave idea to extend to N if needed) Patch4: - Apply hash_32 function to track VMA accessing PIDs (PeterZ) Changes since RFC V1: - Include Mel's vma scan delay patch - Change the accessing pid store logic (Thanks Mel) - Fencing structure / code to NUMA_BALANCING (David, Mel) - Adding clearing access PID logic (Mel) - Descriptive change log ( Mike Rapoport) Things to ponder over: ========================================== - Improvement to clearing accessing PIDs logic (discussed in-detail in patch3 itself (Done in this patchset by implementing 2 window history) - Current scan period is not changed in the patchset, so we do see frequent tries to scan. Relaxing scan period dynamically could improve results further. [1] sched/numa: Process Adaptive autoNUMA Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/ [2] RFC V1 Link: https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/ [3] V2 Link: https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/ Results: Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement is more than 5% and huge system time (80%+) improvement from mmtest autonuma. (dbench had huge std deviation to post) kernbench =========== 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Amean user-256 22002.51 ( 0.00%) 22649.95 -2.94%* Amean syst-256 10162.78 ( 0.00%) 8214.13 * 19.17%* Amean elsp-256 160.74 ( 0.00%) 156.92 * 2.38%* Duration User 66017.43 67959.84 Duration System 30503.15 24657.03 Duration Elapsed 504.61 493.12 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Ops NUMA alloc hit 1738835089.00 1738780310.00 Ops NUMA alloc local 1738834448.00 1738779711.00 Ops NUMA base-page range updates 477310.00 392566.00 Ops NUMA PTE updates 477310.00 392566.00 Ops NUMA hint faults 96817.00 87555.00 Ops NUMA hint local faults % 10150.00 2192.00 Ops NUMA hint local percent 10.48 2.50 Ops NUMA pages migrated 86660.00 85363.00 Ops AutoNUMA cost 489.07 442.14 autonumabench =============== 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Amean syst-NUMA01 399.50 ( 0.00%) 52.05 * 86.97%* Amean syst-NUMA01_THREADLOCAL 0.21 ( 0.00%) 0.22 * -5.41%* Amean syst-NUMA02 0.80 ( 0.00%) 0.78 * 2.68%* Amean syst-NUMA02_SMT 0.65 ( 0.00%) 0.68 * -3.95%* Amean elsp-NUMA01 313.26 ( 0.00%) 313.11 * 0.05%* Amean elsp-NUMA01_THREADLOCAL 1.06 ( 0.00%) 1.08 * -1.76%* Amean elsp-NUMA02 3.19 ( 0.00%) 3.24 * -1.52%* Amean elsp-NUMA02_SMT 3.72 ( 0.00%) 3.61 * 2.92%* Duration User 396433.47 324835.96 Duration System 2808.70 376.66 Duration Elapsed 2258.61 2258.12 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Ops NUMA alloc hit 59921806.00 49623489.00 Ops NUMA alloc miss 0.00 0.00 Ops NUMA interleave hit 0.00 0.00 Ops NUMA alloc local 59920880.00 49622594.00 Ops NUMA base-page range updates 152259275.00 50075.00 Ops NUMA PTE updates 152259275.00 50075.00 Ops NUMA PMD updates 0.00 0.00 Ops NUMA hint faults 154660352.00 39014.00 Ops NUMA hint local faults % 138550501.00 23139.00 Ops NUMA hint local percent 89.58 59.31 Ops NUMA pages migrated 8179067.00 14147.00 Ops AutoNUMA cost 774522.98 195.69 This patch (of 4): Currently whenever a new task is created we wait for sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead. Extend the same logic to new or very short-lived VMAs. [raghavendra.kt@amd.com: add initialization in vm_area_dup())] Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Cc: Bharata B Rao <bharata@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Disha Talreja <dishaa.talreja@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-30 07:00:45 -04:00
Chris von Recklinghausen	88cd4097c1	mm: separate vma->lock from vm_area_struct JIRA: https://issues.redhat.com/browse/RHEL-27741 commit c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f Author: Suren Baghdasaryan <surenb@google.com> Date: Mon Feb 27 09:36:32 2023 -0800 mm: separate vma->lock from vm_area_struct vma->lock being part of the vm_area_struct causes performance regression during page faults because during contention its count and owner fields are constantly updated and having other parts of vm_area_struct used during page fault handling next to them causes constant cache line bouncing. Fix that by moving the lock outside of the vm_area_struct. All attempts to keep vma->lock inside vm_area_struct in a separate cache line still produce performance regression especially on NUMA machines. Smallest regression was achieved when lock is placed in the fourth cache line but that bloats vm_area_struct to 256 bytes. Considering performance and memory impact, separate lock looks like the best option. It increases memory footprint of each VMA but that can be optimized later if the new size causes issues. Note that after this change vma_init() does not allocate or initialize vma->lock anymore. A number of drivers allocate a pseudo VMA on the stack but they never use the VMA's lock, therefore it does not need to be allocated. The future drivers which might need the VMA lock should use vm_area_alloc()/vm_area_free() to allocate the VMA. Link: https://lkml.kernel.org/r/20230227173632.3292573-34-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-30 07:00:45 -04:00
Chris von Recklinghausen	624b7d40f8	mm/mmap: free vm_area_struct without call_rcu in exit_mmap JIRA: https://issues.redhat.com/browse/RHEL-27741 commit 0d2ebf9c3f7822e7ba3e4792ea3b6b19aa2da34a Author: Suren Baghdasaryan <surenb@google.com> Date: Mon Feb 27 09:36:31 2023 -0800 mm/mmap: free vm_area_struct without call_rcu in exit_mmap call_rcu() can take a long time when callback offloading is enabled. Its use in the vm_area_free can cause regressions in the exit path when multiple VMAs are being freed. Because exit_mmap() is called only after the last mm user drops its refcount, the page fault handlers can't be racing with it. Any other possible user like oom-reaper or process_mrelease are already synchronized using mmap_lock. Therefore exit_mmap() can free VMAs directly, without the use of call_rcu(). Expose __vm_area_free() and use it from exit_mmap() to avoid possible call_rcu() floods and performance regressions caused by it. Link: https://lkml.kernel.org/r/20230227173632.3292573-33-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-30 07:00:45 -04:00
Chris von Recklinghausen	4faa49fc27	kernel/fork: assert no VMA readers during its destruction JIRA: https://issues.redhat.com/browse/RHEL-27741 commit f2e13784c16a98e269b3111ac02ae44446dd589c Author: Suren Baghdasaryan <surenb@google.com> Date: Mon Feb 27 09:36:19 2023 -0800 kernel/fork: assert no VMA readers during its destruction Assert there are no holders of VMA lock for reading when it is about to be destroyed. Link: https://lkml.kernel.org/r/20230227173632.3292573-21-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-30 07:00:42 -04:00
Chris von Recklinghausen	06ea40f7dc	mm: add per-VMA lock and helper functions to control it Conflicts: include/linux/mm.h, include/linux/mmap_lock.h - fuzz JIRA: https://issues.redhat.com/browse/RHEL-27741 commit 5e31275cc997f8ec5d9e8d65fe9840ebed89db19 Author: Suren Baghdasaryan <surenb@google.com> Date: Mon Feb 27 09:36:11 2023 -0800 mm: add per-VMA lock and helper functions to control it Introduce per-VMA locking. The lock implementation relies on a per-vma and per-mm sequence counters to note exclusive locking: - read lock - (implemented by vma_start_read) requires the vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ. If they match then there must be a vma exclusive lock held somewhere. - read unlock - (implemented by vma_end_read) is a trivial vma->lock unlock. - write lock - (vma_start_write) requires the mmap_lock to be held exclusively and the current mm counter is assigned to the vma counter. This will allow multiple vmas to be locked under a single mmap_lock write lock (e.g. during vma merging). The vma counter is modified under exclusive vma lock. - write unlock - (vma_end_write_all) is a batch release of all vma locks held. It doesn't pair with a specific vma_start_write! It is done before exclusive mmap_lock is released by incrementing mm sequence counter (mm_lock_seq). - write downgrade - if the mmap_lock is downgraded to the read lock, all vma write locks are released as well (effectivelly same as write unlock). Link: https://lkml.kernel.org/r/20230227173632.3292573-13-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-30 07:00:40 -04:00
Chris von Recklinghausen	3f6e1d9cb8	mm: rcu safe VMA freeing Conflicts: include/linux/mm_types.h - fuzz JIRA: https://issues.redhat.com/browse/RHEL-27741 commit 20cce633f4254cc0df39665449726e3172518f6c Author: Michel Lespinasse <michel@lespinasse.org> Date: Mon Feb 27 09:36:09 2023 -0800 mm: rcu safe VMA freeing This prepares for page faults handling under VMA lock, looking up VMAs under protection of an rcu read lock, instead of the usual mmap read lock. Link: https://lkml.kernel.org/r/20230227173632.3292573-11-surenb@google.com Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-30 07:00:39 -04:00
Aristeu Rozanski	9280e1a4aa	mm: enable maple tree RCU mode by default JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit 3dd4432549415f3c65dd52d5c687629efbf4ece1 Author: Liam R. Howlett <Liam.Howlett@oracle.com> Date: Mon Feb 27 09:36:07 2023 -0800 mm: enable maple tree RCU mode by default Use the maple tree in RCU mode for VMA tracking. The maple tree tracks the stack and is able to update the pivot (lower/upper boundary) in-place to allow the page fault handler to write to the tree while holding just the mmap read lock. This is safe as the writes to the stack have a guard VMA which ensures there will always be a NULL in the direction of the growth and thus will only update a pivot. It is possible, but not recommended, to have VMAs that grow up/down without guard VMAs. syzbot has constructed a testcase which sets up a VMA to grow and consume the empty space. Overwriting the entire NULL entry causes the tree to be altered in a way that is not safe for concurrent readers; the readers may see a node being rewritten or one that does not match the maple state they are using. Enabling RCU mode allows the concurrent readers to see a stable node and will return the expected result. [Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[ Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/ Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:26 -04:00
Aristeu Rozanski	fec82fff3c	mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit e430a95a04efc557bc4ff9b3035c7c85aee5d63f Author: Suren Baghdasaryan <surenb@google.com> Date: Thu Jan 26 11:37:48 2023 -0800 mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK To simplify the usage of VM_LOCKED_CLEAR_MASK in vm_flags_clear(), replace it with VM_LOCKED_MASK bitmask and convert all users. Link: https://lkml.kernel.org/r/20230126193752.297968-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:17 -04:00
Aristeu Rozanski	314860cbc3	kernel/fork: convert vma assignment to a memcpy JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit 06e78b614e3780f9ac32056f2861159fd19d9702 Author: Suren Baghdasaryan <surenb@google.com> Date: Thu Jan 26 11:37:46 2023 -0800 kernel/fork: convert vma assignment to a memcpy Patch series "introduce vm_flags modifier functions", v4. This patchset was originally published as a part of per-VMA locking [1] and was split after suggestion that it's viable on its own and to facilitate the review process. It is now a preprequisite for the next version of per-VMA lock patchset, which reuses vm_flags modifier functions to lock the VMA when vm_flags are being updated. VMA vm_flags modifications are usually done under exclusive mmap_lock protection because this attrubute affects other decisions like VMA merging or splitting and races should be prevented. Introduce vm_flags modifier functions to enforce correct locking. This patch (of 7): Convert vma assignment in vm_area_dup() to a memcpy() to prevent compiler errors when we add a const modifier to vma->vm_flags. Link: https://lkml.kernel.org/r/20230126193752.297968-1-surenb@google.com Link: https://lkml.kernel.org/r/20230126193752.297968-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:17 -04:00
Aristeu Rozanski	e6172d44ab	kernel/fork: convert forking to using the vmi iterator JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit 3b9dbd5e91b11911d21effbb80d1976fb21660df Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Fri Jan 20 11:26:10 2023 -0500 kernel/fork: convert forking to using the vmi iterator Avoid using the maple tree interface directly. This gains type safety. Link: https://lkml.kernel.org/r/20230120162650.984577-10-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:13 -04:00
Waiman Long	724656e7cf	freezer,sched: Rewrite core freezer logic JIRA: https://issues.redhat.com/browse/RHEL-34600 Conflicts: 1) A merge conflict in the kernel/signal.c hunk due to the presence of RHEL-only commit `975d318867` ("signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT."). 2) A merge conflict in the kernel/time/hrtimer.c hunk due to the presence of RHEL-only commit `5f76194136` ("time/hrtimer: Embed hrtimer mode into hrtimer_sleeper"). 3) The fs/cifs/inode.c hunk was applied to fs/smb/client/inode.c due to the presence of upstream commit 38c8a9a52082 ("smb: move client and server files to common directory fs/smb"). 4) Similarly, the fs/cifs/transport.c hunk was applied to fs/smb/client/transport.c manually due to the presence of a later upstream commit d527f51331ca ("cifs: Fix UAF in cifs_demultiplex_thread()"). Note that all the prerequiste patches in the same patch series (https://lore.kernel.org/lkml/20220822111816.760285417@infradead.org/) had already been merged into RHEL9. commit f5d39b020809146cc28e6e73369bf8065e0310aa Author: Peter Zijlstra <peterz@infradead.org> Date: Mon, 22 Aug 2022 13:18:22 +0200 freezer,sched: Rewrite core freezer logic Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states can be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org Signed-off-by: Waiman Long <longman@redhat.com>	2024-04-26 22:49:06 -04:00
Chris von Recklinghausen	26437a89ef	mm: remove the vma linked list Conflicts: include/linux/mm.h - We already have 21b85b09527c ("madvise: use zap_page_range_single for madvise dontneed") so keep declaration for zap_page_range_single kernel/fork.c - We already have f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") so keep declaration of i mm/mmap.c - We already have `a1e8cb93bf` ("mm: drop oom code from exit_mmap") and `db3644c677` ("mm: delete unused MMF_OOM_VICTIM flag") so keep setting MMF_OOM_SKIP in mm->flags JIRA: https://issues.redhat.com/browse/RHEL-27736 commit 763ecb035029f500d7e6dc99acd1ad299b7726a1 Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Tue Sep 6 19:49:06 2022 +0000 mm: remove the vma linked list Replace any vm_next use with vma_find(). Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the maple tree. Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap(). At the same time, alter the loop to be more compact. Now that free_pgtables() and unmap_vmas() take a maple tree as an argument, rearrange do_mas_align_munmap() to use the new tree to hold the vmas to remove. Remove __vma_link_list() and __vma_unlink_list() as they are exclusively used to update the linked list. Drop linked list update from __insert_vm_struct(). Rework validation of tree as it was depending on the linked list. [yang.lee@linux.alibaba.com: fix one kernel-doc comment] Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1949 Link: https://lkml.kernel.org/r/20220824021918.94116-1-yang.lee@linux.alib aba.comLink: https://lkml.kernel.org/r/20220906194824.2110408-69-Liam.Howlett@or acle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-01 11:19:57 -04:00
Chris von Recklinghausen	11d9f41086	fork: use VMA iterator JIRA: https://issues.redhat.com/browse/RHEL-27736 commit fa5e587679f034530e8c14bc1c466490053b2ff2 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Tue Sep 6 19:48:59 2022 +0000 fork: use VMA iterator The VMA iterator is faster than the linked list and removing the linked list will shrink the vm_area_struct. Link: https://lkml.kernel.org/r/20220906194824.2110408-50-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-01 11:19:52 -04:00
Chris von Recklinghausen	eb370ae179	mm: remove vmacache JIRA: https://issues.redhat.com/browse/RHEL-27736 commit 7964cf8caa4dfa42c4149f3833d3878713cda3dc Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Tue Sep 6 19:48:51 2022 +0000 mm: remove vmacache By using the maple tree and the maple tree state, the vmacache is no longer beneficial and is complicating the VMA code. Remove the vmacache to reduce the work in keeping it up to date and code complexity. Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-01 11:19:46 -04:00
Chris von Recklinghausen	adeb5664bb	mm: remove rb tree. Conflicts: mm/mmap.c - We already have 54a611b60590 ("Maple Tree: add new data structure") so mas_preallocate no longer takes a vma argument. We already have 92b7399695a5 ("mmap: fix copy_vma() failure path") so keep check for new_vma->vm_file, the fput on it, and the unlink_anon_vmas call JIRA: https://issues.redhat.com/browse/RHEL-27736 commit 524e00b36e8c547f5582eef3fb645a8d9fc5e3df Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Tue Sep 6 19:48:48 2022 +0000 mm: remove rb tree. Remove the RB tree and start using the maple tree for vm_area_struct tracking. Drop validate_mm() calls in expand_upwards() and expand_downwards() as the lock is not held. Link: https://lkml.kernel.org/r/20220906194824.2110408-18-Liam.Howlett@oracl e.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-01 11:19:44 -04:00
Chris von Recklinghausen	21beb4ef93	kernel/fork: use maple tree for dup_mmap() during forking JIRA: https://issues.redhat.com/browse/RHEL-27736 commit c9dbe82cb99db5b6029c6bc43fcf7881d3f50268 Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Tue Sep 6 19:48:47 2022 +0000 kernel/fork: use maple tree for dup_mmap() during forking The maple tree was already tracking VMAs in this function by an earlier commit, but the rbtree iterator was being used to iterate the list. Change the iterator to use a maple tree native iterator and switch to the maple tree advanced API to avoid multiple walks of the tree during insert operations. Unexport the now-unused vma_store() function. For performance reasons we bulk allocate the maple tree nodes. The node calculations are done internally to the tree and use the VMA count and assume the worst-case node requirements. The VM_DONT_COPY flag does not allow for the most efficient copy method of the tree and so a bulk loading algorithm is used. Link: https://lkml.kernel.org/r/20220906194824.2110408-15-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-01 11:19:43 -04:00
Chris von Recklinghausen	cde31a5d92	mm: start tracking VMAs with maple tree JIRA: https://issues.redhat.com/browse/RHEL-27736 commit d4af56c5c7c6781ca6ca8075e2cf5bc119ed33d1 Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Tue Sep 6 19:48:45 2022 +0000 mm: start tracking VMAs with maple tree Start tracking the VMAs with the new maple tree structure in parallel with the rb_tree. Add debug and trace events for maple tree operations and duplicate the rb_tree that is created on forks into the maple tree. The maple tree is added to the mm_struct including the mm_init struct, added support in required mm/mmap functions, added tracking in kernel/fork for process forking, and used to find the unmapped_area and checked against what the rbtree finds. This also moves the mmap_lock() in exit_mmap() since the oom reaper call does walk the VMAs. Otherwise lockdep will be unhappy if oom happens. When splitting a vma fails due to allocations of the maple tree nodes, the error path in __split_vma() calls new->vm_ops->close(new). The page accounting for hugetlb is actually in the close() operation, so it accounts for the removal of 1/2 of the VMA which was not adjusted. This results in a negative exit value. To avoid the negative charge, set vm_start = vm_end and vm_pgoff = 0. There is also a potential accounting issue in special mappings from insert_vm_struct() failing to allocate, so reverse the charge there in the failure scenario. Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-01 11:19:41 -04:00
Prarit Bhargava	3e4b9d3d7e	stackprotector: move get_random_canary() into stackprotector.h JIRA: https://issues.redhat.com/browse/RHEL-25415 Conflicts: Minor drift issues. commit b3883a9a1f09e7b41f4dcb1bbd7262216a62d253 Author: Jason A. Donenfeld <Jason@zx2c4.com> Date: Sun Oct 23 22:06:00 2022 +0200 stackprotector: move get_random_canary() into stackprotector.h This has nothing to do with random.c and everything to do with stack protectors. Yes, it uses randomness. But many things use randomness. random.h and random.c are concerned with the generation of randomness, not with each and every use. So move this function into the more specific stackprotector.h file where it belongs. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Prarit Bhargava <prarit@redhat.com>	2024-03-20 09:43:05 -04:00
Prarit Bhargava	8f4ce17181	fork: Generalize PF_IO_WORKER handling JIRA: https://issues.redhat.com/browse/RHEL-25415 Conflicts: Did not apply to some unsupported arches. Those changes have been dropped. Add idle_dummy() function here instead of backporting out-of-scope linux commit 36cb0e1cda64 ("fork: Explicity test for idle tasks in copy_thread") commit 5bd2e97c868a8a44470950ed01846cab6328e540 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Tue Apr 12 10:18:48 2022 -0500 fork: Generalize PF_IO_WORKER handling Add fn and fn_arg members into struct kernel_clone_args and test for them in copy_thread (instead of testing for PF_KTHREAD \| PF_IO_WORKER). This allows any task that wants to be a user space task that only runs in kernel mode to use this functionality. The code on x86 is an exception and still retains a PF_KTHREAD test because x86 unlikely everything else handles kthreads slightly differently than user space tasks that start with a function. The functions that created tasks that start with a function have been updated to set ".fn" and ".fn_arg" instead of ".stack" and ".stack_size". These functions are fork_idle(), create_io_thread(), kernel_thread(), and user_mode_thread(). Link: https://lkml.kernel.org/r/20220506141512.516114-4-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Prarit Bhargava <prarit@redhat.com>	2024-03-20 09:43:04 -04:00
Prarit Bhargava	76f9c1e6a9	x86/split_lock: Make life miserable for split lockers JIRA: https://issues.redhat.com/browse/RHEL-25415 Conflicts: Minor drift issues. commit b041b525dab95352fbd666b14dc73ab898df465f Author: Tony Luck <tony.luck@intel.com> Date: Thu Mar 10 12:48:53 2022 -0800 x86/split_lock: Make life miserable for split lockers In https://lore.kernel.org/all/87y22uujkm.ffs@tglx/ Thomas said: Its's simply wishful thinking that stuff gets fixed because of a WARN_ONCE(). This has never worked. The only thing which works is to make stuff fail hard or slow it down in a way which makes it annoying enough to users to complain. He was talking about WBINVD. But it made me think about how we use the split lock detection feature in Linux. Existing code has three options for applications: 1) Don't enable split lock detection (allow arbitrary split locks) 2) Warn once when a process uses split lock, but let the process keep running with split lock detection disabled 3) Kill process that use split locks Option 2 falls into the "wishful thinking" territory that Thomas warns does nothing. But option 3 might not be viable in a situation with legacy applications that need to run. Hence make option 2 much stricter to "slow it down in a way which makes it annoying". Primary reason for this change is to provide better quality of service to the rest of the applications running on the system. Internal testing shows that even with many processes splitting locks, performance for the rest of the system is much more responsive. The new "warn" mode operates like this. When an application tries to execute a bus lock the #AC handler. 1) Delays (interruptibly) 10 ms before moving to next step. 2) Blocks (interruptibly) until it can get the semaphore If interrupted, just return. Assume the signal will either kill the task, or direct execution away from the instruction that is trying to get the bus lock. 3) Disables split lock detection for the current core 4) Schedules a work queue to re-enable split lock detect in 2 jiffies 5) Returns The work queue that re-enables split lock detection also releases the semaphore. There is a corner case where a CPU may be taken offline while split lock detection is disabled. A CPU hotplug handler handles this case. Old behaviour was to only print the split lock warning on the first occurrence of a split lock from a task. Preserve that by adding a flag to the task structure that suppresses subsequent split lock messages from that task. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220310204854.31752-2-tony.luck@intel.com Signed-off-by: Prarit Bhargava <prarit@redhat.com>	2024-03-20 09:42:47 -04:00
Prarit Bhargava	5022b3efeb	fork: Pass struct kernel_clone_args into copy_thread JIRA: https://issues.redhat.com/browse/RHEL-25415 commit c5febea0956fd3874e8fb59c6f84d68f128d68f8 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Fri Apr 8 18:07:50 2022 -0500 fork: Pass struct kernel_clone_args into copy_thread With io_uring we have started supporting tasks that are for most purposes user space tasks that exclusively run code in kernel mode. The kernel task that exec's init and tasks that exec user mode helpers are also user mode tasks that just run kernel code until they call kernel execve. Pass kernel_clone_args into copy_thread so these oddball tasks can be supported more cleanly and easily. v2: Fix spelling of kenrel_clone_args on h8300 Link: https://lkml.kernel.org/r/20220506141512.516114-2-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Omitted-fix: 0626e1c9f3e5 LoongArch: Fix copy_thread() build errors loongarch not supported in RHEL9 Signed-off-by: Prarit Bhargava <prarit@redhat.com>	2024-03-20 09:42:34 -04:00
Vitaly Kuznetsov	be72381490	x86/mm: Use mm_alloc() in poking_init() JIRA: https://issues.redhat.com/browse/RHEL-25415 commit 3f4c8211d982099be693be9aa7d6fc4607dff290 Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Oct 25 21:38:21 2022 +0200 x86/mm: Use mm_alloc() in poking_init() Instead of duplicating init_mm, allocate a fresh mm. The advantage is that mm_alloc() has much simpler dependencies. Additionally it makes more conceptual sense, init_mm has no (and must not have) user state to duplicate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.816175235@infradead.org Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>	2024-03-20 09:42:26 -04:00
Vitaly Kuznetsov	037738e296	mm: Move mm_cachep initialization to mm_init() JIRA: https://issues.redhat.com/browse/RHEL-25415 commit af80602799681c78f14fbe20b6185a56020dedee Author: Peter Zijlstra <peterz@infradead.org> Date: Tue Oct 25 21:38:18 2022 +0200 mm: Move mm_cachep initialization to mm_init() In order to allow using mm_alloc() much earlier, move initializing mm_cachep into mm_init(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.751153381@infradead.org Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>	2024-03-20 09:42:26 -04:00
Jan Stancek	78042596b6	Merge: iommu: IOMMU and DMA-mapping API Updates for 9.4 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3180 # Merge Request Required Information ``` Bugzilla: https://bugzilla.redhat.com/2223717 JIRA: https://issues.redhat.com/browse/RHEL-10007 JIRA: https://issues.redhat.com/browse/RHEL-10026 JIRA: https://issues.redhat.com/browse/RHEL-10042 JIRA: https://issues.redhat.com/browse/RHEL-10094 JIRA: https://issues.redhat.com/browse/RHEL-3655 JIRA: https://issues.redhat.com/browse/RHEL-800 Depends: !3244 Depends: !3245 Omitted-fix: c7bd8a1f45ba ("iommu/apple-dart: Handle DMA_FQ domains in attach_dev()") - Apple Dart not supported Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Testing: A mix of fio jobs, and various stress-ng io stressors (--hdd, --readahead, --aio, --aiol, --seek, --sync_file) run with strict and lazy translation modes on amd, intel, and arm systems. pgtbl_v2 tested on AMD Genoa host Conflicts: Should be noted in individual commits. In particular one upstream merge in 6.4, 58390c8ce1bd, had a rather messy merge conflict resolution set, so a number of commits have those cleanups added in here. ``` ## Summary of Changes ``` Rebase through v6.5 with a good portion of v6.6 as well (minus the dynamic swiotlb mempool support, per numa dma cma support, and arm + mm tlb invalidate changes). For iommufd changes there are backports of the underlying functionality in iommufd, but I have left the vfio commits that will eventually make use of it for Alex. Highlights * AMD GA Log Overflow refactor and PPR Log support * AMD v2 page table support * AMD v2 5 level guest page table support * Various cleanups and fixes * Sync ipmmu-vmsa in preparation for Renesas support (config not enabled) * Continuation of swiotlb rework * Continuation of the refactor of core iommu code as part of SVA, iommufd, and pasid support work * Continuation of the iommufd prep work (config still not enabled) * Support for bounce buffer usage with non cache-line aligned kmallocs on arm64 * Clean up of in-kernel pasid use for vt-d * More cleanup of BUG_ON and warning use in vt-d This is based on top of MR !2843 and !3158. ``` Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com> ## Approved Development Ticket All submissions to CentOS Stream must reference an approved ticket in [Red Hat Jira](https://issues.redhat.com/). Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved. Approved-by: John W. Linville <linville@redhat.com> Approved-by: Tony Camuso <tcamuso@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Approved-by: Mika Penttilä <mpenttil@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Eric Auger <eric.auger@redhat.com> Approved-by: Mark Salter <msalter@redhat.com> Approved-by: Donald Dutile <ddutile@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-11-19 15:53:34 +01:00
Scott Weaver	1e550aa9e1	Merge: kernel/fork: beware of __put_task_struct() calling context MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3002 Bugzilla: https://bugzilla.redhat.com/2060283 commit d243b34459cea30cfe5f3a9b2feb44e7daff9938 Author: Wander Lairson Costa <wander@redhat.com> Date: Wed Jun 14 09:23:21 2023 -0300 kernel/fork: beware of __put_task_struct() calling context Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping locks. Therefore, it can't be called from an non-preemptible context. One practical example is splat inside inactive_task_timer(), which is called in a interrupt context: CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W --------- Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012 Call Trace: dump_stack_lvl+0x57/0x7d mark_lock_irq.cold+0x33/0xba mark_lock+0x1e7/0x400 mark_usage+0x11d/0x140 __lock_acquire+0x30d/0x930 lock_acquire.part.0+0x9c/0x210 rt_spin_lock+0x27/0xe0 refill_obj_stock+0x3d/0x3a0 kmem_cache_free+0x357/0x560 inactive_task_timer+0x1ad/0x340 __run_hrtimer+0x8a/0x1a0 __hrtimer_run_queues+0x91/0x130 hrtimer_interrupt+0x10f/0x220 __sysvec_apic_timer_interrupt+0x7b/0xd0 sysvec_apic_timer_interrupt+0x4f/0xd0 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0033:0x7fff196bf6f5 Instead of calling __put_task_struct() directly, we defer it using call_rcu(). A more natural approach would use a workqueue, but since in PREEMPT_RT, we can't allocate dynamic memory from atomic context, the code would become more complex because we would need to put the work_struct instance in the task_struct and initialize it when we allocate a new task_struct. The issue is reproducible with stress-ng: while true; do stress-ng --sched deadline --sched-period 1000000000 \ --sched-runtime 800000000 --sched-deadline \ 1000000000 --mmapfork 23 -t 20 done Reported-by: Hu Chunyu <chuhu@redhat.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Valentin Schneider <vschneid@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230614122323.37957-2-wander@redhat.com Signed-off-by: Wander Lairson Costa <wander@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-11-06 14:54:53 -05:00
Jerry Snitselaar	216afcb34d	iommu/sva: Move PASID helpers to sva code JIRA: https://issues.redhat.com/browse/RHEL-10094 Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Conflicts: 400b9b93441c ("iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()") changed pasid_valid to mm_valid_pasid. commit cd3891158a77685aee6129f7374a018d13540b2c Author: Jacob Pan <jacob.jun.pan@linux.intel.com> Date: Wed Mar 22 13:07:58 2023 -0700 iommu/sva: Move PASID helpers to sva code Preparing to remove IOASID infrastructure, PASID management will be under SVA code. Decouple mm code from IOASID. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20230322200803.869130-3-jacob.jun.pan@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> (cherry picked from commit cd3891158a77685aee6129f7374a018d13540b2c) Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>	2023-10-27 01:26:58 -07:00
Chris von Recklinghausen	492e3ef0e5	mm: fix memory leak on mm_init error handling JIRA: https://issues.redhat.com/browse/RHEL-1848 commit b20b0368c614c609badfe16fbd113dfb4780acd9 Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Date: Thu Mar 30 09:38:22 2023 -0400 mm: fix memory leak on mm_init error handling commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") introduces a memory leak by missing a call to destroy_context() when a percpu_counter fails to allocate. Before introducing the per-cpu counter allocations, init_new_context() was the last call that could fail in mm_init(), and thus there was no need to ever invoke destroy_context() in the error paths. Adding the following percpu counter allocations adds error paths after init_new_context(), which means its associated destroy_context() needs to be called when percpu counters fail to allocate. Link: https://lkml.kernel.org/r/20230330133822.66271-1-mathieu.desnoyers@efficios.com Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:16:00 -04:00
Chris von Recklinghausen	2a6d71ed47	percpu_counter: add percpu_counter_sum_all interface JIRA: https://issues.redhat.com/browse/RHEL-1848 commit f689054aace2ff13af2e9a44a74fbba650ca31ba Author: Shakeel Butt <shakeelb@google.com> Date: Wed Nov 9 01:20:11 2022 +0000 percpu_counter: add percpu_counter_sum_all interface The percpu_counter is used for scenarios where performance is more important than the accuracy. For percpu_counter users, who want more accurate information in their slowpath, percpu_counter_sum is provided which traverses all the online CPUs to accumulate the data. The reason it only needs to traverse online CPUs is because percpu_counter does implement CPU offline callback which syncs the local data of the offlined CPU. However there is a small race window between the online CPUs traversal of percpu_counter_sum and the CPU offline callback. The offline callback has to traverse all the percpu_counters on the system to flush the CPU local data which can be a lot. During that time, the CPU which is going offline has already been published as offline to all the readers. So, as the offline callback is running, percpu_counter_sum can be called for one counter which has some state on the CPU going offline. Since percpu_counter_sum only traverses online CPUs, it will skip that specific CPU and the offline callback might not have flushed the state for that specific percpu_counter on that offlined CPU. Normally this is not an issue because percpu_counter users can deal with some inaccuracy for small time window. However a new user i.e. mm_struct on the cleanup path wants to check the exact state of the percpu_counter through check_mm(). For such users, this patch introduces percpu_counter_sum_all() which traverses all possible CPUs and it is used in fork.c:check_mm() to avoid the potential race. This issue is exposed by the later patch "mm: convert mm's rss stats into percpu_counter". Link: https://lkml.kernel.org/r/20221109012011.881058-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:15:21 -04:00
Chris von Recklinghausen	9cec47342a	mm: convert mm's rss stats into percpu_counter Conflicts: include/linux/sched.h - We don't have 7964cf8caa4d ("mm: remove vmacache") so don't remove the declaration for vmacache kernel/fork.c - We don't have d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") so don't add calls to mt_init_flags or mt_set_external_lock JIRA: https://issues.redhat.com/browse/RHEL-1848 commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 Author: Shakeel Butt <shakeelb@google.com> Date: Mon Oct 24 05:28:41 2022 +0000 mm: convert mm's rss stats into percpu_counter Currently mm_struct maintains rss_stats which are updated on page fault and the unmapping codepaths. For page fault codepath the updates are cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. The reason for caching is performance for multithreaded applications otherwise the rss_stats updates may become hotspot for such applications. However this optimization comes with the cost of error margin in the rss stats. The rss_stats for applications with large number of threads can be very skewed. At worst the error margin is (nr_threads * 64) and we have a lot of applications with 100s of threads, so the error margin can be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. Recently we started seeing the unbounded errors for rss_stats for specific applications which use TCP rx0cp. It seems like vm_insert_pages() codepath does not sync rss_stats at all. This patch converts the rss_stats into percpu_counter to convert the error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However this conversion enable us to get the accurate stats for situations where accuracy is more important than the cpu cost. This patch does not make such tradeoffs - we can just use percpu_counter_add_local() for the updates and percpu_counter_sum() (or percpu_counter_sync() + percpu_counter_read) for the readers. At the moment the readers are either procfs interface, oom_killer and memory reclaim which I think are not performance critical and should be ok with slow read. However I think we can make that change in a separate patch. Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:15:21 -04:00
Chris von Recklinghausen	41acf1521c	kmsan: handle task creation and exiting JIRA: https://issues.redhat.com/browse/RHEL-1848 commit 50b5e49ca694a60f84a2a12d62b6cb6ec8e3649f Author: Alexander Potapenko <glider@google.com> Date: Thu Sep 15 17:03:50 2022 +0200 kmsan: handle task creation and exiting Tell KMSAN that a new task is created, so the tool creates a backing metadata structure for that task. Link: https://lkml.kernel.org/r/20220915150417.722975-17-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:14:36 -04:00

1 2 3 4 5 ...

1181 Commits