Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Waiman Long	babedfece6	Union-Find: add a new module in kernel library JIRA: https://issues.redhat.com/browse/RHEL-83455 Conflicts: A merge conflict in lib/Makefile and context diff in MAINTAINERS. commit 93c8332c8373fee415bd79f08d5ba4ba7ca5ad15 Author: Xavier <xavier_qy@163.com> Date: Thu, 4 Jul 2024 14:24:43 +0800 Union-Find: add a new module in kernel library This patch implements a union-find data structure in the kernel library, which includes operations for allocating nodes, freeing nodes, finding the root of a node, and merging two nodes. Signed-off-by: Xavier <xavier_qy@163.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-04-09 21:58:35 -04:00
Baoquan He	308e9a3386	Document/kexec: generalize crash hotplug description JIRA: https://issues.redhat.com/browse/RHEL-58641 Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Conflicts: In Documentation/ABI/testing/sysfs-devices-system-cpu, there is conflict because of context fuzz. commit c91c6062d6cd1bc366efb04973ee449c30398a49 Author: Sourabh Jain <sourabhjain@linux.ibm.com> Date: Mon Aug 12 09:46:51 2024 +0530 Document/kexec: generalize crash hotplug description Commit 79365026f869 ("crash: add a new kexec flag for hotplug support") generalizes the crash hotplug support to allow architectures to update multiple kexec segments on CPU/Memory hotplug and not just elfcorehdr. Therefore, update the relevant kernel documentation to reflect the same. No functional change. Link: https://lkml.kernel.org/r/20240812041651.703156-1-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Petr Tesarik <ptesarik@suse.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Petr Tesarik <petr@tesarici.cz> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Baoquan He <bhe@redhat.com>	2024-12-23 09:35:36 +08:00
Rafael Aquini	8a32fd3b9a	maple_tree: update the documentation of maple tree JIRA: https://issues.redhat.com/browse/RHEL-27745 This patch is a backport of the following upstream commit: commit 9bc1d3cdb904170214456bca96c4924f28522ab8 Author: Peng Zhang <zhangpeng.00@bytedance.com> Date: Fri Oct 27 11:38:41 2023 +0800 maple_tree: update the documentation of maple tree Introduce the new interface mtree_dup() in the documentation. Link: https://lkml.kernel.org/r/20231027033845.90608-7-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-12-09 12:23:31 -05:00
Audra Mitchell	754309ec6b	Documentation/protection-keys: Clean up documentation for User Space pkeys JIRA: https://issues.redhat.com/browse/RHEL-55461 This patch is a backport of the following upstream commit: commit f8c1d4ca55177326adad1fdc6bf602423a507542 Author: Ira Weiny <ira.weiny@intel.com> Date: Tue Apr 19 10:06:06 2022 -0700 Documentation/protection-keys: Clean up documentation for User Space pkeys The documentation for user space pkeys was a bit dated including things such as Amazon and distribution testing information which is irrelevant now. Update the documentation. This also streamlines adding the Supervisor pkey documentation later on. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220419170649.1022246-2-ira.weiny@intel.com Signed-off-by: Audra Mitchell <audra@redhat.com>	2024-11-04 09:14:14 -05:00
Rado Vrbovsky	570a71d7db	Merge: mm: update core code to v6.6 upstream MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252 JIRA: https://issues.redhat.com/browse/RHEL-27743 JIRA: https://issues.redhat.com/browse/RHEL-59459 CVE: CVE-2024-46787 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961 This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level. This work follows up on the previous v6.5 update (RHEL-27742) and as such, the bulk of this changeset is comprised of refactoring and clean-ups of the internal implementation of several APIs as it further advances the conversion to FOLIOS, and follow up on the per-VMA locking changes. Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds, and we add a potential extra level of protection (assessment pending) to help on mitigating kernel heap exploits dubbed as "SlubStick". Follow-up fixes are omitted from this series either because they are irrelevant to the bits we support on RHEL or because they depend on bigger changesets introduced upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately. Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot") Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources") Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()") Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros") Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages") Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType") Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()") Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio") Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling") Signed-off-by: Rafael Aquini <raquini@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Approved-by: Mark Salter <msalter@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Approved-by: David Airlie <airlied@redhat.com> Approved-by: Michal Schmidt <mschmidt@redhat.com> Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-10-30 07:22:28 +00:00
Rado Vrbovsky	a5ea1cdd29	Merge: 9.6 IOMMU and DMA api updates MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5113 # Merge Request Required Information JIRA: https://issues.redhat.com/browse/RHEL-36247 JIRA: https://issues.redhat.com/browse/RHEL-54186 JIRA: https://issues.redhat.com/browse/RHEL-54189 JIRA: https://issues.redhat.com/browse/RHEL-55199 JIRA: https://issues.redhat.com/browse/RHEL-55200 JIRA: https://issues.redhat.com/browse/RHEL-55448 JIRA: https://issues.redhat.com/browse/RHEL-55450 JIRA: https://issues.redhat.com/browse/RHEL-55466 JIRA: https://issues.redhat.com/browse/RHEL-57229 Upstream: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git CVE: CVE-2024-44994 ## Summary of Changes This brings the IOMMU and DMA api subsystems in line with v6.11. Overall it is smaller than normal, but I expect there will be another MR later on with some things landing in v6.12. In addition to the usual fixes, and cleanups the major changes are: - iommufd io page fault support - smmuv3 dirty page tracking support - vt-d cache tagging support - iommu memory usage observability support - Some cleanups from Robin related to dma range calculation, arch setup_dma_ops, and iommu_fwspec_ops - Another batch of updates to smmuv3 from Jason's reworking of the driver (2b/3) There also is a rhel only cleanup to deal with a cleanup in the iommufd code by Linus in one of merges that was related to some mm changes at the time, but wasn't dealt with when the iommufd or mm changes were merged to rhel. Testing: - kernel-tests iommu new-boot testing - kernel-tests iommu fio testing - kernel-tests dmatest and idxd tests - iommufd kernel selftest - general cki testing v5: Added 3 fixes that recently landed upstream. v6: Resolve issues raised by Don: - Conflicts note updates - Backport of acpi, and arm7 io-pgtable commit. - While doing this I went through to diff against upstream and cleaned up a couple more merge commit cleanups: * __GFP_ZERO use in amd ppr log alloc code. * Line break in swiotlb code. * dead config option in drivers/iommu/intel/Kconfig - Hyperv commit that was missed in previous backports. v7: Fixed typo from using Conflict instead of Conflicts for tag. v8: Fixed borked conflict resolution after adding acpi patch in v6. v10: Added dt bindings commit mentioned by Eric. v11: Rebase due to merge conflict after acpi MR merged. This also drops a couple acpi commits that were now empty. Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com> ## Approved Development Ticket All submissions to CentOS Stream must reference an approved ticket in [Red Hat Jira](https://issues.redhat.com/). Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved. Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: Mark Salter <msalter@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: Donald Dutile <ddutile@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-10-19 08:07:00 +00:00
Rafael Aquini	4e136352a4	mm: add orphaned kernel-doc to the rst files. JIRA: https://issues.redhat.com/browse/RHEL-27743 This patch is a backport of the following upstream commit: commit 61ff748b5b7b0c32daddbfb92c3bc15d938754dc Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Fri Aug 18 21:06:30 2023 +0100 mm: add orphaned kernel-doc to the rst files. There are many files in mm/ that contain kernel-doc which is not currently published on kernel.org. Some of it is easily categorisable, but most of it is going into the miscellaneous documentation section to be organised later. Some files aren't ready to be included; they contain documentation with build errors. Or they're nommu.c which duplicates documentation from "real" MMU systems. Those files are noted with a # mark (although really anything which isn't a recognised directive would do to prevent inclusion) Link: https://lkml.kernel.org/r/20230818200630.2719595-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-10-01 11:21:59 -04:00
Rafael Aquini	bb486f4fbc	mm: remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO JIRA: https://issues.redhat.com/browse/RHEL-27743 This patch is a backport of the following upstream commit: commit 29d26f1215de14721188988a59b1426abb85b7be Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Wed Aug 2 16:13:33 2023 +0100 mm: remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO Current best practice is to reuse the name of the function as a define to indicate that the function is implemented by the architecture. Link: https://lkml.kernel.org/r/20230802151406.3735276-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-10-01 11:20:19 -04:00
Rafael Aquini	6443ba8155	mm: add generic flush_icache_pages() and documentation JIRA: https://issues.redhat.com/browse/RHEL-27743 This patch is a backport of the following upstream commit: commit 3a255267f6dff40e193501cf731f409ce9175503 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Wed Aug 2 16:13:31 2023 +0100 mm: add generic flush_icache_pages() and documentation flush_icache_page() is deprecated but not yet removed, so add a range version of it. Change the documentation to refer to update_mmu_cache_range() instead of update_mmu_cache(). Link: https://lkml.kernel.org/r/20230802151406.3735276-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-10-01 11:20:17 -04:00
Jerry Snitselaar	c880b92922	Documentation/core-api: correct reference to SWIOTLB_DYNAMIC JIRA: https://issues.redhat.com/browse/RHEL-55466 Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git commit 82d71b53d7e732ede6028591342bdc80fabfa29f Author: Lukas Bulwahn <lukas.bulwahn@redhat.com> Date: Mon May 27 15:13:14 2024 +0200 Documentation/core-api: correct reference to SWIOTLB_DYNAMIC Commit c93f261dfc39 ("Documentation/core-api: add swiotlb documentation") accidentally refers to CONFIG_DYNAMIC_SWIOTLB in one place, while the config is actually called CONFIG_SWIOTLB_DYNAMIC. Correct the reference to the intended config option. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Reviewed-by: Petr Tesarik <petr@tesarici.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> (cherry picked from commit 82d71b53d7e732ede6028591342bdc80fabfa29f) Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>	2024-09-20 12:29:02 -07:00
Jerry Snitselaar	b75202bae9	Documentation/core-api: add swiotlb documentation JIRA: https://issues.redhat.com/browse/RHEL-55466 Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git commit c93f261dfc395b1386320b2f0b160a6f34ed9ea5 Author: Michael Kelley <mhklinux@outlook.com> Date: Wed May 1 08:16:51 2024 -0700 Documentation/core-api: add swiotlb documentation There's currently no documentation for the swiotlb. Add documentation describing usage scenarios, the key APIs, and implementation details. Group the new documentation with other DMA-related documentation. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Petr Tesarik <petr@tesarici.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> (cherry picked from commit c93f261dfc395b1386320b2f0b160a6f34ed9ea5) Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>	2024-09-20 12:26:35 -07:00
Rafael Aquini	ff90cf56a9	mm: Don't pin ZERO_PAGE in pin_user_pages() JIRA: https://issues.redhat.com/browse/RHEL-27742 This patch is a backport of the following upstream commit: commit c8070b78751955e59b42457b974bea4a4fe00187 Author: David Howells <dhowells@redhat.com> Date: Fri May 26 22:41:40 2023 +0100 mm: Don't pin ZERO_PAGE in pin_user_pages() Make pin_user_pages() leave a ZERO_PAGE unpinned if it extracts a pointer to it from the page tables and make unpin_user_page() correspondingly ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a zero page's refcount as we're only allowed ~2 million pins on it - something that userspace can conceivably trigger. Add a pair of functions to test whether a page or a folio is a ZERO_PAGE. Signed-off-by: David Howells <dhowells@redhat.com> cc: Christoph Hellwig <hch@infradead.org> cc: David Hildenbrand <david@redhat.com> cc: Lorenzo Stoakes <lstoakes@gmail.com> cc: Andrew Morton <akpm@linux-foundation.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Matthew Wilcox <willy@infradead.org> cc: Jan Kara <jack@suse.cz> cc: Jeff Layton <jlayton@kernel.org> cc: Jason Gunthorpe <jgg@nvidia.com> cc: Logan Gunthorpe <logang@deltatee.com> cc: Hillf Danton <hdanton@sina.com> cc: Christian Brauner <brauner@kernel.org> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-kernel@vger.kernel.org cc: linux-mm@kvack.org Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20230526214142.958751-2-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:36:12 -04:00
Lucas Zampieri	804616b9d7	Merge: update drivers/base to match Linux v6.6 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3774 JIRA: https://issues.redhat.com/browse/RHEL-26183 Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Myron Stowe <mstowe@redhat.com> Approved-by: Pingfan Liu <piliu@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-07-12 14:11:54 +00:00
Lucas Zampieri	bce53f8035	Merge: CNB95: string: Allow 2-argument strscpy() and strscpy_pad() MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4435 JIRA: https://issues.redhat.com/browse/RHEL-40250 Tested: Just built Omitted-fix: 62776e4378ae9 ("mips: boot/compressed: use __NO_FORTIFY") - No need to include this as MIPS arch is not supported in RHEL Commits: ``` 54d9469bc515 ("fortify: Add run-time WARN for cross-field memcpy()") 311fb40aa056 ("fortify: Use SIZE_MAX instead of (size_t)-1") fa35198f3957 ("fortify: Explicitly check bounds are compile-time constants") 9f7d69c5cd23 ("fortify: Convert to struct vs member helpers") 03699f271de1 ("string: Rewrite and add more kern-doc for the str*() functions") 62e1cbfc5d79 ("fortify: Short-circuit known-safe calls to strscpy()") 439a1bcac648 ("fortify: Use __builtin_dynamic_object_size() when available") 26dd68d293fd ("overflow: add DEFINE_FLEX() for on-stack allocs") 21a2c74b0a2a ("fortify: Use const variables for __member_size tracking") ead62aa370a8 ("fortify: strscpy: Fix flipped q and p docstring typo") f0a6b5831cfb ("uml: Replace strlcpy with strscpy") b229baa374db ("kernel.h: split out COUNT_ARGS() and CONCATENATE() to args.h") e6584c3964f2 ("string: Allow 2-argument strscpy()") f478898e0aa7 ("string: Redefine strscpy_pad() as a macro") 8366d124ec93 ("string: Allow 2-argument strscpy_pad()") 0d043351e5ba ("ext4: fix fortify warning in fs/ext4/fast_commit.c:1551") b2ba00c2a517 ("rxrpc: replace zero-lenth array with DECLARE_FLEX_ARRAY() helper") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Petr Oros <poros@redhat.com> Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-06-25 13:33:33 +00:00
Donald Dutile	e41f7154cf	module: add debug stats to help identify memory pressure JIRA: https://issues.redhat.com/browse/RHEL-28063 Conflicts: Adding RHEL-only MODULE_STATS set to n for now. Possibly add in future -debug kernels, to be determined. Add rest of commit(s) to enable clean backport for further commits. commit df3e764d8e5cd416efee29e0de3c93917dff5d33 Author: Luis Chamberlain <mcgrof@kernel.org> Date: Tue Mar 28 20:03:19 2023 -0700 module: add debug stats to help identify memory pressure Loading modules with finit_module() can end up using vmalloc(), vmap() and vmalloc() again, for a total of up to 3 separate allocations in the worst case for a single module. We always kernel_read() the module, that's a vmalloc(). Then vmap() is used for the module decompression, and if so the last read buffer is freed as we use the now decompressed module buffer to stuff data into our copy module. The last allocation is specific to each architectures but pretty much that's generally a series of vmalloc() calls or a variation of vmalloc to handle ELF sections with special permissions. Evaluation with new stress-ng module support [1] with just 100 ops is proving that you can end up using GiBs of data easily even with all care we have in the kernel and userspace today in trying to not load modules which are already loaded. 100 ops seems to resemble the sort of pressure a system with about 400 CPUs can create on module loading. Although issues relating to duplicate module requests due to each CPU inucurring a new module reuest is silly and some of these are being fixed, we currently lack proper tooling to help diagnose easily what happened, when it happened and who likely is to blame -- userspace or kernel module autoloading. Provide an initial set of stats which use debugfs to let us easily scrape post-boot information about failed loads. This sort of information can be used on production worklaods to try to optimize avoiding* redundant memory pressure using finit_module(). There's a few examples that can be provided: A 255 vCPU system without the next patch in this series applied: Startup finished in 19.143s (kernel) + 7.078s (userspace) = 26.221s graphical.target reached after 6.988s in userspace And 13.58 GiB of virtual memory space lost due to failed module loading: root@big ~ # cat /sys/kernel/debug/modules/stats Mods ever loaded 67 Mods failed on kread 0 Mods failed on decompress 0 Mods failed on becoming 0 Mods failed on load 1411 Total module size 11464704 Total mod text size 4194304 Failed kread bytes 0 Failed decompress bytes 0 Failed becoming bytes 0 Failed kmod bytes 14588526272 Virtual mem wasted bytes 14588526272 Average mod size 171115 Average mod text size 62602 Average fail load bytes 10339140 Duplicate failed modules: module-name How-many-times Reason kvm_intel 249 Load kvm 249 Load irqbypass 8 Load crct10dif_pclmul 128 Load ghash_clmulni_intel 27 Load sha512_ssse3 50 Load sha512_generic 200 Load aesni_intel 249 Load crypto_simd 41 Load cryptd 131 Load evdev 2 Load serio_raw 1 Load virtio_pci 3 Load nvme 3 Load nvme_core 3 Load virtio_pci_legacy_dev 3 Load virtio_pci_modern_dev 3 Load t10_pi 3 Load virtio 3 Load crc32_pclmul 6 Load crc64_rocksoft 3 Load crc32c_intel 40 Load virtio_ring 3 Load crc64 3 Load The following screen shot, of a simple 8vcpu 8 GiB KVM guest with the next patch in this series applied, shows 226.53 MiB are wasted in virtual memory allocations which due to duplicate module requests during boot. It also shows an average module memory size of 167.10 KiB and an an average module .text + .init.text size of 61.13 KiB. The end shows all modules which were detected as duplicate requests and whether or not they failed early after just the first kernel_read*() call or late after we've already allocated the private space for the module in layout_and_allocate(). A system with module decompression would reveal more wasted virtual memory space. We should put effort now into identifying the source of these duplicate module requests and trimming these down as much possible. Larger systems will obviously show much more wasted virtual memory allocations. root@kmod ~ # cat /sys/kernel/debug/modules/stats Mods ever loaded 67 Mods failed on kread 0 Mods failed on decompress 0 Mods failed on becoming 83 Mods failed on load 16 Total module size 11464704 Total mod text size 4194304 Failed kread bytes 0 Failed decompress bytes 0 Failed becoming bytes 228959096 Failed kmod bytes 8578080 Virtual mem wasted bytes 237537176 Average mod size 171115 Average mod text size 62602 Avg fail becoming bytes 2758544 Average fail load bytes 536130 Duplicate failed modules: module-name How-many-times Reason kvm_intel 7 Becoming kvm 7 Becoming irqbypass 6 Becoming & Load crct10dif_pclmul 7 Becoming & Load ghash_clmulni_intel 7 Becoming & Load sha512_ssse3 6 Becoming & Load sha512_generic 7 Becoming & Load aesni_intel 7 Becoming crypto_simd 7 Becoming & Load cryptd 3 Becoming & Load evdev 1 Becoming serio_raw 1 Becoming nvme 3 Becoming nvme_core 3 Becoming t10_pi 3 Becoming virtio_pci 3 Becoming crc32_pclmul 6 Becoming & Load crc64_rocksoft 3 Becoming crc32c_intel 3 Becoming virtio_pci_modern_dev 2 Becoming virtio_pci_legacy_dev 1 Becoming crc64 2 Becoming virtio 2 Becoming virtio_ring 2 Becoming [0] https://github.com/ColinIanKing/stress-ng.git [1] echo 0 > /proc/sys/vm/oom_dump_tasks ./stress-ng --module 100 --module-name xfs Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Donald Dutile <ddutile@redhat.com>	2024-06-17 14:17:25 -04:00
Lucas Zampieri	f6029bf351	Merge: workqueue: Backport workqueue commits to v6.9 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3910 JIRA: https://issues.redhat.com/browse/RHEL-25103 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3910 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3847 The primary purpose of this MR is to backport those upstream workqueue commits which enables ordered workqueues and rescuers to follow changes in workqueue unbound cpumask which is necessary to make sure that isolated CPUs won't be disturbed due to unbound work items being handled by those CPUs. These upstream commits were merged into the v6.9 kernel which also contains some major changes in workqueue code. This makes the required commits dependent on some of the v6.9 workqueue commits. It is less risky to sync the workqueue code up to v6.9 instead of selective backports of some dependent commits. This MR also includes some miscellaneous commits in other subsystems due to changes in the underlying workqueue implementations. A follow-up proactive workqueue fixes MR will be created later on, if necessary. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Tony Camuso <tcamuso@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Approved-by: Vladis Dronov <vdronov@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Radu Rendec <rrendec@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-06-13 13:07:43 +00:00
Ivan Vecera	4f908e0d18	string: Rewrite and add more kern-doc for the str() functions JIRA: https://issues.redhat.com/browse/RHEL-40250 commit 03699f271de1f4df6369cd379506539cd7d590d3 Author: Kees Cook <keescook@chromium.org> Date: Fri Sep 2 14:33:44 2022 -0700 string: Rewrite and add more kern-doc for the str() functions While there were varying degrees of kern-doc for various str*()-family functions, many needed updating and clarification, or to just be entirely written. Update (and relocate) existing kern-doc and add missing functions, sadly shaking my head at how many times I have written "Do not use this function". Include the results in the core kernel API doc. Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Andy Shevchenko <andy@kernel.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-hardening@vger.kernel.org Tested-by: Akira Yokosawa <akiyks@gmail.com> Link: https://lore.kernel.org/lkml/9b0cf584-01b3-3013-b800-1ef59fe82476@gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2024-06-10 19:14:58 +02:00
Mark Langsdorf	013a5d02a3	crash: memory and CPU hotplug sysfs attributes JIRA: https://issues.redhat.com/browse/RHEL-26183 Conflicts: Documentation/ABI/testing/sysfs-devices-system-cpu - The crash_hotplug text was added to the end of the file in upstream, so I did the same here. include/linux/kexec.h - minor context differences commit 88a6f89944216b028d3872b0cec0f51a2f955460 Author: Eric DeVolder <eric.devolder@oracle.com> Date: Thu, 24 Aug 2023 16:25:14 +0000 Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="00000051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="8000000" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Link: https://lkml.kernel.org/r/20230814214446.6659-5-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Akhil Raj <lf32.dev@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>	2024-05-31 15:37:43 -04:00
Waiman Long	deebfc6ab4	workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 3bc1e711c26bff01d41ad71145ecb8dcb4412576 Author: Tejun Heo <tj@kernel.org> Date: Mon, 5 Feb 2024 14:19:10 -1000 workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way to create ordered workqueues and the new NUMA support broke it. These problems can be subtle and the fact that they can only trigger on NUMA machines made them even more difficult to debug. However, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue udpates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. There aren't that many UNBOUND w/ @max_active==1 users in the tree and the preceding patches audited all and converted them to alloc_ordered_workqueue() as appropriate. This patch removes the implicit promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones. v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in apply_workqueue_attrs_locked() which spuriously triggers WARNING and fails workqueue creation. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	da3eaa2838	workqueue: Implement BH workqueues to eventually replace tasklets JIRA: https://issues.redhat.com/browse/RHEL-25103 Conflicts: A minor context diff in kernel/workqueue.c due to missing upstream commit 68279f9c9f59 ("treewide: mark stuff as __ro_after_init"). commit 4cb1ef64609f9b0254184b2947824f4b46ccab22 Author: Tejun Heo <tj@kernel.org> Date: Sun, 4 Feb 2024 11:28:06 -1000 workqueue: Implement BH workqueues to eventually replace tasklets The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws such as the execution code accessing the tasklet item after the execution is complete which can lead to subtle use-after-free in certain usage scenarios and less-developed flush and cancel mechanisms. This patch implements BH workqueues which share the same semantics and features of regular workqueues but execute their work items in the softirq context. As there is always only one BH execution context per CPU, none of the concurrency management mechanisms applies and a BH workqueue can be thought of as a convenience wrapper around softirq. Except for the inability to sleep while executing and lack of max_active adjustments, BH workqueues and work items should behave the same as regular workqueues and work items. Currently, the execution is hooked to tasklet[_hi]. However, the goal is to convert all tasklet users over to BH workqueues. Once the conversion is complete, tasklet can be removed and BH workqueues can directly take over the tasklet softirqs. system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in tasklet, all existing tasklet users should be able to use the system BH workqueues without creating their own workqueues. v3: - Add missing interrupt.h include. v2: - Instead of using tasklets, hook directly into its softirq action functions - tasklet[_hi]_action(). This is slightly cheaper and closer to the eventual code structure we want to arrive at. Suggested by Lai. - Lai also pointed out several places which need NULL worker->task handling or can use clarification. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com Tested-by: Allen Pais <allen.lkml@gmail.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	f71ada9990	Documentation/core-api: fix spelling mistake in workqueue JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 22160b08d88898898b2bb99282663cdf3caa5c1c Author: attreyee-muk <tintinm2017@gmail.com> Date: Thu, 11 Jan 2024 00:27:47 +0530 Documentation/core-api: fix spelling mistake in workqueue Correct to "following" from "followings" in the sentence "The followings are the read bandwidths and CPU utilizations depending on different affinity scope settings on ``kcryptd`` measured over five runs." Signed-off-by: Attreyee Mukherjee <tintinm2017@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20240110185746.24974-1-tintinm2017@gmail.com Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:29 -04:00
Waiman Long	fd132bf1dd	Documentation/core-api : fix typo in workqueue JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 89405db5cd1ed641dfe91b4a3796bb6188ab05e3 Author: attreyee-muk <tintinm2017@gmail.com> Date: Sat, 23 Dec 2023 23:23:17 +0530 Documentation/core-api : fix typo in workqueue Correct to “boundaries” from “bounaries” Signed-off-by: Attreyee Mukherjee <tintinm2017@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20231223175316.24951-1-tintinm2017@gmail.com Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	66b663c9fd	workqueue: doc: Fix function and sysfs path errors JIRA: https://issues.redhat.com/browse/RHEL-25103 commit bd9e7326b8d512ee724006d4ec06dfbf3096ae9e Author: WangJinchao <wangjinchao@xfusion.com> Date: Thu, 12 Oct 2023 15:17:38 +0800 workqueue: doc: Fix function and sysfs path errors alloc_ordered_queue -> alloc_ordered_workqueue /sys/devices/virtual/WQ_NAME/ -> /sys/devices/virtual/workqueue/WQ_NAME/ Signed-off-by: WangJinchao <wangjinchao@xfusion.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	150f0fd88d	workqueue: Make default affinity_scope dynamically updatable JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 523a301e66afd1ea9856660bcf3cee3a7c84c6dd Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:25 -1000 workqueue: Make default affinity_scope dynamically updatable While workqueue.default_affinity_scope is writable, it only affects workqueues which are created afterwards and isn't very useful. Instead, let's introduce explicit "default" scope and update the effective scope dynamically when workqueue.default_affinity_scope is changed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	170fd51369	workqueue: Add "Affinity Scopes and Performance" section to documentation JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 7dbf15c5c05e835d488e0fee49a35b0f23452e45 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:25 -1000 workqueue: Add "Affinity Scopes and Performance" section to documentation With affinity scopes and their strictness setting added, unbound workqueues should now be able to cover wide variety of configurations and use cases. Unfortunately, the performance picture is not entirely straight-forward due to a trade-off between efficiency and work-conservation in some situations necessitating manual configuration. This patch adds "Affinity Scopes and Performance" section to Documentation/core-api/workqueue.rst which illustrates the trade-off with a set of experiments and provides some guidelines. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	87b2fa1d0e	workqueue: Implement non-strict affinity scope for unbound workqueues JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 8639ecebc9b1796d7074751a350462f5e1c61cd4 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:25 -1000 workqueue: Implement non-strict affinity scope for unbound workqueues An unbound workqueue can be served by multiple worker_pools to improve locality. The segmentation is achieved by grouping CPUs into pods. By default, the cache boundaries according to cpus_share_cache() define the CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the system has two L3 caches. The workqueue would be mapped to two worker_pools each serving one L3 cache domains. While this improves locality, because the pod boundaries are strict, it limits the total bandwidth a given issuer can consume. For example, let's say there is a thread pinned to a CPU issuing enough work items to saturate the whole machine. With the machine segmented into two pods, no matter how many work items it issues, it can only use half of the CPUs on the system. While this limitation has existed for a very long time, it wasn't very pronounced because the affinity grouping used to be always by NUMA nodes. With cache boundaries as the default and support for even finer grained scopes (smt and cpu), it is now an a lot more pressing problem. This patch implements non-strict affinity scope where the pod boundaries aren't enforced strictly. Going back to the previous example, the workqueue would still be mapped to two worker_pools; however, the affinity enforcement would be soft. The workers in both pools would have their cpus_allowed set to the whole machine thus allowing the scheduler to migrate them anywhere on the machine. However, whenever an idle worker is woken up, the workqueue code asks the scheduler to bring back the task within the pod if the worker is outside. ie. work items start executing within its affinity scope but can be migrated outside as the scheduler sees fit. This removes the hard cap on utilization while maintaining the benefits of affinity scopes. After the earlier ->__pod_cpumask changes, the implementation is pretty simple. When non-strict which is the new default: * pool_allowed_cpus() returns @pool->attrs->cpumask instead of ->__pod_cpumask so that the workers are allowed to run on any CPU that the associated workqueues allow. * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets the field to a CPU within the pod. This would be the first use of task_struct->wake_cpu outside scheduler proper, so it isn't clear whether this would be acceptable. However, other methods of migrating tasks are significantly more expensive and are likely prohibitively so if we want to do this on every work item. This needs discussion with scheduler folks. There is also a race window where setting ->wake_cpu wouldn't be effective as the target task is still on CPU. However, the window is pretty small and this being a best-effort optimization, it doesn't seem to warrant more complexity at the moment. While the non-strict cache affinity scopes seem to be the best option, the performance picture interacts with the affinity scope and is a bit complicated to fully discuss in this patch, so the behavior is made easily selectable through wqattrs and sysfs and the next patch will add documentation to discuss performance implications. v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	89997d1af2	workqueue: Add multiple affinity scopes and interface to select them JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 63c5484e74952f60f5810256bd69814d167b8d22 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:24 -1000 workqueue: Add multiple affinity scopes and interface to select them Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE the default. The code changes to actually add the additional scopes are trivial. Also add module parameter "workqueue.default_affinity_scope" to override the default scope and "affinity_scope" sysfs file to configure it per workqueue. wq_dump.py and documentations are updated accordingly. This enables significant flexibility in configuring how unbound workqueues behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu workqueue. On the other hand, "system" removes all locality boundaries. Many modern machines have multiple L3 caches often while being mostly uniform in terms of memory access. Thus, workqueue's previous behavior of spreading work items in each NUMA node had negative performance implications from unncessarily crossing L3 boundaries between issue and execution. However, picking a finer grained affinity scope also has a downside in that an issuer in one group can't utilize CPUs in other groups. While dependent on the specifics of workload, there's usually a noticeable penalty in crossing L3 boundaries, so let's default to CACHE. This issue will be further addressed and documented with examples in future patches. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	8aa1b3540f	workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 7f7dc377a3b2bbaa8cf8941587c228eab4bd82ec Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:24 -1000 workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration Lack of visibility has always been a pain point for workqueues. While the recently added wq_monitor.py improved the situation, it's still difficult to understand what worker pools are active in the system, how workqueues map to them and why. The lack of visibility into how workqueues are configured is going to become more noticeable as workqueue improves locality awareness and provides more mechanisms to customize locality related behaviors. Now that the basic framework for more flexible locality support is in place, this is a good time to improve the situation. This patch adds tools/workqueues/wq_dump.py which prints out the topology configuration, worker pools and how workqueues are mapped to pools. Read the command's help message for more details. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	de03f9c951	workqueue: Make unbound workqueues to use per-cpu pool_workqueues JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 636b927eba5bc633753f8eb80f35e1d5be806e51 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:23 -1000 workqueue: Make unbound workqueues to use per-cpu pool_workqueues A pwq (pool_workqueue) represents an association between a workqueue and a worker_pool. When a work item is queued, the workqueue selects the pwq to use, which in turn determines the pool, and queues the work item to the pool through the pwq. pwq is also what implements the maximum concurrency limit - @max_active. As a per-cpu workqueue should be assocaited with a different worker_pool on each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq. However, unbound workqueues were sharing a pwq within each NUMA node by default. The sharing has several downsides: * Because @max_active is per-pwq, the meaning of @max_active changes depending on the machine configuration and whether workqueue NUMA locality support is enabled. * Makes per-cpu and unbound code deviate. * Gets in the way of making workqueue CPU locality awareness more flexible. This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu workqueues do by making the following changes: * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound workqueues. * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs the specified pwq to the target CPU's wq->cpu_pwq. * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq. This makes the return value of wq_calc_node_cpumask() unnecessary. It now returns void. * @max_active now means the same thing for both per-cpu and unbound workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer used in workqueue implementation and will be removed later. * All unbound pwq operations which used to be per-numa-node are now per-cpu. For most unbound workqueue users, this shouldn't cause noticeable changes. Work item issue and completion will be a small bit faster, flush_workqueue() would become a bit more expensive, and the total concurrency limit would likely become higher. All @max_active==1 use cases are currently being audited for conversion into alloc_ordered_workqueue() and they shouldn't be affected once the audit and conversion is complete. One area where the behavior change may be more noticeable is workqueue_congested() as the reported congestion state is now per CPU instead of NUMA node. There are only two users of this interface - drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are cc'd. Inputs on the behavior change would be very much appreciated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Karsten Graul <kgraul@linux.ibm.com> Cc: Wenjia Zhang <wenjia@linux.ibm.com> Cc: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	de650632ad	workqueue: Track and monitor per-workqueue CPU time usage JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 8a1dd1e547c1a037692e7a6da6a76108108c72b1 Author: Tejun Heo <tj@kernel.org> Date: Wed, 17 May 2023 17:02:09 -1000 workqueue: Track and monitor per-workqueue CPU time usage Now that wq_worker_tick() is there, we can easily track the rough CPU time consumption of each workqueue by charging the whole tick whenever a tick hits an active workqueue. While not super accurate, it provides reasonable visibility into the workqueues that consume a lot of CPU cycles. wq_monitor.py is updated to report the per-workqueue CPU times. v2: wq_monitor.py was using "cputime" as the key when outputting in json format. Use "cpu_time" instead for consistency with other fields. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:24 -04:00
Waiman Long	1665f6ac9c	workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 616db8779b1e3f93075df691432cccc5ef3c3ba0 Author: Tejun Heo <tj@kernel.org> Date: Wed, 17 May 2023 17:02:08 -1000 workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE If a per-cpu work item hogs the CPU, it can prevent other work items from starting through concurrency management. A per-cpu workqueue which intends to host such CPU-hogging work items can choose to not participate in concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be error-prone and difficult to debug when missed. This patch adds an automatic CPU usage based detection. If a concurrency-managed work item consumes more CPU time than the threshold (10ms by default) continuously without intervening sleeps, wq_worker_tick() which is called from scheduler_tick() will detect the condition and automatically mark it CPU_INTENSIVE. The mechanism isn't foolproof: * Detection depends on tick hitting the work item. Getting preempted at the right timings may allow a violating work item to evade detection at least temporarily. * nohz_full CPUs may not be running ticks and thus can fail detection. * Even when detection is working, the 10ms detection delays can add up if many CPU-hogging work items are queued at the same time. However, in vast majority of cases, this should be able to detect violations reliably and provide reasonable protection with a small increase in code complexity. If some work items trigger this condition repeatedly, the bigger problem likely is the CPU being saturated with such per-cpu work items and the solution would be making them UNBOUND. The next patch will add a debug mechanism to help spot such cases. v4: Documentation for workqueue.cpu_intensive_thresh_us added to kernel-parameters.txt. v3: Switch to use wq_worker_tick() instead of hooking into preemptions as suggested by Peter. v2: Lai pointed out that wq_worker_stopping() also needs to be called from preemption and rtlock paths and an earlier patch was updated accordingly. This patch adds a comment describing the risk of infinte recursions and how they're avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:24 -04:00
Waiman Long	20a387c381	workqueue: Add pwq->stats[] and a monitoring script JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 725e8ec59c56c65fb92e343c10a8842cd0d4f194 Author: Tejun Heo <tj@kernel.org> Date: Wed, 17 May 2023 17:02:08 -1000 workqueue: Add pwq->stats[] and a monitoring script Currently, the only way to peer into workqueue operations is through tracing. While possible, it isn't easy or convenient to monitor per-workqueue behaviors over time this way. Let's add pwq->stats[] that track relevant events and a drgn monitoring script - tools/workqueue/wq_monitor.py. It's arguable whether this needs to be configurable. However, it currently only has several counters and the runtime overhead shouldn't be noticeable given that they're on pwq's which are per-cpu on per-cpu workqueues and per-numa-node on unbound ones. Let's keep it simple for the time being. v2: Patch reordered to earlier with fewer fields. Field will be added back gradually. Help message improved. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	b47ae598c5	workqueue: doc: Call out the non-reentrance conditions JIRA: https://issues.redhat.com/browse/RHEL-25103 commit f9eaaa82b474350aa8436d15a7ae150a3c8b9d5c Author: Boqun Feng <boqun.feng@gmail.com> Date: Fri, 22 Oct 2021 08:42:08 +0800 workqueue: doc: Call out the non-reentrance conditions The current doc of workqueue API suggests that work items are non-reentrant: any work item is guaranteed to be executed by at most one worker system-wide at any given time. However this is not true, the following case can cause a work item W executed by two workers at the same time: queue_work_on(0, WQ1, W); // after a worker picks up W and clear the pending bit queue_work_on(1, WQ2, W); // workers on CPU0 and CPU1 will execute W in the same time. , which means the non-reentrance of a work item is conditional, and Lai Jiangshan provided a nice summary[1] of the conditions, therefore use it to describe a work item instance and improve the doc. [1]: https://lore.kernel.org/lkml/CAJhGHyDudet_xyNk=8xnuO2==o-u06s0E0GZVP4Q67nmQ84Ceg@mail.gmail.com/ Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:21 -04:00
Chris von Recklinghausen	3b26b723b6	mm/slab: document kfree() as allowed for kmem_cache_alloc() objects JIRA: https://issues.redhat.com/browse/RHEL-27741 commit ae65a5211d90e54ae604012ce9cf234c48780929 Author: Vlastimil Babka <vbabka@suse.cz> Date: Thu Mar 2 16:01:00 2023 +0100 mm/slab: document kfree() as allowed for kmem_cache_alloc() objects This will make it easier to free objects in situations when they can come from either kmalloc() or kmem_cache_alloc(), and also allow kfree_rcu() for freeing objects from kmem_cache_alloc(). For the SLAB and SLUB allocators this was always possible so with SLOB gone, we can document it as supported. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-30 07:00:18 -04:00
Chris von Recklinghausen	ea5a8c928e	mm, printk: introduce new format %pGt for page_type JIRA: https://issues.redhat.com/browse/RHEL-27741 commit 4c85c0be3d7a9a7ffe48bfe0954eacc0ba9d3c75 Author: Hyeonggon Yoo <42.hyeyoo@gmail.com> Date: Mon Jan 30 13:25:13 2023 +0900 mm, printk: introduce new format %pGt for page_type %pGp format is used to display 'flags' field of a struct page. However, some page flags (i.e. PG_buddy, see page-flags.h for more details) are stored in page_type field. To display human-readable output of page_type, introduce %pGt format. It is important to note the meaning of bits are different in page_type. if page_type is 0xffffffff, no flags are set. Setting PG_buddy (0x00000080) flag results in a page_type of 0xffffff7f. Clearing a bit actually means setting a flag. Bits in page_type are inverted when displaying type names. Only values for which page_type_has_type() returns true are considered as page_type, to avoid confusion with mapcount values. if it returns false, only raw values are displayed and not page type names. Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> [vsprintf part] Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-30 07:00:05 -04:00
Ivan Vecera	479b9ec9d0	doc/netlink: Update genetlink-legacy documentation JIRA: https://issues.redhat.com/browse/RHEL-30656 commit 294f37fc87728230cf81f387aa6a0a8438066751 Author: Donald Hunter <donald.hunter@gmail.com> Date: Fri Aug 25 13:27:46 2023 +0100 doc/netlink: Update genetlink-legacy documentation Add documentation for recently added genetlink-legacy schema attributes. Remove statements about 'work in progress' and 'todo'. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230825122756.7603-4-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2024-04-10 09:19:32 +02:00
Ivan Vecera	12a80d14ce	docs: add more netlink docs (incl. spec docs) JIRA: https://issues.redhat.com/browse/RHEL-30344 Conflicts: - small context conflict in MAINTAINERS commit 9d6a65079c98f55fa2249c50e517d133d137c251 Author: Jakub Kicinski <kuba@kernel.org> Date: Fri Jan 20 09:50:34 2023 -0800 docs: add more netlink docs (incl. spec docs) Add documentation about the upcoming Netlink protocol specs. Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2024-04-02 11:15:38 +02:00
Nico Pache	3fd58a2958	selftests/vm: rename selftests/vm to selftests/mm commit baa489fabd01596d5426d6e112b34ba5fb59ab82 Author: SeongJae Park <sj@kernel.org> Date: Tue Jan 3 18:07:53 2023 +0000 selftests/vm: rename selftests/vm to selftests/mm Rename selftets/vm to selftests/mm for being more consistent with the code, documentation, and tools directories, and won't be confused with virtual machines. [sj@kernel.org: convert missing vm->mm changes] Link: https://lkml.kernel.org/r/20230107230643.252273-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230103180754.129637-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> JIRA: https://issues.redhat.com/browse/RHEL-5617 Signed-off-by: Nico Pache <npache@redhat.com>	2024-01-19 10:06:45 -07:00
Jaroslav Kysela	0ad76c02a7	Documentation: core-api: Drop :export: for int_log.h JIRA: https://issues.redhat.com/browse/RHEL-13724 commit 5bdeb6f5c7b94f653183f08eca8a08022b2efac6 Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Date: Tue Jul 25 13:49:56 2023 +0300 Documentation: core-api: Drop :export: for int_log.h The :export: keyword makes sense only for C-files, where EXPORT_SYMBOL() might appear. Otherwise kernel-doc may not produce anything out of this file. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: f97fa3dcb2db ("lib/math: Move dvb_math.c into lib/math/int_log.c") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20230725104956.47806-1-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jaroslav Kysela <jkysela@redhat.com>	2023-12-18 16:30:55 +01:00
Jaroslav Kysela	09e210346e	lib/math: Move dvb_math.c into lib/math/int_log.c JIRA: https://issues.redhat.com/browse/RHEL-13724 commit f97fa3dcb2db02013e6904c032a1d2d45707ee40 Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Date: Mon Jul 3 16:52:08 2023 +0300 lib/math: Move dvb_math.c into lib/math/int_log.c Some existing and new users may benefit from the intlog2() and intlog10() APIs, make them wide available. Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Acked-by: Mauro Carvalho Chehab <mchehab@kernel.org> Link: https://lore.kernel.org/r/20230619172019.21457-2-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20230703135211.87416-2-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jaroslav Kysela <jkysela@redhat.com>	2023-12-18 16:30:34 +01:00
David Arcari	4f7156ed12	x86/topology: Remove CPU0 hotplug option JIRA: https://issues.redhat.com/browse/RHEL-15512 commit e59e74dc48a309cb848ffc3d76a0d61aa6803c05 Author: Thomas Gleixner <tglx@linutronix.de> Date: Fri May 12 23:07:04 2023 +0200 x86/topology: Remove CPU0 hotplug option This was introduced together with commit `e1c467e690` ("x86, hotplug: Wake up CPU0 via NMI instead of INIT, SIPI, SIPI") to eventually support physical hotplug of CPU0: "We'll change this code in the future to wake up hard offlined CPU0 if real platform and request are available." 11 years later this has not happened and physical hotplug is not officially supported. Remove the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.715707999@linutronix.de Signed-off-by: David Arcari <darcari@redhat.com>	2023-12-05 11:56:51 -05:00
Chris von Recklinghausen	fe5f50def7	mm: remove folio_pincount_ptr() and head_compound_pincount() JIRA: https://issues.redhat.com/browse/RHEL-1848 commit 94688e8eb453e616098cb930e5f6fed4a6ea2dfa Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Wed Jan 11 14:28:47 2023 +0000 mm: remove folio_pincount_ptr() and head_compound_pincount() We can use folio->_pincount directly, since all users are guarded by tests of compound/large. Link: https://lkml.kernel.org/r/20230111142915.1001531-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:15:52 -04:00
Chris von Recklinghausen	35cd91f122	mm/page_alloc: remove obsolete gfpflags_normal_context() JIRA: https://issues.redhat.com/browse/RHEL-1848 commit def76fd549c513bb90278a8d6d0fe3ef3faa20a7 Author: Miaohe Lin <linmiaohe@huawei.com> Date: Fri Sep 16 15:22:56 2022 +0800 mm/page_alloc: remove obsolete gfpflags_normal_context() Since commit dacb5d8875cc ("tcp: fix page frag corruption on page fault"), there's no caller of gfpflags_normal_context(). Remove it as this helper is strictly tied to the sk page frag usage and there won't be other user in the future. [linmiaohe@huawei.com: fix htmldocs] Link: https://lkml.kernel.org/r/1bc55727-9b66-0e9e-c306-f10c4716ea89@huawei.com Link: https://lkml.kernel.org/r/20220916072257.9639-16-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:14:47 -04:00
Chris von Recklinghausen	914f4f8c1f	headers/deps: mm: align MANITAINERS and Docs with new gfp.h structure JIRA: https://issues.redhat.com/browse/RHEL-1848 commit 7343f2b0db4961d9f386e685e651c663dc763d0c Author: Yury Norov <yury.norov@gmail.com> Date: Wed Jul 6 08:52:24 2022 -0700 headers/deps: mm: align MANITAINERS and Docs with new gfp.h structure After moving gfp types out of gfp.h, we have to align MAINTAINERS and Docs, to avoid warnings like this: >> include/linux/gfp.h:1: warning: 'Page mobility and placement hints' not found >> include/linux/gfp.h:1: warning: 'Watermark modifiers' not found >> include/linux/gfp.h:1: warning: 'Reclaim modifiers' not found >> include/linux/gfp.h:1: warning: 'Useful GFP flag combinations' not found Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:12:56 -04:00
Jan Stancek	d50c1077fc	Merge: MM changes for RHEL 9.3 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2069 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160210 Patches 0001-0037: Generic preparation and code dependencies, mostly moving sysctls definitions to their related compilation units; Patches 0038-0058: Change sets targeting ARM; Patches 0059-0090: Change sets targeting PPC; Patches 0091-0144: Change sets targeting s390; Patches 0145-0154: Change sets targeting x86_64; Patches 0155-0915: Change sets for portable changes; Not necessary: Omitted-fix: b83699ea1e62 ("LoongArch: mm: Avoid unnecessary page fault retires on shared memory types") Omitted-fix: 6f5c672d17f5 ("s390/smp: enforce lowcore protection on CPU restart") Omitted-fix: 953503751a42 ("Revert "s390/smp: enforce lowcore protection on CPU restart"") To be picked up later (presumably under bz2168372): Omitted-fix: 4f4c549feb4e ("arm64: mte: Avoid the racy walk of the vma list during core dump") Omitted-fix: ae3a2a218821 ("powerpc/ftrace: Remove redundant create_branch() calls") Omitted-fix: af320fb7ddb0 ("selftests/bpf: Fix s390x vmlinux path") Omitted-fix: 16fd6b31dd9b ("Revert "mm: migration: fix the FOLL_GET failure on following huge page"") Omitted-fix: 1d8d14641fd9 ("mm/hugetlb: support write-faults in shared mappings") Change already in CS9 Omitted-fix: 4302abc628fc ("powerpc/64s: Prevent fallthrough to hash TLB flush when using radix") Omitted-fix: badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs") Omitted-fix: 36d4b36b6959 ("lib/nodemask: inline next_node_in() and node_random()") Omitted-fix: c36e20249571 ("mm: introduce mf_dax_kill_procs() for fsdax case") Omitted-fix: dea18da45992 ("powerpc/64s: Fix stress_hpt memblock alloc alignment") Omitted-fix: 2e8cff0a0eee ("arm64: fix rodata=full") Omitted-fix: 2081b3bd0c11 ("arm64: fix rodata=full again") Omitted-fix: 736eedc974ea ("arm64: mte: Fix double-freeing of the temporary tag storage during coredump") Omitted-fix: 19e183b54528 ("elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}") Omitted-fix: 61d2d1808b20 ("arm64: mm: don't acquire mutex when rewriting swapper") Omitted-fix: a3d3163fbe69 ("x86/mm/32: Fix W^X detection when page tables do not support NX") Omitted-fix: a970174d7a10 ("x86/mm: Do not verify W^X at boot up") Omitted-fix: 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") Omitted-fix: 63cf584203f3 ("mm: teach mincore_hugetlb about pte markers") Brew: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=51556952 KT1+mm regression: https://beaker.engineering.redhat.com/jobs/7663545 Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Mika Penttilä <mpenttil@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-03-29 12:09:11 +02:00
Jan Stancek	7854c2726c	Merge: bpf, xdp: update to 6.1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2067 bpf, xdp: update to 6.1 Bugzilla: https://bugzilla.redhat.com/2166911 Signed-off-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Felix Maurer <fmaurer@redhat.com> Approved-by: Viktor Malik <vmalik@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-03-27 07:05:16 +02:00
Chris von Recklinghausen	23ba6a23ee	arch//: remove CONFIG_VIRT_TO_BUS Conflicts: drop changes to arch/alpha/include/asm/io.h - unsupported config Bugzilla: https://bugzilla.redhat.com/2160210 commit 4313a24985f00340eeb591fd66aa2b257b9e0a69 Author: Arnd Bergmann <arnd@arndb.de> Date: Mon May 23 21:59:02 2022 +0200 arch//: remove CONFIG_VIRT_TO_BUS All architecture-independent users of virt_to_bus() and bus_to_virt() have been fixed to use the dma mapping interfaces or have been removed now. This means the definitions on most architectures, and the CONFIG_VIRT_TO_BUS symbol are now obsolete and can be removed. The only exceptions to this are a few network and scsi drivers for m68k Amiga and VME machines and ppc32 Macintosh. These drivers work correctly with the old interfaces and are probably not worth changing. On alpha and parisc, virt_to_bus() were still used in asm/floppy.h. alpha can use isa_virt_to_bus() like x86 does, and parisc can just open-code the virt_to_phys() here, as this is architecture specific code. I tried updating the bus-virt-phys-mapping.rst documentation, which started as an email from Linus to explain some details of the Linux-2.0 driver interfaces. The bits about virt_to_bus() were declared obsolete backin 2000, and the rest is not all that relevant any more, so in the end I just decided to remove the file completely. Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:19:15 -04:00
Chris von Recklinghausen	267a7a9b62	docs: rename Documentation/vm to Documentation/mm Conflicts: drop changes to arch/loongarch/Kconfig - unsupported config Bugzilla: https://bugzilla.redhat.com/2160210 commit ee65728e103bb7dd99d8604bf6c7aa89c7d7e446 Author: Mike Rapoport <rppt@kernel.org> Date: Mon Jun 27 09:00:26 2022 +0300 docs: rename Documentation/vm to Documentation/mm so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Jonathan Corbet <corbet@lwn.net> Acked-by: Wu XiangCheng <bobwxc@email.cn> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:19:15 -04:00
Nico Pache	d0b81c5b5a	Maple Tree: add new data structure Conflicts: lib/Makefile: slight makefile conflict commit 54a611b605901c7d5d05b6b8f5d04a6ceb0962aa Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Tue Sep 6 19:48:39 2022 +0000 Maple Tree: add new data structure Patch series "Introducing the Maple Tree" The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. Davidlor said : Yes I like the maple tree, and at this stage I don't think we can ask for : more from this series wrt the MM - albeit there seems to still be some : folks reporting breakage. Fundamentally I see Liam's work to (re)move : complexity out of the MM (not to say that the actual maple tree is not : complex) by consolidating the three complimentary data structures very : much worth it considering performance does not take a hit. This was very : much a turn off with the range locking approach, which worst case scenario : incurred in prohibitive overhead. Also as Liam and Matthew have : mentioned, RCU opens up a lot of nice performance opportunities, and in : addition academia[1] has shown outstanding scalability of address spaces : with the foundation of replacing the locked rbtree with RCU aware trees. A similar work has been discovered in the academic press https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf Sheer coincidence. We designed our tree with the intention of solving the hardest problem first. Upon settling on a b-tree variant and a rough outline, we researched ranged based b-trees and RCU b-trees and did find that article. So it was nice to find reassurances that we were on the right path, but our design choice of using ranges made that paper unusable for us. This patch (of 70): The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. There is additional BUG_ON() calls added within the tree, most of which are in debug code. These will be replaced with a WARN_ON() call in the future. There is also additional BUG_ON() calls within the code which will also be reduced in number at a later date. These exist to catch things such as out-of-range accesses which would crash anyways. Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2166668 Signed-off-by: Nico Pache <npache@redhat.com>	2023-03-07 01:23:54 -07:00
Artem Savkov	af356f677f	timekeeping: Introduce fast accessor to clock tai Bugzilla: https://bugzilla.redhat.com/2166911 Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git commit 3dc6ffae2da201284cb24af66af77ee0bbb2efaa Author: Kurt Kanzenbach <kurt@linutronix.de> Date: Thu Apr 14 11:18:03 2022 +0200 timekeeping: Introduce fast accessor to clock tai Introduce fast/NMI safe accessor to clock tai for tracing. The Linux kernel tracing infrastructure has support for using different clocks to generate timestamps for trace events. Especially in TSN networks it's useful to have TAI as trace clock, because the application scheduling is done in accordance to the network time, which is based on TAI. With a tai trace_clock in place, it becomes very convenient to correlate network activity with Linux kernel application traces. Use the same implementation as ktime_get_boot_fast_ns() does by reading the monotonic time and adding the TAI offset. The same limitations as for the fast boot implementation apply. The TAI offset may change at run time e.g., by setting the time or using adjtimex() with an offset. However, these kind of offset changes are rare events. Nevertheless, the user has to be aware and deal with it in post processing. An alternative approach would be to use the same implementation as ktime_get_real_fast_ns() does. However, this requires to add an additional u64 member to the tk_read_base struct. This struct together with a seqcount is designed to fit into a single cache line on 64 bit architectures. Adding a new member would violate this constraint. Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20220414091805.89667-2-kurt@linutronix.de Signed-off-by: Artem Savkov <asavkov@redhat.com>	2023-03-06 14:54:28 +01:00

1 2 3 4 5 ...

369 Commits