Commit Graph

369 Commits

Author SHA1 Message Date
Waiman Long babedfece6 Union-Find: add a new module in kernel library
JIRA: https://issues.redhat.com/browse/RHEL-83455
Conflicts: A merge conflict in lib/Makefile and context diff in MAINTAINERS.

commit 93c8332c8373fee415bd79f08d5ba4ba7ca5ad15
Author: Xavier <xavier_qy@163.com>
Date:   Thu, 4 Jul 2024 14:24:43 +0800

    Union-Find: add a new module in kernel library

    This patch implements a union-find data structure in the kernel library,
    which includes operations for allocating nodes, freeing nodes,
    finding the root of a node, and merging two nodes.

    Signed-off-by: Xavier <xavier_qy@163.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-04-09 21:58:35 -04:00
Baoquan He 308e9a3386 Document/kexec: generalize crash hotplug description
JIRA: https://issues.redhat.com/browse/RHEL-58641

Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Conflicts: In Documentation/ABI/testing/sysfs-devices-system-cpu, there
           is conflict because of context fuzz.

commit c91c6062d6cd1bc366efb04973ee449c30398a49
Author: Sourabh Jain <sourabhjain@linux.ibm.com>
Date:   Mon Aug 12 09:46:51 2024 +0530

    Document/kexec: generalize crash hotplug description

    Commit 79365026f869 ("crash: add a new kexec flag for hotplug support")
    generalizes the crash hotplug support to allow architectures to update
    multiple kexec segments on CPU/Memory hotplug and not just elfcorehdr.
    Therefore, update the relevant kernel documentation to reflect the same.

    No functional change.

    Link: https://lkml.kernel.org/r/20240812041651.703156-1-sourabhjain@linux.ibm.com
    Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
    Reviewed-by: Petr Tesarik <ptesarik@suse.com>
    Acked-by: Baoquan He <bhe@redhat.com>
    Cc: Hari Bathini <hbathini@linux.ibm.com>
    Cc: Petr Tesarik <petr@tesarici.cz>
    Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Baoquan He <bhe@redhat.com>
2024-12-23 09:35:36 +08:00
Rafael Aquini 8a32fd3b9a maple_tree: update the documentation of maple tree
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 9bc1d3cdb904170214456bca96c4924f28522ab8
Author: Peng Zhang <zhangpeng.00@bytedance.com>
Date:   Fri Oct 27 11:38:41 2023 +0800

    maple_tree: update the documentation of maple tree

    Introduce the new interface mtree_dup() in the documentation.

    Link: https://lkml.kernel.org/r/20231027033845.90608-7-zhangpeng.00@bytedance.com
    Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Mateusz Guzik <mjguzik@gmail.com>
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Mike Christie <michael.christie@oracle.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:31 -05:00
Audra Mitchell 754309ec6b Documentation/protection-keys: Clean up documentation for User Space pkeys
JIRA: https://issues.redhat.com/browse/RHEL-55461

This patch is a backport of the following upstream commit:
commit f8c1d4ca55177326adad1fdc6bf602423a507542
Author: Ira Weiny <ira.weiny@intel.com>
Date:   Tue Apr 19 10:06:06 2022 -0700

    Documentation/protection-keys: Clean up documentation for User Space pkeys

    The documentation for user space pkeys was a bit dated including things
    such as Amazon and distribution testing information which is irrelevant
    now.

    Update the documentation.  This also streamlines adding the Supervisor
    pkey documentation later on.

    Signed-off-by: Ira Weiny <ira.weiny@intel.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Link: https://lkml.kernel.org/r/20220419170649.1022246-2-ira.weiny@intel.com

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-11-04 09:14:14 -05:00
Rado Vrbovsky 570a71d7db Merge: mm: update core code to v6.6 upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252

JIRA: https://issues.redhat.com/browse/RHEL-27743  
JIRA: https://issues.redhat.com/browse/RHEL-59459    
CVE: CVE-2024-46787    
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961  
  
This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level.    
This work follows up on the previous v6.5 update (RHEL-27742) and as such,    
the bulk of this changeset is comprised of refactoring and clean-ups of     
the internal implementation of several APIs as it further advances the     
conversion to FOLIOS, and follow up on the per-VMA locking changes.

Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow    
Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds,    
and we add a potential extra level of protection (assessment pending) to help    
on mitigating kernel heap exploits dubbed as "SlubStick".     
    
Follow-up fixes are omitted from this series either because they are irrelevant to     
the bits we support on RHEL or because they depend on bigger changesets introduced     
upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately.    

Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot")    
Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources")   
Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()")    
Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros")    
Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages")    
Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType")    
Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()")    
Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio")    
Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling")    
    
Signed-off-by: Rafael Aquini <raquini@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-30 07:22:28 +00:00
Rado Vrbovsky a5ea1cdd29 Merge: 9.6 IOMMU and DMA api updates
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5113

# Merge Request Required Information

JIRA: https://issues.redhat.com/browse/RHEL-36247  
JIRA: https://issues.redhat.com/browse/RHEL-54186  
JIRA: https://issues.redhat.com/browse/RHEL-54189    
JIRA: https://issues.redhat.com/browse/RHEL-55199  
JIRA: https://issues.redhat.com/browse/RHEL-55200    
JIRA: https://issues.redhat.com/browse/RHEL-55448  
JIRA: https://issues.redhat.com/browse/RHEL-55450  
JIRA: https://issues.redhat.com/browse/RHEL-55466  
JIRA: https://issues.redhat.com/browse/RHEL-57229  
Upstream: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git  
CVE: CVE-2024-44994  

## Summary of Changes

This brings the IOMMU and DMA api subsystems in line with v6.11. Overall it is smaller than normal, but I expect there will be another MR later on with some things landing in v6.12. In addition to the usual fixes, and cleanups the major changes are:

- iommufd io page fault support
- smmuv3 dirty page tracking support
- vt-d cache tagging support
- iommu memory usage observability support
- Some cleanups from Robin related to dma range calculation, arch setup_dma_ops, and iommu_fwspec_ops
- Another batch of updates to smmuv3 from Jason's reworking of the driver (2b/3)

There also is a rhel only cleanup to deal with a cleanup in the iommufd code by Linus in one of merges that was related to some mm changes at the time, but wasn't dealt with when the iommufd or mm changes were merged to rhel.

Testing:

- kernel-tests iommu new-boot testing
- kernel-tests iommu fio testing
- kernel-tests dmatest and idxd tests
- iommufd kernel selftest
- general cki testing


v5: Added 3 fixes that recently landed upstream.  
v6: Resolve issues raised by Don:  

    - Conflicts note updates  
    - Backport of acpi, and arm7 io-pgtable commit.  
    - While doing this I went through to diff against upstream and cleaned up a couple more merge commit cleanups:  

      * __GFP_ZERO use in amd ppr log alloc code.
      * Line break in swiotlb code.
      * dead config option in drivers/iommu/intel/Kconfig  

    - Hyperv commit that was missed in previous backports.  
v7: Fixed typo from using Conflict instead of Conflicts for tag.  
v8: Fixed borked conflict resolution after adding acpi patch in v6.  
v10: Added dt bindings commit mentioned by Eric.  
v11: Rebase due to merge conflict after acpi MR merged. This also drops a couple acpi commits that were now empty.  

Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>

## Approved Development Ticket
All submissions to CentOS Stream must reference an approved ticket in [Red Hat Jira](https://issues.redhat.com/). Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-19 08:07:00 +00:00
Rafael Aquini 4e136352a4 mm: add orphaned kernel-doc to the rst files.
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 61ff748b5b7b0c32daddbfb92c3bc15d938754dc
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Aug 18 21:06:30 2023 +0100

    mm: add orphaned kernel-doc to the rst files.

    There are many files in mm/ that contain kernel-doc which is not
    currently published on kernel.org.  Some of it is easily categorisable,
    but most of it is going into the miscellaneous documentation section to
    be organised later.

    Some files aren't ready to be included; they contain documentation with
    build errors.  Or they're nommu.c which duplicates documentation from
    "real" MMU systems.  Those files are noted with a # mark (although really
    anything which isn't a recognised directive would do to prevent inclusion)

    Link: https://lkml.kernel.org/r/20230818200630.2719595-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:59 -04:00
Rafael Aquini bb486f4fbc mm: remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 29d26f1215de14721188988a59b1426abb85b7be
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 2 16:13:33 2023 +0100

    mm: remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO

    Current best practice is to reuse the name of the function as a define to
    indicate that the function is implemented by the architecture.

    Link: https://lkml.kernel.org/r/20230802151406.3735276-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:19 -04:00
Rafael Aquini 6443ba8155 mm: add generic flush_icache_pages() and documentation
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 3a255267f6dff40e193501cf731f409ce9175503
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 2 16:13:31 2023 +0100

    mm: add generic flush_icache_pages() and documentation

    flush_icache_page() is deprecated but not yet removed, so add a range
    version of it.  Change the documentation to refer to
    update_mmu_cache_range() instead of update_mmu_cache().

    Link: https://lkml.kernel.org/r/20230802151406.3735276-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:17 -04:00
Jerry Snitselaar c880b92922 Documentation/core-api: correct reference to SWIOTLB_DYNAMIC
JIRA: https://issues.redhat.com/browse/RHEL-55466
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 82d71b53d7e732ede6028591342bdc80fabfa29f
Author: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Date:   Mon May 27 15:13:14 2024 +0200

    Documentation/core-api: correct reference to SWIOTLB_DYNAMIC

    Commit c93f261dfc39 ("Documentation/core-api: add swiotlb documentation")
    accidentally refers to CONFIG_DYNAMIC_SWIOTLB in one place, while the
    config is actually called CONFIG_SWIOTLB_DYNAMIC.

    Correct the reference to the intended config option.

    Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
    Reviewed-by: Petr Tesarik <petr@tesarici.cz>
    Signed-off-by: Christoph Hellwig <hch@lst.de>

(cherry picked from commit 82d71b53d7e732ede6028591342bdc80fabfa29f)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2024-09-20 12:29:02 -07:00
Jerry Snitselaar b75202bae9 Documentation/core-api: add swiotlb documentation
JIRA: https://issues.redhat.com/browse/RHEL-55466
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit c93f261dfc395b1386320b2f0b160a6f34ed9ea5
Author: Michael Kelley <mhklinux@outlook.com>
Date:   Wed May 1 08:16:51 2024 -0700

    Documentation/core-api: add swiotlb documentation

    There's currently no documentation for the swiotlb. Add documentation
    describing usage scenarios, the key APIs, and implementation details.
    Group the new documentation with other DMA-related documentation.

    Signed-off-by: Michael Kelley <mhklinux@outlook.com>
    Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
    Reviewed-by: Petr Tesarik <petr@tesarici.cz>
    Signed-off-by: Christoph Hellwig <hch@lst.de>

(cherry picked from commit c93f261dfc395b1386320b2f0b160a6f34ed9ea5)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2024-09-20 12:26:35 -07:00
Rafael Aquini ff90cf56a9 mm: Don't pin ZERO_PAGE in pin_user_pages()
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit c8070b78751955e59b42457b974bea4a4fe00187
Author: David Howells <dhowells@redhat.com>
Date:   Fri May 26 22:41:40 2023 +0100

    mm: Don't pin ZERO_PAGE in pin_user_pages()

    Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
    to it from the page tables and make unpin_user_page*() correspondingly
    ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
    zero page's refcount as we're only allowed ~2 million pins on it -
    something that userspace can conceivably trigger.

    Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.

    Signed-off-by: David Howells <dhowells@redhat.com>
    cc: Christoph Hellwig <hch@infradead.org>
    cc: David Hildenbrand <david@redhat.com>
    cc: Lorenzo Stoakes <lstoakes@gmail.com>
    cc: Andrew Morton <akpm@linux-foundation.org>
    cc: Jens Axboe <axboe@kernel.dk>
    cc: Al Viro <viro@zeniv.linux.org.uk>
    cc: Matthew Wilcox <willy@infradead.org>
    cc: Jan Kara <jack@suse.cz>
    cc: Jeff Layton <jlayton@kernel.org>
    cc: Jason Gunthorpe <jgg@nvidia.com>
    cc: Logan Gunthorpe <logang@deltatee.com>
    cc: Hillf Danton <hdanton@sina.com>
    cc: Christian Brauner <brauner@kernel.org>
    cc: Linus Torvalds <torvalds@linux-foundation.org>
    cc: linux-fsdevel@vger.kernel.org
    cc: linux-block@vger.kernel.org
    cc: linux-kernel@vger.kernel.org
    cc: linux-mm@kvack.org
    Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: David Hildenbrand <david@redhat.com>
    Link: https://lore.kernel.org/r/20230526214142.958751-2-dhowells@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:12 -04:00
Lucas Zampieri 804616b9d7 Merge: update drivers/base to match Linux v6.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3774

JIRA: https://issues.redhat.com/browse/RHEL-26183

Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>

Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Myron Stowe <mstowe@redhat.com>
Approved-by: Pingfan Liu <piliu@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-07-12 14:11:54 +00:00
Lucas Zampieri bce53f8035 Merge: CNB95: string: Allow 2-argument strscpy() and strscpy_pad()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4435

JIRA: https://issues.redhat.com/browse/RHEL-40250  
Tested: Just built

Omitted-fix: 62776e4378ae9 ("mips: boot/compressed: use __NO_FORTIFY")  
- No need to include this as MIPS arch is not supported in RHEL

Commits:
```
54d9469bc515 ("fortify: Add run-time WARN for cross-field memcpy()")
311fb40aa056 ("fortify: Use SIZE_MAX instead of (size_t)-1")
fa35198f3957 ("fortify: Explicitly check bounds are compile-time constants")
9f7d69c5cd23 ("fortify: Convert to struct vs member helpers")
03699f271de1 ("string: Rewrite and add more kern-doc for the str*() functions")
62e1cbfc5d79 ("fortify: Short-circuit known-safe calls to strscpy()")
439a1bcac648 ("fortify: Use __builtin_dynamic_object_size() when available")
26dd68d293fd ("overflow: add DEFINE_FLEX() for on-stack allocs")
21a2c74b0a2a ("fortify: Use const variables for __member_size tracking")
ead62aa370a8 ("fortify: strscpy: Fix flipped q and p docstring typo")
f0a6b5831cfb ("uml: Replace strlcpy with strscpy")
b229baa374db ("kernel.h: split out COUNT_ARGS() and CONCATENATE() to args.h")
e6584c3964f2 ("string: Allow 2-argument strscpy()")
f478898e0aa7 ("string: Redefine strscpy_pad() as a macro")
8366d124ec93 ("string: Allow 2-argument strscpy_pad()")
0d043351e5ba ("ext4: fix fortify warning in fs/ext4/fast_commit.c:1551")
b2ba00c2a517 ("rxrpc: replace zero-lenth array with DECLARE_FLEX_ARRAY() helper")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Petr Oros <poros@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-25 13:33:33 +00:00
Donald Dutile e41f7154cf module: add debug stats to help identify memory pressure
JIRA: https://issues.redhat.com/browse/RHEL-28063

Conflicts:
   Adding RHEL-only MODULE_STATS set to n for now. Possibly
   add in future -debug kernels, to be determined.
   Add rest of commit(s) to enable clean backport for further commits.

commit df3e764d8e5cd416efee29e0de3c93917dff5d33
Author: Luis Chamberlain <mcgrof@kernel.org>
Date:   Tue Mar 28 20:03:19 2023 -0700

    module: add debug stats to help identify memory pressure

    Loading modules with finit_module() can end up using vmalloc(), vmap()
    and vmalloc() again, for a total of up to 3 separate allocations in the
    worst case for a single module. We always kernel_read*() the module,
    that's a vmalloc(). Then vmap() is used for the module decompression,
    and if so the last read buffer is freed as we use the now decompressed
    module buffer to stuff data into our copy module. The last allocation is
    specific to each architectures but pretty much that's generally a series
    of vmalloc() calls or a variation of vmalloc to handle ELF sections with
    special permissions.

    Evaluation with new stress-ng module support [1] with just 100 ops
    is proving that you can end up using GiBs of data easily even with all
    care we have in the kernel and userspace today in trying to not load modules
    which are already loaded. 100 ops seems to resemble the sort of pressure a
    system with about 400 CPUs can create on module loading. Although issues
    relating to duplicate module requests due to each CPU inucurring a new
    module reuest is silly and some of these are being fixed, we currently lack
    proper tooling to help diagnose easily what happened, when it happened
    and who likely is to blame -- userspace or kernel module autoloading.

    Provide an initial set of stats which use debugfs to let us easily scrape
    post-boot information about failed loads. This sort of information can
    be used on production worklaods to try to optimize *avoiding* redundant
    memory pressure using finit_module().

    There's a few examples that can be provided:

    A 255 vCPU system without the next patch in this series applied:

    Startup finished in 19.143s (kernel) + 7.078s (userspace) = 26.221s
    graphical.target reached after 6.988s in userspace

    And 13.58 GiB of virtual memory space lost due to failed module loading:

    root@big ~ # cat /sys/kernel/debug/modules/stats
             Mods ever loaded       67
         Mods failed on kread       0
    Mods failed on decompress       0
      Mods failed on becoming       0
          Mods failed on load       1411
            Total module size       11464704
          Total mod text size       4194304
           Failed kread bytes       0
      Failed decompress bytes       0
        Failed becoming bytes       0
            Failed kmod bytes       14588526272
     Virtual mem wasted bytes       14588526272
             Average mod size       171115
        Average mod text size       62602
      Average fail load bytes       10339140
    Duplicate failed modules:
                  module-name        How-many-times                    Reason
                    kvm_intel                   249                      Load
                          kvm                   249                      Load
                    irqbypass                     8                      Load
             crct10dif_pclmul                   128                      Load
          ghash_clmulni_intel                    27                      Load
                 sha512_ssse3                    50                      Load
               sha512_generic                   200                      Load
                  aesni_intel                   249                      Load
                  crypto_simd                    41                      Load
                       cryptd                   131                      Load
                        evdev                     2                      Load
                    serio_raw                     1                      Load
                   virtio_pci                     3                      Load
                         nvme                     3                      Load
                    nvme_core                     3                      Load
        virtio_pci_legacy_dev                     3                      Load
        virtio_pci_modern_dev                     3                      Load
                       t10_pi                     3                      Load
                       virtio                     3                      Load
                 crc32_pclmul                     6                      Load
               crc64_rocksoft                     3                      Load
                 crc32c_intel                    40                      Load
                  virtio_ring                     3                      Load
                        crc64                     3                      Load

    The following screen shot, of a simple 8vcpu 8 GiB KVM guest with the
    next patch in this series applied, shows 226.53 MiB are wasted in virtual
    memory allocations which due to duplicate module requests during boot.
    It also shows an average module memory size of 167.10 KiB and an an
    average module .text + .init.text size of 61.13 KiB. The end shows all
    modules which were detected as duplicate requests and whether or not
    they failed early after just the first kernel_read*() call or late after
    we've already allocated the private space for the module in
    layout_and_allocate(). A system with module decompression would reveal
    more wasted virtual memory space.

    We should put effort now into identifying the source of these duplicate
    module requests and trimming these down as much possible. Larger systems
    will obviously show much more wasted virtual memory allocations.

    root@kmod ~ # cat /sys/kernel/debug/modules/stats
             Mods ever loaded       67
         Mods failed on kread       0
    Mods failed on decompress       0
      Mods failed on becoming       83
          Mods failed on load       16
            Total module size       11464704
          Total mod text size       4194304
           Failed kread bytes       0
      Failed decompress bytes       0
        Failed becoming bytes       228959096
            Failed kmod bytes       8578080
     Virtual mem wasted bytes       237537176
             Average mod size       171115
        Average mod text size       62602
      Avg fail becoming bytes       2758544
      Average fail load bytes       536130
    Duplicate failed modules:
                  module-name        How-many-times                    Reason
                    kvm_intel                     7                  Becoming
                          kvm                     7                  Becoming
                    irqbypass                     6           Becoming & Load
             crct10dif_pclmul                     7           Becoming & Load
          ghash_clmulni_intel                     7           Becoming & Load
                 sha512_ssse3                     6           Becoming & Load
               sha512_generic                     7           Becoming & Load
                  aesni_intel                     7                  Becoming
                  crypto_simd                     7           Becoming & Load
                       cryptd                     3           Becoming & Load
                        evdev                     1                  Becoming
                    serio_raw                     1                  Becoming
                         nvme                     3                  Becoming
                    nvme_core                     3                  Becoming
                       t10_pi                     3                  Becoming
                   virtio_pci                     3                  Becoming
                 crc32_pclmul                     6           Becoming & Load
               crc64_rocksoft                     3                  Becoming
                 crc32c_intel                     3                  Becoming
        virtio_pci_modern_dev                     2                  Becoming
        virtio_pci_legacy_dev                     1                  Becoming
                        crc64                     2                  Becoming
                       virtio                     2                  Becoming
                  virtio_ring                     2                  Becoming

    [0] https://github.com/ColinIanKing/stress-ng.git
    [1] echo 0 > /proc/sys/vm/oom_dump_tasks
        ./stress-ng --module 100 --module-name xfs

    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:25 -04:00
Lucas Zampieri f6029bf351 Merge: workqueue: Backport workqueue commits to v6.9
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3910

JIRA: https://issues.redhat.com/browse/RHEL-25103    
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3910    
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3847    

The primary purpose of this MR is to backport those upstream workqueue
commits which enables ordered workqueues and rescuers to follow
changes in workqueue unbound cpumask which is necessary to make sure
that isolated CPUs won't be disturbed due to unbound work items being
handled by those CPUs.

These upstream commits were merged into the v6.9 kernel which also
contains some major changes in workqueue code. This makes the required
commits dependent on some of the v6.9 workqueue commits. It is less risky
to sync the workqueue code up to v6.9 instead of selective backports
of some dependent commits. This MR also includes some miscellaneous
commits in other subsystems due to changes in the underlying workqueue
implementations.

A follow-up proactive workqueue fixes MR will be created later on,
if necessary.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: Vladis Dronov <vdronov@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Radu Rendec <rrendec@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-13 13:07:43 +00:00
Ivan Vecera 4f908e0d18 string: Rewrite and add more kern-doc for the str*() functions
JIRA: https://issues.redhat.com/browse/RHEL-40250

commit 03699f271de1f4df6369cd379506539cd7d590d3
Author: Kees Cook <keescook@chromium.org>
Date:   Fri Sep 2 14:33:44 2022 -0700

    string: Rewrite and add more kern-doc for the str*() functions

    While there were varying degrees of kern-doc for various str*()-family
    functions, many needed updating and clarification, or to just be
    entirely written. Update (and relocate) existing kern-doc and add missing
    functions, sadly shaking my head at how many times I have written "Do
    not use this function". Include the results in the core kernel API doc.

    Cc: Bagas Sanjaya <bagasdotme@gmail.com>
    Cc: Andy Shevchenko <andy@kernel.org>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: linux-hardening@vger.kernel.org
    Tested-by: Akira Yokosawa <akiyks@gmail.com>
    Link: https://lore.kernel.org/lkml/9b0cf584-01b3-3013-b800-1ef59fe82476@gmail.com
    Signed-off-by: Kees Cook <keescook@chromium.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-06-10 19:14:58 +02:00
Mark Langsdorf 013a5d02a3 crash: memory and CPU hotplug sysfs attributes
JIRA: https://issues.redhat.com/browse/RHEL-26183
Conflicts:
	Documentation/ABI/testing/sysfs-devices-system-cpu -
The crash_hotplug text was added to the end of the file
in upstream, so I did the same here.
	include/linux/kexec.h - minor context differences

commit 88a6f89944216b028d3872b0cec0f51a2f955460
Author: Eric DeVolder <eric.devolder@oracle.com>
Date: Thu, 24 Aug 2023 16:25:14 +0000

Introduce the crash_hotplug attribute for memory and CPUs for use by
userspace.  These attributes directly facilitate the udev rule for
managing userspace re-loading of the crash kernel upon hot un/plug
changes.

For memory, expose the crash_hotplug attribute to the
/sys/devices/system/memory directory.  For example:

 # udevadm info --attribute-walk /sys/devices/system/memory/memory81
  looking at device '/devices/system/memory/memory81':
    KERNEL=="memory81"
    SUBSYSTEM=="memory"
    DRIVER==""
    ATTR{online}=="1"
    ATTR{phys_device}=="0"
    ATTR{phys_index}=="00000051"
    ATTR{removable}=="1"
    ATTR{state}=="online"
    ATTR{valid_zones}=="Movable"

  looking at parent device '/devices/system/memory':
    KERNELS=="memory"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{auto_online_blocks}=="offline"
    ATTRS{block_size_bytes}=="8000000"
    ATTRS{crash_hotplug}=="1"

For CPUs, expose the crash_hotplug attribute to the
/sys/devices/system/cpu directory. For example:

 # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0
  looking at device '/devices/system/cpu/cpu0':
    KERNEL=="cpu0"
    SUBSYSTEM=="cpu"
    DRIVER=="processor"
    ATTR{crash_notes}=="277c38600"
    ATTR{crash_notes_size}=="368"
    ATTR{online}=="1"

  looking at parent device '/devices/system/cpu':
    KERNELS=="cpu"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{crash_hotplug}=="1"
    ATTRS{isolated}==""
    ATTRS{kernel_max}=="8191"
    ATTRS{nohz_full}=="  (null)"
    ATTRS{offline}=="4-7"
    ATTRS{online}=="0-3"
    ATTRS{possible}=="0-7"
    ATTRS{present}=="0-3"

With these sysfs attributes in place, it is possible to efficiently
instruct the udev rule to skip crash kernel reloading for kernels
configured with crash hotplug support.

For example, the following is the proposed udev rule change for RHEL
system 98-kexec.rules (as the first lines of the rule file):

 # The kernel updates the crash elfcorehdr for CPU and memory changes
 SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
 SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

When examined in the context of 98-kexec.rules, the above rules test if
crash_hotplug is set, and if so, the userspace initiated
unload-then-reload of the crash kernel is skipped.

CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU
and CONFIG_MEMORY_HOTPLUG kernel config options.  If an architecture
supports, for example, memory hotplug but not CPU hotplug, then the
/sys/devices/system/memory/crash_hotplug attribute file is present, but
the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be
present.  Thus the udev rule skips userspace processing of memory hot
un/plug events, but the udev rule will evaluate false for CPU events, thus
allowing userspace to process CPU hot un/plug events (ie the
unload-then-reload of the kdump capture kernel).

Link: https://lkml.kernel.org/r/20230814214446.6659-5-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2024-05-31 15:37:43 -04:00
Waiman Long deebfc6ab4 workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 3bc1e711c26bff01d41ad71145ecb8dcb4412576
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 5 Feb 2024 14:19:10 -1000

    workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered

    5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
    automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered
    workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way
    to create ordered workqueues and the new NUMA support broke it. These
    problems can be subtle and the fact that they can only trigger on NUMA
    machines made them even more difficult to debug.

    However, overloading the UNBOUND allocation interface this way creates other
    issues. It's difficult to tell whether a given workqueue actually needs to
    be ordered and users that legitimately want a min concurrency level wq
    unexpectedly gets an ordered one instead. With planned UNBOUND workqueue
    udpates to improve execution locality and more prevalence of chiplet designs
    which can benefit from such improvements, this isn't a state we wanna be in
    forever.

    There aren't that many UNBOUND w/ @max_active==1 users in the tree and the
    preceding patches audited all and converted them to
    alloc_ordered_workqueue() as appropriate. This patch removes the implicit
    promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones.

    v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in
        apply_workqueue_attrs_locked() which spuriously triggers WARNING and
        fails workqueue creation. Fix it.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long da3eaa2838 workqueue: Implement BH workqueues to eventually replace tasklets
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A minor context diff in kernel/workqueue.c due to missing
	   upstream commit 68279f9c9f59 ("treewide: mark stuff as
	   __ro_after_init").

commit 4cb1ef64609f9b0254184b2947824f4b46ccab22
Author: Tejun Heo <tj@kernel.org>
Date:   Sun, 4 Feb 2024 11:28:06 -1000

    workqueue: Implement BH workqueues to eventually replace tasklets

    The only generic interface to execute asynchronously in the BH context is
    tasklet; however, it's marked deprecated and has some design flaws such as
    the execution code accessing the tasklet item after the execution is
    complete which can lead to subtle use-after-free in certain usage scenarios
    and less-developed flush and cancel mechanisms.

    This patch implements BH workqueues which share the same semantics and
    features of regular workqueues but execute their work items in the softirq
    context. As there is always only one BH execution context per CPU, none of
    the concurrency management mechanisms applies and a BH workqueue can be
    thought of as a convenience wrapper around softirq.

    Except for the inability to sleep while executing and lack of max_active
    adjustments, BH workqueues and work items should behave the same as regular
    workqueues and work items.

    Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
    convert all tasklet users over to BH workqueues. Once the conversion is
    complete, tasklet can be removed and BH workqueues can directly take over
    the tasklet softirqs.

    system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
    tasklet, all existing tasklet users should be able to use the system BH
    workqueues without creating their own workqueues.

    v3: - Add missing interrupt.h include.

    v2: - Instead of using tasklets, hook directly into its softirq action
          functions - tasklet[_hi]_action(). This is slightly cheaper and closer
          to the eventual code structure we want to arrive at. Suggested by Lai.

        - Lai also pointed out several places which need NULL worker->task
          handling or can use clarification. Updated.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
    Tested-by: Allen Pais <allen.lkml@gmail.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long f71ada9990 Documentation/core-api: fix spelling mistake in workqueue
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 22160b08d88898898b2bb99282663cdf3caa5c1c
Author: attreyee-muk <tintinm2017@gmail.com>
Date:   Thu, 11 Jan 2024 00:27:47 +0530

    Documentation/core-api: fix spelling mistake in workqueue

    Correct to "following" from "followings" in the sentence "The followings
    are the read bandwidths and CPU utilizations depending on different affinity
    scope settings on ``kcryptd`` measured over five runs."

    Signed-off-by: Attreyee Mukherjee <tintinm2017@gmail.com>
    Signed-off-by: Jonathan Corbet <corbet@lwn.net>
    Link: https://lore.kernel.org/r/20240110185746.24974-1-tintinm2017@gmail.com

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:29 -04:00
Waiman Long fd132bf1dd Documentation/core-api : fix typo in workqueue
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 89405db5cd1ed641dfe91b4a3796bb6188ab05e3
Author: attreyee-muk <tintinm2017@gmail.com>
Date:   Sat, 23 Dec 2023 23:23:17 +0530

    Documentation/core-api : fix typo in workqueue

    Correct to “boundaries” from “bounaries”

    Signed-off-by: Attreyee Mukherjee <tintinm2017@gmail.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Jonathan Corbet <corbet@lwn.net>
    Link: https://lore.kernel.org/r/20231223175316.24951-1-tintinm2017@gmail.com

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long 66b663c9fd workqueue: doc: Fix function and sysfs path errors
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit bd9e7326b8d512ee724006d4ec06dfbf3096ae9e
Author: WangJinchao <wangjinchao@xfusion.com>
Date:   Thu, 12 Oct 2023 15:17:38 +0800

    workqueue: doc: Fix function and sysfs path errors

    alloc_ordered_queue -> alloc_ordered_workqueue
    /sys/devices/virtual/WQ_NAME/
        -> /sys/devices/virtual/workqueue/WQ_NAME/

    Signed-off-by: WangJinchao <wangjinchao@xfusion.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long 150f0fd88d workqueue: Make default affinity_scope dynamically updatable
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 523a301e66afd1ea9856660bcf3cee3a7c84c6dd
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:25 -1000

    workqueue: Make default affinity_scope dynamically updatable

    While workqueue.default_affinity_scope is writable, it only affects
    workqueues which are created afterwards and isn't very useful. Instead,
    let's introduce explicit "default" scope and update the effective scope
    dynamically when workqueue.default_affinity_scope is changed.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long 170fd51369 workqueue: Add "Affinity Scopes and Performance" section to documentation
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 7dbf15c5c05e835d488e0fee49a35b0f23452e45
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:25 -1000

    workqueue: Add "Affinity Scopes and Performance" section to documentation

    With affinity scopes and their strictness setting added, unbound workqueues
    should now be able to cover wide variety of configurations and use cases.
    Unfortunately, the performance picture is not entirely straight-forward due
    to a trade-off between efficiency and work-conservation in some situations
    necessitating manual configuration.

    This patch adds "Affinity Scopes and Performance" section to
    Documentation/core-api/workqueue.rst which illustrates the trade-off with a
    set of experiments and provides some guidelines.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long 87b2fa1d0e workqueue: Implement non-strict affinity scope for unbound workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 8639ecebc9b1796d7074751a350462f5e1c61cd4
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:25 -1000

    workqueue: Implement non-strict affinity scope for unbound workqueues

    An unbound workqueue can be served by multiple worker_pools to improve
    locality. The segmentation is achieved by grouping CPUs into pods. By
    default, the cache boundaries according to cpus_share_cache() define the
    CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
    system has two L3 caches. The workqueue would be mapped to two worker_pools
    each serving one L3 cache domains.

    While this improves locality, because the pod boundaries are strict, it
    limits the total bandwidth a given issuer can consume. For example, let's
    say there is a thread pinned to a CPU issuing enough work items to saturate
    the whole machine. With the machine segmented into two pods, no matter how
    many work items it issues, it can only use half of the CPUs on the system.

    While this limitation has existed for a very long time, it wasn't very
    pronounced because the affinity grouping used to be always by NUMA nodes.
    With cache boundaries as the default and support for even finer grained
    scopes (smt and cpu), it is now an a lot more pressing problem.

    This patch implements non-strict affinity scope where the pod boundaries
    aren't enforced strictly. Going back to the previous example, the workqueue
    would still be mapped to two worker_pools; however, the affinity enforcement
    would be soft. The workers in both pools would have their cpus_allowed set
    to the whole machine thus allowing the scheduler to migrate them anywhere on
    the machine. However, whenever an idle worker is woken up, the workqueue
    code asks the scheduler to bring back the task within the pod if the worker
    is outside. ie. work items start executing within its affinity scope but can
    be migrated outside as the scheduler sees fit. This removes the hard cap on
    utilization while maintaining the benefits of affinity scopes.

    After the earlier ->__pod_cpumask changes, the implementation is pretty
    simple. When non-strict which is the new default:

    * pool_allowed_cpus() returns @pool->attrs->cpumask instead of
      ->__pod_cpumask so that the workers are allowed to run on any CPU that
      the associated workqueues allow.

    * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
      the field to a CPU within the pod.

    This would be the first use of task_struct->wake_cpu outside scheduler
    proper, so it isn't clear whether this would be acceptable. However, other
    methods of migrating tasks are significantly more expensive and are likely
    prohibitively so if we want to do this on every work item. This needs
    discussion with scheduler folks.

    There is also a race window where setting ->wake_cpu wouldn't be effective
    as the target task is still on CPU. However, the window is pretty small and
    this being a best-effort optimization, it doesn't seem to warrant more
    complexity at the moment.

    While the non-strict cache affinity scopes seem to be the best option, the
    performance picture interacts with the affinity scope and is a bit
    complicated to fully discuss in this patch, so the behavior is made easily
    selectable through wqattrs and sysfs and the next patch will add
    documentation to discuss performance implications.

    v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long 89997d1af2 workqueue: Add multiple affinity scopes and interface to select them
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 63c5484e74952f60f5810256bd69814d167b8d22
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:24 -1000

    workqueue: Add multiple affinity scopes and interface to select them

    Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE
    the default. The code changes to actually add the additional scopes are
    trivial.

    Also add module parameter "workqueue.default_affinity_scope" to override the
    default scope and "affinity_scope" sysfs file to configure it per workqueue.
    wq_dump.py and documentations are updated accordingly.

    This enables significant flexibility in configuring how unbound workqueues
    behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu
    workqueue. On the other hand, "system" removes all locality boundaries.

    Many modern machines have multiple L3 caches often while being mostly
    uniform in terms of memory access. Thus, workqueue's previous behavior of
    spreading work items in each NUMA node had negative performance implications
    from unncessarily crossing L3 boundaries between issue and execution.
    However, picking a finer grained affinity scope also has a downside in that
    an issuer in one group can't utilize CPUs in other groups.

    While dependent on the specifics of workload, there's usually a noticeable
    penalty in crossing L3 boundaries, so let's default to CACHE. This issue
    will be further addressed and documented with examples in future patches.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long 8aa1b3540f workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 7f7dc377a3b2bbaa8cf8941587c228eab4bd82ec
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:24 -1000

    workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration

    Lack of visibility has always been a pain point for workqueues. While the
    recently added wq_monitor.py improved the situation, it's still difficult to
    understand what worker pools are active in the system, how workqueues map to
    them and why. The lack of visibility into how workqueues are configured is
    going to become more noticeable as workqueue improves locality awareness and
    provides more mechanisms to customize locality related behaviors.

    Now that the basic framework for more flexible locality support is in place,
    this is a good time to improve the situation. This patch adds
    tools/workqueues/wq_dump.py which prints out the topology configuration,
    worker pools and how workqueues are mapped to pools. Read the command's help
    message for more details.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long de03f9c951 workqueue: Make unbound workqueues to use per-cpu pool_workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 636b927eba5bc633753f8eb80f35e1d5be806e51
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:23 -1000

    workqueue: Make unbound workqueues to use per-cpu pool_workqueues

    A pwq (pool_workqueue) represents an association between a workqueue and a
    worker_pool. When a work item is queued, the workqueue selects the pwq to
    use, which in turn determines the pool, and queues the work item to the pool
    through the pwq. pwq is also what implements the maximum concurrency limit -
    @max_active.

    As a per-cpu workqueue should be assocaited with a different worker_pool on
    each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
    However, unbound workqueues were sharing a pwq within each NUMA node by
    default. The sharing has several downsides:

    * Because @max_active is per-pwq, the meaning of @max_active changes
      depending on the machine configuration and whether workqueue NUMA locality
      support is enabled.

    * Makes per-cpu and unbound code deviate.

    * Gets in the way of making workqueue CPU locality awareness more flexible.

    This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
    workqueues do by making the following changes:

    * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
      just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
      workqueues.

    * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
      the specified pwq to the target CPU's wq->cpu_pwq.

    * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
      unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
      This makes the return value of wq_calc_node_cpumask() unnecessary. It now
      returns void.

    * @max_active now means the same thing for both per-cpu and unbound
      workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
      documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
      used in workqueue implementation and will be removed later.

    * All unbound pwq operations which used to be per-numa-node are now per-cpu.

    For most unbound workqueue users, this shouldn't cause noticeable changes.
    Work item issue and completion will be a small bit faster, flush_workqueue()
    would become a bit more expensive, and the total concurrency limit would
    likely become higher. All @max_active==1 use cases are currently being
    audited for conversion into alloc_ordered_workqueue() and they shouldn't be
    affected once the audit and conversion is complete.

    One area where the behavior change may be more noticeable is
    workqueue_congested() as the reported congestion state is now per CPU
    instead of NUMA node. There are only two users of this interface -
    drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
    cc'd. Inputs on the behavior change would be very much appreciated.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Leon Romanovsky <leon@kernel.org>
    Cc: Karsten Graul <kgraul@linux.ibm.com>
    Cc: Wenjia Zhang <wenjia@linux.ibm.com>
    Cc: Jan Karcher <jaka@linux.ibm.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long de650632ad workqueue: Track and monitor per-workqueue CPU time usage
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 8a1dd1e547c1a037692e7a6da6a76108108c72b1
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:09 -1000

    workqueue: Track and monitor per-workqueue CPU time usage

    Now that wq_worker_tick() is there, we can easily track the rough CPU time
    consumption of each workqueue by charging the whole tick whenever a tick
    hits an active workqueue. While not super accurate, it provides reasonable
    visibility into the workqueues that consume a lot of CPU cycles.
    wq_monitor.py is updated to report the per-workqueue CPU times.

    v2: wq_monitor.py was using "cputime" as the key when outputting in json
        format. Use "cpu_time" instead for consistency with other fields.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long 1665f6ac9c workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 616db8779b1e3f93075df691432cccc5ef3c3ba0
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

    If a per-cpu work item hogs the CPU, it can prevent other work items from
    starting through concurrency management. A per-cpu workqueue which intends
    to host such CPU-hogging work items can choose to not participate in
    concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
    error-prone and difficult to debug when missed.

    This patch adds an automatic CPU usage based detection. If a
    concurrency-managed work item consumes more CPU time than the threshold
    (10ms by default) continuously without intervening sleeps, wq_worker_tick()
    which is called from scheduler_tick() will detect the condition and
    automatically mark it CPU_INTENSIVE.

    The mechanism isn't foolproof:

    * Detection depends on tick hitting the work item. Getting preempted at the
      right timings may allow a violating work item to evade detection at least
      temporarily.

    * nohz_full CPUs may not be running ticks and thus can fail detection.

    * Even when detection is working, the 10ms detection delays can add up if
      many CPU-hogging work items are queued at the same time.

    However, in vast majority of cases, this should be able to detect violations
    reliably and provide reasonable protection with a small increase in code
    complexity.

    If some work items trigger this condition repeatedly, the bigger problem
    likely is the CPU being saturated with such per-cpu work items and the
    solution would be making them UNBOUND. The next patch will add a debug
    mechanism to help spot such cases.

    v4: Documentation for workqueue.cpu_intensive_thresh_us added to
        kernel-parameters.txt.

    v3: Switch to use wq_worker_tick() instead of hooking into preemptions as
        suggested by Peter.

    v2: Lai pointed out that wq_worker_stopping() also needs to be called from
        preemption and rtlock paths and an earlier patch was updated
        accordingly. This patch adds a comment describing the risk of infinte
        recursions and how they're avoided.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long 20a387c381 workqueue: Add pwq->stats[] and a monitoring script
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 725e8ec59c56c65fb92e343c10a8842cd0d4f194
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Add pwq->stats[] and a monitoring script

    Currently, the only way to peer into workqueue operations is through
    tracing. While possible, it isn't easy or convenient to monitor
    per-workqueue behaviors over time this way. Let's add pwq->stats[] that
    track relevant events and a drgn monitoring script -
    tools/workqueue/wq_monitor.py.

    It's arguable whether this needs to be configurable. However, it currently
    only has several counters and the runtime overhead shouldn't be noticeable
    given that they're on pwq's which are per-cpu on per-cpu workqueues and
    per-numa-node on unbound ones. Let's keep it simple for the time being.

    v2: Patch reordered to earlier with fewer fields. Field will be added back
        gradually. Help message improved.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long b47ae598c5 workqueue: doc: Call out the non-reentrance conditions
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit f9eaaa82b474350aa8436d15a7ae150a3c8b9d5c
Author: Boqun Feng <boqun.feng@gmail.com>
Date:   Fri, 22 Oct 2021 08:42:08 +0800

    workqueue: doc: Call out the non-reentrance conditions

    The current doc of workqueue API suggests that work items are
    non-reentrant: any work item is guaranteed to be executed by at most one
    worker system-wide at any given time. However this is not true, the
    following case can cause a work item W executed by two workers at
    the same time:

            queue_work_on(0, WQ1, W);
            // after a worker picks up W and clear the pending bit
            queue_work_on(1, WQ2, W);
            // workers on CPU0 and CPU1 will execute W in the same time.

    , which means the non-reentrance of a work item is conditional, and
    Lai Jiangshan provided a nice summary[1] of the conditions, therefore
    use it to describe a work item instance and improve the doc.

    [1]: https://lore.kernel.org/lkml/CAJhGHyDudet_xyNk=8xnuO2==o-u06s0E0GZVP4Q67nmQ84Ceg@mail.gmail.com/

    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Suggested-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:21 -04:00
Chris von Recklinghausen 3b26b723b6 mm/slab: document kfree() as allowed for kmem_cache_alloc() objects
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit ae65a5211d90e54ae604012ce9cf234c48780929
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Mar 2 16:01:00 2023 +0100

    mm/slab: document kfree() as allowed for kmem_cache_alloc() objects

    This will make it easier to free objects in situations when they can
    come from either kmalloc() or kmem_cache_alloc(), and also allow
    kfree_rcu() for freeing objects from kmem_cache_alloc().

    For the SLAB and SLUB allocators this was always possible so with SLOB
    gone, we can document it as supported.

    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: "Paul E. McKenney" <paulmck@kernel.org>
    Cc: Frederic Weisbecker <frederic@kernel.org>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Josh Triplett <josh@joshtriplett.org>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:18 -04:00
Chris von Recklinghausen ea5a8c928e mm, printk: introduce new format %pGt for page_type
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 4c85c0be3d7a9a7ffe48bfe0954eacc0ba9d3c75
Author: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Date:   Mon Jan 30 13:25:13 2023 +0900

    mm, printk: introduce new format %pGt for page_type

    %pGp format is used to display 'flags' field of a struct page.  However,
    some page flags (i.e.  PG_buddy, see page-flags.h for more details) are
    stored in page_type field.  To display human-readable output of page_type,
    introduce %pGt format.

    It is important to note the meaning of bits are different in page_type.
    if page_type is 0xffffffff, no flags are set.  Setting PG_buddy
    (0x00000080) flag results in a page_type of 0xffffff7f.  Clearing a bit
    actually means setting a flag.  Bits in page_type are inverted when
    displaying type names.

    Only values for which page_type_has_type() returns true are considered as
    page_type, to avoid confusion with mapcount values.  if it returns false,
    only raw values are displayed and not page type names.

    Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com
    Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Reviewed-by: Petr Mladek <pmladek@suse.com>     [vsprintf part]
    Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Joe Perches <joe@perches.com>
    Cc: John Ogness <john.ogness@linutronix.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:05 -04:00
Ivan Vecera 479b9ec9d0 doc/netlink: Update genetlink-legacy documentation
JIRA: https://issues.redhat.com/browse/RHEL-30656

commit 294f37fc87728230cf81f387aa6a0a8438066751
Author: Donald Hunter <donald.hunter@gmail.com>
Date:   Fri Aug 25 13:27:46 2023 +0100

    doc/netlink: Update genetlink-legacy documentation

    Add documentation for recently added genetlink-legacy schema attributes.
    Remove statements about 'work in progress' and 'todo'.

    Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
    Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
    Link: https://lore.kernel.org/r/20230825122756.7603-4-donald.hunter@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-10 09:19:32 +02:00
Ivan Vecera 12a80d14ce docs: add more netlink docs (incl. spec docs)
JIRA: https://issues.redhat.com/browse/RHEL-30344

Conflicts:
- small context conflict in MAINTAINERS

commit 9d6a65079c98f55fa2249c50e517d133d137c251
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Fri Jan 20 09:50:34 2023 -0800

    docs: add more netlink docs (incl. spec docs)

    Add documentation about the upcoming Netlink protocol specs.

    Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
    Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-02 11:15:38 +02:00
Nico Pache 3fd58a2958 selftests/vm: rename selftests/vm to selftests/mm
commit baa489fabd01596d5426d6e112b34ba5fb59ab82
Author: SeongJae Park <sj@kernel.org>
Date:   Tue Jan 3 18:07:53 2023 +0000

    selftests/vm: rename selftests/vm to selftests/mm

    Rename selftets/vm to selftests/mm for being more consistent with the
    code, documentation, and tools directories, and won't be confused with
    virtual machines.

    [sj@kernel.org: convert missing vm->mm changes]
      Link: https://lkml.kernel.org/r/20230107230643.252273-1-sj@kernel.org
    Link: https://lkml.kernel.org/r/20230103180754.129637-5-sj@kernel.org
    Signed-off-by: SeongJae Park <sj@kernel.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5617
Signed-off-by: Nico Pache <npache@redhat.com>
2024-01-19 10:06:45 -07:00
Jaroslav Kysela 0ad76c02a7 Documentation: core-api: Drop :export: for int_log.h
JIRA: https://issues.redhat.com/browse/RHEL-13724

commit 5bdeb6f5c7b94f653183f08eca8a08022b2efac6
Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date: Tue Jul 25 13:49:56 2023 +0300

    Documentation: core-api: Drop :export: for int_log.h

    The :export: keyword makes sense only for C-files, where EXPORT_SYMBOL()
    might appear. Otherwise kernel-doc may not produce anything out of this
    file.

    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Fixes: f97fa3dcb2db ("lib/math: Move dvb_math.c into lib/math/int_log.c")
    Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
    Acked-by: Jonathan Corbet <corbet@lwn.net>
    Link: https://lore.kernel.org/r/20230725104956.47806-1-andriy.shevchenko@linux.intel.com
    Signed-off-by: Mark Brown <broonie@kernel.org>

Signed-off-by: Jaroslav Kysela <jkysela@redhat.com>
2023-12-18 16:30:55 +01:00
Jaroslav Kysela 09e210346e lib/math: Move dvb_math.c into lib/math/int_log.c
JIRA: https://issues.redhat.com/browse/RHEL-13724

commit f97fa3dcb2db02013e6904c032a1d2d45707ee40
Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date: Mon Jul 3 16:52:08 2023 +0300

    lib/math: Move dvb_math.c into lib/math/int_log.c

    Some existing and new users may benefit from the intlog2() and
    intlog10() APIs, make them wide available.

    Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
    Acked-by: Mauro Carvalho Chehab <mchehab@kernel.org>
    Link: https://lore.kernel.org/r/20230619172019.21457-2-andriy.shevchenko@linux.intel.com
    Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
    Link: https://lore.kernel.org/r/20230703135211.87416-2-andriy.shevchenko@linux.intel.com
    Signed-off-by: Mark Brown <broonie@kernel.org>

Signed-off-by: Jaroslav Kysela <jkysela@redhat.com>
2023-12-18 16:30:34 +01:00
David Arcari 4f7156ed12 x86/topology: Remove CPU0 hotplug option
JIRA: https://issues.redhat.com/browse/RHEL-15512

commit e59e74dc48a309cb848ffc3d76a0d61aa6803c05
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Fri May 12 23:07:04 2023 +0200

    x86/topology: Remove CPU0 hotplug option

    This was introduced together with commit e1c467e690 ("x86, hotplug: Wake
    up CPU0 via NMI instead of INIT, SIPI, SIPI") to eventually support
    physical hotplug of CPU0:

     "We'll change this code in the future to wake up hard offlined CPU0 if
      real platform and request are available."

    11 years later this has not happened and physical hotplug is not officially
    supported. Remove the cruft.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Michael Kelley <mikelley@microsoft.com>
    Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Tested-by: Helge Deller <deller@gmx.de> # parisc
    Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
    Link: https://lore.kernel.org/r/20230512205255.715707999@linutronix.de

Signed-off-by: David Arcari <darcari@redhat.com>
2023-12-05 11:56:51 -05:00
Chris von Recklinghausen fe5f50def7 mm: remove folio_pincount_ptr() and head_compound_pincount()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 94688e8eb453e616098cb930e5f6fed4a6ea2dfa
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:47 2023 +0000

    mm: remove folio_pincount_ptr() and head_compound_pincount()

    We can use folio->_pincount directly, since all users are guarded by tests
    of compound/large.

    Link: https://lkml.kernel.org/r/20230111142915.1001531-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:52 -04:00
Chris von Recklinghausen 35cd91f122 mm/page_alloc: remove obsolete gfpflags_normal_context()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit def76fd549c513bb90278a8d6d0fe3ef3faa20a7
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:56 2022 +0800

    mm/page_alloc: remove obsolete gfpflags_normal_context()

    Since commit dacb5d8875cc ("tcp: fix page frag corruption on page fault"),
    there's no caller of gfpflags_normal_context().  Remove it as this helper
    is strictly tied to the sk page frag usage and there won't be other user
    in the future.

    [linmiaohe@huawei.com: fix htmldocs]
      Link: https://lkml.kernel.org/r/1bc55727-9b66-0e9e-c306-f10c4716ea89@huawei.com
    Link: https://lkml.kernel.org/r/20220916072257.9639-16-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:47 -04:00
Chris von Recklinghausen 914f4f8c1f headers/deps: mm: align MANITAINERS and Docs with new gfp.h structure
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 7343f2b0db4961d9f386e685e651c663dc763d0c
Author: Yury Norov <yury.norov@gmail.com>
Date:   Wed Jul 6 08:52:24 2022 -0700

    headers/deps: mm: align MANITAINERS and Docs with new gfp.h structure

    After moving gfp types out of gfp.h, we have to align MAINTAINERS
    and Docs, to avoid warnings like this:

    >> include/linux/gfp.h:1: warning: 'Page mobility and placement hints' not found
    >> include/linux/gfp.h:1: warning: 'Watermark modifiers' not found
    >> include/linux/gfp.h:1: warning: 'Reclaim modifiers' not found
    >> include/linux/gfp.h:1: warning: 'Useful GFP flag combinations' not found

    Signed-off-by: Yury Norov <yury.norov@gmail.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:12:56 -04:00
Jan Stancek d50c1077fc Merge: MM changes for RHEL 9.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2069

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160210

Patches 0001-0037: Generic preparation and code dependencies, mostly moving sysctls definitions to their related compilation units;
Patches 0038-0058: Change sets targeting ARM;
Patches 0059-0090: Change sets targeting PPC;
Patches 0091-0144: Change sets targeting s390;
Patches 0145-0154: Change sets targeting x86_64;
Patches 0155-0915: Change sets for portable changes;

Not necessary:
Omitted-fix: b83699ea1e62 ("LoongArch: mm: Avoid unnecessary page fault retires on shared memory types")
Omitted-fix: 6f5c672d17f5 ("s390/smp: enforce lowcore protection on CPU restart")
Omitted-fix: 953503751a42 ("Revert "s390/smp: enforce lowcore protection on CPU restart"")

To be picked up later (presumably under bz2168372):
Omitted-fix: 4f4c549feb4e ("arm64: mte: Avoid the racy walk of the vma list during core dump")
Omitted-fix: ae3a2a218821 ("powerpc/ftrace: Remove redundant create_branch() calls")
Omitted-fix: af320fb7ddb0 ("selftests/bpf: Fix s390x vmlinux path")
Omitted-fix: 16fd6b31dd9b ("Revert "mm: migration: fix the FOLL_GET failure on following huge page"")

Omitted-fix: 1d8d14641fd9 ("mm/hugetlb: support write-faults in shared mappings") Change already in CS9

Omitted-fix: 4302abc628fc ("powerpc/64s: Prevent fallthrough to hash TLB flush when using radix")
Omitted-fix: badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs")

Omitted-fix: 36d4b36b6959 ("lib/nodemask: inline next_node_in() and node_random()")

Omitted-fix: c36e20249571 ("mm: introduce mf_dax_kill_procs() for fsdax case")

Omitted-fix: dea18da45992 ("powerpc/64s: Fix stress_hpt memblock alloc alignment")

Omitted-fix: 2e8cff0a0eee ("arm64: fix rodata=full")
Omitted-fix: 2081b3bd0c11 ("arm64: fix rodata=full again")
Omitted-fix: 736eedc974ea ("arm64: mte: Fix double-freeing of the temporary tag storage during coredump")
Omitted-fix: 19e183b54528 ("elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}")
Omitted-fix: 61d2d1808b20 ("arm64: mm: don't acquire mutex when rewriting swapper")
Omitted-fix: a3d3163fbe69 ("x86/mm/32: Fix W^X detection when page tables do not support NX")
Omitted-fix: a970174d7a10 ("x86/mm: Do not verify W^X at boot up")
Omitted-fix: 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON")
Omitted-fix: 63cf584203f3 ("mm: teach mincore_hugetlb about pte markers")

Brew: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=51556952
KT1+mm regression: https://beaker.engineering.redhat.com/jobs/7663545

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Mika Penttilä <mpenttil@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-03-29 12:09:11 +02:00
Jan Stancek 7854c2726c Merge: bpf, xdp: update to 6.1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2067

bpf, xdp: update to 6.1

Bugzilla: https://bugzilla.redhat.com/2166911

Signed-off-by: Artem Savkov <asavkov@redhat.com>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>

Approved-by: Viktor Malik <vmalik@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-03-27 07:05:16 +02:00
Chris von Recklinghausen 23ba6a23ee arch/*/: remove CONFIG_VIRT_TO_BUS
Conflicts: drop changes to arch/alpha/include/asm/io.h - unsupported config

Bugzilla: https://bugzilla.redhat.com/2160210

commit 4313a24985f00340eeb591fd66aa2b257b9e0a69
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Mon May 23 21:59:02 2022 +0200

    arch/*/: remove CONFIG_VIRT_TO_BUS

    All architecture-independent users of virt_to_bus() and bus_to_virt()
    have been fixed to use the dma mapping interfaces or have been
    removed now.  This means the definitions on most architectures, and the
    CONFIG_VIRT_TO_BUS symbol are now obsolete and can be removed.

    The only exceptions to this are a few network and scsi drivers for m68k
    Amiga and VME machines and ppc32 Macintosh. These drivers work correctly
    with the old interfaces and are probably not worth changing.

    On alpha and parisc, virt_to_bus() were still used in asm/floppy.h.
    alpha can use isa_virt_to_bus() like x86 does, and parisc can just
    open-code the virt_to_phys() here, as this is architecture specific
    code.

    I tried updating the bus-virt-phys-mapping.rst documentation, which
    started as an email from Linus to explain some details of the Linux-2.0
    driver interfaces. The bits about virt_to_bus() were declared obsolete
    backin 2000, and the rest is not all that relevant any more, so in the
    end I just decided to remove the file completely.

    Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
    Acked-by: Helge Deller <deller@gmx.de> # parisc
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:15 -04:00
Chris von Recklinghausen 267a7a9b62 docs: rename Documentation/vm to Documentation/mm
Conflicts: drop changes to arch/loongarch/Kconfig - unsupported config

Bugzilla: https://bugzilla.redhat.com/2160210

commit ee65728e103bb7dd99d8604bf6c7aa89c7d7e446
Author: Mike Rapoport <rppt@kernel.org>
Date:   Mon Jun 27 09:00:26 2022 +0300

    docs: rename Documentation/vm to Documentation/mm

    so it will be consistent with code mm directory and with
    Documentation/admin-guide/mm and won't be confused with virtual machines.

    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Tested-by: Ira Weiny <ira.weiny@intel.com>
    Acked-by: Jonathan Corbet <corbet@lwn.net>
    Acked-by: Wu XiangCheng <bobwxc@email.cn>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:15 -04:00
Nico Pache d0b81c5b5a Maple Tree: add new data structure
Conflicts:
       lib/Makefile: slight makefile conflict

commit 54a611b605901c7d5d05b6b8f5d04a6ceb0962aa
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:39 2022 +0000

    Maple Tree: add new data structure

    Patch series "Introducing the Maple Tree"

    The maple tree is an RCU-safe range based B-tree designed to use modern
    processor cache efficiently.  There are a number of places in the kernel
    that a non-overlapping range-based tree would be beneficial, especially
    one with a simple interface.  If you use an rbtree with other data
    structures to improve performance or an interval tree to track
    non-overlapping ranges, then this is for you.

    The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
    nodes.  With the increased branching factor, it is significantly shorter
    than the rbtree so it has fewer cache misses.  The removal of the linked
    list between subsequent entries also reduces the cache misses and the need
    to pull in the previous and next VMA during many tree alterations.

    The first user that is covered in this patch set is the vm_area_struct,
    where three data structures are replaced by the maple tree: the augmented
    rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
    long term goal is to reduce or remove the mmap_lock contention.

    The plan is to get to the point where we use the maple tree in RCU mode.
    Readers will not block for writers.  A single write operation will be
    allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
    would be RCU enabled and this mode would be entered once multiple tasks
    are using the mm_struct.

    Davidlor said

    : Yes I like the maple tree, and at this stage I don't think we can ask for
    : more from this series wrt the MM - albeit there seems to still be some
    : folks reporting breakage.  Fundamentally I see Liam's work to (re)move
    : complexity out of the MM (not to say that the actual maple tree is not
    : complex) by consolidating the three complimentary data structures very
    : much worth it considering performance does not take a hit.  This was very
    : much a turn off with the range locking approach, which worst case scenario
    : incurred in prohibitive overhead.  Also as Liam and Matthew have
    : mentioned, RCU opens up a lot of nice performance opportunities, and in
    : addition academia[1] has shown outstanding scalability of address spaces
    : with the foundation of replacing the locked rbtree with RCU aware trees.

    A similar work has been discovered in the academic press

            https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf

    Sheer coincidence.  We designed our tree with the intention of solving the
    hardest problem first.  Upon settling on a b-tree variant and a rough
    outline, we researched ranged based b-trees and RCU b-trees and did find
    that article.  So it was nice to find reassurances that we were on the
    right path, but our design choice of using ranges made that paper unusable
    for us.

    This patch (of 70):

    The maple tree is an RCU-safe range based B-tree designed to use modern
    processor cache efficiently.  There are a number of places in the kernel
    that a non-overlapping range-based tree would be beneficial, especially
    one with a simple interface.  If you use an rbtree with other data
    structures to improve performance or an interval tree to track
    non-overlapping ranges, then this is for you.

    The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
    nodes.  With the increased branching factor, it is significantly shorter
    than the rbtree so it has fewer cache misses.  The removal of the linked
    list between subsequent entries also reduces the cache misses and the need
    to pull in the previous and next VMA during many tree alterations.

    The first user that is covered in this patch set is the vm_area_struct,
    where three data structures are replaced by the maple tree: the augmented
    rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
    long term goal is to reduce or remove the mmap_lock contention.

    The plan is to get to the point where we use the maple tree in RCU mode.
    Readers will not block for writers.  A single write operation will be
    allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
    would be RCU enabled and this mode would be entered once multiple tasks
    are using the mm_struct.

    There is additional BUG_ON() calls added within the tree, most of which
    are in debug code.  These will be replaced with a WARN_ON() call in the
    future.  There is also additional BUG_ON() calls within the code which
    will also be reduced in number at a later date.  These exist to catch
    things such as out-of-range accesses which would crash anyways.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
    Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Tested-by: David Howells <dhowells@redhat.com>
    Tested-by: Sven Schnelle <svens@linux.ibm.com>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2166668
Signed-off-by: Nico Pache <npache@redhat.com>
2023-03-07 01:23:54 -07:00
Artem Savkov af356f677f timekeeping: Introduce fast accessor to clock tai
Bugzilla: https://bugzilla.redhat.com/2166911

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 3dc6ffae2da201284cb24af66af77ee0bbb2efaa
Author: Kurt Kanzenbach <kurt@linutronix.de>
Date:   Thu Apr 14 11:18:03 2022 +0200

    timekeeping: Introduce fast accessor to clock tai

    Introduce fast/NMI safe accessor to clock tai for tracing. The Linux kernel
    tracing infrastructure has support for using different clocks to generate
    timestamps for trace events. Especially in TSN networks it's useful to have TAI
    as trace clock, because the application scheduling is done in accordance to the
    network time, which is based on TAI. With a tai trace_clock in place, it becomes
    very convenient to correlate network activity with Linux kernel application
    traces.

    Use the same implementation as ktime_get_boot_fast_ns() does by reading the
    monotonic time and adding the TAI offset. The same limitations as for the fast
    boot implementation apply. The TAI offset may change at run time e.g., by
    setting the time or using adjtimex() with an offset. However, these kind of
    offset changes are rare events. Nevertheless, the user has to be aware and deal
    with it in post processing.

    An alternative approach would be to use the same implementation as
    ktime_get_real_fast_ns() does. However, this requires to add an additional u64
    member to the tk_read_base struct. This struct together with a seqcount is
    designed to fit into a single cache line on 64 bit architectures. Adding a new
    member would violate this constraint.

    Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Link: https://lore.kernel.org/r/20220414091805.89667-2-kurt@linutronix.de

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:28 +01:00