Commit Graph

1243 Commits

Author SHA1 Message Date
Alex Williamson afe3cf413a mm: Provide address mask in struct follow_pfnmap_args
JIRA: https://issues.redhat.com/browse/RHEL-85593

commit 62fb8adc43afad5fa1c9cadc6f3a8e9fb72af194
Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Tue Feb 18 15:22:05 2025 -0700

    mm: Provide address mask in struct follow_pfnmap_args

    follow_pfnmap_start() walks the page table for a given address and
    fills out the struct follow_pfnmap_args in pfnmap_args_setup().
    The address mask of the page table level is already provided to this
    latter function for calculating the pfn.  This address mask can also
    be useful for the caller to determine the extent of the contiguous
    mapping.

    For example, vfio-pci now supports huge_fault for pfnmaps and is able
    to insert pud and pmd mappings.  When we DMA map these pfnmaps, ex.
    PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
    for a contiguous pfn range.  Providing the mapping address mask allows
    us to skip the extent of the mapping level.  Assuming a 1GB pud level
    and 4KB page size, iterations are reduced by a factor of 256K.  In wall
    clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.

    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: linux-mm@kvack.org
    Reviewed-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com>
    Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Link: https://lore.kernel.org/r/20250218222209.1382449-6-alex.williamson@redhat.com
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-04-08 12:33:52 -06:00
Donald Dutile 50863d4ebe mm: remove follow_pte()
JIRA: https://issues.redhat.com/browse/RHEL-73613

commit b0a1c0d0edcd75a0f8ec5fd19dbd64b8d097f534
Author: Peter Xu <peterx@redhat.com>
Date:   Mon Aug 26 16:43:50 2024 -0400

    mm: remove follow_pte()

    follow_pte() users have been converted to follow_pfnmap*().  Remove the
    API.

    Link: https://lkml.kernel.org/r/20240826204353.2228736-17-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gavin Shan <gshan@redhat.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Niklas Schnelle <schnelle@linux.ibm.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2025-03-26 22:00:46 -04:00
Donald Dutile 695754d739 mm: follow_pte() improvements
JIRA: https://issues.redhat.com/browse/RHEL-73613

commit c5541ba378e3d36ea88bf5839d5b23e33e7d1627
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Apr 10 17:55:27 2024 +0200

    mm: follow_pte() improvements

    follow_pte() is now our main function to lookup PTEs in VM_PFNMAP/VM_IO
    VMAs.  Let's perform some more sanity checks to make this exported
    function harder to abuse.

    Further, extend the doc a bit, it still focuses on the KVM use case with
    MMU notifiers.  Drop the KVM+follow_pfn() comment, follow_pfn() is no
    more, and we have other users nowadays.

    Also extend the doc regarding refcounted pages and the interaction with
    MMU notifiers.

    KVM is one example that uses MMU notifiers and can deal with refcounted
    pages properly.  VFIO is one example that doesn't use MMU notifiers, and
    to prevent use-after-free, rejects refcounted pages: pfn_valid(pfn) &&
    !PageReserved(pfn_to_page(pfn)).  Protection changes are less of a concern
    for users like VFIO: the behavior is similar to longterm-pinning a page,
    and getting the PTE protection changed afterwards.

    The primary concern with refcounted pages is use-after-free, which callers
    should be aware of.

    Link: https://lkml.kernel.org/r/20240410155527.474777-4-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Fei Li <fei1.li@intel.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Yonghua Huang <yonghua.huang@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2025-03-26 22:00:46 -04:00
Donald Dutile 71ea9e5fa0 mm/access_process_vm: use the new follow_pfnmap API
JIRA: https://issues.redhat.com/browse/RHEL-73613

commit b17269a51cc7f046a6f2cf9a6c314a0de885e5a5
Author: Peter Xu <peterx@redhat.com>
Date:   Mon Aug 26 16:43:49 2024 -0400

    mm/access_process_vm: use the new follow_pfnmap API

    Use the new API that can understand huge pfn mappings.

    Link: https://lkml.kernel.org/r/20240826204353.2228736-16-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gavin Shan <gshan@redhat.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Niklas Schnelle <schnelle@linux.ibm.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2025-03-26 22:00:46 -04:00
Donald Dutile ea381e4a2e mm: pass VMA instead of MM to follow_pte()
Conflicts: Drop acrn hunk since not supported in RHEL9.

JIRA: https://issues.redhat.com/browse/RHEL-73613

commit 29ae7d96d166fa08c7232daf8a314ef5ba1efd20
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Apr 10 17:55:26 2024 +0200

    mm: pass VMA instead of MM to follow_pte()

    ... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll
    now also perform these sanity checks for direct follow_pte()
    invocations.

    For generic_access_phys(), we might now check multiple times: nothing to
    worry about, really.

    Link: https://lkml.kernel.org/r/20240410155527.474777-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Sean Christopherson <seanjc@google.com>       [KVM]
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Fei Li <fei1.li@intel.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Yonghua Huang <yonghua.huang@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2025-03-26 22:00:45 -04:00
Donald Dutile d8240a9acc mm: move follow_phys to arch/x86/mm/pat/memtype.c
JIRA: https://issues.redhat.com/browse/RHEL-73613

commit 5b34b76cb0cd8a21dee5c7677eae98480b0d05cc
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Mar 25 07:45:42 2024 +0800

    mm: move follow_phys to arch/x86/mm/pat/memtype.c

    follow_phys is only used by two callers in arch/x86/mm/pat/memtype.c.
    Move it there and hardcode the two arguments that get the same values
    passed by both callers.

    [david@redhat.com: conflict resolutions]
    Link: https://lkml.kernel.org/r/20240403212131.929421-4-david@redhat.com
    Link: https://lkml.kernel.org/r/20240324234542.2038726-4-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Fei Li <fei1.li@intel.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Nathan Chancellor <nathan@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2025-03-26 22:00:45 -04:00
Donald Dutile 1bcf287508 mm: fix follow_pfnmap API lockdep assert
JIRA: https://issues.redhat.com/browse/RHEL-73613

commit b1b46751671be5a426982f037a47ae05f37ff80b
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Oct 18 09:50:05 2024 -0700

    mm: fix follow_pfnmap API lockdep assert

    The lockdep asserts for the new follow_pfnmap() API "knows" that a
    pfnmap always has a vma->vm_file, since that's the only way to create
    such a mapping.

    And that's actually true for all the normal cases.  But not for the mmap
    failure case, where the incomplete mapping is torn down and we have
    cleared vma->vm_file because the failure occured before the file was
    linked to the vma.

    So this codepath does actually need to check for vm_file being NULL.

    Reported-by: Jann Horn <jannh@google.com>
    Fixes: 6da8e9634bb7 ("mm: new follow_pfnmap API")
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2025-03-26 22:00:45 -04:00
Donald Dutile 60750497ff mm: new follow_pfnmap API
JIRA: https://issues.redhat.com/browse/RHEL-73613

commit 6da8e9634bb7e3fdad9ae0e4db873a05036c4343
Author: Peter Xu <peterx@redhat.com>
Date:   Mon Aug 26 16:43:43 2024 -0400

    mm: new follow_pfnmap API

    Introduce a pair of APIs to follow pfn mappings to get entry information.
    It's very similar to what follow_pte() does before, but different in that
    it recognizes huge pfn mappings.

    Link: https://lkml.kernel.org/r/20240826204353.2228736-10-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gavin Shan <gshan@redhat.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Niklas Schnelle <schnelle@linux.ibm.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2025-03-26 22:00:45 -04:00
Donald Dutile 474991c0e4 mm: remove follow_pfn
JIRA: https://issues.redhat.com/browse/RHEL-73613

commit cb10c28ac82c9b7a5e9b3b1dc7157036c20c36dd
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Mar 25 07:45:41 2024 +0800

    mm: remove follow_pfn

    Remove follow_pfn now that the last user is gone.

    Link: https://lkml.kernel.org/r/20240324234542.2038726-3-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Fei Li <fei1.li@intel.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Nathan Chancellor <nathan@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2025-03-26 22:00:45 -04:00
Donald Dutile 65d55d0e19 mm/pagewalk: check pfnmap for folio_walk_start()
JIRA: https://issues.redhat.com/browse/RHEL-73613

Conflicts: Dropped hunk to folio_walk_start(), as it's not in RHEL9.
           Note: contrary to Subject, there is a change to vm_normal_page_pmd()
	   that is kept.

commit 10d83d7781a8a6ff02bafd172c1ab183b27f8d5a
Author: Peter Xu <peterx@redhat.com>
Date:   Mon Aug 26 16:43:40 2024 -0400

    mm/pagewalk: check pfnmap for folio_walk_start()

    Teach folio_walk_start() to recognize special pmd/pud mappings, and fail
    them properly as it means there's no folio backing them.

    [peterx@redhat.com: remove some stale comments, per David]
      Link: https://lkml.kernel.org/r/20240829202237.2640288-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20240826204353.2228736-7-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Gavin Shan <gshan@redhat.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Niklas Schnelle <schnelle@linux.ibm.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2025-03-26 22:00:44 -04:00
Luiz Capitulino 7535c3fec0 Revert "mm: add vma_has_recency()"
This reverts commit d908e3177a.

JIRA: https://issues.redhat.com/browse/RHEL-80655
Upstream Status: RHEL-only

It was found that the introduction of POSIX_FADV_NOREUSE in 9.6 is
causing 10-20x slow down in OCP's etcd compaction which we believe
is due to increased OCP API latency.

In particular, we believe that this commit is regressing MADV_RANDOM
in a way that causes performance degradation for applications
using this hint, as after this commit the pages backing the VMAs that
are marked for random access will not receive a second chance to be
re-activated once they are in the LRU inactive list.

The conflict is due to downstream a85223eeb8 ("mm: ptep_get()
conversion") in mm/rmap.c::folio_referenced_one().

Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
2025-03-06 16:13:44 -05:00
Rafael Aquini 82c7711710 mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 2b0f922323ccfa76219bcaacd35cd50aeaa13592
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Oct 11 12:24:45 2024 +0200

    mm: don't install PMD mappings when THPs are disabled by the hw/process/vma

    We (or rather, readahead logic :) ) might be allocating a THP in the
    pagecache and then try mapping it into a process that explicitly disabled
    THP: we might end up installing PMD mappings.

    This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
    THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before
    starting the VM.

    For example, starting a VM backed on a file system with large folios
    supported makes the VM crash when the VM tries accessing such a mapping
    using KVM.

    Is it also a problem when the HW disabled THP using
    TRANSPARENT_HUGEPAGE_UNSUPPORTED?  At least on x86 this would be the case
    without X86_FEATURE_PSE.

    In the future, we might be able to do better on s390x and only disallow
    PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
    really wants.  For now, fix it by essentially performing the same check as
    would be done in __thp_vma_allowable_orders() or in shmem code, where this
    works as expected, and disallow PMD mappings, making us fallback to PTE
    mappings.

    Link: https://lkml.kernel.org/r/20241011102445.934409-3-david@redhat.com
    Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reported-by: Leo Fu <bfu@redhat.com>
    Tested-by: Thomas Huth <thuth@redhat.com>
    Cc: Thomas Huth <thuth@redhat.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Janosch Frank <frankja@linux.ibm.com>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:47 -05:00
Rafael Aquini bd953d39d3 mm: avoid leaving partial pfn mappings around in error case
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Sep 11 17:11:23 2024 -0700

    mm: avoid leaving partial pfn mappings around in error case

    As Jann points out, PFN mappings are special, because unlike normal
    memory mappings, there is no lifetime information associated with the
    mapping - it is just a raw mapping of PFNs with no reference counting of
    a 'struct page'.

    That's all very much intentional, but it does mean that it's easy to
    mess up the cleanup in case of errors.  Yes, a failed mmap() will always
    eventually clean up any partial mappings, but without any explicit
    lifetime in the page table mapping itself, it's very easy to do the
    error handling in the wrong order.

    In particular, it's easy to mistakenly free the physical backing store
    before the page tables are actually cleaned up and (temporarily) have
    stale dangling PTE entries.

    To make this situation less error-prone, just make sure that any partial
    pfn mapping is torn down early, before any other error handling.

    Reported-and-tested-by: Jann Horn <jannh@google.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Simona Vetter <simona.vetter@ffwll.ch>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:41 -05:00
Rafael Aquini dddceb5b5f mm/numa: no task_numa_fault() call if PTE is changed
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * this is a direct port from v6.6 LTS branch backport commit 19b4397c4a15
    ("mm/numa: no task_numa_fault() call if PTE is changed"), due to RHEL9
    missing upstream commit d2136d749d76 ("mm: support multi-size THP numa
    balancing") along with its accompanying series.

This patch is a backport of the following upstream commit:
commit 40b760cfd44566bca791c80e0720d70d75382b84
Author: Zi Yan <ziy@nvidia.com>
Date:   Fri Aug 9 10:59:04 2024 -0400

    mm/numa: no task_numa_fault() call if PTE is changed

    When handling a numa page fault, task_numa_fault() should be called by a
    process that restores the page table of the faulted folio to avoid
    duplicated stats counting.  Commit b99a342d4f ("NUMA balancing: reduce
    TLB flush via delaying mapping on hint page fault") restructured
    do_numa_page() and did not avoid task_numa_fault() call in the second page
    table check after a numa migration failure.  Fix it by making all
    !pte_same() return immediately.

    This issue can cause task_numa_fault() being called more than necessary
    and lead to unexpected numa balancing results (It is hard to tell whether
    the issue will cause positive or negative performance impact due to
    duplicated numa fault counting).

    Link: https://lkml.kernel.org/r/20240809145906.1513458-2-ziy@nvidia.com
    Fixes: b99a342d4f ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault")
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reported-by: "Huang, Ying" <ying.huang@intel.com>
    Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.intel.com/
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:32 -05:00
Rafael Aquini d0e1e96ac0 mm: memory: fix shift-out-of-bounds in fault_around_bytes_set
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 5aa598a72eafcf05239519646ec88638c8894dba
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Sat Mar 2 14:43:12 2024 +0800

    mm: memory: fix shift-out-of-bounds in fault_around_bytes_set

    The rounddown_pow_of_two(0) is undefined, so val = 0 is not allowed in the
    fault_around_bytes_set(), and leads to shift-out-of-bounds,

    UBSAN: shift-out-of-bounds in include/linux/log2.h:67:13
    shift exponent 4294967295 is too large for 64-bit type 'long unsigned int'
    CPU: 7 PID: 107 Comm: sh Not tainted 6.8.0-rc6-next-20240301 #294
    Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
    Call trace:
     dump_backtrace+0x94/0xec
     show_stack+0x18/0x24
     dump_stack_lvl+0x78/0x90
     dump_stack+0x18/0x24
     ubsan_epilogue+0x10/0x44
     __ubsan_handle_shift_out_of_bounds+0x98/0x134
     fault_around_bytes_set+0xa4/0xb0
     simple_attr_write_xsigned.isra.0+0xe4/0x1ac
     simple_attr_write+0x18/0x24
     debugfs_attr_write+0x4c/0x98
     vfs_write+0xd0/0x4b0
     ksys_write+0x6c/0xfc
     __arm64_sys_write+0x1c/0x28
     invoke_syscall+0x44/0x104
     el0_svc_common.constprop.0+0x40/0xe0
     do_el0_svc+0x1c/0x28
     el0_svc+0x34/0xdc
     el0t_64_sync_handler+0xc0/0xc4
     el0t_64_sync+0x190/0x194
    ---[ end trace ]---

    Fix it by setting the minimum val to PAGE_SIZE.

    Link: https://lkml.kernel.org/r/20240302064312.2358924-1-wangkefeng.wang@huawei.com
    Fixes: 53d36a56d8c4 ("mm: prefer fault_around_pages to fault_around_bytes")
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reported-by: Yue Sun <samsun1006219@gmail.com>
    Closes: https://lore.kernel.org/all/CAEkJfYPim6DQqW1GqCiHLdh2-eweqk1fGyXqs3JM+8e1qGge8w@mail.gmail.com/
    Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:35 -05:00
Rafael Aquini 070f8b6fd5 mm: remove unnecessary ia64 code and comment
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit e99fb98d478a0480d50e334df21bef12fb74e17f
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Dec 22 15:02:03 2023 +0800

    mm: remove unnecessary ia64 code and comment

    IA64 has gone with commit cf8e8658100d ("arch: Remove Itanium (IA-64)
    architecture"), remove unnecessary ia64 special mm code and comment too.

    Link: https://lkml.kernel.org/r/20231222070203.2966980-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:11 -05:00
Rafael Aquini 0a546fc1e9 mm: convert swap_readpage() to swap_read_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit c9bdf768dd9319d2d80a334646e2c8116af9e430
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Dec 13 21:58:39 2023 +0000

    mm: convert swap_readpage() to swap_read_folio()

    All callers have a folio, so pass it in, saving two calls to
    compound_head().

    Link: https://lkml.kernel.org/r/20231213215842.671461-11-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:06 -05:00
Rafael Aquini 614fbccf12 mm: convert __do_fault() to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 01d1e0e6b7d99ebaf2e42d2205595080b7d0c271
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Nov 8 18:28:05 2023 +0000

    mm: convert __do_fault() to use a folio

    Convert vmf->page to a folio as soon as we're going to use it.  This fixes
    a bug if the fault handler returns a tail page with hardware poison; tail
    pages have an invalid page->index, so we would fail to unmap the page from
    the page tables.  We actually have to unmap the entire folio (or
    mapping_evict_folio() will fail), so use unmap_mapping_folio() instead.

    This also saves various calls to compound_head() hidden in lock_page(),
    put_page(), etc.

    Link: https://lkml.kernel.org/r/20231108182809.602073-3-willy@infradead.org
    Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:35 -05:00
Rafael Aquini 15c37b2ac9 fork: use __mt_dup() to duplicate maple tree in dup_mmap()
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * kernel/fork.c: differences on the 3rd and 4th hunks are due to out-of-order
    backport of commit 35e351780fa9 ("fork: defer linking file vma until vma is
    fully initialized"), and we have one RHEL-only hunk here that reverts commit
    2b4f3b4987b5 ("fork: lock VMAs of the parent process when forking") which was
    a temporary measure upstream that ended up in a dead branch after v6.6;

This patch is a backport of the following upstream commit:
commit d2406291483775ecddaee929231a39c70c08fda2
Author: Peng Zhang <zhangpeng.00@bytedance.com>
Date:   Fri Oct 27 11:38:45 2023 +0800

    fork: use __mt_dup() to duplicate maple tree in dup_mmap()

    In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
    directly replacing the entries of VMAs in the new maple tree can result in
    better performance.  __mt_dup() uses DFS pre-order to duplicate the maple
    tree, so it is efficient.

    The average time complexity of __mt_dup() is O(n), where n is the number
    of VMAs.  The proof of the time complexity is provided in the commit log
    that introduces __mt_dup().  After duplicating the maple tree, each
    element is traversed and replaced (ignoring the cases of deletion, which
    are rare).  Since it is only a replacement operation for each element,
    this process is also O(n).

    Analyzing the exact time complexity of the previous algorithm is
    challenging because each insertion can involve appending to a node,
    pushing data to adjacent nodes, or even splitting nodes.  The frequency of
    each action is difficult to calculate.  The worst-case scenario for a
    single insertion is when the tree undergoes splitting at every level.  If
    we consider each insertion as the worst-case scenario, we can determine
    that the upper bound of the time complexity is O(n*log(n)), although this
    is a loose upper bound.  However, based on the test data, it appears that
    the actual time complexity is likely to be O(n).

    As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
    fails, there will be a portion of VMAs that have not been duplicated in
    the maple tree.  To handle this, we mark the failure point with
    XA_ZERO_ENTRY.  In exit_mmap(), if this marker is encountered, stop
    releasing VMAs that have not been duplicated after this point.

    There is a "spawn" in byte-unixbench[1], which can be used to test the
    performance of fork().  I modified it slightly to make it work with
    different number of VMAs.

    Below are the test results.  The first row shows the number of VMAs.  The
    second and third rows show the number of fork() calls per ten seconds,
    corresponding to next-20231006 and the this patchset, respectively.  The
    test results were obtained with CPU binding to avoid scheduler load
    balancing that could cause unstable results.  There are still some
    fluctuations in the test results, but at least they are better than the
    original performance.

    21     121   221    421    821    1621   3221   6421   12821  25621  51221
    112100 76261 54227  34035  20195  11112  6017   3161   1606   802    393
    114558 83067 65008  45824  28751  16072  8922   4747   2436   1233   599
    2.19%  8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%

    [1] https://github.com/kdlucas/byte-unixbench/tree/master

    Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
    Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
    Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Mateusz Guzik <mjguzik@gmail.com>
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Mike Christie <michael.christie@oracle.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:34 -05:00
Rafael Aquini e34266c05e mm/gup: adapt get_user_page_vma_remote() to never return NULL
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * mm/memory.c: 2nd hunk dropped for this backport because the extra line
    originally added by upstream commit ca5e863233e8 ("mm/gup: remove vmas
    parameter from get_user_pages_remote()") was not backported into RHEL-9
    (see commit e24b3ade32)

This patch is a backport of the following upstream commit:
commit 6a1960b8a8773324d870fa32ba68ff3106523a95
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Tue Oct 3 00:14:54 2023 +0100

    mm/gup: adapt get_user_page_vma_remote() to never return NULL

    get_user_pages_remote() will never return 0 except in the case of
    FOLL_NOWAIT being specified, which we explicitly disallow.

    This simplifies error handling for the caller and avoids the awkwardness
    of dealing with both errors and failing to pin.  Failing to pin here is an
    error.

    Link: https://lkml.kernel.org/r/00319ce292d27b3aae76a0eb220ce3f528187508.1696288092.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Suggested-by: Arnd Bergmann <arnd@arndb.de>
    Reviewed-by: Arnd Bergmann <arnd@arndb.de>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Richard Cochran <richardcochran@gmail.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:48 -05:00
Rafael Aquini 2472278b18 mm: make __access_remote_vm() static
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit c43cfa42541c04a3a94312e39ab81c41ba431277
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Tue Oct 3 00:14:51 2023 +0100

    mm: make __access_remote_vm() static

    Patch series "various improvements to the GUP interface", v2.

    A series of fixes to simplify and improve the GUP interface with an eye to
    providing groundwork to future improvements:-

    * __access_remote_vm() and access_remote_vm() are functionally identical,
      so make the former static such that in future we can potentially change
      the external-facing implementation details of this function.

    * Extend is_valid_gup_args() to cover the missing FOLL_TOUCH case, and
      simplify things by defining INTERNAL_GUP_FLAGS to check against.

    * Adjust __get_user_pages_locked() to explicitly treat a failure to pin any
      pages as an error in all circumstances other than FOLL_NOWAIT being
      specified, bringing it in line with the nommu implementation of this
      function.

    * (With many thanks to Arnd who suggested this in the first instance)
      Update get_user_page_vma_remote() to explicitly only return a page or an
      error, simplifying the interface and avoiding the questionable
      IS_ERR_OR_NULL() pattern.

    This patch (of 4):

    access_remote_vm() passes through parameters to __access_remote_vm()
    directly, so remove the __access_remote_vm() function from mm.h and use
    access_remote_vm() in the one caller that needs it (ptrace_access_vm()).

    This allows future adjustments to the GUP-internal __access_remote_vm()
    function while keeping the access_remote_vm() function stable.

    Link: https://lkml.kernel.org/r/cover.1696288092.git.lstoakes@gmail.com
    Link: https://lkml.kernel.org/r/f7877c5039ce1c202a514a8aeeefc5cdd5e32d19.1696288092.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Reviewed-by: Arnd Bergmann <arnd@arndb.de>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Richard Cochran <richardcochran@gmail.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:45 -05:00
Rafael Aquini 71a32e7d3d mm: mempolicy: make mpol_misplaced() to take a folio
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 75c70128a67311070115b90d826a229d4bbbb2b5
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Sep 21 15:44:16 2023 +0800

    mm: mempolicy: make mpol_misplaced() to take a folio

    In preparation for large folio numa balancing, make mpol_misplaced() to
    take a folio, no functional change intended.

    Link: https://lkml.kernel.org/r/20230921074417.24004-6-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:35 -05:00
Rafael Aquini 866dcb67b1 mm: memory: make numa_migrate_prep() to take a folio
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit cda6d93672ac5dd8af778a3f3e6082e12233b65b
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Sep 21 15:44:15 2023 +0800

    mm: memory: make numa_migrate_prep() to take a folio

    In preparation for large folio numa balancing, make numa_migrate_prep() to
    take a folio, no functional change intended.

    Link: https://lkml.kernel.org/r/20230921074417.24004-5-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:34 -05:00
Rafael Aquini 8209cd877c mm: memory: use a folio in do_numa_page()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 6695cf68b15c215d33b8add64c33e01e3cbe236c
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Sep 21 15:44:14 2023 +0800

    mm: memory: use a folio in do_numa_page()

    Numa balancing only try to migrate non-compound page in do_numa_page(),
    use a folio in it to save several compound_head calls, note we use
    folio_estimated_sharers(), it is enough to check the folio sharers since
    only normal page is handled, if large folio numa balancing is supported, a
    precise folio sharers check would be used, no functional change intended.

    Link: https://lkml.kernel.org/r/20230921074417.24004-4-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:33 -05:00
Rafael Aquini de619ae047 mm: memory: add vm_normal_folio_pmd()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 65610453459f9048678a0daef89d592e412ec00a
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Sep 21 15:44:12 2023 +0800

    mm: memory: add vm_normal_folio_pmd()

    Patch series "mm: convert numa balancing functions to use a folio", v2.

    do_numa_pages() only handles non-compound pages, and only PMD-mapped THPs
    are handled in do_huge_pmd_numa_page().  But a large, PTE-mapped folio
    will be supported so let's convert more numa balancing functions to
    use/take a folio in preparation for that, no functional change intended
    for now.

    This patch (of 6):

    The new vm_normal_folio_pmd() wrapper is similar to vm_normal_folio(),
    which allow them to completely replace the struct page variables with
    struct folio variables.

    Link: https://lkml.kernel.org/r/20230921074417.24004-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20230921074417.24004-2-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:32 -05:00
Rafael Aquini 27ca54790a mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 73eab3ca481e5be0f1fd8140365d604482f84ee1
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 13 17:51:27 2023 +0800

    mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio()

    At present, numa balance only support base page and PMD-mapped THP, but we
    will expand to support to migrate large folio/pte-mapped THP in the
    future, it is better to make migrate_misplaced_page() to take a folio
    instead of a page, and rename it to migrate_misplaced_folio(), it is a
    preparation, also this remove several compound_head() calls.

    Link: https://lkml.kernel.org/r/20230913095131.2426871-5-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:25 -05:00
Audra Mitchell 7e3c51874e mm: Warn on shadow stack memory in wrong vma
JIRA: https://issues.redhat.com/browse/RHEL-55461

This patch is a backport of the following upstream commit:
commit e5136e876581ba5b63220378e25fec9dcec7bad1
Author: Rick Edgecombe <rick.p.edgecombe@intel.com>
Date:   Mon Jun 12 17:10:43 2023 -0700

    mm: Warn on shadow stack memory in wrong vma

    The x86 Control-flow Enforcement Technology (CET) feature includes a new
    type of memory called shadow stack. This shadow stack memory has some
    unusual properties, which requires some core mm changes to function
    properly.

    One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
    treated as shadow by the CPU, but this combination used to be created by
    the kernel on x86. Previous patches have changed the kernel to now avoid
    creating these PTEs unless they are for shadow stack memory. In case any
    missed corners of the kernel are still creating PTEs like this for
    non-shadow stack memory, and to catch any re-introductions of the logic,
    warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow
    stack VMAs when they are being zapped. This won't catch transient cases
    but should have decent coverage.

    In order to check if a PTE is shadow stack in core mm code, add two arch
    breakouts arch_check_zapped_pte/pmd(). This will allow shadow stack
    specific code to be kept in arch/x86.

    Only do the check if shadow stack is supported by the CPU and configured
    because in rare cases older CPUs may write Dirty=1 to a Write=0 CPU on
    older CPUs. This check is handled in pte_shstk()/pmd_shstk().

    Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: Mark Brown <broonie@kernel.org>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Tested-by: Pengfei Xu <pengfei.xu@intel.com>
    Tested-by: John Allen <john.allen@amd.com>
    Tested-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/all/20230613001108.3040476-18-rick.p.edgecombe%40intel.com

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-11-04 09:14:14 -05:00
Rafael Aquini c3be6088b1 mm: fix old/young bit handling in the faulting path
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * minor context difference due to out-of-order backport of commit
      2bad466cc9d9 ("mm/uffd: UFFD_FEATURE_WP_UNPOPULATED")

This patch is a backport of the following upstream commit:
commit 4cd7ba16a0afb36550eed7690e73d3e7a743fa96
Author: Ram Tummala <rtummala@nvidia.com>
Date:   Tue Jul 9 18:45:39 2024 -0700

    mm: fix old/young bit handling in the faulting path

    Commit 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()")
    replaced do_set_pte() with set_pte_range() and that introduced a
    regression in the following faulting path of non-anonymous vmas which
    caused the PTE for the faulting address to be marked as old instead of
    young.

    handle_pte_fault()
      do_pte_missing()
        do_fault()
          do_read_fault() || do_cow_fault() || do_shared_fault()
            finish_fault()
              set_pte_range()

    The polarity of prefault calculation is incorrect.  This leads to prefault
    being incorrectly set for the faulting address.  The following check will
    incorrectly mark the PTE old rather than young.  On some architectures
    this will cause a double fault to mark it young when the access is
    retried.

        if (prefault && arch_wants_old_prefaulted_pte())
            entry = pte_mkold(entry);

    On a subsequent fault on the same address, the faulting path will see a
    non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE
    will then be correctly marked young in handle_pte_fault() itself.

    Due to this bug, performance degradation in the fault handling path will
    be observed due to unnecessary double faulting.

    Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com
    Fixes: 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()")
    Signed-off-by: Ram Tummala <rtummala@nvidia.com>
    Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Yin Fengwei <fengwei.yin@intel.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:32 -04:00
Rafael Aquini cd1cd44bf9 mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 3d2c908768877714a354ee6d7bf93e801400d5e2
Author: David Hildenbrand <david@redhat.com>
Date:   Mon Aug 21 18:08:48 2023 +0200

    mm/swap: inline folio_set_swap_entry() and folio_swap_entry()

    Let's simply work on the folio directly and remove the helpers.

    Link: https://lkml.kernel.org/r/20230821160849.531668-4-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Reviewed-by: Chris Li <chrisl@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Seth Jennings <sjenning@redhat.com>
    Cc: Vitaly Wool <vitaly.wool@konsulko.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:06 -04:00
Rafael Aquini 33f1751df5 mm/swap: stop using page->private on tail pages for THP_SWAP
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit cfeed8ffe55b37fa10286aaaa1369da00cb88440
Author: David Hildenbrand <david@redhat.com>
Date:   Mon Aug 21 18:08:46 2023 +0200

    mm/swap: stop using page->private on tail pages for THP_SWAP

    Patch series "mm/swap: stop using page->private on tail pages for THP_SWAP
    + cleanups".

    This series stops using page->private on tail pages for THP_SWAP, replaces
    folio->private by folio->swap for swapcache folios, and starts using
    "new_folio" for tail pages that we are splitting to remove the usage of
    page->private for swapcache handling completely.

    This patch (of 4):

    Let's stop using page->private on tail pages, making it possible to just
    unconditionally reuse that field in the tail pages of large folios.

    The remaining usage of the private field for THP_SWAP is in the THP
    splitting code (mm/huge_memory.c), that we'll handle separately later.

    Update the THP_SWAP documentation and sanity checks in mm_types.h and
    __split_huge_page_tail().

    [david@redhat.com: stop using page->private on tail pages for THP_SWAP]
      Link: https://lkml.kernel.org/r/6f0a82a3-6948-20d9-580b-be1dbf415701@redhat.com
    Link: https://lkml.kernel.org/r/20230821160849.531668-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20230821160849.531668-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>     [arm64]
    Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Seth Jennings <sjenning@redhat.com>
    Cc: Vitaly Wool <vitaly.wool@konsulko.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:05 -04:00
Rafael Aquini aaa814d7e1 mm: remove checks for pte_index
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit bb7dbaafff3f582d18028a5b99a8faa789842678
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Aug 19 04:18:37 2023 +0100

    mm: remove checks for pte_index

    Since pte_index is always defined, we don't need to check whether it's
    defined or not.  Delete the slow version that doesn't depend on it and
    remove the #define since nobody needs to test for it.

    Link: https://lkml.kernel.org/r/20230819031837.3160096-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Christian Dietrich <stettberger@dokucode.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:03 -04:00
Rafael Aquini 930d4bbabf mm: remove enum page_entry_size
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * fs/erofs/data.c: hunks dropped as RHEL is missing commit 06252e9ce05b
      ("erofs: dax support for non-tailpacking regular file")
  * fs/ext2/file.c: minor contex difference as RHEL is missing commit
      70f3bad8c315 ("ext2: Convert to using invalidate_lock")
  * fs/fuse/dax.c: minor contex difference as RHEL is missing commit
      8bcbbe9c7c8e ("fuse: Convert to using invalidate_lock")

This patch is a backport of the following upstream commit:
commit 1d024e7a8dabcc3c84d77532a88c774c32cf8245
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Aug 18 21:23:35 2023 +0100

    mm: remove enum page_entry_size

    Remove the unnecessary encoding of page order into an enum and pass the
    page order directly.  That lets us get rid of pe_order().

    The switch constructs have to be changed to if/else constructs to prevent
    GCC from warning on builds with 3-level page tables where PMD_ORDER and
    PUD_ORDER have the same value.

    If you are looking at this commit because your driver stopped compiling,
    look at the previous commit as well and audit your driver to be sure it
    doesn't depend on mmap_lock being held in its ->huge_fault method.

    [willy@infradead.org: use "order %u" to match the (non dev_t) style]
      Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org
    Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:01 -04:00
Rafael Aquini a2d8b7832f mm: allow ->huge_fault() to be called without the mmap_lock held
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 40d49a3c9e4a0e5cf7a6fcebc8d4d7d63d1f3f1b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Aug 18 21:23:34 2023 +0100

    mm: allow ->huge_fault() to be called without the mmap_lock held

    Remove the checks for the VMA lock being held, allowing the page fault
    path to call into the filesystem instead of retrying with the mmap_lock
    held.  This will improve scalability for DAX page faults.  Also update the
    documentation to match (and fix some other changes that have happened
    recently).

    Link: https://lkml.kernel.org/r/20230818202335.2739663-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:00 -04:00
Rafael Aquini 2142680857 mm: convert ptlock_free() to use ptdescs
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 6ed1b8a09deb0b99fd3b54e11535c80284689555
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Aug 7 16:04:52 2023 -0700

    mm: convert ptlock_free() to use ptdescs

    This removes some direct accesses to struct page, working towards
    splitting out struct ptdesc from struct page.

    Link: https://lkml.kernel.org/r/20230807230513.102486-11-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Dinh Nguyen <dinguyen@kernel.org>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Guo Ren <guoren@kernel.org>
    Cc: Huacai Chen <chenhuacai@kernel.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
    Cc: Jonas Bonn <jonas@southpole.se>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Palmer Dabbelt <palmer@rivosinc.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:22 -04:00
Rafael Aquini e666d241bf mm: convert ptlock_alloc() to use ptdescs
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit f5ecca06b3a5d0371ee27ee08aa06c686407a8af
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Aug 7 16:04:47 2023 -0700

    mm: convert ptlock_alloc() to use ptdescs

    This removes some direct accesses to struct page, working towards
    splitting out struct ptdesc from struct page.

    Link: https://lkml.kernel.org/r/20230807230513.102486-6-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Dinh Nguyen <dinguyen@kernel.org>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Guo Ren <guoren@kernel.org>
    Cc: Huacai Chen <chenhuacai@kernel.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
    Cc: Jonas Bonn <jonas@southpole.se>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Palmer Dabbelt <palmer@rivosinc.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:17 -04:00
Rafael Aquini b58609c4fa mm: call update_mmu_cache_range() in more page fault handling paths
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * minor context conflict on the 6th hunk due to out-of-order backport of
      upstream commit 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA")

This patch is a backport of the following upstream commit:
commit 5003a2bdf6880dc9c301f555bece1154081158fe
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 2 16:14:06 2023 +0100

    mm: call update_mmu_cache_range() in more page fault handling paths

    Pass the vm_fault to the architecture to help it make smarter decisions
    about which PTEs to insert into the TLB.

    Link: https://lkml.kernel.org/r/20230802151406.3735276-39-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:34 -04:00
Rafael Aquini af8796d9b7 mm: convert do_set_pte() to set_pte_range()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 3bd786f76de2e01745f462844fd1a206052ee8b8
Author: Yin Fengwei <fengwei.yin@intel.com>
Date:   Wed Aug 2 16:14:04 2023 +0100

    mm: convert do_set_pte() to set_pte_range()

    set_pte_range() allows to setup page table entries for a specific
    range.  It takes advantage of batched rmap update for large folio.
    It now takes care of calling update_mmu_cache_range().

    Link: https://lkml.kernel.org/r/20230802151406.3735276-37-willy@infradead.org
    Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:33 -04:00
Rafael Aquini fca7a7db19 mm: use flush_icache_pages() in do_set_pmd()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 9f1f5b60e76d44fa85fef6970b7477f72d3999eb
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 2 16:14:01 2023 +0100

    mm: use flush_icache_pages() in do_set_pmd()

    Push the iteration over each page down to the architectures (many can
    flush the entire THP without iteration).

    Link: https://lkml.kernel.org/r/20230802151406.3735276-34-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:29 -04:00
Rafael Aquini 1014b8c004 mm/memory.c: fix some kernel-doc comments
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 6e412203eeae68b599fb0a0722961e68f90322df
Author: Yang Li <yang.lee@linux.alibaba.com>
Date:   Thu Jul 27 09:55:58 2023 +0800

    mm/memory.c: fix some kernel-doc comments

    Add description of @mas and @tree_end, remove @mt in unmap_vmas().  to
    silence the warnings:

    mm/memory.c:1837: warning: Function parameter or member 'mas' not described in 'unmap_vmas'
    mm/memory.c:1837: warning: Function parameter or member 'tree_end' not described in 'unmap_vmas'
    mm/memory.c:1837: warning: Excess function parameter 'mt' description in 'unmap_vmas'

    Link: https://lkml.kernel.org/r/20230727015558.69554-1-yang.lee@linux.alibaba.com
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Reported-by: Abaci Robot <abaci@linux.alibaba.com>
    Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5996
    Cc: Liam Howlett <liam.howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:03 -04:00
Rafael Aquini 7318a0936b mm: handle faults that merely update the accessed bit under the VMA lock
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 063e60d806151f3733acabccb62a463d55fac469
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:10 2023 +0100

    mm: handle faults that merely update the accessed bit under the VMA lock

    Move FAULT_FLAG_VMA_LOCK check out of handle_pte_fault().  This should
    have a significant performance improvement for mmaped files.  Write faults
    (on read-only shared pages) still take the mmap lock as we do not want to
    audit all the implementations of ->pfn_mkwrite() and ->page_mkwrite().
    However write-faults on private mappings are handled under the VMA lock.

    [willy@infradead.org: address "suspicious RCU usage" warning]
      Link: https://lkml.kernel.org/r/ZMK7jwpI4uD6tKrF@casper.infradead.org
    Link: https://lkml.kernel.org/r/20230724185410.1124082-11-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:41 -04:00
Rafael Aquini b49e98668c mm: handle swap and NUMA PTE faults under the VMA lock
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 4c2f803abb1797e571579adcaf134a727b3ffc48
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:09 2023 +0100

    mm: handle swap and NUMA PTE faults under the VMA lock

    Move the FAULT_FLAG_VMA_LOCK check down in handle_pte_fault().  This is
    probably not a huge win in its own right, but is a nicely separable bit
    from the next patch.

    Link: https://lkml.kernel.org/r/20230724185410.1124082-10-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:40 -04:00
Rafael Aquini fc444a09a8 mm: run the fault-around code under the VMA lock
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit f5617ffeb450f84c57f7eba1a3524a29955d42b7
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:08 2023 +0100

    mm: run the fault-around code under the VMA lock

    The map_pages fs method should be safe to run under the VMA lock instead
    of the mmap lock.  This should have a measurable reduction in contention
    on the mmap lock.

    Link: https://lkml.kernel.org/r/20230724185410.1124082-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:39 -04:00
Rafael Aquini 98e1e3a844 mm: move FAULT_FLAG_VMA_LOCK check down from do_fault()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 61a4b8d32025dcabcd78994f887a4b9dff912cf0
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:07 2023 +0100

    mm: move FAULT_FLAG_VMA_LOCK check down from do_fault()

    Perform the check at the start of do_read_fault(), do_cow_fault() and
    do_shared_fault() instead.  Should be no performance change from the last
    commit.

    Link: https://lkml.kernel.org/r/20230724185410.1124082-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:39 -04:00
Rafael Aquini bb23dd56a9 mm: move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 0c2e394ab23017303f676e6206a54c54bb0e3681
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:06 2023 +0100

    mm: move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault()

    Call do_pte_missing() under the VMA lock ...  then immediately retry in
    do_fault().

    Link: https://lkml.kernel.org/r/20230724185410.1124082-7-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:38 -04:00
Rafael Aquini 80b2a84a74 mm: handle some PMD faults under the VMA lock
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 8f5fd0e1a02020062c52063f15d4e5c426ee3547
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:05 2023 +0100

    mm: handle some PMD faults under the VMA lock

    Push the VMA_LOCK check down from __handle_mm_fault() to
    handle_pte_fault().  Once again, we refuse to call ->huge_fault() with the
    VMA lock held, but we will wait for a PMD migration entry with the VMA
    lock held, handle NUMA migration and set the accessed bit.  We were
    already doing this for anonymous VMAs, so it should be safe.

    Link: https://lkml.kernel.org/r/20230724185410.1124082-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:37 -04:00
Rafael Aquini 03c15bf28d mm: handle PUD faults under the VMA lock
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit c4fd825e188471d4d2796e02729dd029b3b23210
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:04 2023 +0100

    mm: handle PUD faults under the VMA lock

    Postpone checking the VMA_LOCK flag until we've attempted to handle faults
    on PUDs.  There's a mild upside to this patch in that we'll allocate the
    page tables while under the VMA lock rather than the mmap lock, reducing
    the hold time on the mmap lock, since the retry will find the page tables
    already populated.  The real purpose here is to make a commit that shows
    we don't call ->huge_fault under the VMA lock.  We do now handle setting
    the accessed bit on a PUD fault under the VMA lock, but that doesn't seem
    likely to be a measurable difference.

    Link: https://lkml.kernel.org/r/20230724185410.1124082-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:36 -04:00
Rafael Aquini 37d87ab0c0 mm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 4ec31152a80d83d74d231d964703a721236244ef
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:03 2023 +0100

    mm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()

    Handle a little more of the page fault path outside the mmap sem.  The
    hugetlb path doesn't need to check whether the VMA is anonymous; the
    VM_HUGETLB flag is only set on hugetlbfs VMAs.  There should be no
    performance change from the previous commit; this is simply a step to ease
    bisection of any problems.

    Link: https://lkml.kernel.org/r/20230724185410.1124082-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:36 -04:00
Rafael Aquini d755df6daa mm: allow per-VMA locks on file-backed VMAs
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * MAINTAINERS: minor context difference due to backport of upstream commit
      14006f1d8fa2 ("Documentations: Analyze heavily used Networking related structs")

This patch is a backport of the following upstream commit:
commit 350f6bbca1de515cd7519a33661cefc93ea06054
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:02 2023 +0100

    mm: allow per-VMA locks on file-backed VMAs

    Remove the TCP layering violation by allowing per-VMA locks on all VMAs.
    The fault path will immediately fail in handle_mm_fault().  There may be a
    small performance reduction from this patch as a little unnecessary work
    will be done on each page fault.  See later patches for the improvement.

    Link: https://lkml.kernel.org/r/20230724185410.1124082-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:35 -04:00
Rafael Aquini 236fad9df4 mm: change do_vmi_align_munmap() tracking of VMAs to remove
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * mm/memory.c: minor context difference due to the backport of upstream
      commit 2820b0f09be9 ("hugetlbfs: close race between MADV_DONTNEED and page fault")

This patch is a backport of the following upstream commit:
commit fd892593d44d8b649caf30a67f0c7696d976d901
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Mon Jul 24 14:31:45 2023 -0400

    mm: change do_vmi_align_munmap() tracking of VMAs to remove

    The majority of the calls to munmap a vm range is within a single vma.
    The maple tree is able to store a single entry at 0, with a size of 1 as
    a pointer and avoid any allocations.  Change do_vmi_align_munmap() to
    store the VMAs being munmap()'ed into a tree indexed by the count.  This
    will leverage the ability to store the first entry without a node
    allocation.

    Storing the entries into a tree by the count and not the vma start and
    end means changing the functions which iterate over the entries.  Update
    unmap_vmas() and free_pgtables() to take a maple state and a tree end
    address to support this functionality.

    Passing through the same maple state to unmap_vmas() and free_pgtables()
    means the state needs to be reset between calls.  This happens in the
    static unmap_region() and exit_mmap().

    Link: https://lkml.kernel.org/r/20230724183157.3939892-4-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Peng Zhang <zhangpeng.00@bytedance.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:24 -04:00
Rafael Aquini b76e8e2fac mm/memory: pass folio into do_page_mkwrite()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 86aa6998ad00af823de81d12d41d7063c14298a0
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Mon Jul 10 22:35:44 2023 -0700

    mm/memory: pass folio into do_page_mkwrite()

    Saves one implicit call to compound_head().

    I'm not sure if I should change the name of the function to
    do_folio_mkwrite() and update the description comment to reference a folio
    as the vm_op is still called page_mkwrite.

    Link: https://lkml.kernel.org/r/20230711053544.156617-1-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:16 -04:00