Commit Graph

172 Commits

Author SHA1 Message Date
Rafael Aquini 7472fae475 mm/mlock: set the correct prev on failure
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit faa242b1d2a97143150bdc50d5b61fd70fcd17cd
Author: Wei Yang <richard.weiyang@gmail.com>
Date:   Sun Oct 27 12:33:21 2024 +0000

    mm/mlock: set the correct prev on failure

    After commit 94d7d9233951 ("mm: abstract the vma_merge()/split_vma()
    pattern for mprotect() et al."), if vma_modify_flags() return error, the
    vma is set to an error code.  This will lead to an invalid prev be
    returned.

    Generally this shouldn't matter as the caller should treat an error as
    indicating state is now invalidated, however unfortunately
    apply_mlockall_flags() does not check for errors and assumes that
    mlock_fixup() correctly maintains prev even if an error were to occur.

    This patch fixes that assumption.

    [lorenzo.stoakes@oracle.com: provide a better fix and rephrase the log]
    Link: https://lkml.kernel.org/r/20241027123321.19511-1-richard.weiyang@gmail.com
    Fixes: 94d7d9233951 ("mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.")
    Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
    Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:57 -05:00
Rafael Aquini 26dc006376 mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * fs/userfaultfd.c: context difference on the 1st hunk due to out-of-order
    backport of commit c88033efe9a3 ("mm/userfaultfd: reset ptes when close()
    for wr-protected ones"), and context differences on the 2nd and 4th hunks
    due to RHEL9 missing upstream commit d61ea1cb0095 ("userfaultfd:
    UFFD_FEATURE_WP_ASYNC") and its series;

This patch is a backport of the following upstream commit:
commit 94d7d923395129b9248777e575c877e40007f9dc
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Wed Oct 11 18:04:28 2023 +0100

    mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.

    mprotect() and other functions which change VMA parameters over a range
    each employ a pattern of:-

    1. Attempt to merge the range with adjacent VMAs.
    2. If this fails, and the range spans a subset of the VMA, split it
       accordingly.

    This is open-coded and duplicated in each case. Also in each case most of
    the parameters passed to vma_merge() remain the same.

    Create a new function, vma_modify(), which abstracts this operation,
    accepting only those parameters which can be changed.

    To avoid the mess of invoking each function call with unnecessary
    parameters, create inline wrapper functions for each of the modify
    operations, parameterised only by what is required to perform the action.

    We can also significantly simplify the logic - by returning the VMA if we
    split (or merged VMA if we do not) we no longer need specific handling for
    merge/split cases in any of the call sites.

    Note that the userfaultfd_release() case works even though it does not
    split VMAs - since start is set to vma->vm_start and end is set to
    vma->vm_end, the split logic does not trigger.

    In addition, since we calculate pgoff to be equal to vma->vm_pgoff + (start
    - vma->vm_start) >> PAGE_SHIFT, and start - vma->vm_start will be 0 in this
    instance, this invocation will remain unchanged.

    We eliminate a VM_WARN_ON() in mprotect_fixup() as this simply asserts that
    vma_merge() correctly ensures that flags remain the same, something that is
    already checked in is_mergeable_vma() and elsewhere, and in any case is not
    specific to mprotect().

    Link: https://lkml.kernel.org/r/0dfa9368f37199a423674bf0ee312e8ea0619044.1697043508.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:51 -05:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Nico Pache 0f34ae86f8 mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once
commit 60081bf19b0ec8fa40c589bd361fa2bc763f1050
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Aug 4 08:27:22 2023 -0700

    mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once

    Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is
    not obvious and makes it hard to understand where vma locking is happening.
    Also in some cases (like in dup_userfaultfd()) vma should be locked earlier
    than vma_flags modification. To make locking more visible, change these
    functions to assert that the vma write lock is taken and explicitly lock
    the vma beforehand. Fix userfaultfd functions which should lock the vma
    earlier.

    Link: https://lkml.kernel.org/r/20230804152724.3090321-5-surenb@google.com
    Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:30 -06:00
Nico Pache 0b91dbac20 mm: enable page walking API to lock vmas during the walk
commit 49b0638502da097c15d46cd4e871dbaa022caf7c
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Aug 4 08:27:19 2023 -0700

    mm: enable page walking API to lock vmas during the walk

    walk_page_range() and friends often operate under write-locked mmap_lock.
    With introduction of vma locks, the vmas have to be locked as well during
    such walks to prevent concurrent page faults in these areas.  Add an
    additional member to mm_walk_ops to indicate locking requirements for the
    walk.

    The change ensures that page walks which prevent concurrent page faults
    by write-locking mmap_lock, operate correctly after introduction of
    per-vma locks.  With per-vma locks page faults can be handled under vma
    lock without taking mmap_lock at all, so write locking mmap_lock would
    not stop them.  The change ensures vmas are properly locked during such
    walks.

    A sample issue this solves is do_mbind() performing queue_pages_range()
    to queue pages for migration.  Without this change a concurrent page
    can be faulted into the area and be left out of migration.

    Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
    Suggested-by: Jann Horn <jannh@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam Howlett <liam.howlett@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:27 -06:00
Nico Pache 8e7697e229 mm/mlock: fix vma iterator conversion of apply_vma_lock_flags()
commit 2658f94d679243209889cdfa8de3743cde1abea9
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Tue Jul 11 13:50:20 2023 -0400

    mm/mlock: fix vma iterator conversion of apply_vma_lock_flags()

    apply_vma_lock_flags() calls mlock_fixup(), which could merge the VMA
    after where the vma iterator is located.  Although this is not an issue,
    the next iteration of the loop will check the start of the vma to be equal
    to the locally saved 'tmp' variable and cause an incorrect failure
    scenario.  Fix the error by setting tmp to the end of the vma iterator
    value before restarting the loop.

    There is also a potential of the error code being overwritten when the
    loop terminates early.  Fix the return issue by directly returning when an
    error is encountered since there is nothing to undo after the loop.

    Link: https://lkml.kernel.org/r/20230711175020.4091336-1-Liam.Howlett@oracle.com
    Fixes: 37598f5a9d8b ("mlock: convert mlock to vma iterator")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reported-by: Ryan Roberts <ryan.roberts@arm.com>
      Link: https://lore.kernel.org/linux-mm/50341ca1-d582-b33a-e3d0-acb08a65166f@arm.com/
    Tested-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:24 -06:00
Chris von Recklinghausen ea45f65a4f mm: mlock: use folios_put() in mlock_folio_batch()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 2bd7f621130b47cab8bed82234cac1f9f105efb7
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Thu Apr 6 00:18:54 2023 +0800

    mm: mlock: use folios_put() in mlock_folio_batch()

    Since we have updated mlock to use folios, it's better to call
    folios_put() instead of calling release_pages() directly.

    Link: https://lkml.kernel.org/r/20230405161854.6931-2-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:55 -04:00
Aristeu Rozanski 33b59dcee4 mm: introduce vm_flags_reset_once to replace WRITE_ONCE vm_flags updates
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 601c3c29dbeb049862faa00917f2daf094a71028
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Tue Jan 31 16:01:16 2023 -0800

    mm: introduce vm_flags_reset_once to replace WRITE_ONCE vm_flags updates

    Provide vm_flags_reset_once() and replace the vm_flags updates which used
    WRITE_ONCE() to prevent compiler optimizations.

    Link: https://lkml.kernel.org/r/20230201000116.1333160-1-surenb@google.com
    Fixes: 0cce31a0aa0e ("mm: replace vma->vm_flags direct modifications with modifier calls")
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:18 -04:00
Aristeu Rozanski e214620cfb mm: replace vma->vm_flags direct modifications with modifier calls
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped stuff we don't support when not applying cleanly, left the rest for sake of saving work

commit 1c71222e5f2393b5ea1a41795c67589eea7e3490
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:49 2023 -0800

    mm: replace vma->vm_flags direct modifications with modifier calls

    Replace direct modifications to vma->vm_flags with calls to modifier
    functions to be able to track flag changes and to keep vma locking
    correctness.

    [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
    Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski fec82fff3c mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit e430a95a04efc557bc4ff9b3035c7c85aee5d63f
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:48 2023 -0800

    mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

    To simplify the usage of VM_LOCKED_CLEAR_MASK in vm_flags_clear(), replace
    it with VM_LOCKED_MASK bitmask and convert all users.

    Link: https://lkml.kernel.org/r/20230126193752.297968-4-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski 7b2b9fac55 mm: switch vma_merge(), split_vma(), and __split_vma to vma iterator
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 9760ebffbf5507320e0de41f5b80089bdef996a0
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:30 2023 -0500

    mm: switch vma_merge(), split_vma(), and __split_vma to vma iterator

    Drop the vmi_* functions and transition all users to use the vma iterator
    directly.

    Link: https://lkml.kernel.org/r/20230120162650.984577-30-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:15 -04:00
Aristeu Rozanski 812fbb3101 mlock: convert mlock to vma iterator
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 37598f5a9d8b63b91cce0cb6bac5f6374ed1bb80
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:19 2023 -0500

    mlock: convert mlock to vma iterator

    Use the vma iterator so that the iterator can be invalidated or updated to
    avoid each caller doing so.

    Link: https://lkml.kernel.org/r/20230120162650.984577-19-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:14 -04:00
Aristeu Rozanski 96cb17f8b1 mm: remove mlock_vma_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7efecffb8e7968c4a6c53177b0053ca4765fe233
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:28:25 2023 +0000

    mm: remove mlock_vma_page()

    All callers now have a folio and can call mlock_vma_folio().  Update the
    documentation to refer to mlock_vma_folio().

    Link: https://lkml.kernel.org/r/20230116192827.2146732-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:10 -04:00
Aristeu Rozanski 067fb10657 mm: mlock: update the interface to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 96f97c438f61ddba94117dcd1a1eb0aaafa22309
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Thu Jan 12 12:39:31 2023 +0000

    mm: mlock: update the interface to use folios

    Update the mlock interface to accept folios rather than pages, bringing
    the interface in line with the internal implementation.

    munlock_vma_page() still requires a page_folio() conversion, however this
    is consistent with the existent mlock_vma_page() implementation and a
    product of rmap still dealing in pages rather than folios.

    Link: https://lkml.kernel.org/r/cba12777c5544305014bc0cbec56bb4cc71477d8.1673526881.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: William Kucharski <william.kucharski@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:08 -04:00
Aristeu Rozanski 4d7f95c903 mm: mlock: use folios and a folio batch internally
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 90d07210ab55e458c87048e1ad55582ecff0a3d5
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Thu Jan 12 12:39:29 2023 +0000

    mm: mlock: use folios and a folio batch internally

    This brings mlock in line with the folio batches declared in mm/swap.c and
    makes the code more consistent across the two.

    The existing mechanism for identifying which operation each folio in the
    batch is undergoing is maintained, i.e.  using the lower 2 bits of the
    struct folio address (previously struct page address).  This should
    continue to function correctly as folios remain at least system
    word-aligned.

    All invocations of mlock() pass either a non-compound page or the head of
    a THP-compound page and no tail pages need updating so this functionality
    works with struct folios being used internally rather than struct pages.

    In this patch the external interface is kept identical to before in order
    to maintain separation between patches in the series, using a rather
    awkward conversion from struct page to struct folio in relevant functions.

    However, this maintenance of the existing interface is intended to be
    temporary - the next patch in the series will update the interfaces to
    accept folios directly.

    Link: https://lkml.kernel.org/r/9f894d54d568773f4ed3cb0eef5f8932f62c95f4.1673526881.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: William Kucharski <william.kucharski@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:08 -04:00
Chris von Recklinghausen 734d96299d mm/mlock: drop dead code in count_mm_mlocked_page_nr()
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 66071896cdfe096fcd4aef55a5efbd5216fa15de
Author: Liam Howlett <liam.howlett@oracle.com>
Date:   Wed Jun 15 17:40:58 2022 +0000

    mm/mlock: drop dead code in count_mm_mlocked_page_nr()

    The check for mm being null has never been needed since the only caller
    has always passed in current->mm.  Remove the check from
    count_mm_mlocked_page_nr().

    Link: https://lkml.kernel.org/r/20220615174050.738523-1-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:58 -04:00
Chris von Recklinghausen 807ff5f197 mm/mlock: use vma iterator and maple state instead of vma linked list
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 33108b05f39b78137c38c677b7a2d0fb7defed14
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:49:02 2022 +0000

    mm/mlock: use vma iterator and maple state instead of vma linked list

    Handle overflow checking in count_mm_mlocked_page_nr() differently.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-58-Liam.Howlett@oracle.com
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:54 -04:00
Chris von Recklinghausen 0e4b299323 mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 7780d04046a2288ab85d88bedacc60fa4fad9971
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:17:26 2023 -0700

    mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails

    Simple walk_page_range() users should set ACTION_AGAIN to retry when
    pte_offset_map_lock() fails.

    No need to check pmd_trans_unstable(): that was precisely to avoid the
    possiblity of calling pte_offset_map() on a racily removed or inserted THP
    entry, but such cases are now safely handled inside it.  Likewise there is
    no need to check pmd_none() or pmd_bad() before calling it.

    Link: https://lkml.kernel.org/r/c77d9d10-3aad-e3ce-4896-99e91c7947f3@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: SeongJae Park <sj@kernel.org> for mm/damon part
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:14 -04:00
Chris von Recklinghausen 6697b528b0 mm: handling Non-LRU pages returned by vm_normal_pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3218f8712d6bba1812efd5e0d66c1e15134f2a91
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:11 2022 -0500

    mm: handling Non-LRU pages returned by vm_normal_pages

    With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
    device-managed anonymous pages that are not LRU pages.  Although they
    behave like normal pages for purposes of mapping in CPU page, and for COW.
    They do not support LRU lists, NUMA migration or THP.

    Callers to follow_page() currently don't expect ZONE_DEVICE pages,
    however, with DEVICE_COHERENT we might now return ZONE_DEVICE.  Check for
    ZONE_DEVICE pages in applicable users of follow_page() as well.

    Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.com
    Signed-off-by: Alex Sierra <alex.sierra@amd.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>       [v2]
    Reviewed-by: Alistair Popple <apopple@nvidia.com>       [v6]
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen 78ce396a88 mm/mlock: fix two bugs in user_shm_lock()
Bugzilla: https://bugzilla.redhat.com/2120352

commit e97824ff663ce3509fe040431c713182c2f058b1
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 16:09:18 2022 +0800

    mm/mlock: fix two bugs in user_shm_lock()

    user_shm_lock forgets to set allowed to 0 when get_ucounts fails. So the
    later user_shm_unlock might do the extra dec_rlimit_ucounts. Also in the
    RLIM_INFINITY case, user_shm_lock will success regardless of the value of
    memlock where memblock == LONG_MAX && !capable(CAP_IPC_LOCK) should fail.
    Fix all of these by changing the code to leave lock_limit at ULONG_MAX aka
    RLIM_INFINITY, leave "allowed" initialized to 0 and remove the special case
    of RLIM_INFINITY as nothing can be greater than ULONG_MAX.

    Credit goes to Eric W. Biederman for proposing simplifying the code and
    thus catching the later bug.

    Fixes: d7c9e99aee ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: stable@vger.kernel.org
    v1: https://lkml.kernel.org/r/20220310132417.41189-1-linmiaohe@huawei.com
    v2: https://lkml.kernel.org/r/20220314064039.62972-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220322080918.59861-1-linmiaohe@huawei.com
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:57 -04:00
Chris von Recklinghausen 71958349db mm: refactor vm_area_struct::anon_vma_name usage code
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5c26f6ac9416b63d093e29c30e79b3297e425472
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Mar 4 20:28:51 2022 -0800

    mm: refactor vm_area_struct::anon_vma_name usage code

    Avoid mixing strings and their anon_vma_name referenced pointers by
    using struct anon_vma_name whenever possible.  This simplifies the code
    and allows easier sharing of anon_vma_name structures when they
    represent the same name.

    [surenb@google.com: fix comment]

    Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com
    Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Colin Cross <ccross@google.com>
    Cc: Sumit Semwal <sumit.semwal@linaro.org>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Alexey Gladkov <legion@kernel.org>
    Cc: Sasha Levin <sashal@kernel.org>
    Cc: Chris Hyser <chris.hyser@oracle.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Cyrill Gorcunov <gorcunov@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:46 -04:00
Chris von Recklinghausen 70649ff1fb mm: add a field to store names for private anonymous memory
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9a10064f5625d5572c3626c1516e0bebc6c9fe9b
Author: Colin Cross <ccross@google.com>
Date:   Fri Jan 14 14:05:59 2022 -0800

    mm: add a field to store names for private anonymous memory

    In many userspace applications, and especially in VM based applications
    like Android uses heavily, there are multiple different allocators in
    use.  At a minimum there is libc malloc and the stack, and in many cases
    there are libc malloc, the stack, direct syscalls to mmap anonymous
    memory, and multiple VM heaps (one for small objects, one for big
    objects, etc.).  Each of these layers usually has its own tools to
    inspect its usage; malloc by compiling a debug version, the VM through
    heap inspection tools, and for direct syscalls there is usually no way
    to track them.

    On Android we heavily use a set of tools that use an extended version of
    the logic covered in Documentation/vm/pagemap.txt to walk all pages
    mapped in userspace and slice their usage by process, shared (COW) vs.
    unique mappings, backing, etc.  This can account for real physical
    memory usage even in cases like fork without exec (which Android uses
    heavily to share as many private COW pages as possible between
    processes), Kernel SamePage Merging, and clean zero pages.  It produces
    a measurement of the pages that only exist in that process (USS, for
    unique), and a measurement of the physical memory usage of that process
    with the cost of shared pages being evenly split between processes that
    share them (PSS).

    If all anonymous memory is indistinguishable then figuring out the real
    physical memory usage (PSS) of each heap requires either a pagemap
    walking tool that can understand the heap debugging of every layer, or
    for every layer's heap debugging tools to implement the pagemap walking
    logic, in which case it is hard to get a consistent view of memory
    across the whole system.

    Tracking the information in userspace leads to all sorts of problems.
    It either needs to be stored inside the process, which means every
    process has to have an API to export its current heap information upon
    request, or it has to be stored externally in a filesystem that somebody
    needs to clean up on crashes.  It needs to be readable while the process
    is still running, so it has to have some sort of synchronization with
    every layer of userspace.  Efficiently tracking the ranges requires
    reimplementing something like the kernel vma trees, and linking to it
    from every layer of userspace.  It requires more memory, more syscalls,
    more runtime cost, and more complexity to separately track regions that
    the kernel is already tracking.

    This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
    userspace-provided name for anonymous vmas.  The names of named
    anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
    [anon:<name>].

    Userspace can set the name for a region of memory by calling

       prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)

    Setting the name to NULL clears it.  The name length limit is 80 bytes
    including NUL-terminator and is checked to contain only printable ascii
    characters (including space), except '[',']','\','$' and '`'.

    Ascii strings are being used to have a descriptive identifiers for vmas,
    which can be understood by the users reading /proc/pid/maps or
    /proc/pid/smaps.  Names can be standardized for a given system and they
    can include some variable parts such as the name of the allocator or a
    library, tid of the thread using it, etc.

    The name is stored in a pointer in the shared union in vm_area_struct
    that points to a null terminated string.  Anonymous vmas with the same
    name (equivalent strings) and are otherwise mergeable will be merged.
    The name pointers are not shared between vmas even if they contain the
    same name.  The name pointer is stored in a union with fields that are
    only used on file-backed mappings, so it does not increase memory usage.

    CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
    feature.  It keeps the feature disabled by default to prevent any
    additional memory overhead and to avoid confusing procfs parsers on
    systems which are not ready to support named anonymous vmas.

    The patch is based on the original patch developed by Colin Cross, more
    specifically on its latest version [1] posted upstream by Sumit Semwal.
    It used a userspace pointer to store vma names.  In that design, name
    pointers could be shared between vmas.  However during the last
    upstreaming attempt, Kees Cook raised concerns [2] about this approach
    and suggested to copy the name into kernel memory space, perform
    validity checks [3] and store as a string referenced from
    vm_area_struct.

    One big concern is about fork() performance which would need to strdup
    anonymous vma names.  Dave Hansen suggested experimenting with
    worst-case scenario of forking a process with 64k vmas having longest
    possible names [4].  I ran this experiment on an ARM64 Android device
    and recorded a worst-case regression of almost 40% when forking such a
    process.

    This regression is addressed in the followup patch which replaces the
    pointer to a name with a refcounted structure that allows sharing the
    name pointer between vmas of the same name.  Instead of duplicating the
    string during fork() or when splitting a vma it increments the refcount.

    [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
    [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
    [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
    [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/

    Changes for prctl(2) manual page (in the options section):

    PR_SET_VMA
            Sets an attribute specified in arg2 for virtual memory areas
            starting from the address specified in arg3 and spanning the
            size specified  in arg4. arg5 specifies the value of the attribute
            to be set. Note that assigning an attribute to a virtual memory
            area might prevent it from being merged with adjacent virtual
            memory areas due to the difference in that attribute's value.

            Currently, arg2 must be one of:

            PR_SET_VMA_ANON_NAME
                    Set a name for anonymous virtual memory areas. arg5 should
                    be a pointer to a null-terminated string containing the
                    name. The name length including null byte cannot exceed
                    80 bytes. If arg5 is NULL, the name of the appropriate
                    anonymous virtual memory areas will be reset. The name
                    can contain only printable ascii characters (including
                    space), except '[',']','\','$' and '`'.

                    This feature is available only if the kernel is built with
                    the CONFIG_ANON_VMA_NAME option enabled.

    [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
      Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
    [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
     added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
     work here was done by Colin Cross, therefore, with his permission, keeping
     him as the author]

    Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
    Signed-off-by: Colin Cross <ccross@google.com>
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Cyrill Gorcunov <gorcunov@openvz.org>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Jan Glauber <jan.glauber@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Stultz <john.stultz@linaro.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rob Landley <rob@landley.net>
    Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
    Cc: Shaohua Li <shli@fusionio.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
Waiman Long e462accf60 mm/munlock: protect the per-CPU pagevec by a local_lock_t
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2109671
Conflicts: A minor fuzz in mm/migrate.c due to missing upstream commit
	   1eba86c096e3 ("mm: change page type prior to adding page
	   table entry"). Pulling it, however, will require taking in
	   a number of additional patches. So it is not done here.

commit adb11e78c5dc5e26774acb05f983da36447f7911
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Fri, 1 Apr 2022 11:28:33 -0700

    mm/munlock: protect the per-CPU pagevec by a local_lock_t

    The access to mlock_pvec is protected by disabling preemption via
    get_cpu_var() or implicit by having preemption disabled by the caller
    (in mlock_page_drain() case).  This breaks on PREEMPT_RT since
    folio_lruvec_lock_irq() acquires a sleeping lock in this section.

    Create struct mlock_pvec which consits of the local_lock_t and the
    pagevec.  Acquire the local_lock() before accessing the per-CPU pagevec.
    Replace mlock_page_drain() with a _local() version which is invoked on
    the local CPU and acquires the local_lock_t and a _remote() version
    which uses the pagevec from a remote CPU which offline.

    Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-07-21 14:50:55 -04:00
Aristeu Rozanski 99cfd73d88 mm/mlock: Add mlock_vma_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit dcc5d337c5e62761ee71f2e25c7aa890b1aa41a2
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Feb 15 13:33:59 2022 -0500

    mm/mlock: Add mlock_vma_folio()

    Convert mlock_page() into mlock_folio() and convert the callers.  Keep
    mlock_vma_page() as a wrapper.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:18 -04:00
Aristeu Rozanski cd396ce107 mm/munlock: mlock_page() munlock_page() batch by pagevec
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 2fbb0c10d1e8222604132b3a3f81bfd8345a44b6
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:37:29 2022 -0800

    mm/munlock: mlock_page() munlock_page() batch by pagevec

    A weakness of the page->mlock_count approach is the need for lruvec lock
    while holding page table lock.  That is not an overhead we would allow on
    normal pages, but I think acceptable just for pages in an mlocked area.
    But let's try to amortize the extra cost by gathering on per-cpu pagevec
    before acquiring the lruvec lock.

    I have an unverified conjecture that the mlock pagevec might work out
    well for delaying the mlock processing of new file pages until they have
    got off lru_cache_add()'s pagevec and on to LRU.

    The initialization of page->mlock_count is subject to races and awkward:
    0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
    this commit, which just widens the window?  I haven't gone back to think
    it through.  Maybe someone can point out a better way to initialize it.

    Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
    into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
    rather than lru_cache_add()'s pagevec.

    Experimented with various orderings: the right thing seems to be for
    mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
    pagevec, but munlock_page() to leave TestClearPageMlocked to the later
    pagevec processing.

    Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
    their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
    for that.

    This still leaves acquiring lruvec locks under page table lock each time
    the pagevec fills (or a THP is added): which I suppose is rather silly,
    since they sit on pagevec waiting to be processed long after page table
    lock has been dropped; but I'm disinclined to uglify the calling sequence
    until some load shows an actual problem with it (nothing wrong with
    taking lruvec lock under page table lock, just "nicer" to do it less).

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 7d43d2ba0b mm/munlock: mlock_pte_range() when mlocking or munlocking
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 34b6792380ce4f4b41018351cd67c9c26f4a7a0d
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:31:48 2022 -0800

    mm/munlock: mlock_pte_range() when mlocking or munlocking

    Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
    required to lower the mlock_counts when munlocking without munmapping;
    and its complement, implementation of mlock_vma_pages_range(), required
    to raise the mlock_counts on pages already there when a range is mlocked.

    Combine them into just the one function mlock_vma_pages_range(), using
    walk_page_range() to run mlock_pte_range().  This approach fixes the
    "Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
    https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/

    Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
    a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
    pte first, it will not try to munlock the page: leaving release_pages()
    to correct it when the last reference to the page is gone - that's okay,
    a page is not evictable anyway while it is held by an extra reference.

    Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
    a racing remove_migration_pte() or try_to_unmap_one() (depending on
    i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
    then mlock_pte_range() mlock it a second time.  This is harder to
    reproduce, but a more serious race because it could leave the page
    unevictable indefinitely though the area is munlocked afterwards.
    Guard against it by setting the (inappropriate) VM_IO flag,
    and modifying mlock_vma_page() to decline such vmas.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 17d5d29247 mm/munlock: maintain page->mlock_count while unevictable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 07ca760673088f262da57ff42c15558688565aa2
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:29:54 2022 -0800

    mm/munlock: maintain page->mlock_count while unevictable

    Previous patches have been preparatory: now implement page->mlock_count.
    The ordering of the "Unevictable LRU" is of no significance, and there is
    no point holding unevictable pages on a list: place page->mlock_count to
    overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
    which needs to be even so as not to satisfy PageTail - though 2 could be
    added instead of 1 for each mlock, if that's ever an improvement).

    But it's only safe to rely on or modify page->mlock_count while lruvec
    lock is held and page is on unevictable "LRU" - we can save lots of edits
    by continuing to pretend that there's an imaginary LRU here (there is an
    unevictable count which still needs to be maintained, but not a list).

    The mlock_count technique suffers from an unreliability much like with
    page_mlock(): while someone else has the page off LRU, not much can
    be done.  As before, err on the safe side (behave as if mlock_count 0),
    and let try_to_unlock_one() move the page to unevictable if reclaim finds
    out later on - a few misplaced pages don't matter, what we want to avoid
    is imbalancing reclaim by flooding evictable lists with unevictable pages.

    I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
    if we have taken lruvec lock to get the page off its present list, then
    we save everyone trouble (and however many extra atomic ops) by putting
    it on its destination list immediately.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 3856990130 mm/munlock: replace clear_page_mlock() by final clearance
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit b109b87050df5438ee745b2bddfa3587970025bb
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:28:05 2022 -0800

    mm/munlock: replace clear_page_mlock() by final clearance

    Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
    of the munlocking to clear_page_mlock(), since PageMlocked is typically
    still set when mapcount has fallen to 0.  That is not what we want: we
    want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
    on the integrity of of the mlock/munlock protocol - small numbers are
    not surprising, but big numbers mean the protocol is not working.

    That could be easily fixed by placing munlock_vma_page() at the start of
    page_remove_rmap(); but later in the series we shall want to batch the
    munlocking, and that too would tend to leave PageMlocked still set at
    the point when it is checked.

    So delete clear_page_mlock() now: leave it instead to release_pages()
    (and __page_cache_release()) to do this backstop clearing of Mlocked,
    when page refcount has fallen to 0.  If a pinned page occasionally gets
    counted as Mlocked and Unevictable until it is unpinned, that's okay.

    A slightly regrettable side-effect of this change is that, since
    release_pages() and __page_cache_release() may be called at interrupt
    time, those places which update NR_MLOCK with interrupts enabled
    had better use mod_zone_page_state() than __mod_zone_page_state()
    (but holding the lruvec lock always has interrupts disabled).

    This change, forcing Mlocked off when refcount 0 instead of earlier
    when mapcount 0, is not fundamental: it can be reversed if performance
    or something else is found to suffer; but this is the easiest way to
    separate the stats - let's not complicate that without good reason.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 4b2aa38f6e mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context difference due lack of f4c4a3f484 and differences due RHEL-only 44740bc20b

commit cea86fe246b694a191804b47378eb9d77aefabec
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:26:39 2022 -0800

    mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

    Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
    inline functions which check (vma->vm_flags & VM_LOCKED) before calling
    mlock_page() and munlock_page() in mm/mlock.c.

    Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
    because we have understandable difficulty in accounting pte maps of THPs,
    and if passed a PageHead page, mlock_page() and munlock_page() cannot
    tell whether it's a pmd map to be counted or a pte map to be ignored.

    Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
    others, and use that to call mlock_vma_page() at the end of the page
    adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
    beginning? unimportant, but end was easier for assertions in testing).

    No page lock is required (although almost all adds happen to hold it):
    delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
    Certainly page lock did serialize with page migration, but I'm having
    difficulty explaining why that was ever important.

    Mlock accounting on THPs has been hard to define, differed between anon
    and file, involved PageDoubleMap in some places and not others, required
    clear_page_mlock() at some points.  Keep it simple now: just count the
    pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

    page_add_new_anon_rmap() callers unchanged: they have long been calling
    lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
    handling (it also checks for not VM_SPECIAL: I think that's overcautious,
    and inconsistent with other checks, that mmap_region() already prevents
    VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 4ba8fd7ec7 mm/munlock: delete munlock_vma_pages_all(), allow oomreap
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing prototype for pmd_install()

commit a213e5cf71cbcea4b23caedcb8fe6629a333b275
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:23:29 2022 -0800

    mm/munlock: delete munlock_vma_pages_all(), allow oomreap

    munlock_vma_pages_range() will still be required, when munlocking but
    not munmapping a set of pages; but when unmapping a pte, the mlock count
    will be maintained in much the same way as it will be maintained when
    mapping in the pte.  Which removes the need for munlock_vma_pages_all()
    on mlocked vmas when munmapping or exiting: eliminating the catastrophic
    contention on i_mmap_rwsem, and the need for page lock on the pages.

    There is still a need to update locked_vm accounting according to the
    munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
    exit_mmap() does not need locked_vm updates, so delete unlock_range().

    And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
    because of the uncertainty in blocking on all those page locks?
    No fear of that now, so permit the OOM reaper on mlocked vmas.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski f7bbad4076 mm/munlock: delete page_mlock() and all its works
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit ebcbc6ea7d8a604ad8504dae70a6ac1b1e64a0b7
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:20:24 2022 -0800

    mm/munlock: delete page_mlock() and all its works

    We have recommended some applications to mlock their userspace, but that
    turns out to be counter-productive: when many processes mlock the same
    file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
    is needed for write, to remove any vma mapping that file from rmap's tree;
    but hogged for read by those with mlocks calling page_mlock() (formerly
    known as try_to_munlock()) on *each* page mapped from the file (the
    purpose being to find out whether another process has the page mlocked,
    so therefore it should not be unmlocked yet).

    Several optimizations have been made in the past: one is to skip
    page_mlock() when mapcount tells that nothing else has this page
    mapped; but that doesn't help at all when others do have it mapped.
    This time around, I initially intended to add a preliminary search
    of the rmap tree for overlapping VM_LOCKED ranges; but that gets
    messy with locking order, when in doubt whether a page is actually
    present; and risks adding even more contention on the i_mmap_rwsem.

    A solution would be much easier, if only there were space in struct page
    for an mlock_count... but actually, most of the time, there is space for
    it - an mlocked page spends most of its life on an unevictable LRU, but
    since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
    been redundant.  Let's try to reuse its page->lru.

    But leave that until a later patch: in this patch, clear the ground by
    removing page_mlock(), and all the infrastructure that has gathered
    around it - which mostly hinders understanding, and will make reviewing
    new additions harder.  Don't mind those old comments about THPs, they
    date from before 4.5's refcounting rework: splitting is not a risk here.

    Just keep a minimal version of munlock_vma_page(), as reminder of what it
    should attend to (in particular, the odd way PGSTRANDED is counted out of
    PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
    unchanged __mlock_posix_error_return() out of the way, down to above its
    caller: this series then makes no further change after mlock_fixup().

    After this and each following commit, the kernel builds, boots and runs;
    but with deficiencies which may show up in testing of mlock and munlock.
    The system calls succeed or fail as before, and mlock remains effective
    in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
    may be shown too low after mlock, grow, then stay too high after munlock:
    with previously mlocked pages remaining unevictable for too long, until
    finally unmapped and freed and counts corrected. Normal service will be
    resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski dc03a02b92 mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 0de340cbed3359423e38ed49242ac9d6986b5cfd
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Jun 29 22:27:31 2021 -0400

    mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave()

    These are the folio equivalents of relock_page_lruvec_irq() and
    folio_lruvec_relock_irqsave().  Also convert page_matches_lruvec()
    to folio_matches_lruvec().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:28 -04:00
Rafael Aquini 080e1a43f7 mm/mlock: fix potential imbalanced rlimit ucounts adjustment
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 5c2a956c3eea173b2bc89f632507c0eeaebf6c4a
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:56 2022 -0700

    mm/mlock: fix potential imbalanced rlimit ucounts adjustment

    user_shm_lock forgets to set allowed to 0 when get_ucounts fails.  So
    the later user_shm_unlock might do the extra dec_rlimit_ucounts.  Fix
    this by resetting allowed to 0.

    Link: https://lkml.kernel.org/r/20220310132417.41189-1-linmiaohe@huawei.com
    Fixes: d7c9e99aee ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Cc: Chris Mason <chris.mason@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:45 -04:00
Mike Rapoport 1507f51255 mm: introduce memfd_secret system call to create "secret" memory areas
Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly
enable it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call.  The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.

Secretmem is designed to provide the following protections:

* Enhanced protection (in conjunction with all the other in-kernel
  attack prevention systems) against ROP attacks.  Seceretmem makes
  "simple" ROP insufficient to perform exfiltration, which increases the
  required complexity of the attack.  Along with other protections like
  the kernel stack size limit and address space layout randomization which
  make finding gadgets is really hard, absence of any in-kernel primitive
  for accessing secret memory means the one gadget ROP attack can't work.
  Since the only way to access secret memory is to reconstruct the missing
  mapping entry, the attacker has to recover the physical page and insert
  a PTE pointing to it in the kernel and then retrieve the contents.  That
  takes at least three gadgets which is a level of difficulty beyond most
  standard attacks.

* Prevent cross-process secret userspace memory exposures.  Once the
  secret memory is allocated, the user can't accidentally pass it into the
  kernel to be transmitted somewhere.  The secreremem pages cannot be
  accessed via the direct map and they are disallowed in GUP.

* Harden against exploited kernel flaws.  In order to access secretmem,
  a kernel-side attack would need to either walk the page tables and
  create new ones, or spawn a new privileged uiserspace process to perform
  secrets exfiltration using ptrace.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise().  File
descriptor approach allows explicit and controlled sharing of the memory
areas, it allows to seal the operations.  Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userspace hipervisor process, for instance QEMU.  Andy Lutomirski says:

  "Getting fd-backed memory into a guest will take some possibly major
  work in the kernel, but getting vma-backed memory into a guest without
  mapping it in the host user address space seems much, much worse."

memfd_secret() is made a dedicated system call rather than an extension to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to
the memory.  Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall.  Moreover, the initial
implementation of memfd_secret() is completely distinct from
memfd_create() so there is no much sense in overloading memfd_create() to
begin with.  If there will be a need for code sharing between these
implementation it can be easily achieved without a need to adjust user
visible APIs.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance.  However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e057
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice".  Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK.  Since these mappings are already locked independently
from mlock(), an attempt to mlock()/munlock() secretmem range would fail
and mlockall()/munlockall() will ignore secretmem mappings.

However, unlike mlock()ed memory, secretmem currently behaves more like
long-term GUP: secretmem mappings are unmovable mappings directly consumed
by user space.  With default limits, there is no excessive use of
secretmem and it poses no real problem in combination with
ZONE_MOVABLE/CMA, but in the future this should be addressed to allow
balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

[akpm@linux-foundation.org: suppress Kconfig whine]

Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:48:21 -07:00
Linus Torvalds 71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Alistair Popple cd62734ca6 mm/rmap: split try_to_munlock from try_to_unmap
The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used in
different combinations.

TTU_MUNLOCK is one such flag.  However it is exclusively used by
try_to_munlock() which specifies no other flags.  Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Link: https://lkml.kernel.org/r/20210616105937.23201-4-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:03 -07:00
Linus Torvalds c54b245d01 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace rlimit handling update from Eric Biederman:
 "This is the work mainly by Alexey Gladkov to limit rlimits to the
  rlimits of the user that created a user namespace, and to allow users
  to have stricter limits on the resources created within a user
  namespace."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  cred: add missing return error code when set_cred_ucounts() failed
  ucounts: Silence warning in dec_rlimit_ucounts
  ucounts: Set ucount_max to the largest positive value the type can hold
  kselftests: Add test to check for rlimit changes in different user namespaces
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_NPROC on top of ucounts
  Use atomic_t for ucounts reference counting
  Add a reference to ucounts for each cred
  Increase size of ucounts to atomic_long_t
2021-06-28 20:39:26 -07:00
Zhiyuan Dai 68d68ff6eb mm/mempool: minor coding style tweaks
Various coding style tweaks to various files under mm/

[daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:27 -07:00
Alexey Gladkov d7c9e99aee Reimplement RLIMIT_MEMLOCK on top of ucounts
The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Fix issue found by lkp robot.

v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30 14:14:02 -05:00
Miaohe Lin 48b03eea32 mm/mlock: stop counting mlocked pages when none vma is found
There will be no vma satisfies addr < vm_end when find_vma() returns NULL.
Thus it's meaningless to traverse the vma list below because we can't
find any vma to count mlocked pages.  Stop counting mlocked pages in this
case to save some vma list traversal cycles.

Link: https://lkml.kernel.org/r/20210204110705.17586-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:41:01 -08:00
Yu Zhao 46ae6b2cc2 mm/swap.c: don't pass "enum lru_list" to del_page_from_lru_list()
The parameter is redundant in the sense that it can be potentially
extracted from the "struct page" parameter by page_lru(). We need to
make sure that existing PageActive() or PageUnevictable() remains
until the function returns. A few places don't conform, and simple
reordering fixes them.

This patch may have left page_off_lru() seemingly odd, and we'll take
care of it in the next patch.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-6-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:33 -08:00
Alexander Duyck 2a5e4e340b mm/lru: introduce relock_page_lruvec()
Add relock_page_lruvec() to replace repeated same code, no functional
change.

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding.  By doing this we can avoid the extra pointer walks and accesses
of the memory cgroup.

In addition we can avoid the checks entirely if lruvec is currently NULL.

[alex.shi@linux.alibaba.com: use page_memcg()]
  Link: https://lkml.kernel.org/r/66d8e79d-7ec6-bfbc-1c82-bf32db3ae5b7@linux.alibaba.com

Link: https://lkml.kernel.org/r/1604566549-62481-19-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Chen, Rong A" <rong.a.chen@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 14:48:04 -08:00
Alex Shi 6168d0da2b mm/lru: replace pgdat lru_lock with lruvec lock
This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node.  So on a large machine, each of memcg don't have
to suffer from per node pgdat->lru_lock competition.  They could go fast
with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In isolate_migratepages_block(), compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case on
his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

Hugh Dickins helped on the patch polish, thanks!

[alex.shi@linux.alibaba.com: fix comment typo]
  Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com
[alex.shi@linux.alibaba.com: use page_memcg()]
  Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com

Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 14:48:04 -08:00
Alex Shi d25b5bd8a8 mm/lru: introduce TestClearPageLRU()
Currently lru_lock still guards both lru list and page's lru bit, that's
ok.  but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking.  Just taking lruvec lock
first may be undermined by the page's memcg charge/migration.  To fix this
problem, we will clear the lru bit out of locking and use it as pin down
action to block the page isolation in memcg changing.

So now a standard steps of page isolation is following:
	1, get_page(); 	       #pin the page avoid to be free
	2, TestClearPageLRU(); #block other isolation like memcg change
	3, spin_lock on lru_lock; #serialize lru list access
	4, delete page from lru list;

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU.  This
function will be used as page isolation precondition to prevent other
isolations some where else.  Then there are may !PageLRU page on lru list,
need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
   temporary moment(isolating), the page may have no lru bit when
   it's on lru list.  but the page still must be on lru list when the
   lru bit set.
2, have to remove lru bit before delete it from lru list.

As Andrew Morton mentioned this change would dirty cacheline for a page
which isn't on the LRU.  But the loss would be acceptable in Rong Chen
<rong.a.chen@intel.com> report:
https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

Link: https://lkml.kernel.org/r/1604566549-62481-15-git-send-email-alex.shi@linux.alibaba.com
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 14:48:04 -08:00
Alex Shi 13805a88a9 mm/mlock: remove __munlock_isolate_lru_page()
__munlock_isolate_lru_page() only has one caller, remove it to clean up
and simplify code.

Link: https://lkml.kernel.org/r/1604566549-62481-14-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Chen, Rong A" <rong.a.chen@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 14:48:04 -08:00
Alex Shi 3db19aa39b mm/mlock: remove lru_lock on TestClearPageMlocked
In the func munlock_vma_page, comments mentained lru_lock needed for
serialization with split_huge_pages.  But the page must be PageLocked as
well as pages in split_huge_page series funcs.  Thus the PageLocked is
enough to serialize both funcs.

Further more, Hugh Dickins pointed: before splitting in
split_huge_page_to_list, the page was unmap_page() to remove pmd/ptes
which protect the page from munlock.  Thus, no needs to guard
__split_huge_page_tail for mlock clean, just keep the lru_lock there for
isolation purpose.

LKP found a preempt issue on __mod_zone_page_state which need change to
mod_zone_page_state.  Thanks!

Link: https://lkml.kernel.org/r/1604566549-62481-13-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Chen, Rong A" <rong.a.chen@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 14:48:04 -08:00
Hugh Dickins 0964730bf4 mlock: fix unevictable_pgs event counts on THP
5.8 commit 5d91f31faf ("mm: swap: fix vmstats for huge page") has
established that vm_events should count every subpage of a THP, including
unevictable_pgs_culled and unevictable_pgs_rescued; but
lru_cache_add_inactive_or_unevictable() was not doing so for
unevictable_pgs_mlocked, and mm/mlock.c was not doing so for
unevictable_pgs mlocked, munlocked, cleared and stranded.

Fix them; but THPs don't go the pagevec way in mlock.c, so no fixes needed
on that path.

Fixes: 5d91f31faf ("mm: swap: fix vmstats for huge page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301408230.5954@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-19 13:13:38 -07:00
Matthew Wilcox (Oracle) 6c357848b4 mm: replace hpage_nr_pages with thp_nr_pages
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14 19:56:56 -07:00
Michel Lespinasse c1e8d7c6a7 mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse d8ed45c5dc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00