Commit Graph

538 Commits

Author SHA1 Message Date
CKI Backport Bot 8126f96c73 mm/mempolicy: fix migrate_to_node() assuming there is at least one VMA in a MM
JIRA: https://issues.redhat.com/browse/RHEL-75840
CVE: CVE-2024-56611

commit 091c1dd2d4df6edd1beebe0e5863d4034ade9572
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 20 21:11:51 2024 +0100

    mm/mempolicy: fix migrate_to_node() assuming there is at least one VMA in a MM

    We currently assume that there is at least one VMA in a MM, which isn't
    true.

    So we might end up having find_vma() return NULL, to then de-reference
    NULL.  So properly handle find_vma() returning NULL.

    This fixes the report:

    Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI
    KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
    CPU: 1 UID: 0 PID: 6021 Comm: syz-executor284 Not tainted 6.12.0-rc7-syzkaller-00187-gf868cd251776 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024
    RIP: 0010:migrate_to_node mm/mempolicy.c:1090 [inline]
    RIP: 0010:do_migrate_pages+0x403/0x6f0 mm/mempolicy.c:1194
    Code: ...
    RSP: 0018:ffffc9000375fd08 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffffc9000375fd78 RCX: 0000000000000000
    RDX: ffff88807e171300 RSI: dffffc0000000000 RDI: ffff88803390c044
    RBP: ffff88807e171428 R08: 0000000000000014 R09: fffffbfff2039ef1
    R10: ffffffff901cf78f R11: 0000000000000000 R12: 0000000000000003
    R13: ffffc9000375fe90 R14: ffffc9000375fe98 R15: ffffc9000375fdf8
    FS:  00005555919e1380(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00005555919e1ca8 CR3: 000000007f12a000 CR4: 00000000003526f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <TASK>
     kernel_migrate_pages+0x5b2/0x750 mm/mempolicy.c:1709
     __do_sys_migrate_pages mm/mempolicy.c:1727 [inline]
     __se_sys_migrate_pages mm/mempolicy.c:1723 [inline]
     __x64_sys_migrate_pages+0x96/0x100 mm/mempolicy.c:1723
     do_syscall_x64 arch/x86/entry/common.c:52 [inline]
     do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x77/0x7f

    [akpm@linux-foundation.org: add unlikely()]
    Link: https://lkml.kernel.org/r/20241120201151.9518-1-david@redhat.com
    Fixes: 39743889aa ("[PATCH] Swap Migration V5: sys_migrate_pages interface")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reported-by: syzbot+3511625422f7aa637f0d@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/lkml/673d2696.050a0220.3c9d61.012f.GAE@google.com/T/
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Christoph Lameter <cl@linux.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2025-01-28 13:46:13 +00:00
Rafael Aquini beff3f5e32 mm/numa_balancing: teach mpol_to_str about the balancing mode
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit af649773fb25250cd22625af021fb6275c56a3ee
Author: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Date:   Mon Jul 8 08:56:32 2024 +0100

    mm/numa_balancing: teach mpol_to_str about the balancing mode

    Since balancing mode was added in bda420b985 ("numa balancing: migrate
    on fault among multiple bound nodes"), it was possible to set this mode
    but it wouldn't be shown in /proc/<pid>/numa_maps since there was no
    support for it in the mpol_to_str() helper.

    Furthermore, because the balancing mode sets the MPOL_F_MORON flag, it
    would be displayed as 'default' due a workaround introduced a few years
    earlier in 8790c71a18 ("mm/mempolicy.c: fix mempolicy printing in
    numa_maps").

    To tidy this up we implement two changes:

    Replace the MPOL_F_MORON check by pointer comparison against the
    preferred_node_policy array.  By doing this we generalise the current
    special casing and replace the incorrect 'default' with the correct 'bind'
    for the mode.

    Secondly, we add a string representation and corresponding handling for
    the MPOL_F_NUMA_BALANCING flag.

    With the two changes together we start showing the balancing flag when it
    is set and therefore complete the fix.

    Representation format chosen is to separate multiple flags with vertical
    bars, following what existed long time ago in kernel 2.6.25.  But as
    between then and now there wasn't a way to display multiple flags, this
    patch does not change the format in practice.

    Some /proc/<pid>/numa_maps output examples:

     555559580000 bind=balancing:0-1,3 file=...
     555585800000 bind=balancing|static:0,2 file=...
     555635240000 prefer=relative:0 file=

    Link: https://lkml.kernel.org/r/20240708075632.95857-1-tursulin@igalia.com
    Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
    Fixes: bda420b985 ("numa balancing: migrate on fault among multiple bound nodes")
    References: 8790c71a18 ("mm/mempolicy.c: fix mempolicy printing in numa_maps")
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: <stable@vger.kernel.org>    [5.12+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:28 -05:00
Rafael Aquini f9e926534b mempolicy: alloc_pages_mpol() for NUMA policy without vma
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * mm/swap.h, mm/swap_state.c, and mm/zwap.c: minor context differences due to
    out-of-oder backport of commit a65b0e7607cc ("zswap: make shrinking memcg-aware")

This patch is a backport of the following upstream commit:
commit ddc1a5cbc05dc62743a2f409b96faa5cf95ba064
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Oct 19 13:39:08 2023 -0700

    mempolicy: alloc_pages_mpol() for NUMA policy without vma

    Shrink shmem's stack usage by eliminating the pseudo-vma from its folio
    allocation.  alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the
    principal actor for passing mempolicy choice down to __alloc_pages(),
    rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).

    vma_alloc_folio() and alloc_pages() remain, but as wrappers around
    alloc_pages_mpol().  alloc_pages_bulk_*() untouched, except to provide the
    additional args to policy_nodemask(), which subsumes policy_node().
    Cleanup throughout, cutting out some unhelpful "helpers".

    It would all be much simpler without MPOL_INTERLEAVE, but that adds a
    dynamic to the constant mpol: complicated by v3.6 commit 09c231cb8b
    ("tmpfs: distribute interleave better across nodes"), which added ino bias
    to the interleave, hidden from mm/mempolicy.c until this commit.

    Hence "ilx" throughout, the "interleave index".  Originally I thought it
    could be done just with nid, but that's wrong: the nodemask may come from
    the shared policy layer below a shmem vma, or it may come from the task
    layer above a shmem vma; and without the final nodemask then nodeid cannot
    be decided.  And how ilx is applied depends also on page order.

    The interleave index is almost always irrelevant unless MPOL_INTERLEAVE:
    with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX
    passed down from vma-less alloc_pages() is also used as hint not to use
    THP-style hugepage allocation - to avoid the overhead of a hugepage arg
    (though I don't understand why we never just added a GFP bit for THP - if
    it actually needs a different allocation strategy from other pages of the
    same order).  vma_alloc_folio() still carries its hugepage arg here, but
    it is not used, and should be removed when agreed.

    get_vma_policy() no longer allows a NULL vma: over time I believe we've
    eradicated all the places which used to need it e.g.  swapoff and madvise
    used to pass NULL vma to read_swap_cache_async(), but now know the vma.

    [hughd@google.com: handle NULL mpol being passed to __read_swap_cache_async()]
      Link: https://lkml.kernel.org/r/ea419956-4751-0102-21f7-9c93cb957892@google.com
    Link: https://lkml.kernel.org/r/74e34633-6060-f5e3-aee-7040d43f2e93@google.com
    Link: https://lkml.kernel.org/r/1738368e-bac0-fd11-ed7f-b87142a939fe@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Cc: Domenico Cerasuolo <mimmocerasuolo@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:16 -05:00
Rafael Aquini 15d43a67db mm: add page_rmappable_folio() wrapper
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 23e4883248f0472d806c8b3422ba6257e67bf1a5
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Oct 3 02:25:33 2023 -0700

    mm: add page_rmappable_folio() wrapper

    folio_prep_large_rmappable() is being used repeatedly along with a
    conversion from page to folio, a check non-NULL, a check order > 1: wrap
    it all up into struct folio *page_rmappable_folio(struct page *).

    Link: https://lkml.kernel.org/r/8d92c6cf-eebe-748-e29c-c8ab224c741@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:15 -05:00
Rafael Aquini febda83ac6 mempolicy: clean up minor dead code in queue_pages_test_walk()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 3efbe13e361acfd163fd1a2466e4fb9bed7dc1b0
Author: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Date:   Mon Jan 22 10:25:04 2024 +0100

    mempolicy: clean up minor dead code in queue_pages_test_walk()

    Commit 2cafb582173f ("mempolicy: remove confusing MPOL_MF_LAZY dead code")
    removes MPOL_MF_LAZY handling in queue_pages_test_walk(), and with that,
    there is no effective use of the local variable endvma in that function
    remaining.

    Remove the local variable endvma and its dead code. No functional change.

    This issue was identified with clang-analyzer's dead stores analysis.

    Link: https://lkml.kernel.org/r/20240122092504.18377-1-lukas.bulwahn@gmail.com
    Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:14 -05:00
Rafael Aquini 52c9b42203 mempolicy: remove confusing MPOL_MF_LAZY dead code
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 2cafb582173f3870240af90de3f31d18b0728882
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Oct 3 02:24:18 2023 -0700

    mempolicy: remove confusing MPOL_MF_LAZY dead code

    v3.8 commit b24f53a0be ("mm: mempolicy: Add MPOL_MF_LAZY") introduced
    MPOL_MF_LAZY, and included it in the MPOL_MF_VALID flags; but a720094ded
    ("mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now")
    immediately removed it from MPOL_MF_VALID flags, pending further review.
    "This will need to be revisited", but it has not been reinstated.

    The present state is confusing: there is dead code in mm/mempolicy.c to
    handle MPOL_MF_LAZY cases which can never occur.  Remove that: it can be
    resurrected later if necessary.  But keep the definition of MPOL_MF_LAZY,
    which must remain in the UAPI, even though it always fails with EINVAL.

    https://lore.kernel.org/linux-mm/1553041659-46787-1-git-send-email-yang.shi@linux.alibaba.com/
    links to a previous request to remove MPOL_MF_LAZY.

    Link: https://lkml.kernel.org/r/80c9665c-1c3f-17ba-21a3-f6115cebf7d@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:13 -05:00
Rafael Aquini 3ceff7df7e mempolicy: mpol_shared_policy_init() without pseudo-vma
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 35ec8fa0207b3c7f7c3c22337c9a507d7b291626
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Oct 3 02:22:59 2023 -0700

    mempolicy: mpol_shared_policy_init() without pseudo-vma

    mpol_shared_policy_init() does not need to use a pseudo-vma: it can use
    sp_alloc() and sp_insert() directly, since the object's shared policy tree
    is empty and inaccessible (needing no lock) at get_inode() time.

    Link: https://lkml.kernel.org/r/3bef62d8-ae78-4c2-533-56a44ae425c@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:13 -05:00
Rafael Aquini a9adddee75 mempolicy trivia: use pgoff_t in shared mempolicy tree
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 93397c3b7684555b7cec726cd13eef6742d191fe
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Oct 3 02:21:34 2023 -0700

    mempolicy trivia: use pgoff_t in shared mempolicy tree

    Prefer the more explicit "pgoff_t" to "unsigned long" when dealing with a
    shared mempolicy tree.  Delete confusing comment about pseudo mm vmas.

    Link: https://lkml.kernel.org/r/5451157-3818-4af5-fd2c-5d26a5d1dc53@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:12 -05:00
Rafael Aquini b1c997eb0b mempolicy trivia: slightly more consistent naming
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit c36f6e6dff4d32ec8b6da8f553933727a57a7a4a
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Oct 3 02:20:14 2023 -0700

    mempolicy trivia: slightly more consistent naming

    Before getting down to work, do a little cleanup, mainly of inconsistent
    variable naming.  I gave up trying to rationalize mpol versus pol versus
    policy, and node versus nid, but let's avoid p and nd.  Remove a few
    superfluous blank lines, but add one; and here prefer vma->vm_policy to
    vma_policy(vma) - the latter being appropriate in other sources, which
    have to allow for !CONFIG_NUMA.  That intriguing line about KERNEL_DS?
    should have gone in v2.6.15, when numa_policy_init() stopped using
    set_mempolicy(2)'s system call handler.

    Link: https://lkml.kernel.org/r/68287974-b6ae-7df-4ba-d19ddd69cbf@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:11 -05:00
Rafael Aquini 6069e43d12 mempolicy trivia: delete those ancient pr_debug()s
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 7f1ee4e2070883d18a431c761db8bb30a958b654
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Oct 3 02:19:00 2023 -0700

    mempolicy trivia: delete those ancient pr_debug()s

    Delete those ancient pr_debug()s - PDprintk()s in Andi Kleen's original
    submission of core NUMA API, and useful when debugging shared mempolicy
    lifetime back then, but not used recently.

    Link: https://lkml.kernel.org/r/f25135-ffb2-40d8-9577-720772b333@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:11 -05:00
Rafael Aquini 93566adae4 mempolicy: fix migrate_pages(2) syscall return nr_failed
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 1cb5d11a370f661c5d0d888bb0cfc2cdc5791382
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Oct 3 02:17:43 2023 -0700

    mempolicy: fix migrate_pages(2) syscall return nr_failed

    "man 2 migrate_pages" says "On success migrate_pages() returns the number
    of pages that could not be moved".  Although 5.3 and 5.4 commits fixed
    mbind(MPOL_MF_STRICT|MPOL_MF_MOVE*) to fail with EIO when not all pages
    could be moved (because some could not be isolated for migration),
    migrate_pages(2) was left still reporting only those pages failing at the
    migration stage, forgetting those failing at the earlier isolation stage.

    Fix that by accumulating a long nr_failed count in struct queue_pages,
    returned by queue_pages_range() when it's not returning an error, for
    adding on to the nr_failed count from migrate_pages() in mm/migrate.c.  A
    count of pages?  It's more a count of folios, but changing it to pages
    would entail more work (also in mm/migrate.c): does not seem justified.

    queue_pages_range() itself should only return -EIO in the "strictly
    unmovable" case (STRICT without any MOVEs): in that case it's best to
    break out as soon as nr_failed gets set; but otherwise it should continue
    to isolate pages for MOVing even when nr_failed - as the mbind(2) manpage
    promises.

    There's a case when nr_failed should be incremented when it was missed:
    queue_folios_pte_range() and queue_folios_hugetlb() count the transient
    migration entries, like queue_folios_pmd() already did.  And there's a
    case when nr_failed should not be incremented when it would have been: in
    meeting later PTEs of the same large folio, which can only be isolated
    once: fixed by recording the current large folio in struct queue_pages.

    Clean up the affected functions, fixing or updating many comments.  Bool
    migrate_folio_add(), without -EIO: true if adding, or if skipping shared
    (but its arguable folio_estimated_sharers() heuristic left unchanged).
    Use MPOL_MF_WRLOCK flag to queue_pages_range(), instead of bool lock_vma.
    Use explicit STRICT|MOVE* flags where queue_pages_test_walk() checks for
    skipping, instead of hiding them behind MPOL_MF_VALID.

    Link: https://lkml.kernel.org/r/9a6b0b9-3bb-dbef-8adf-efab4397b8d@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:10 -05:00
Rafael Aquini 26dc006376 mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * fs/userfaultfd.c: context difference on the 1st hunk due to out-of-order
    backport of commit c88033efe9a3 ("mm/userfaultfd: reset ptes when close()
    for wr-protected ones"), and context differences on the 2nd and 4th hunks
    due to RHEL9 missing upstream commit d61ea1cb0095 ("userfaultfd:
    UFFD_FEATURE_WP_ASYNC") and its series;

This patch is a backport of the following upstream commit:
commit 94d7d923395129b9248777e575c877e40007f9dc
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Wed Oct 11 18:04:28 2023 +0100

    mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.

    mprotect() and other functions which change VMA parameters over a range
    each employ a pattern of:-

    1. Attempt to merge the range with adjacent VMAs.
    2. If this fails, and the range spans a subset of the VMA, split it
       accordingly.

    This is open-coded and duplicated in each case. Also in each case most of
    the parameters passed to vma_merge() remain the same.

    Create a new function, vma_modify(), which abstracts this operation,
    accepting only those parameters which can be changed.

    To avoid the mess of invoking each function call with unnecessary
    parameters, create inline wrapper functions for each of the modify
    operations, parameterised only by what is required to perform the action.

    We can also significantly simplify the logic - by returning the VMA if we
    split (or merged VMA if we do not) we no longer need specific handling for
    merge/split cases in any of the call sites.

    Note that the userfaultfd_release() case works even though it does not
    split VMAs - since start is set to vma->vm_start and end is set to
    vma->vm_end, the split logic does not trigger.

    In addition, since we calculate pgoff to be equal to vma->vm_pgoff + (start
    - vma->vm_start) >> PAGE_SHIFT, and start - vma->vm_start will be 0 in this
    instance, this invocation will remain unchanged.

    We eliminate a VM_WARN_ON() in mprotect_fixup() as this simply asserts that
    vma_merge() correctly ensures that flags remain the same, something that is
    already checked in is_mergeable_vma() and elsewhere, and in any case is not
    specific to mprotect().

    Link: https://lkml.kernel.org/r/0dfa9368f37199a423674bf0ee312e8ea0619044.1697043508.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:51 -05:00
Rafael Aquini 49c4b4943f sched/numa, mm: make numa migrate functions to take a folio
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 8c9ae56dc73b5ae48a14000b96292bd4f2aeb710
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Sep 21 15:44:17 2023 +0800

    sched/numa, mm: make numa migrate functions to take a folio

    The cpupid (or access time) is stored in the head page for THP, so it is
    safely to make should_numa_migrate_memory() and numa_hint_fault_latency()
    to take a folio.  This is in preparation for large folio numa balancing.

    Link: https://lkml.kernel.org/r/20230921074417.24004-7-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:36 -05:00
Rafael Aquini 71a32e7d3d mm: mempolicy: make mpol_misplaced() to take a folio
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 75c70128a67311070115b90d826a229d4bbbb2b5
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Sep 21 15:44:16 2023 +0800

    mm: mempolicy: make mpol_misplaced() to take a folio

    In preparation for large folio numa balancing, make mpol_misplaced() to
    take a folio, no functional change intended.

    Link: https://lkml.kernel.org/r/20230921074417.24004-6-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:35 -05:00
Rafael Aquini 6e53c42dda mm: convert prep_transhuge_page() to folio_prep_large_rmappable()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit da6e7bf3a0315025e4199d599bd31763f0df3b4a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:53 2023 +0100

    mm: convert prep_transhuge_page() to folio_prep_large_rmappable()

    Match folio_undo_large_rmappable(), and move the casting from page to
    folio into the callers (which they were largely doing anyway).

    Link: https://lkml.kernel.org/r/20230816151201.3655946-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:48 -04:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Rafael Aquini 553573f4b1 mm: convert migrate_pages() to work on folios
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * dropped hunk for Documentation/translations/zh_CN/mm/page_migration.rst.
    This doc file was introduced upstream via pre-v6.0 (v6.0-rc1) merge
    commit 6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of
    git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm") which was never
    picked by previous backport attempts.

This patch is a backport of the following upstream commit:
commit 4e096ae1801e24b338e02715c65c3ffa8883ba5d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat May 13 01:11:01 2023 +0100

    mm: convert migrate_pages() to work on folios

    Almost all of the callers & implementors of migrate_pages() were already
    converted to use folios.  compaction_alloc() & compaction_free() are
    trivial to convert a part of this patch and not worth splitting out.

    Link: https://lkml.kernel.org/r/20230513001101.276972-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:25 -04:00
Nico Pache 64e5270670 mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
commit 24526268f4e38c9ec0c4a30de4f37ad2a2a84e47
Author: Yang Shi <yang@os.amperecomputing.com>
Date:   Wed Sep 20 15:32:42 2023 -0700

    mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified

    When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
    should attempt to migrate all existing pages, and return -EIO if there is
    misplaced or unmovable page.  Then commit 6f4576e368 ("mempolicy: apply
    page table walker on queue_pages_range()") messed up the return value and
    didn't break VMA scan early ianymore when MPOL_MF_STRICT alone.  The
    return value problem was fixed by commit a7f40cfe3b ("mm: mempolicy:
    make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
    the VMA walk early if unmovable page is met, it may cause some pages are
    not migrated as expected.

    The code should conceptually do:

     if (MPOL_MF_MOVE|MOVEALL)
         scan all vmas
         try to migrate the existing pages
         return success
     else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
         scan all vmas
         try to migrate the existing pages
         return -EIO if unmovable or migration failed
     else /* MPOL_MF_STRICT alone */
         break early if meets unmovable and don't call mbind_range() at all
     else /* none of those flags */
         check the ranges in test_walk, EFAULT without mbind_range() if discontig.

    Fixed the behavior.

    Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.com
    Fixes: a7f40cfe3b ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
    Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Rafael Aquini <aquini@redhat.com>
    Cc: Kirill A. Shutemov <kirill@shutemov.name>
    Cc: David Rientjes <rientjes@google.com>
    Cc: <stable@vger.kernel.org>    [4.9+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:28 -06:00
Nico Pache 0b91dbac20 mm: enable page walking API to lock vmas during the walk
commit 49b0638502da097c15d46cd4e871dbaa022caf7c
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Aug 4 08:27:19 2023 -0700

    mm: enable page walking API to lock vmas during the walk

    walk_page_range() and friends often operate under write-locked mmap_lock.
    With introduction of vma locks, the vmas have to be locked as well during
    such walks to prevent concurrent page faults in these areas.  Add an
    additional member to mm_walk_ops to indicate locking requirements for the
    walk.

    The change ensures that page walks which prevent concurrent page faults
    by write-locking mmap_lock, operate correctly after introduction of
    per-vma locks.  With per-vma locks page faults can be handled under vma
    lock without taking mmap_lock at all, so write locking mmap_lock would
    not stop them.  The change ensures vmas are properly locked during such
    walks.

    A sample issue this solves is do_mbind() performing queue_pages_range()
    to queue pages for migration.  Without this change a concurrent page
    can be faulted into the area and be left out of migration.

    Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
    Suggested-by: Jann Horn <jannh@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam Howlett <liam.howlett@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:27 -06:00
Nico Pache f9d450bdb2 mm/mempolicy: Take VMA lock before replacing policy
commit 6c21e066f9256ea1df6f88768f6ae1080b7cf509
Author: Jann Horn <jannh@google.com>
Date:   Fri Jul 28 06:13:21 2023 +0200

    mm/mempolicy: Take VMA lock before replacing policy

    mbind() calls down into vma_replace_policy() without taking the per-VMA
    locks, replaces the VMA's vma->vm_policy pointer, and frees the old
    policy.  That's bad; a concurrent page fault might still be using the
    old policy (in vma_alloc_folio()), resulting in use-after-free.

    Normally this will manifest as a use-after-free read first, but it can
    result in memory corruption, including because vma_alloc_folio() can
    call mpol_cond_put() on the freed policy, which conditionally changes
    the policy's refcount member.

    This bug is specific to CONFIG_NUMA, but it does also affect non-NUMA
    systems as long as the kernel was built with CONFIG_NUMA.

    Signed-off-by: Jann Horn <jannh@google.com>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:26 -06:00
Nico Pache 3d51551f5f mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer
commit 51f625377561e5b167da2db5aafb7ee268f691c5
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Thu Sep 28 13:24:32 2023 -0400

    mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer

    The two users of mbind_range() are expecting that mbind_range() will
    update the pointer to the previous VMA, or return an error.  However,
    set_mempolicy_home_node() does not call mbind_range() if there is no VMA
    policy.  The fix is to update the pointer to the previous VMA prior to
    continuing iterating the VMAs when there is no policy.

    Users may experience a WARN_ON() during VMA policy updates when updating
    a range of VMAs on the home node.

    Link: https://lkml.kernel.org/r/20230928172432.2246534-1-Liam.Howlett@oracle.com
    Link: https://lore.kernel.org/linux-mm/CALcu4rbT+fMVNaO_F2izaCT+e7jzcAciFkOvk21HGJsmLcUuwQ@mail.gmail.com/
    Fixes: f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reported-by: Yikebaer Aizezi <yikebaer61@gmail.com>
    Closes: https://lore.kernel.org/linux-mm/CALcu4rbT+fMVNaO_F2izaCT+e7jzcAciFkOvk21HGJsmLcUuwQ@mail.gmail.com/
    Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:24 -06:00
Chris von Recklinghausen f52957e911 mm/mempolicy: correctly update prev when policy is equal on mbind
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 00ca0f2e86bf40b016a646e6323a8941a09cf106
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Sun Apr 30 16:07:07 2023 +0100

    mm/mempolicy: correctly update prev when policy is equal on mbind

    The refactoring in commit f4e9e0e69468 ("mm/mempolicy: fix use-after-free
    of VMA iterator") introduces a subtle bug which arises when attempting to
    apply a new NUMA policy across a range of VMAs in mbind_range().

    The refactoring passes a **prev pointer to keep track of the previous VMA
    in order to reduce duplication, and in all but one case it keeps this
    correctly updated.

    The bug arises when a VMA within the specified range has an equivalent
    policy as determined by mpol_equal() - which unlike other cases, does not
    update prev.

    This can result in a situation where, later in the iteration, a VMA is
    found whose policy does need to change.  At this point, vma_merge() is
    invoked with prev pointing to a VMA which is before the previous VMA.

    Since vma_merge() discovers the curr VMA by looking for the one
    immediately after prev, it will now be in a situation where this VMA is
    incorrect and the merge will not proceed correctly.

    This is checked in the VM_WARN_ON() invariant case with end >
    curr->vm_end, which, if a merge is possible, results in a warning (if
    CONFIG_DEBUG_VM is specified).

    I note that vma_merge() performs these invariant checks only after
    merge_prev/merge_next are checked, which is debatable as it hides this
    issue if no merge is possible even though a buggy situation has arisen.

    The solution is simply to update the prev pointer even when policies are
    equal.

    This caused a bug to arise in the 6.2.y stable tree, and this patch
    resolves this bug.

    Link: https://lkml.kernel.org/r/83f1d612acb519d777bebf7f3359317c4e7f4265.1682866629.git.lstoakes@gmail.com
    Fixes: f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator")
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Reported-by: kernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/oe-lkp/202304292203.44ddeff6-oliver.sang@intel.com
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:09 -04:00
Aristeu Rozanski 0a678d3d82 mm/mempolicy: fix use-after-free of VMA iterator
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f4e9e0e69468583c2c6d9d5c7bfc975e292bf188
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Mon Apr 10 11:22:05 2023 -0400

    mm/mempolicy: fix use-after-free of VMA iterator

    set_mempolicy_home_node() iterates over a list of VMAs and calls
    mbind_range() on each VMA, which also iterates over the singular list of
    the VMA passed in and potentially splits the VMA.  Since the VMA iterator
    is not passed through, set_mempolicy_home_node() may now point to a stale
    node in the VMA tree.  This can result in a UAF as reported by syzbot.

    Avoid the stale maple tree node by passing the VMA iterator through to the
    underlying call to split_vma().

    mbind_range() is also overly complicated, since there are two calling
    functions and one already handles iterating over the VMAs.  Simplify
    mbind_range() to only handle merging and splitting of the VMAs.

    Align the new loop in do_mbind() and existing loop in
    set_mempolicy_home_node() to use the reduced mbind_range() function.  This
    allows for a single location of the range calculation and avoids
    constantly looking up the previous VMA (since this is a loop over the
    VMAs).

    Link: https://lore.kernel.org/linux-mm/000000000000c93feb05f87e24ad@google.com/
    Fixes: 66850be55e8e ("mm/mempolicy: use vma iterator & maple state instead of vma linked list")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reported-by: syzbot+a7c1ec5b1d71ceaa5186@syzkaller.appspotmail.com
      Link: https://lkml.kernel.org/r/20230410152205.2294819-1-Liam.Howlett@oracle.com
    Tested-by: syzbot+a7c1ec5b1d71ceaa5186@syzkaller.appspotmail.com
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:26 -04:00
Aristeu Rozanski 39263f3448 mm: hugetlb: change to return bool for isolate_hugetlb()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 9747b9e92418b61c2281561e0651803f1fad0159
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:36 2023 +0800

    mm: hugetlb: change to return bool for isolate_hugetlb()

    Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not
    care about the negative value, thus we can convert the isolate_hugetlb()
    to return a boolean value to make code more clear when checking the
    hugetlb isolation state.  Moreover converts 2 users which will consider
    the negative value returned by isolate_hugetlb().

    No functional changes intended.

    [akpm@linux-foundation.org: shorten locked section, per SeongJae Park]
    Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:24 -04:00
Aristeu Rozanski d1230addeb mm: change to return bool for folio_isolate_lru()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit be2d57563822b7e00b2b16d9354637c4b6d6d5cc
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:34 2023 +0800

    mm: change to return bool for folio_isolate_lru()

    Patch series "Change the return value for page isolation functions", v3.

    Now the page isolation functions did not return a boolean to indicate
    success or not, instead it will return a negative error when failed
    to isolate a page. So below code used in most places seem a boolean
    success/failure thing, which can confuse people whether the isolation
    is successful.

    if (folio_isolate_lru(folio))
            continue;

    Moreover the page isolation functions only return 0 or -EBUSY, and
    most users did not care about the negative error except for few users,
    thus we can convert all page isolation functions to return a boolean
    value, which can remove the confusion to make code more clear.

    No functional changes intended in this patch series.

    This patch (of 4):

    Now the folio_isolate_lru() did not return a boolean value to indicate
    isolation success or not, however below code checking the return value can
    make people think that it was a boolean success/failure thing, which makes
    people easy to make mistakes (see the fix patch[1]).

    if (folio_isolate_lru(folio))
            continue;

    Thus it's better to check the negative error value expilictly returned by
    folio_isolate_lru(), which makes code more clear per Linus's
    suggestion[2].  Moreover Matthew suggested we can convert the isolation
    functions to return a boolean[3], since most users did not care about the
    negative error value, and can also remove the confusing of checking return
    value.

    So this patch converts the folio_isolate_lru() to return a boolean value,
    which means return 'true' to indicate the folio isolation is successful,
    and 'false' means a failure to isolation.  Meanwhile changing all users'
    logic of checking the isolation state.

    No functional changes intended.

    [1] https://lore.kernel.org/all/20230131063206.28820-1-Kuan-Ying.Lee@mediatek.com/T/#u
    [2] https://lore.kernel.org/all/CAHk-=wiBrY+O-4=2mrbVyxR+hOqfdJ=Do6xoucfJ9_5az01L4Q@mail.gmail.com/
    [3] https://lore.kernel.org/all/Y+sTFqwMNAjDvxw3@casper.infradead.org/

    Link: https://lkml.kernel.org/r/cover.1676424378.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/8a4e3679ed4196168efadf7ea36c038f2f7d5aa9.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>

    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski 5958849d56 mm/mempolicy: convert migrate_page_add() to migrate_folio_add()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 4a64981dfee9119aa2c1f243b48f34cbbd67779c
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Jan 30 12:18:33 2023 -0800

    mm/mempolicy: convert migrate_page_add() to migrate_folio_add()

    Replace migrate_page_add() with migrate_folio_add().  migrate_folio_add()
    does the same a migrate_page_add() but takes in a folio instead of a page.
    This removes a couple of calls to compound_head().

    Link: https://lkml.kernel.org/r/20230130201833.27042-7-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:22 -04:00
Aristeu Rozanski c5a9a9b2fb mm/mempolicy: convert queue_pages_required() to queue_folio_required()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit d451b89dcd183da725eda84dfb8a46c0b32a4234
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Jan 30 12:18:32 2023 -0800

    mm/mempolicy: convert queue_pages_required() to queue_folio_required()

    Replace queue_pages_required() with queue_folio_required().
    queue_folio_required() does the same as queue_pages_required(), except
    takes in a folio instead of a page.

    Link: https://lkml.kernel.org/r/20230130201833.27042-6-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:21 -04:00
Aristeu Rozanski 9ed58f804d mm/mempolicy: convert queue_pages_hugetlb() to queue_folios_hugetlb()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 0a2c1e8183163a31fe8c9838f3108aacf9c05c4a
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Jan 30 12:18:31 2023 -0800

    mm/mempolicy: convert queue_pages_hugetlb() to queue_folios_hugetlb()

    This change is in preparation for the conversion of queue_pages_required()
    to queue_folio_required() and migrate_page_add() to migrate_folio_add().

    Link: https://lkml.kernel.org/r/20230130201833.27042-5-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:21 -04:00
Aristeu Rozanski a687c6f4d6 mm/mempolicy: convert queue_pages_pte_range() to queue_folios_pte_range()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 3dae02bbd07f40e37bbfec2d77119628db461eaa
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Jan 30 12:18:30 2023 -0800

    mm/mempolicy: convert queue_pages_pte_range() to queue_folios_pte_range()

    This function now operates on folios associated with ptes instead of
    pages.

    This change is in preparation for the conversion of queue_pages_required()
    to queue_folio_required() and migrate_page_add() to migrate_folio_add().

    Link: https://lkml.kernel.org/r/20230130201833.27042-4-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:21 -04:00
Aristeu Rozanski 1f07f548fb mm/mempolicy: convert queue_pages_pmd() to queue_folios_pmd()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit de1f5055523e9a035b38533f25a56df03d45034a
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Jan 30 12:18:29 2023 -0800

    mm/mempolicy: convert queue_pages_pmd() to queue_folios_pmd()

    The function now operates on a folio instead of the page associated with a
    pmd.

    This change is in preparation for the conversion of queue_pages_required()
    to queue_folio_required() and migrate_page_add() to migrate_folio_add().

    Link: https://lkml.kernel.org/r/20230130201833.27042-3-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:21 -04:00
Aristeu Rozanski 456efc9e7d mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: reverting 830fb0c1df, which was a backport of da9a298f5fa twice by mistake

commit d0ce0e47b323a8d7fb5dc3314ce56afa650ade2d
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Wed Jan 25 09:05:33 2023 -0800

    mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()

    Change alloc_huge_page() to alloc_hugetlb_folio() by changing all callers
    to handle the now folio return type of the function.  In this conversion,
    alloc_huge_page_vma() is also changed to alloc_hugetlb_folio_vma() and
    hugepage_add_new_anon_rmap() is changed to take in a folio directly.  Many
    additions of '&folio->page' are cleaned up in subsequent patches.

    hugetlbfs_fallocate() is also refactored to use the RCU +
    page_cache_next_miss() API.

    Link: https://lkml.kernel.org/r/20230125170537.96973-5-sidhartha.kumar@oracle.com
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:21 -04:00
Aristeu Rozanski edf79d9715 mm/hugetlb: convert isolate_hugetlb to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 6aa3a920125e9f58891e2b5dc2efd4d0c1ff05a6
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Fri Jan 13 16:30:50 2023 -0600

    mm/hugetlb: convert isolate_hugetlb to folios

    Patch series "continue hugetlb folio conversion", v3.

    This series continues the conversion of core hugetlb functions to use
    folios. This series converts many helper funtions in the hugetlb fault
    path. This is in preparation for another series to convert the hugetlb
    fault code paths to operate on folios.

    This patch (of 8):

    Convert isolate_hugetlb() to take in a folio and convert its callers to
    pass a folio.  Use page_folio() to convert the callers to use a folio is
    safe as isolate_hugetlb() operates on a head page.

    Link: https://lkml.kernel.org/r/20230113223057.173292-1-sidhartha.kumar@oracle.com
    Link: https://lkml.kernel.org/r/20230113223057.173292-2-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:20 -04:00
Aristeu Rozanski 7b2b9fac55 mm: switch vma_merge(), split_vma(), and __split_vma to vma iterator
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 9760ebffbf5507320e0de41f5b80089bdef996a0
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:30 2023 -0500

    mm: switch vma_merge(), split_vma(), and __split_vma to vma iterator

    Drop the vmi_* functions and transition all users to use the vma iterator
    directly.

    Link: https://lkml.kernel.org/r/20230120162650.984577-30-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:15 -04:00
Aristeu Rozanski 803b183ad0 mempolicy: convert to vma iterator
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f10c2abcdac4a44795fae9118eaedfe56204afda
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:21 2023 -0500

    mempolicy: convert to vma iterator

    Use the vma iterator so that the iterator can be invalidated or updated to
    avoid each caller doing so.

    Link: https://lkml.kernel.org/r/20230120162650.984577-21-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:14 -04:00
Aristeu Rozanski b099a56a2d mm/mempolicy: do not duplicate policy if it is not applicable for set_mempolicy_home_node
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit e976936cfc66376fc740a3a476365273384ce1ce
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri Dec 16 14:45:37 2022 -0500

    mm/mempolicy: do not duplicate policy if it is not applicable for set_mempolicy_home_node

    set_mempolicy_home_node tries to duplicate a memory policy before checking
    it whether it is applicable for the operation.  There is no real reason
    for doing that and it might actually be a pointless memory allocation and
    deallocation exercise for MPOL_INTERLEAVE.

    Not a big problem but we can do better. Simply check the policy before
    acting on it.

    Link: https://lkml.kernel.org/r/20221216194537.238047-2-mathieu.desnoyers@efficios.com
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Feng Tang <feng.tang@intel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:00 -04:00
Chris von Recklinghausen 87ff2bba80 mm/mempolicy: fix mbind_range() arguments to vma_merge()
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 7329e3ebe3594b425955ab591ecea335e85842c2
Author: Liam Howlett <liam.howlett@oracle.com>
Date:   Sat Oct 15 02:12:33 2022 +0000

    mm/mempolicy: fix mbind_range() arguments to vma_merge()

    Fuzzing produced an invalid argument to vma_merge() which was caught by
    the newly added verification of the number of VMAs being removed on
    process exit.  Analyzing the failure eventually resulted in finding an
    issue with the search of a VMA that started at address 0, which caused an
    underflow and thus the loss of many VMAs being tracked in the tree.  Fix
    the underflow by changing the search of the maple tree to use the start
    address directly.

    Link: https://lkml.kernel.org/r/20221015021135.2816178-1-Liam.Howlett@oracle.com
    Fixes: 66850be55e8e ("mm/mempolicy: use vma iterator & maple state instead of vma linked list")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reported-by: kernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/r/202210052318.5ad10912-oliver.sang@intel.com
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:01 -04:00
Chris von Recklinghausen eb8d8cad04 mm/mempolicy: use vma iterator & maple state instead of vma linked list
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 66850be55e8e5f371db2c091751a932a656c5f4d
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:02 2022 +0000

    mm/mempolicy: use vma iterator & maple state instead of vma linked list

    Reworked the way mbind_range() finds the first VMA to reuse the maple
    state and limit the number of tree walks needed.

    Note, this drops the VM_BUG_ON(!vma) call, which would catch a start
    address higher than the last VMA.  The code was written in a way that
    allowed no VMA updates to occur and still return success.  There should be
    no functional change to this scenario with the new code.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-57-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:54 -04:00
Phil Auld 57cbb13a8f numa: Generalize numa_map_to_online_node()
JIRA: https://issues.redhat.com/browse/RHEL-17580

commit b1f099b1cf51d553c510c6c8141c27d9ba7ea1fe
Author: Yury Norov <yury.norov@gmail.com>
Date:   Sat Aug 19 07:12:33 2023 -0700

    numa: Generalize numa_map_to_online_node()

    The function in fact searches the nearest node for a given one,
    based on a N_ONLINE state. This is a common pattern to search
    for a nearest node.

    This patch converts numa_map_to_online_node() to numa_nearest_node()
    so that others won't need to opencode the logic.

    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20230819141239.287290-2-yury.norov@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-11-29 10:41:09 -05:00
Chris von Recklinghausen 0e4b299323 mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 7780d04046a2288ab85d88bedacc60fa4fad9971
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:17:26 2023 -0700

    mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails

    Simple walk_page_range() users should set ACTION_AGAIN to retry when
    pte_offset_map_lock() fails.

    No need to check pmd_trans_unstable(): that was precisely to avoid the
    possiblity of calling pte_offset_map() on a racily removed or inserted THP
    entry, but such cases are now safely handled inside it.  Likewise there is
    no need to check pmd_none() or pmd_bad() before calling it.

    Link: https://lkml.kernel.org/r/c77d9d10-3aad-e3ce-4896-99e91c7947f3@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: SeongJae Park <sj@kernel.org> for mm/damon part
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:14 -04:00
Chris von Recklinghausen 8c7fab83e8 mm/mempolicy: use PAGE_ALIGN instead of open-coding it
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit aaa31e058dd82453c89302c9331945894ff555a6
Author: ze zuo <zuoze1@huawei.com>
Date:   Tue Sep 13 01:55:05 2022 +0000

    mm/mempolicy: use PAGE_ALIGN instead of open-coding it

    Replace the simple calculation with PAGE_ALIGN.

    Link: https://lkml.kernel.org/r/20220913015505.1998958-1-zuoze1@huawei.com
    Signed-off-by: ze zuo <zuoze1@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:31 -04:00
Nico Pache 9cc71ee2bd migrate: hugetlb: check for hugetlb shared PMD in node migration
commit 73bdf65ea74857d7fb2ec3067a3cec0e261b1462
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Thu Jan 26 14:27:21 2023 -0800

    migrate: hugetlb: check for hugetlb shared PMD in node migration

    migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to
    move pages shared with another process to a different node.  page_mapcount
    > 1 is being used to determine if a hugetlb page is shared.  However, a
    hugetlb page will have a mapcount of 1 if mapped by multiple processes via
    a shared PMD.  As a result, hugetlb pages shared by multiple processes and
    mapped with a shared PMD can be moved by a process without CAP_SYS_NICE.

    To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is found
    consider the page shared.

    Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com
    Fixes: e2d8cf4055 ("migrate: add hugepage migration code to migrate_pages()")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:03 -06:00
Nico Pache e032a150e3 mm/mempolicy: fix memory leak in set_mempolicy_home_node system call
commit 38ce7c9bdfc228c14d7621ba36d3eebedd9d4f76
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Thu Dec 15 14:46:21 2022 -0500

    mm/mempolicy: fix memory leak in set_mempolicy_home_node system call

    When encountering any vma in the range with policy other than MPOL_BIND or
    MPOL_PREFERRED_MANY, an error is returned without issuing a mpol_put on
    the policy just allocated with mpol_dup().

    This allows arbitrary users to leak kernel memory.

    Link: https://lkml.kernel.org/r/20221215194621.202816-1-mathieu.desnoyers@efficios.com
    Fixes: c6018b4b2549 ("mm/mempolicy: add set_mempolicy_home_node syscall")
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Feng Tang <feng.tang@intel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: <stable@vger.kernel.org>    [5.17+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Chris von Recklinghausen d392132e1e mm/uffd: detect pgtable allocation failures
Bugzilla: https://bugzilla.redhat.com/2160210

commit d1751118c88673fe5a948ad82277898e9e284c55
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Jan 4 17:52:07 2023 -0500

    mm/uffd: detect pgtable allocation failures

    Before this patch, when there's any pgtable allocation issues happened
    during change_protection(), the error will be ignored from the syscall.
    For shmem, there will be an error dumped into the host dmesg.  Two issues
    with that:

      (1) Doing a trace dump when allocation fails is not anything close to
          grace.

      (2) The user should be notified with any kind of such error, so the user
          can trap it and decide what to do next, either by retrying, or stop
          the process properly, or anything else.

    For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
    pgtable allocation failure happened.  It should not normally break anyone,
    though.  If it breaks, then in good ways.

    One man-page update will be on the way to introduce the new -ENOMEM for
    UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on
    the 5.19-till-now kernels.

    [akpm@linux-foundation.org: coding-style cleanups]
    Link: https://lkml.kernel.org/r/20230104225207.1066932-4-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: James Houghton <jthoughton@google.com>
    Acked-by: James Houghton <jthoughton@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:35 -04:00
Chris von Recklinghausen 403d9558c1 mm/mprotect: use long for page accountings and retval
Bugzilla: https://bugzilla.redhat.com/2160210

commit a79390f5d6a78647fd70856bd42b22d994de0ba2
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Jan 4 17:52:06 2023 -0500

    mm/mprotect: use long for page accountings and retval

    Switch to use type "long" for page accountings and retval across the whole
    procedure of change_protection().

    The change should have shrinked the possible maximum page number to be
    half comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on
    any system either because the maximum possible pages touched by change
    protection should be ULONG_MAX / PAGE_SIZE.

    Two reasons to switch from "unsigned long" to "long":

      1. It suites better on count_vm_numa_events(), whose 2nd parameter takes
         a long type.

      2. It paves way for returning negative (error) values in the future.

    Currently the only caller that consumes this retval is change_prot_numa(),
    where the unsigned long was converted to an int.  Since at it, touching up
    the numa code to also take a long, so it'll avoid any possible overflow
    too during the int-size convertion.

    Link: https://lkml.kernel.org/r/20230104225207.1066932-3-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: James Houghton <jthoughton@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:35 -04:00
Chris von Recklinghausen d0b4183cf0 mm/mprotect: drop pgprot_t parameter from change_protection()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 1ef488edd6c4d447784710974f049628c2890481
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Dec 23 16:56:16 2022 +0100

    mm/mprotect: drop pgprot_t parameter from change_protection()

    Being able to provide a custom protection opens the door for
    inconsistencies and BUGs: for example, accidentally allowing for more
    permissions than desired by other mechanisms (e.g., softdirty tracking).
    vma->vm_page_prot should be the single source of truth.

    Only PROT_NUMA is special: there is no way we can erroneously allow
    for more permissions when removing all permissions. Special-case using
    the MM_CP_PROT_NUMA flag.

    [david@redhat.com: PAGE_NONE might not be defined without CONFIG_NUMA_BALANCING]
      Link: https://lkml.kernel.org/r/5084ff1c-ebb3-f918-6a60-bacabf550a88@redhat.com
    Link: https://lkml.kernel.org/r/20221223155616.297723-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:35 -04:00
Chris von Recklinghausen c3d530408f mm/mempolicy: remove unneeded out label
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6d97cf88ddde9c976d04b886b10b464ec8006c85
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 19 19:52:33 2022 +0800

    mm/mempolicy: remove unneeded out label

    We can use unlock label to unlock ptl and return ret directly to remove
    the unneeded out label and reduce the size of mempolicy.o.  No functional
    change intended.

    [Before]
       text    data     bss     dec     hex filename
      26702    3972    6168   36842    8fea mm/mempolicy.o

    [After]
       text    data     bss     dec     hex filename
      26662    3972    6168   36802    8fc2 mm/mempolicy.o

    Link: https://lkml.kernel.org/r/20220719115233.6706-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:28 -04:00
Chris von Recklinghausen 2234c38cea mm/mempolicy: fix uninit-value in mpol_rebind_policy()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 018160ad314d75b1409129b2247b614a9f35894c
Author: Wang Cheng <wanngchenng@gmail.com>
Date:   Thu May 19 14:08:54 2022 -0700

    mm/mempolicy: fix uninit-value in mpol_rebind_policy()

    mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when
    pol->mode is MPOL_LOCAL.  Check pol->mode before access
    pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c).

    BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline]
    BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368
     mpol_rebind_policy mm/mempolicy.c:352 [inline]
     mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368
     cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline]
     cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278
     cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515
     cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline]
     cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804
     __cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520
     cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539
     cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852
     kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296
     call_write_iter include/linux/fs.h:2162 [inline]
     new_sync_write fs/read_write.c:503 [inline]
     vfs_write+0x1318/0x2030 fs/read_write.c:590
     ksys_write+0x28b/0x510 fs/read_write.c:643
     __do_sys_write fs/read_write.c:655 [inline]
     __se_sys_write fs/read_write.c:652 [inline]
     __x64_sys_write+0xdb/0x120 fs/read_write.c:652
     do_syscall_x64 arch/x86/entry/common.c:51 [inline]
     do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Uninit was created at:
     slab_post_alloc_hook mm/slab.h:524 [inline]
     slab_alloc_node mm/slub.c:3251 [inline]
     slab_alloc mm/slub.c:3259 [inline]
     kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264
     mpol_new mm/mempolicy.c:293 [inline]
     do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853
     kernel_set_mempolicy mm/mempolicy.c:1504 [inline]
     __do_sys_set_mempolicy mm/mempolicy.c:1510 [inline]
     __se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507
     __x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507
     do_syscall_x64 arch/x86/entry/common.c:51 [inline]
     do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    KMSAN: uninit-value in mpol_rebind_task (2)
    https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc

    This patch seems to fix below bug too.
    KMSAN: uninit-value in mpol_rebind_mm (2)
    https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b

    The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy().
    When syzkaller reproducer runs to the beginning of mpol_new(),

                mpol_new() mm/mempolicy.c
              do_mbind() mm/mempolicy.c
            kernel_mbind() mm/mempolicy.c

    `mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags`
    is 0. Then

            mode = MPOL_LOCAL;
            ...
            policy->mode = mode;
            policy->flags = flags;

    will be executed. So in mpol_set_nodemask(),

                mpol_set_nodemask() mm/mempolicy.c
              do_mbind()
            kernel_mbind()

    pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized,
    which will be accessed in mpol_rebind_policy().

    Link: https://lkml.kernel.org/r/20220512123428.fq3wofedp6oiotd4@ppc.localdomain
    Signed-off-by: Wang Cheng <wanngchenng@gmail.com>
    Reported-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com>
    Tested-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen af7f14b54b mm: remove alloc_pages_vma()
Bugzilla: https://bugzilla.redhat.com/2160210

commit adf88aa8ea7ff143825a2a8a7193f92e0e6fc79b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu May 12 20:23:01 2022 -0700

    mm: remove alloc_pages_vma()

    All callers have now been converted to use vma_alloc_folio(), so convert
    the body of alloc_pages_vma() to allocate folios instead.

    Link: https://lkml.kernel.org/r/20220504182857.4013401-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:06 -04:00
Nico Pache 6dc345f13e mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process
commit d2226ebd5484afcf9f9b71b394ec1567a7730eb1
Author: Feng Tang <feng.tang@intel.com>
Date:   Fri Aug 5 08:59:03 2022 +0800

    mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process

    Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in
    commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple
    preferred nodes"), the policy_nodemask_current()'s semantics for this new
    policy has been changed, which returns 'preferred' nodes instead of
    'allowed' nodes.

    With the changed semantic of policy_nodemask_current, a task with
    MPOL_PREFERRED_MANY policy could fail to get its reservation even though
    it can fall back to other nodes (either defined by cpusets or all online
    nodes) for that reservation failing mmap calles unnecessarily early.

    The fix is to not consider MPOL_PREFERRED_MANY for reservations at all
    because they, unlike MPOL_MBIND, do not pose any actual hard constrain.

    Michal suggested the policy_nodemask_current() is only used by hugetlb,
    and could be moved to hugetlb code with more explicit name to enforce the
    'allowed' semantics for which only MPOL_BIND policy matters.

    apply_policy_zone() is made extern to be called in hugetlb code and its
    return value is changed to bool.

    [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/

    Link: https://lkml.kernel.org/r/20220805005903.95563-1-feng.tang@intel.com
    Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
    Signed-off-by: Feng Tang <feng.tang@intel.com>
    Reported-by: Muchun Song <songmuchun@bytedance.com>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Ben Widawsky <bwidawsk@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:42 -07:00
Nico Pache 06c339403c mm/mempolicy: fix lock contention on mems_allowed
commit 12c1dc8e7441773c74dc62fab76553c24015f6e1
Author: Abel Wu <wuyun.abel@bytedance.com>
Date:   Thu Aug 11 20:41:57 2022 +0800

    mm/mempolicy: fix lock contention on mems_allowed

    The mems_allowed field can be modified by other tasks, so it isn't safe to
    access it with alloc_lock unlocked even in the current process context.

    Say there are two tasks: A from cpusetA is performing set_mempolicy(2),
    and B is changing cpusetA's cpuset.mems:

      A (set_mempolicy)             B (echo xx > cpuset.mems)
      -------------------------------------------------------
      pol = mpol_new();
                                    update_tasks_nodemask(cpusetA) {
                                      foreach t in cpusetA {
                                        cpuset_change_task_nodemask(t) {
      mpol_set_nodemask(pol) {
                                          task_lock(t); // t could be A
        new = f(A->mems_allowed);
                                          update t->mems_allowed;
        pol.create(pol, new);
                                          task_unlock(t);
      }
                                        }
                                      }
                                    }
      task_lock(A);
      A->mempolicy = pol;
      task_unlock(A);

    In this case A's pol->nodes is computed by old mems_allowed, and could
    be inconsistent with A's new mems_allowed.

    While it is different when replacing vmas' policy: the pol->nodes is
    gone wild only when current_cpuset_is_being_rebound():

      A (mbind)                     B (echo xx > cpuset.mems)
      -------------------------------------------------------
      pol = mpol_new();
      mmap_write_lock(A->mm);
                                    cpuset_being_rebound = cpusetA;
                                    update_tasks_nodemask(cpusetA) {
                                      foreach t in cpusetA {
                                        cpuset_change_task_nodemask(t) {
      mpol_set_nodemask(pol) {
                                          task_lock(t); // t could be A
        mask = f(A->mems_allowed);
                                          update t->mems_allowed;
        pol.create(pol, mask);
                                          task_unlock(t);
      }
                                        }
      foreach v in A->mm {
        if (cpuset_being_rebound == cpusetA)
          pol.rebind(pol, cpuset.mems);
        v->vma_policy = pol;
      }
      mmap_write_unlock(A->mm);
                                        mmap_write_lock(t->mm);
                                        mpol_rebind_mm(t->mm);
                                        mmap_write_unlock(t->mm);
                                      }
                                    }
                                    cpuset_being_rebound = NULL;

    In this case, the cpuset.mems, which has already done updating, is finally
    used for calculating pol->nodes, rather than A->mems_allowed.  So it is OK
    to call mpol_set_nodemask() with alloc_lock unlocked when doing mbind(2).

    Link: https://lkml.kernel.org/r/20220811124157.74888-1-wuyun.abel@bytedance.com
    Fixes: 78b132e9ba ("mm/mempolicy: remove or narrow the lock on current")
    Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:42 -07:00