Commit Graph

2073 Commits

Author SHA1 Message Date
Paolo Bonzini e05d25b2e8 mm/page_alloc: make deferred page init free pages in MAX_ORDER blocks
JIRA: https://issues.redhat.com/browse/RHEL-10059

Normal page init path frees pages during the boot in MAX_ORDER chunks, but
deferred page init path does it in pageblock blocks.

Change deferred page init path to work in MAX_ORDER blocks.

For cases when MAX_ORDER is larger than pageblock, set migrate type to
MIGRATE_MOVABLE for all pageblocks covered by the page.

Link: https://lkml.kernel.org/r/20230321002415.20843-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 3f6dac0fd1b83178137e7b4e722d8f29612cbec1)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

[RHEL-only: code is in mm/page_alloc.c, not mm/mm_init.c]
2023-10-30 12:46:46 +01:00
Paolo Bonzini 13262962e2 mm: Add support for unaccepted memory
JIRA: https://issues.redhat.com/browse/RHEL-10059

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways the kernel can deal with unaccepted memory:

 1. Accept all the memory during boot. It is easy to implement and it
    doesn't have runtime cost once the system is booted. The downside is
    very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock accepts memory on allocation. It serves early boot memory
    allocations and doesn't limit them to pre-accepted pool of memory.

  - page allocator accepts memory on the first allocation of the page.
    When kernel runs out of accepted memory, it accepts memory until the
    high watermark is reached. It helps to minimize fragmentation.

EFI code will provide two helpers if the platform supports unaccepted
memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com
(cherry picked from commit dcdfdd40fa82b6704d2841938e5c8ec3051eb0d6)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

[RHEL: upstream has mm/mm_init.c split out of mm/page_alloc.c]
2023-10-30 09:14:17 +01:00
Paolo Bonzini 538bf6f332 mm, treewide: redefine MAX_ORDER sanely
JIRA: https://issues.redhat.com/browse/RHEL-10059

MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.

This definition is counter-intuitive and lead to number of bugs all over
the kernel.

Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.

[kirill@shutemov.name: fix min() warning]
  Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box
[akpm@linux-foundation.org: fix another min_t warning]
[kirill@shutemov.name: fixups per Zi Yan]
  Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name
[akpm@linux-foundation.org: fix underlining in docs]
  Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 23baf831a32c04f9a968812511540b1b3e648bf5)

[RHEL: Fix conflicts by changing MAX_ORDER - 1 to MAX_ORDER,
       ">= MAX_ORDER" to "> MAX_ORDER", etc.]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-30 09:12:37 +01:00
Chris von Recklinghausen cb1425e0ef kasan: reset page tags properly with sampling
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 420ef683b5217338bc679c33fd9361b52f53a526
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Tue Jan 24 21:35:26 2023 +0100

    kasan: reset page tags properly with sampling

    The implementation of page_alloc poisoning sampling assumed that
    tag_clear_highpage resets page tags for __GFP_ZEROTAGS allocations.
    However, this is no longer the case since commit 70c248aca9e7 ("mm: kasan:
    Skip unpoisoning of user pages").

    This leads to kernel crashes when MTE-enabled userspace mappings are used
    with Hardware Tag-Based KASAN enabled.

    Reset page tags for __GFP_ZEROTAGS allocations in post_alloc_hook().

    Also clarify and fix related comments.

    [andreyknvl@google.com: update comment]
     Link: https://lkml.kernel.org/r/5dbd866714b4839069e2d8469ac45b60953db290.1674592780.git.andreyknvl@google.com
    Link: https://lkml.kernel.org/r/24ea20c1b19c2b4b56cf9f5b354915f8dbccfc77.1674592496.git.andreyknvl@google.com
    Fixes: 44383cef54c0 ("kasan: allow sampling page_alloc allocations for HW_TAGS")
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reported-by: Peter Collingbourne <pcc@google.com>
    Tested-by: Peter Collingbourne <pcc@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:58 -04:00
Chris von Recklinghausen fe5f50def7 mm: remove folio_pincount_ptr() and head_compound_pincount()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 94688e8eb453e616098cb930e5f6fed4a6ea2dfa
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:47 2023 +0000

    mm: remove folio_pincount_ptr() and head_compound_pincount()

    We can use folio->_pincount directly, since all users are guarded by tests
    of compound/large.

    Link: https://lkml.kernel.org/r/20230111142915.1001531-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:52 -04:00
Chris von Recklinghausen 62d9864b57 mm: multi-gen LRU: per-node lru_gen_folio lists
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e4dde56cd208674ce899b47589f263499e5b8cdc
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Dec 21 21:19:04 2022 -0700

    mm: multi-gen LRU: per-node lru_gen_folio lists

    For each node, memcgs are divided into two generations: the old and
    the young. For each generation, memcgs are randomly sharded into
    multiple bins to improve scalability. For each bin, an RCU hlist_nulls
    is virtually divided into three segments: the head, the tail and the
    default.

    An onlining memcg is added to the tail of a random bin in the old
    generation. The eviction starts at the head of a random bin in the old
    generation. The per-node memcg generation counter, whose reminder (mod
    2) indexes the old generation, is incremented when all its bins become
    empty.

    There are four operations:
    1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in
       its current generation (old or young) and updates its "seg" to
       "head";
    2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in
       its current generation (old or young) and updates its "seg" to
       "tail";
    3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in
       the old generation, updates its "gen" to "old" and resets its "seg"
       to "default";
    4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin
       in the young generation, updates its "gen" to "young" and resets
       its "seg" to "default".

    The events that trigger the above operations are:
    1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
    2. The first attempt to reclaim an memcg below low, which triggers
       MEMCG_LRU_TAIL;
    3. The first attempt to reclaim an memcg below reclaimable size
       threshold, which triggers MEMCG_LRU_TAIL;
    4. The second attempt to reclaim an memcg below reclaimable size
       threshold, which triggers MEMCG_LRU_YOUNG;
    5. Attempting to reclaim an memcg below min, which triggers
       MEMCG_LRU_YOUNG;
    6. Finishing the aging on the eviction path, which triggers
       MEMCG_LRU_YOUNG;
    7. Offlining an memcg, which triggers MEMCG_LRU_OLD.

    Note that memcg LRU only applies to global reclaim, and the
    round-robin incrementing of their max_seq counters ensures the
    eventual fairness to all eligible memcgs. For memcg reclaim, it still
    relies on mem_cgroup_iter().

    Link: https://lkml.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:50 -04:00
Chris von Recklinghausen dbad0dfa61 kasan: allow sampling page_alloc allocations for HW_TAGS
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 44383cef54c0ce1201f884d83cc2b367bc5aa4f7
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Mon Dec 19 19:09:18 2022 +0100

    kasan: allow sampling page_alloc allocations for HW_TAGS

    As Hardware Tag-Based KASAN is intended to be used in production, its
    performance impact is crucial.  As page_alloc allocations tend to be big,
    tagging and checking all such allocations can introduce a significant
    slowdown.

    Add two new boot parameters that allow to alleviate that slowdown:

    - kasan.page_alloc.sample, which makes Hardware Tag-Based KASAN tag only
      every Nth page_alloc allocation with the order configured by the second
      added parameter (default: tag every such allocation).

    - kasan.page_alloc.sample.order, which makes sampling enabled by the first
      parameter only affect page_alloc allocations with the order equal or
      greater than the specified value (default: 3, see below).

    The exact performance improvement caused by using the new parameters
    depends on their values and the applied workload.

    The chosen default value for kasan.page_alloc.sample.order is 3, which
    matches both PAGE_ALLOC_COSTLY_ORDER and SKB_FRAG_PAGE_ORDER.  This is
    done for two reasons:

    1. PAGE_ALLOC_COSTLY_ORDER is "the order at which allocations are deemed
       costly to service", which corresponds to the idea that only large and
       thus costly allocations are supposed to sampled.

    2. One of the workloads targeted by this patch is a benchmark that sends
       a large amount of data over a local loopback connection. Most multi-page
       data allocations in the networking subsystem have the order of
       SKB_FRAG_PAGE_ORDER (or PAGE_ALLOC_COSTLY_ORDER).

    When running a local loopback test on a testing MTE-enabled device in sync
    mode, enabling Hardware Tag-Based KASAN introduces a ~50% slowdown.
    Applying this patch and setting kasan.page_alloc.sampling to a value
    higher than 1 allows to lower the slowdown.  The performance improvement
    saturates around the sampling interval value of 10 with the default
    sampling page order of 3.  This lowers the slowdown to ~20%.  The slowdown
    in real scenarios involving the network will likely be better.

    Enabling page_alloc sampling has a downside: KASAN misses bad accesses to
    a page_alloc allocation that has not been tagged.  This lowers the value
    of KASAN as a security mitigation.

    However, based on measuring the number of page_alloc allocations of
    different orders during boot in a test build, sampling with the default
    kasan.page_alloc.sample.order value affects only ~7% of allocations.  The
    rest ~93% of allocations are still checked deterministically.

    Link: https://lkml.kernel.org/r/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Mark Brand <markbrand@google.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:47 -04:00
Chris von Recklinghausen 6d31a562d5 mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4b51634cd16a01b2be0f6b69cc0dae63de4751f2
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Nov 22 01:49:36 2022 -0800

    mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped

    Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now?
    Yes.  Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
    but if we slightly abuse subpages_mapcount by additionally demanding that
    one bit be set there when the compound page is PMD-mapped, then a cascade
    of two atomic ops is able to maintain the stats without bit_spin_lock.

    This is harder to reason about than when bit_spin_locked, but I believe
    safe; and no drift in stats detected when testing.  When there are racing
    removes and adds, of course the sequence of operations is less well-
    defined; but each operation on subpages_mapcount is atomically good.  What
    might be disastrous, is if subpages_mapcount could ever fleetingly appear
    negative: but the pte lock (or pmd lock) these rmap functions are called
    under, ensures that a last remove cannot race ahead of a first add.

    Continue to make an exception for hugetlb (PageHuge) pages, though that
    exception can be easily removed by a further commit if necessary: leave
    subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
    carry on checking compound_mapcount too in folio_mapped(), page_mapped().

    Evidence is that this way goes slightly faster than the previous
    implementation in all cases (pmds after ptes now taking around 103ms); and
    relieves us of worrying about contention on the bit_spin_lock.

    Link: https://lkml.kernel.org/r/3978f3ca-5473-55a7-4e14-efea5968d892@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Dan Carpenter <error27@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:27 -04:00
Chris von Recklinghausen e1c02a97f1 mm,thp,rmap: simplify compound page mapcount handling
Conflicts:
	include/linux/mm.h - We already have
		a1554c002699 ("include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h")
		so keep declaration of nr_free_buffer_pages
	mm/huge_memory.c - We already have RHEL-only commit
		0837bdd68b ("Revert "mm: thp: stabilize the THP mapcount in page_remove_anon_compound_rmap"")
		so there is a difference in deleted code.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit cb67f4282bf9693658dbda934a441ddbbb1446df
Author: Hugh Dickins <hughd@google.com>
Date:   Wed Nov 2 18:51:38 2022 -0700

    mm,thp,rmap: simplify compound page mapcount handling

    Compound page (folio) mapcount calculations have been different for anon
    and file (or shmem) THPs, and involved the obscure PageDoubleMap flag.
    And each huge mapping and unmapping of a file (or shmem) THP involved
    atomically incrementing and decrementing the mapcount of every subpage of
    that huge page, dirtying many struct page cachelines.

    Add subpages_mapcount field to the struct folio and first tail page, so
    that the total of subpage mapcounts is available in one place near the
    head: then page_mapcount() and total_mapcount() and page_mapped(), and
    their folio equivalents, are so quick that anon and file and hugetlb don't
    need to be optimized differently.  Delete the unloved PageDoubleMap.

    page_add and page_remove rmap functions must now maintain the
    subpages_mapcount as well as the subpage _mapcount, when dealing with pte
    mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and
    NR_FILE_MAPPED statistics still needs reading through the subpages, using
    nr_subpages_unmapped() - but only when first or last pmd mapping finds
    subpages_mapcount raised (double-map case, not the common case).

    But are those counts (used to decide when to split an anon THP, and in
    vmscan's pagecache_reclaimable heuristic) correctly maintained?  Not
    quite: since page_remove_rmap() (and also split_huge_pmd()) is often
    called without page lock, there can be races when a subpage pte mapcount
    0<->1 while compound pmd mapcount 0<->1 is scanning - races which the
    previous implementation had prevented.  The statistics might become
    inaccurate, and even drift down until they underflow through 0.  That is
    not good enough, but is better dealt with in a followup patch.

    Update a few comments on first and second tail page overlaid fields.
    hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
    subpages_mapcount and compound_pincount are already correctly at 0, so
    delete its reinitialization of compound_pincount.

    A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
    18 seconds on small pages, and used to take 1 second on huge pages, but
    now takes 119 milliseconds on huge pages.  Mapping by pmds a second time
    used to take 860ms and now takes 92ms; mapping by pmds after mapping by
    ptes (when the scan is needed) used to take 870ms and now takes 495ms.
    But there might be some benchmarks which would show a slowdown, because
    tail struct pages now fall out of cache until final freeing checks them.

    Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.co
m
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:26 -04:00
Chris von Recklinghausen 94a962cb51 mm/page_alloc: reduce potential fragmentation in make_alloc_exact()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit df48a5f7a3bbac6a700026b554922943ecee1fb0
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue May 31 09:20:51 2022 -0400

    mm/page_alloc: reduce potential fragmentation in make_alloc_exact()

    Try to avoid using the left over split page on the next request for a page
    by calling __free_pages_ok() with FPI_TO_TAIL.  This increases the
    potential of defragmenting memory when it's used for a short period of
    time.

    Link: https://lkml.kernel.org/r/20220531185626.yvlmymbxyoe5vags@revolver
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:04 -04:00
Chris von Recklinghausen 632d28e4fb mm/page_alloc: update comments for rmqueue()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a57ae9ef9e1a20b68ae841a8cab7aff3f000ed9d
Author: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Date:   Sun Sep 18 02:56:40 2022 +0000

    mm/page_alloc: update comments for rmqueue()

    Since commit 44042b4498 ("mm/page_alloc: allow high-order pages to be
    stored on the per-cpu lists"), the per-cpu page allocators (PCP) is not
    only for order-0 pages.  Update the comments.

    Link: https://lkml.kernel.org/r/20220918025640.208586-1-ran.xiaokai@zte.com.cn
    Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:48 -04:00
Chris von Recklinghausen 89622e8025 mm/page_alloc: fix obsolete comment in deferred_pfn_valid()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit c9b3637f8a5a4c869f78c26773c559669796212f
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:57 2022 +0800

    mm/page_alloc: fix obsolete comment in deferred_pfn_valid()

    There are no architectures that can have holes in the memory map within a
    pageblock since commit 859a85ddf90e ("mm: remove pfn_valid_within() and
    CONFIG_HOLES_IN_ZONE").  Update the corresponding comment.

    Link: https://lkml.kernel.org/r/20220916072257.9639-17-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:47 -04:00
Chris von Recklinghausen 8782d950dc mm/page_alloc: use costly_order in WARN_ON_ONCE_GFP()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 896c4d52538df231c3847491acc4f2c23891fe6a
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:55 2022 +0800

    mm/page_alloc: use costly_order in WARN_ON_ONCE_GFP()

    There's no need to check whether order > PAGE_ALLOC_COSTLY_ORDER again.
    Minor readability improvement.

    Link: https://lkml.kernel.org/r/20220916072257.9639-15-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:47 -04:00
Chris von Recklinghausen da0e9cb6cd mm/page_alloc: init local variable buddy_pfn
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit dae37a5dccd104fc465241c42d9e17756ddebbc1
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:54 2022 +0800

    mm/page_alloc: init local variable buddy_pfn

    The local variable buddy_pfn could be passed to buddy_merge_likely()
    without initialization if the passed in order is MAX_ORDER - 1.  This
    looks buggy but buddy_pfn won't be used in this case as there's a order >=
    MAX_ORDER - 2 check.  Init buddy_pfn to 0 anyway to avoid possible future
    misuse.

    Link: https://lkml.kernel.org/r/20220916072257.9639-14-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:46 -04:00
Chris von Recklinghausen 221941763d mm/page_alloc: use helper macro SZ_1{K,M}
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit c940e0207a1c307fdab92b32d0522271036fc3ef
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:53 2022 +0800

    mm/page_alloc: use helper macro SZ_1{K,M}

    Use helper macro SZ_1K and SZ_1M to do the size conversion.  Minor
    readability improvement.

    Link: https://lkml.kernel.org/r/20220916072257.9639-13-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:46 -04:00
Chris von Recklinghausen 3f65ff56dd mm/page_alloc: make boot_nodestats static
Conflicts: mm/internal.h - We already have
	27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount")
	so we already have the declaration for free_zone_device_page
	We already have
	b05a79d4377f ("mm/gup: migrate device coherent pages when pinning instead of failing")
	so we have the declaration of migrate_device_coherent_page
	We already have
	76aefad628aa ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()")
	which makes patch think we don't have a declaration for
	mirrored_kernelcore

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 6dc2c87a5a8878b657d08e34ca0e757d31273e12
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:52 2022 +0800

    mm/page_alloc: make boot_nodestats static

    It's only used in mm/page_alloc.c now.  Make it static.

    Link: https://lkml.kernel.org/r/20220916072257.9639-12-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:46 -04:00
Chris von Recklinghausen ef68aa3621 security: kmsan: fix interoperability with auto-initialization
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 42eaa27d9e7aafb4049fc3a5b02005a917013e65
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Sep 15 17:04:04 2022 +0200

    security: kmsan: fix interoperability with auto-initialization

    Heap and stack initialization is great, but not when we are trying uses of
    uninitialized memory.  When the kernel is built with KMSAN, having kernel
    memory initialization enabled may introduce false negatives.

    We disable CONFIG_INIT_STACK_ALL_PATTERN and CONFIG_INIT_STACK_ALL_ZERO
    under CONFIG_KMSAN, making it impossible to auto-initialize stack
    variables in KMSAN builds.  We also disable
    CONFIG_INIT_ON_ALLOC_DEFAULT_ON and CONFIG_INIT_ON_FREE_DEFAULT_ON to
    prevent accidental use of heap auto-initialization.

    We however still let the users enable heap auto-initialization at
    boot-time (by setting init_on_alloc=1 or init_on_free=1), in which case a
    warning is printed.

    Link: https://lkml.kernel.org/r/20220915150417.722975-31-glider@google.com
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@google.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Eric Biggers <ebiggers@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Ilya Leoshkevich <iii@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vegard Nossum <vegard.nossum@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:40 -04:00
Chris von Recklinghausen cdd02ac72c init: kmsan: call KMSAN initialization routines
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3c206509826094e85ead0b056f484db96829248d
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Sep 15 17:03:51 2022 +0200

    init: kmsan: call KMSAN initialization routines

    kmsan_init_shadow() scans the mappings created at boot time and creates
    metadata pages for those mappings.

    When the memblock allocator returns pages to pagealloc, we reserve 2/3 of
    those pages and use them as metadata for the remaining 1/3.  Once KMSAN
    starts, every page allocated by pagealloc has its associated shadow and
    origin pages.

    kmsan_initialize() initializes the bookkeeping for init_task and enables
    KMSAN.

    Link: https://lkml.kernel.org/r/20220915150417.722975-18-glider@google.com
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@google.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Eric Biggers <ebiggers@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Ilya Leoshkevich <iii@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vegard Nossum <vegard.nossum@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:36 -04:00
Chris von Recklinghausen 271a98f55e mm: kmsan: maintain KMSAN metadata for page operations
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b073d7f8aee4ebf05d10e3380df377b73120cf16
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Sep 15 17:03:48 2022 +0200

    mm: kmsan: maintain KMSAN metadata for page operations

    Insert KMSAN hooks that make the necessary bookkeeping changes:
     - poison page shadow and origins in alloc_pages()/free_page();
     - clear page shadow and origins in clear_page(), copy_user_highpage();
     - copy page metadata in copy_highpage(), wp_page_copy();
     - handle vmap()/vunmap()/iounmap();

    Link: https://lkml.kernel.org/r/20220915150417.722975-15-glider@google.com
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@google.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Eric Biggers <ebiggers@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Ilya Leoshkevich <iii@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vegard Nossum <vegard.nossum@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:35 -04:00
Chris von Recklinghausen 1db8227f74 mm/page_alloc.c: document bulkfree_pcp_prepare() return value
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d452289fcd68f13f4067f0ddd78a5d948cb7d9ea
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Tue Sep 13 15:30:38 2022 -0700

    mm/page_alloc.c: document bulkfree_pcp_prepare() return value

    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: ke.wang <ke.wang@unisoc.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:31 -04:00
Chris von Recklinghausen 23431950f8 mm/page_alloc.c: rename check_free_page() to free_page_is_bad()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a8368cd8e22531b3b248a2c869d71b668aeeb789
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Tue Sep 13 15:20:48 2022 -0700

    mm/page_alloc.c: rename check_free_page() to free_page_is_bad()

    The name "check_free_page()" provides no information regarding its return
    value when the page is indeed found to be bad.

    Renaming it to "free_page_is_bad()" makes it clear that a `true' return
    value means the page was bad.

    And make it return a bool, not an int.

    [akpm@linux-foundation.org: don't use bool as int]
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: ke.wang <ke.wang@unisoc.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:31 -04:00
Chris von Recklinghausen 8d6289d4f0 mm: add pageblock_aligned() macro
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit ee0913c4719610204315a0d8a35122c6233249e0
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 7 14:08:44 2022 +0800

    mm: add pageblock_aligned() macro

    Add pageblock_aligned() and use it to simplify code.

    Link: https://lkml.kernel.org/r/20220907060844.126891-3-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:19 -04:00
Chris von Recklinghausen 556f683f8e mm: reuse pageblock_start/end_pfn() macro
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4f9bc69ac5ce34071a9a51343bc81ca76cb2e3f1
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Sep 7 14:08:42 2022 +0800

    mm: reuse pageblock_start/end_pfn() macro

    Move pageblock_start_pfn/pageblock_end_pfn() into pageblock-flags.h, then
    they could be used somewhere else, not only in compaction, also use
    ALIGN_DOWN() instead of round_down() to be pair with ALIGN(), which should
    be same for pageblock usage.

    Link: https://lkml.kernel.org/r/20220907060844.126891-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:18 -04:00
Chris von Recklinghausen 048844959a mm: remove BUG_ON() in __isolate_free_page()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 9a157dd8fe5ac32304eff8a7722e30352acaa7f0
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Sep 1 09:50:43 2022 +0800

    mm: remove BUG_ON() in __isolate_free_page()

    Drop unneed comment and blank, adjust the variable, and the most important
    is to delete BUG_ON().  The page passed is always buddy page into
    __isolate_free_page() from compaction, page_isolation and page_reporting,
    and the caller also check the return, BUG_ON() is a too drastic measure,
    remove it.

    Link: https://lkml.kernel.org/r/20220901015043.189276-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:42 -04:00
Chris von Recklinghausen 0d0c5f8872 mm/page_alloc.c: delete a redundant parameter of rmqueue_pcplist
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 663d0cfd2e77aa3ed170df76d441c8f07efa3cf6
Author: zezuo <zuoze1@huawei.com>
Date:   Wed Aug 31 01:34:04 2022 +0000

    mm/page_alloc.c: delete a redundant parameter of rmqueue_pcplist

    The gfp_flags parameter is not used in rmqueue_pcplist, so directly delete
    this parameter.

    Link: https://lkml.kernel.org/r/20220831013404.3360714-1-zuoze1@huawei.com
    Signed-off-by: zezuo <zuoze1@huawei.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:39 -04:00
Chris von Recklinghausen 39091324ad mm: fix null-ptr-deref in kswapd_is_running()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b4a0215e11dcfe23a48c65c6d6c82c0c2c551a48
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Sat Aug 27 19:19:59 2022 +0800

    mm: fix null-ptr-deref in kswapd_is_running()

    kswapd_run/stop() will set pgdat->kswapd to NULL, which could race with
    kswapd_is_running() in kcompactd(),

    kswapd_run/stop()                       kcompactd()
                                              kswapd_is_running()
      pgdat->kswapd // error or nomal ptr
                                              verify pgdat->kswapd
                                                // load non-NULL
    pgdat->kswapd
      pgdat->kswapd = NULL
                                              task_is_running(pgdat->kswapd)
                                                // Null pointer derefence

    KASAN reports the null-ptr-deref shown below,

      vmscan: Failed to start kswapd on node 0
      ...
      BUG: KASAN: null-ptr-deref in kcompactd+0x440/0x504
      Read of size 8 at addr 0000000000000024 by task kcompactd0/37

      CPU: 0 PID: 37 Comm: kcompactd0 Kdump: loaded Tainted: G           OE     5.10.60 #1
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      Call trace:
       dump_backtrace+0x0/0x394
       show_stack+0x34/0x4c
       dump_stack+0x158/0x1e4
       __kasan_report+0x138/0x140
       kasan_report+0x44/0xdc
       __asan_load8+0x94/0xd0
       kcompactd+0x440/0x504
       kthread+0x1a4/0x1f0
       ret_from_fork+0x10/0x18

    At present kswapd/kcompactd_run() and kswapd/kcompactd_stop() are protected
    by mem_hotplug_begin/done(), but without kcompactd(). There is no need to
    involve memory hotplug lock in kcompactd(), so let's add a new mutex to
    protect pgdat->kswapd accesses.

    Also, because the kcompactd task will check the state of kswapd task, it's
    better to call kcompactd_stop() before kswapd_stop() to reduce lock
    conflicts.

    [akpm@linux-foundation.org: add comments]
    Link: https://lkml.kernel.org/r/20220827111959.186838-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:38 -04:00
Chris von Recklinghausen 46c061d33c page_ext: introduce boot parameter 'early_page_ext'
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit c4f20f1479c456d9dd1c1e6d8bf956a25de742dc
Author: Li Zhe <lizhe.67@bytedance.com>
Date:   Thu Aug 25 18:27:14 2022 +0800

    page_ext: introduce boot parameter 'early_page_ext'

    In commit 2f1ee0913c ("Revert "mm: use early_pfn_to_nid in
    page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
    avoid some panic problem.  It seems that we cannot track early page
    allocations in current kernel even if page structure has been initialized
    early.

    This patch introduces a new boot parameter 'early_page_ext' to resolve
    this problem.  If we pass it to the kernel, page_ext_init() will be moved
    up and the feature 'deferred initialization of struct pages' will be
    disabled to initialize the page allocator early and prevent the panic
    problem above.  It can help us to catch early page allocations.  This is
    useful especially when we find that the free memory value is not the same
    right after different kernel booting.

    [akpm@linux-foundation.org: fix section issue by removing __meminitdata]
    Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com
    Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jason A. Donenfeld <Jason@zx2c4.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
    Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:36 -04:00
Chris von Recklinghausen d1f841b840 mm: kill find_min_pfn_with_active_regions()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit fb70c4878d6b3001ef40fa39432a38d8cabdcbf7
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Mon Aug 15 19:10:17 2022 +0800

    mm: kill find_min_pfn_with_active_regions()

    find_min_pfn_with_active_regions() is only called from free_area_init().
    Open-code the PHYS_PFN(memblock_start_of_DRAM()) into free_area_init(),
    and kill find_min_pfn_with_active_regions().

    Link: https://lkml.kernel.org/r/20220815111017.39341-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:31 -04:00
Chris von Recklinghausen 7334d36756 mm/page_alloc: only search higher order when fallback
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e933dc4a07b36b835f8ad7085e17cc21ed869051
Author: Abel Wu <wuyun.abel@bytedance.com>
Date:   Wed Aug 3 10:51:21 2022 +0800

    mm/page_alloc: only search higher order when fallback

    It seems unnecessary to search pages with order < alloc_order in
    fallback allocation.

    This can currently happen with ALLOC_NOFRAGMENT and alloc_order >
    pageblock_order, so add a test to prevent it.

    [vbabka@suse.cz: changelog addition]
    Link: https://lkml.kernel.org/r/20220803025121.47018-1-wuyun.abel@bytedance.com
    Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:27 -04:00
Chris von Recklinghausen 51bd718695 page_alloc: remove inactive initialization
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 97bab178e8e4035e0f3a8b1362eec3e86fdcb9ce
Author: Li kunyu <kunyu@nfschina.com>
Date:   Wed Aug 3 14:41:18 2022 +0800

    page_alloc: remove inactive initialization

    The allocation address of the table pointer variable is first performed in
    the function, no initialization assignment is required, and no invalid
    pointer will appear.

    Link: https://lkml.kernel.org/r/20220803064118.3664-1-kunyu@nfschina.com
    Signed-off-by: Li kunyu <kunyu@nfschina.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:27 -04:00
Jeff Moyer 8ad32283af mm/vmemmap/devdax: fix kernel crash when probing devdax devices
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217652

Conflicts: RHEL does not have commit 9420f89db2dd ("mm: move most of
  core MM initialization to mm/mm_init.c"), which moves
  compound_nr_pages() and mm_init_zone_device(), so this patch applies
  the changes to the functions in page_alloc.c.

commit 87a7ae75d7383afa998f57656d1d14e2a730cc47
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Tue Apr 11 19:52:13 2023 +0530

    mm/vmemmap/devdax: fix kernel crash when probing devdax devices
    
    commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for
    compound devmaps") added support for using optimized vmmemap for devdax
    devices.  But how vmemmap mappings are created are architecture specific.
    For example, powerpc with hash translation doesn't have vmemmap mappings
    in init_mm page table instead they are bolted table entries in the
    hardware page table
    
    vmemmap_populate_compound_pages() used by vmemmap optimization code is not
    aware of these architecture-specific mapping.  Hence allow architecture to
    opt for this feature.  I selected architectures supporting
    HUGETLB_PAGE_OPTIMIZE_VMEMMAP option as also supporting this feature.
    
    This patch fixes the below crash on ppc64.
    
    BUG: Unable to handle kernel data access on write at 0xc00c000100400038
    Faulting instruction address: 0xc000000001269d90
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in:
    CPU: 7 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc5-150500.34-default+ #2 5c90a668b6bbd142599890245c2fb5de19d7d28a
    Hardware name: IBM,9009-42G POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.40 (VL950_099) hv:phyp pSeries
    NIP:  c000000001269d90 LR: c0000000004c57d4 CTR: 0000000000000000
    REGS: c000000003632c30 TRAP: 0300   Not tainted  (6.3.0-rc5-150500.34-default+)
    MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24842228  XER: 00000000
    CFAR: c0000000004c57d0 DAR: c00c000100400038 DSISR: 42000000 IRQMASK: 0
    ....
    NIP [c000000001269d90] __init_single_page.isra.74+0x14/0x4c
    LR [c0000000004c57d4] __init_zone_device_page+0x44/0xd0
    Call Trace:
    [c000000003632ed0] [c000000003632f60] 0xc000000003632f60 (unreliable)
    [c000000003632f10] [c0000000004c5ca0] memmap_init_zone_device+0x170/0x250
    [c000000003632fe0] [c0000000005575f8] memremap_pages+0x2c8/0x7f0
    [c0000000036330c0] [c000000000557b5c] devm_memremap_pages+0x3c/0xa0
    [c000000003633100] [c000000000d458a8] dev_dax_probe+0x108/0x3e0
    [c0000000036331a0] [c000000000d41430] dax_bus_probe+0xb0/0x140
    [c0000000036331d0] [c000000000cef27c] really_probe+0x19c/0x520
    [c000000003633260] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
    [c0000000036332e0] [c000000000cef888] driver_probe_device+0x58/0x120
    [c000000003633320] [c000000000cefa6c] __device_attach_driver+0x11c/0x1e0
    [c0000000036333a0] [c000000000cebc58] bus_for_each_drv+0xa8/0x130
    [c000000003633400] [c000000000ceefcc] __device_attach+0x15c/0x250
    [c0000000036334a0] [c000000000ced458] bus_probe_device+0x108/0x110
    [c0000000036334f0] [c000000000ce92dc] device_add+0x7fc/0xa10
    [c0000000036335b0] [c000000000d447c8] devm_create_dev_dax+0x1d8/0x530
    [c000000003633640] [c000000000d46b60] __dax_pmem_probe+0x200/0x270
    [c0000000036337b0] [c000000000d46bf0] dax_pmem_probe+0x20/0x70
    [c0000000036337d0] [c000000000d2279c] nvdimm_bus_probe+0xac/0x2b0
    [c000000003633860] [c000000000cef27c] really_probe+0x19c/0x520
    [c0000000036338f0] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
    [c000000003633970] [c000000000cef888] driver_probe_device+0x58/0x120
    [c0000000036339b0] [c000000000cefd08] __driver_attach+0x1d8/0x240
    [c000000003633a30] [c000000000cebb04] bus_for_each_dev+0xb4/0x130
    [c000000003633a90] [c000000000cee564] driver_attach+0x34/0x50
    [c000000003633ab0] [c000000000ced878] bus_add_driver+0x218/0x300
    [c000000003633b40] [c000000000cf1144] driver_register+0xa4/0x1b0
    [c000000003633bb0] [c000000000d21a0c] __nd_driver_register+0x5c/0x100
    [c000000003633c10] [c00000000206a2e8] dax_pmem_init+0x34/0x48
    [c000000003633c30] [c0000000000132d0] do_one_initcall+0x60/0x320
    [c000000003633d00] [c0000000020051b0] kernel_init_freeable+0x360/0x400
    [c000000003633de0] [c000000000013764] kernel_init+0x34/0x1d0
    [c000000003633e50] [c00000000000de14] ret_from_kernel_thread+0x5c/0x64
    
    Link: https://lkml.kernel.org/r/20230411142214.64464-1-aneesh.kumar@linux.ibm.com
    Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Reported-by: Tarun Sahu <tsahu@linux.ibm.com>
    Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-07-20 13:18:35 -04:00
Nico Pache f4370a46cb mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
commit 4d73ba5fa710fe7d432e0b271e6fecd252aef66e
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Apr 14 15:14:29 2023 +0100

    mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages

    A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
    taking an excessive amount of time for large amounts of memory.  Further
    testing allocating huge pages that the cost is linear i.e.  if allocating
    1G pages in batches of 10 then the time to allocate nr_hugepages from
    10->20->30->etc increases linearly even though 10 pages are allocated at
    each step.  Profiles indicated that much of the time is spent checking the
    validity within already existing huge pages and then attempting a
    migration that fails after isolating the range, draining pages and a whole
    lot of other useless work.

    Commit eb14d4eefd ("mm,page_alloc: drop unnecessary checks from
    pfn_range_valid_contig") removed two checks, one which ignored huge pages
    for contiguous allocations as huge pages can sometimes migrate.  While
    there may be value on migrating a 2M page to satisfy a 1G allocation, it's
    potentially expensive if the 1G allocation fails and it's pointless to try
    moving a 1G page for a new 1G allocation or scan the tail pages for valid
    PFNs.

    Reintroduce the PageHuge check and assume any contiguous region with
    hugetlbfs pages is unsuitable for a new 1G allocation.

    The hpagealloc test allocates huge pages in batches and reports the
    average latency per page over time.  This test happens just after boot
    when fragmentation is not an issue.  Units are in milliseconds.

    hpagealloc
                                   6.3.0-rc6              6.3.0-rc6              6.3.0-rc6
                                     vanilla   hugeallocrevert-v1r1   hugeallocsimple-v1r2
    Min       Latency       26.42 (   0.00%)        5.07 (  80.82%)       18.94 (  28.30%)
    1st-qrtle Latency      356.61 (   0.00%)        5.34 (  98.50%)       19.85 (  94.43%)
    2nd-qrtle Latency      697.26 (   0.00%)        5.47 (  99.22%)       20.44 (  97.07%)
    3rd-qrtle Latency      972.94 (   0.00%)        5.50 (  99.43%)       20.81 (  97.86%)
    Max-1     Latency       26.42 (   0.00%)        5.07 (  80.82%)       18.94 (  28.30%)
    Max-5     Latency       82.14 (   0.00%)        5.11 (  93.78%)       19.31 (  76.49%)
    Max-10    Latency      150.54 (   0.00%)        5.20 (  96.55%)       19.43 (  87.09%)
    Max-90    Latency     1164.45 (   0.00%)        5.53 (  99.52%)       20.97 (  98.20%)
    Max-95    Latency     1223.06 (   0.00%)        5.55 (  99.55%)       21.06 (  98.28%)
    Max-99    Latency     1278.67 (   0.00%)        5.57 (  99.56%)       22.56 (  98.24%)
    Max       Latency     1310.90 (   0.00%)        8.06 (  99.39%)       26.62 (  97.97%)
    Amean     Latency      678.36 (   0.00%)        5.44 *  99.20%*       20.44 *  96.99%*

                       6.3.0-rc6   6.3.0-rc6   6.3.0-rc6
                         vanilla   revert-v1   hugeallocfix-v2
    Duration User           0.28        0.27        0.30
    Duration System       808.66       17.77       35.99
    Duration Elapsed      830.87       18.08       36.33

    The vanilla kernel is poor, taking up to 1.3 second to allocate a huge
    page and almost 10 minutes in total to run the test.  Reverting the
    problematic commit reduces it to 8ms at worst and the patch takes 26ms.
    This patch fixes the main issue with skipping huge pages but leaves the
    page_count() out because a page with an elevated count potentially can
    migrate.

    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=217022
    Link: https://lkml.kernel.org/r/20230414141429.pwgieuwluxwez3rj@techsingularity.net
    Fixes: eb14d4eefd ("mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reported-by: Yuanxi Liu <y.liu@naruida.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:04 -06:00
Nico Pache 4a9fd558e7 mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
commit 1007843a91909a4995ee78a538f62d8665705b66
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Tue Apr 4 23:31:58 2023 +0900

    mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock

    syzbot is reporting circular locking dependency which involves
    zonelist_update_seq seqlock [1], for this lock is checked by memory
    allocation requests which do not need to be retried.

    One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.

      CPU0
      ----
      __build_all_zonelists() {
        write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
        // e.g. timer interrupt handler runs at this moment
          some_timer_func() {
            kmalloc(GFP_ATOMIC) {
              __alloc_pages_slowpath() {
                read_seqbegin(&zonelist_update_seq) {
                  // spins forever because zonelist_update_seq.seqcount is odd
                }
              }
            }
          }
        // e.g. timer interrupt handler finishes
        write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
      }

    This deadlock scenario can be easily eliminated by not calling
    read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
    requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
    requests.  But Michal Hocko does not know whether we should go with this
    approach.

    Another deadlock scenario which syzbot is reporting is a race between
    kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
    port->lock held and printk() from __build_all_zonelists() with
    zonelist_update_seq held.

      CPU0                                   CPU1
      ----                                   ----
      pty_write() {
        tty_insert_flip_string_and_push_buffer() {
                                             __build_all_zonelists() {
                                               write_seqlock(&zonelist_update_seq);
                                               build_zonelists() {
                                                 printk() {
                                                   vprintk() {
                                                     vprintk_default() {
                                                       vprintk_emit() {
                                                         console_unlock() {
                                                           console_flush_all() {
                                                             console_emit_next_record() {
                                                               con->write() = serial8250_console_write() {
          spin_lock_irqsave(&port->lock, flags);
          tty_insert_flip_string() {
            tty_insert_flip_string_fixed_flag() {
              __tty_buffer_request_room() {
                tty_buffer_alloc() {
                  kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
                    __alloc_pages_slowpath() {
                      zonelist_iter_begin() {
                        read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
                                                                 spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
                        }
                      }
                    }
                  }
                }
              }
            }
          }
          spin_unlock_irqrestore(&port->lock, flags);
                                                                 // message is printed to console
                                                                 spin_unlock_irqrestore(&port->lock, flags);
                                                               }
                                                             }
                                                           }
                                                         }
                                                       }
                                                     }
                                                   }
                                                 }
                                               }
                                               write_sequnlock(&zonelist_update_seq);
                                             }
        }
      }

    This deadlock scenario can be eliminated by

      preventing interrupt context from calling kmalloc(GFP_ATOMIC)

    and

      preventing printk() from calling console_flush_all()

    while zonelist_update_seq.seqcount is odd.

    Since Petr Mladek thinks that __build_all_zonelists() can become a
    candidate for deferring printk() [2], let's address this problem by

      disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)

    and

      disabling synchronous printk() in order to avoid console_flush_all()

    .

    As a side effect of minimizing duration of zonelist_update_seq.seqcount
    being odd by disabling synchronous printk(), latency at
    read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
    __GFP_DIRECT_RECLAIM allocation requests will be reduced.  Although, from
    lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
    do not record unnecessary locking dependency) from interrupt context is
    still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
    inside
    write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
    section...

    Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp
    Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
    Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
    Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
      Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
    Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
    Cc: John Ogness <john.ogness@linutronix.de>
    Cc: Patrick Daly <quic_pdaly@quicinc.com>
    Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:04 -06:00
Nico Pache 6afbbe2783 Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"
commit f446883d12b8bfa486f7c98d403054d61d38c989
Author: Peter Collingbourne <pcc@google.com>
Date:   Thu Mar 9 20:29:13 2023 -0800

    Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"

    This reverts commit 487a32ec24be819e747af8c2ab0d5c515508086a.

    should_skip_kasan_poison() reads the PG_skip_kasan_poison flag from
    page->flags.  However, this line of code in free_pages_prepare():

            page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;

    clears most of page->flags, including PG_skip_kasan_poison, before calling
    should_skip_kasan_poison(), which meant that it would never return true as
    a result of the page flag being set.  Therefore, fix the code to call
    should_skip_kasan_poison() before clearing the flags, as we were doing
    before the reverted patch.

    This fixes a measurable performance regression introduced in the reverted
    commit, where munmap() takes longer than intended if HW tags KASAN is
    supported and enabled at runtime.  Without this patch, we see a
    single-digit percentage performance regression in a particular
    mmap()-heavy benchmark when enabling HW tags KASAN, and with the patch,
    there is no statistically significant performance impact when enabling HW
    tags KASAN.

    Link: https://lkml.kernel.org/r/20230310042914.3805818-2-pcc@google.com
    Fixes: 487a32ec24be ("kasan: drop skip_kasan_poison variable in free_pages_prepare")
      Link: https://linux-review.googlesource.com/id/Ic4f13affeebd20548758438bb9ed9ca40e312b79
    Signed-off-by: Peter Collingbourne <pcc@google.com>
    Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com> [arm64]
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: <stable@vger.kernel.org>    [6.1]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:03 -06:00
Nico Pache 7a909edc93 Fix page corruption caused by racy check in __free_pages
commit 462a8e08e0e6287e5ce13187257edbf24213ed03
Author: David Chen <david.chen@nutanix.com>
Date:   Thu Feb 9 17:48:28 2023 +0000

    Fix page corruption caused by racy check in __free_pages

    When we upgraded our kernel, we started seeing some page corruption like
    the following consistently:

      BUG: Bad page state in process ganesha.nfsd  pfn:1304ca
      page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca
      flags: 0x17ffffc0000000()
      raw: 0017ffffc0000000 ffff8a513ffd4c98 ffffeee24b35ec08 0000000000000000
      raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
      page dumped because: nonzero mapcount
      CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P    B      O      5.10.158-1.nutanix.20221209.el7.x86_64 #1
      Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
      Call Trace:
       dump_stack+0x74/0x96
       bad_page.cold+0x63/0x94
       check_new_page_bad+0x6d/0x80
       rmqueue+0x46e/0x970
       get_page_from_freelist+0xcb/0x3f0
       ? _cond_resched+0x19/0x40
       __alloc_pages_nodemask+0x164/0x300
       alloc_pages_current+0x87/0xf0
       skb_page_frag_refill+0x84/0x110
       ...

    Sometimes, it would also show up as corruption in the free list pointer
    and cause crashes.

    After bisecting the issue, we found the issue started from commit
    e320d3012d ("mm/page_alloc.c: fix freeing non-compound pages"):

            if (put_page_testzero(page))
                    free_the_page(page, order);
            else if (!PageHead(page))
                    while (order-- > 0)
                            free_the_page(page + (1 << order), order);

    So the problem is the check PageHead is racy because at this point we
    already dropped our reference to the page.  So even if we came in with
    compound page, the page can already be freed and PageHead can return
    false and we will end up freeing all the tail pages causing double free.

    Fixes: e320d3012d ("mm/page_alloc.c: fix freeing non-compound pages")
    Link: https://lore.kernel.org/lkml/BYAPR02MB448855960A9656EEA81141FC94D99@BYAPR02MB4488.namprd02.prod.outlook.com/
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: stable@vger.kernel.org
    Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:03 -06:00
Nico Pache 5a39f0ca11 mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page
commit 15cd90049d595e592d8860ee15a3f23491d54d17
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Thu Oct 6 10:15:40 2022 +0000

    mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page

    PGFREE and PGALLOC represent the number of freed and allocated pages.  So
    the page order must be considered.

    Link: https://lkml.kernel.org/r/20221006101540.40686-1-laoar.shao@gmail.com
    Fixes: 44042b4498 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Jan Stancek a986f224c2 Merge: KVM: aarch64: Rebase (first round towards v6.3)
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2256

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2175143

This is the first round of backports for KVM/arch. It mostly goes up to v6.2 with the exceptions of few fixes which are post v6.2.
Non regression tests, kseltests, kvm unit tests.

This first round takes patches/series which are rather independent on mm updates and generic/x86 kvm pieces.
The rest will be picked up when dependencies get resolved

Among others were omitted on purpose:
- [PATCH v3 0/7] KVM: x86: never write to memory from kvm_vcpu_check_block
- [PATCH v10 0/7] KVM: arm64: Enable ring-based dirty memory tracking
- [PATCH v3 0/4] dirty_log_perf_test cpu pinning and some goodies

v7 -> v8:
- Dropped e1be43d9b5d0 ("overflow: Implement size_t saturating arithmetic helpers") which was pulled downstream
  by Nico Pache (downstream 0c70f0b178)
- Handled a new contextual conflict while cherry-picking 8675c6f22698 ("KVM: selftests: memslot_perf_test: Support variable guest page size") due to 197ebb713ad0 ("KVM: selftests: move common startup logic to kvm_util.c"), recently applied out-of-order.

v6 -> v7:
- rebase since Mark's MR #2025 has been merged. Also removed the depends tag.
- Improved 2 conflict resolution explanations according to gavin's suggestion
- Full cherry-pick of 1a6182033f2d ("KVM: arm64: selftests: Use FIELD_GET() to
  extract ID register fields") including modifications in aarch32_id_regs.c as
  reported by Connie and Gavin.

v4 -> v5:
- added 797b84517c19 KVM: selftests: Add test for AArch32 ID registers (Gavin)

v3 -> v4:
Added 5 missing fixes (reported Connie, Prarit, Rafael):
- f850c84948ef ("proc/meminfo: fix spacing in SecPageTables")
- 9aec606c1609 ("tools: include: sync include/api/linux/kvm.h")
- 7a2726ec3290 ("KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling them")
- 0cab5b4964c7 ("arm64/sme: Fix context switch for SME only systems")
- d61a12cb9af5 ("KVM: selftests: Fix divide-by-zero bug in memslot_perf_test")

v2 -> v3:
- rebase after mm MR merge
- Took d38ba8ccd9c2 KVM: arm64/mmu: count KVM s2 mmu usage in secondary pagetable stats. So now [PATCH v7 0/4] KVM: mm: count KVM mmu usage in memory stats is fully downstream. This removes a conflict when cherry-picking aa6948f8 (" 	KVM: arm64: Add per-cpu fixmap infrastructure at EL2")
- "[PATCH v5 0/8] KVM: arm64: permit MAP_SHARED mappings with MTE enabled" fully incorporated
- "[PATCH v5 0/8] arm64/sve: Clean up KVM integration and optimise syscalls" fully incorporated
- I did not take b8f8d190fa8f  KVM: arm64: Document the behaviour of S1PTW faults on RO memslots because of conflict that
  will be resolved after kvm generic rebase
- added [PATCH v2 0/3] arm64/sysreg: ISR register conversions
- added [PATCH v7 7/7] KVM: arm64: Normalize cache configuration

v1 -> v2:
- Removed aarch64/page_fault_test kselftests because we couldn't backport some of their fixes and they were skipped due to some failure
- Take some fixes post v6.2
- - Moved to 'ready' state

Signed-off-by: Eric Auger <eric.auger@redhat.com>

Approved-by: Gavin Shan <gshan@redhat.com>
Approved-by: Cornelia Huck <cohuck@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-10 10:51:45 +02:00
Eric Auger 431fb0900c mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2175143
We keep track of several kernel memory stats (total kernel memory, page
tables, stack, vmalloc, etc) on multiple levels (global, per-node,
per-memcg, etc). These stats give insights to users to how much memory
is used by the kernel and for what purposes.

Currently, memory used by KVM mmu is not accounted in any of those
kernel memory stats. This patch series accounts the memory pages
used by KVM for page tables in those stats in a new
NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account
for other types of secondary pages tables (e.g. iommu page tables).

KVM has a decent number of large allocations that aren't for page
tables, but for most of them, the number/size of those allocations
scales linearly with either the number of vCPUs or the amount of memory
assigned to the VM. KVM's secondary page table allocations do not scale
linearly, especially when nested virtualization is in use.

From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's
per-VM pages_{4k,2m,1g} stats unless the guest is doing something
bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is
forced to allocate a large number of page tables even though the guest
isn't accessing that much memory). However, someone would need to either
understand how KVM works to make that connection, or know (or be told) to
go look at KVM's stats if they're running VMs to better decipher the stats.

Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE
is informative. For example, when backing a VM with THP vs. HugeTLB,
NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order
of magnitude higher with THP. So having this stat will at the very least
prove to be useful for understanding tradeoffs between VM backing types,
and likely even steer folks towards potential optimizations.

The original discussion with more details about the rationale:
https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org

This stat will be used by subsequent patches to count KVM mmu
memory usage.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220823004639.2387269-2-yosryahmed@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit ebc97a52b5d6cd5fb0c15a3fc9cdd6eb924646a1)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
2023-05-04 18:25:10 +02:00
Donald Dutile f35c38369a mm: free device private pages have zero refcount
Bugzilla: http://bugzilla.redhat.com/2159905

commit ef233450898f8893dafa193a9f3211fa077a3d05
Author: Alistair Popple <apopple@nvidia.com>
Date:   Wed Sep 28 22:01:16 2022 +1000

    mm: free device private pages have zero refcount

    Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
    refcount") device private pages have no longer had an extra reference
    count when the page is in use.  However before handing them back to the
    owning device driver we add an extra reference count such that free pages
    have a reference count of one.

    This makes it difficult to tell if a page is free or not because both free
    and in use pages will have a non-zero refcount.  Instead we should return
    pages to the drivers page allocator with a zero reference count.  Kernel
    code can then safely use kernel functions such as get_page_unless_zero().

    Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2023-05-02 16:47:26 -04:00
Chris von Recklinghausen 8d4b718382 mm: fix unexpected changes to {failslab|fail_page_alloc}.attr
Bugzilla: https://bugzilla.redhat.com/2160210

commit ea4452de2ae987342fadbdd2c044034e6480daad
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Fri Nov 18 18:00:11 2022 +0800

    mm: fix unexpected changes to {failslab|fail_page_alloc}.attr

    When we specify __GFP_NOWARN, we only expect that no warnings will be
    issued for current caller.  But in the __should_failslab() and
    __should_fail_alloc_page(), the local GFP flags alter the global
    {failslab|fail_page_alloc}.attr, which is persistent and shared by all
    tasks.  This is not what we expected, let's fix it.

    [akpm@linux-foundation.org: unexport should_fail_ex()]
    Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com
    Fixes: 3f913fc5f974 ("mm: fix missing handler for __GFP_NOWARN")
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Reported-by: Dmitry Vyukov <dvyukov@google.com>
    Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Akinobu Mita <akinobu.mita@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:33 -04:00
Chris von Recklinghausen c041aa420c mm/page_alloc: correct the wrong cpuset file path in comment
Bugzilla: https://bugzilla.redhat.com/2160210

commit 189cdcfeeff31a285313c5132b81ae0b998dcad5
Author: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Date:   Mon Jul 18 20:03:35 2022 +0800

    mm/page_alloc: correct the wrong cpuset file path in comment

    cpuset.c was moved to kernel/cgroup/ in below commit
    201af4c0fa ("cgroup: move cgroup files under kernel/cgroup/")
    Correct the wrong path in comment.

    Link: https://lkml.kernel.org/r/20220718120336.5145-1-mark-pk.tsai@mediatek.com
    Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:28 -04:00
Chris von Recklinghausen 5063ee2178 mm/page_alloc: use try_cmpxchg in set_pfnblock_flags_mask
Bugzilla: https://bugzilla.redhat.com/2160210

commit 04ec006171badc73f749dc1cd9d66e05f8575a81
Author: Uros Bizjak <ubizjak@gmail.com>
Date:   Fri Jul 8 16:07:36 2022 +0200

    mm/page_alloc: use try_cmpxchg in set_pfnblock_flags_mask

    Use try_cmpxchg instead of cmpxchg in set_pfnblock_flags_mask.  x86
    CMPXCHG instruction returns success in ZF flag, so this change saves a
    compare after cmpxchg (and related move instruction in front of cmpxchg).
    The main loop improves from:

        1c5d:       48 89 c2                mov    %rax,%rdx
        1c60:       48 89 c1                mov    %rax,%rcx
        1c63:       48 21 fa                and    %rdi,%rdx
        1c66:       4c 09 c2                or     %r8,%rdx
        1c69:       f0 48 0f b1 16          lock cmpxchg %rdx,(%rsi)
        1c6e:       48 39 c1                cmp    %rax,%rcx
        1c71:       75 ea                   jne    1c5d <...>

    to:

        1c60:       48 89 ca                mov    %rcx,%rdx
        1c63:       48 21 c2                and    %rax,%rdx
        1c66:       4c 09 c2                or     %r8,%rdx
        1c69:       f0 48 0f b1 16          lock cmpxchg %rdx,(%rsi)
        1c6e:       75 f0                   jne    1c60 <...>

    Link: https://lkml.kernel.org/r/20220708140736.8737-1-ubizjak@gmail.com
    Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:26 -04:00
Chris von Recklinghausen e2c13e3883 mm, hugetlb: skip irrelevant nodes in show_free_areas()
Bugzilla: https://bugzilla.redhat.com/2160210

commit dcadcf1c30619ead2f3280bfb7f74de8304be2bb
Author: Gang Li <ligang.bdlg@bytedance.com>
Date:   Wed Jul 6 11:46:54 2022 +0800

    mm, hugetlb: skip irrelevant nodes in show_free_areas()

    show_free_areas() allows to filter out node specific data which is
    irrelevant to the allocation request.  But hugetlb_show_meminfo() still
    shows hugetlb on all nodes, which is redundant and unnecessary.

    Use show_mem_node_skip() to skip irrelevant nodes.  And replace
    hugetlb_show_meminfo() with hugetlb_show_meminfo_node(nid).

    before-and-after sample output of OOM:

    before:
    ```
    [  214.362453] Node 1 active_anon:148kB inactive_anon:4050920kB active_file:112kB inactive_file:100kB
    [  214.375429] Node 1 Normal free:45100kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig
    [  214.388334] lowmem_reserve[]: 0 0 0 0 0
    [  214.390251] Node 1 Normal: 423*4kB (UE) 320*8kB (UME) 187*16kB (UE) 117*32kB (UE) 57*64kB (UME) 20
    [  214.397626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    [  214.401518] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    ```

    after:
    ```
    [  145.069705] Node 1 active_anon:128kB inactive_anon:4049412kB active_file:56kB inactive_file:84kB u
    [  145.110319] Node 1 Normal free:45424kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig
    [  145.152315] lowmem_reserve[]: 0 0 0 0 0
    [  145.155244] Node 1 Normal: 470*4kB (UME) 373*8kB (UME) 247*16kB (UME) 168*32kB (UE) 86*64kB (UME)
    [  145.164119] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    ```

    Link: https://lkml.kernel.org/r/20220706034655.1834-1-ligang.bdlg@bytedance.com
    Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:26 -04:00
Chris von Recklinghausen ec2a2090de mm/page_alloc: replace local_lock with normal spinlock
Bugzilla: https://bugzilla.redhat.com/2160210

commit 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jun 24 13:54:23 2022 +0100

    mm/page_alloc: replace local_lock with normal spinlock

    struct per_cpu_pages is no longer strictly local as PCP lists can be
    drained remotely using a lock for protection.  While the use of local_lock
    works, it goes against the intent of local_lock which is for "pure CPU
    local concurrency control mechanisms and not suited for inter-CPU
    concurrency control" (Documentation/locking/locktypes.rst)

    local_lock protects against migration between when the percpu pointer is
    accessed and the pcp->lock acquired.  The lock acquisition is a preemption
    point so in the worst case, a task could migrate to another NUMA node and
    accidentally allocate remote memory.  The main requirement is to pin the
    task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.

    Replace local_lock with helpers that pin a task to a CPU, lookup the
    per-cpu structure and acquire the embedded lock.  It's similar to
    local_lock without breaking the intent behind the API.  It is not a
    complete API as only the parts needed for PCP-alloc are implemented but in
    theory, the generic helpers could be promoted to a general API if there
    was demand for an embedded lock within a per-cpu struct with a guarantee
    that the per-cpu structure locked matches the running CPU and cannot use
    get_cpu_var due to RT concerns.  PCP requires these semantics to avoid
    accidentally allocating remote memory.

    [mgorman@techsingularity.net: use pcp_spin_trylock_irqsave instead of pcpu_spin_trylock_irqsave]
      Link: https://lkml.kernel.org/r/20220627084645.GA27531@techsingularity.net
    Link: https://lkml.kernel.org/r/20220624125423.6126-8-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:25 -04:00
Chris von Recklinghausen ad8acdd646 mm/page_alloc: remotely drain per-cpu lists
Bugzilla: https://bugzilla.redhat.com/2160210

commit 443c2accd1b6679a1320167f8f56eed6536b806e
Author: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Date:   Fri Jun 24 13:54:22 2022 +0100

    mm/page_alloc: remotely drain per-cpu lists

    Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
    drain work queued by __drain_all_pages().  So introduce a new mechanism to
    remotely drain the per-cpu lists.  It is made possible by remotely locking
    'struct per_cpu_pages' new per-cpu spinlocks.  A benefit of this new
    scheme is that drain operations are now migration safe.

    There was no observed performance degradation vs.  the previous scheme.
    Both netperf and hackbench were run in parallel to triggering the
    __drain_all_pages(NULL, true) code path around ~100 times per second.  The
    new scheme performs a bit better (~5%), although the important point here
    is there are no performance regressions vs.  the previous mechanism.
    Per-cpu lists draining happens only in slow paths.

    Minchan Kim tested an earlier version and reported;

            My workload is not NOHZ CPUs but run apps under heavy memory
            pressure so they goes to direct reclaim and be stuck on
            drain_all_pages until work on workqueue run.

            unit: nanosecond
            max(dur)        avg(dur)                count(dur)
            166713013       487511.77786438033      1283

            From traces, system encountered the drain_all_pages 1283 times and
            worst case was 166ms and avg was 487us.

            The other problem was alloc_contig_range in CMA. The PCP draining
            takes several hundred millisecond sometimes though there is no
            memory pressure or a few of pages to be migrated out but CPU were
            fully booked.

            Your patch perfectly removed those wasted time.

    Link: https://lkml.kernel.org/r/20220624125423.6126-7-mgorman@techsingularity.net
    Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:24 -04:00
Chris von Recklinghausen 6bd76def3e mm/page_alloc: protect PCP lists with a spinlock
Bugzilla: https://bugzilla.redhat.com/2160210

commit 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jun 24 13:54:21 2022 +0100

    mm/page_alloc: protect PCP lists with a spinlock

    Currently the PCP lists are protected by using local_lock_irqsave to
    prevent migration and IRQ reentrancy but this is inconvenient.  Remote
    draining of the lists is impossible and a workqueue is required and every
    task allocation/free must disable then enable interrupts which is
    expensive.

    As preparation for dealing with both of those problems, protect the
    lists with a spinlock.  The IRQ-unsafe version of the lock is used
    because IRQs are already disabled by local_lock_irqsave.  spin_trylock
    is used in combination with local_lock_irqsave() but later will be
    replaced with a spin_trylock_irqsave when the local_lock is removed.

    The per_cpu_pages still fits within the same number of cache lines after
    this patch relative to before the series.

    struct per_cpu_pages {
            spinlock_t                 lock;                 /*     0     4 */
            int                        count;                /*     4     4 */
            int                        high;                 /*     8     4 */
            int                        batch;                /*    12     4 */
            short int                  free_factor;          /*    16     2 */
            short int                  expire;               /*    18     2 */

            /* XXX 4 bytes hole, try to pack */

            struct list_head           lists[13];            /*    24   208 */

            /* size: 256, cachelines: 4, members: 7 */
            /* sum members: 228, holes: 1, sum holes: 4 */
            /* padding: 24 */
    } __attribute__((__aligned__(64)));

    There is overhead in the fast path due to acquiring the spinlock even
    though the spinlock is per-cpu and uncontended in the common case.  Page
    Fault Test (PFT) running on a 1-socket reported the following results on a
    1 socket machine.

                                         5.19.0-rc3               5.19.0-rc3
                                            vanilla      mm-pcpspinirq-v5r16
    Hmean     faults/sec-1   869275.7381 (   0.00%)   874597.5167 *   0.61%*
    Hmean     faults/sec-3  2370266.6681 (   0.00%)  2379802.0362 *   0.40%*
    Hmean     faults/sec-5  2701099.7019 (   0.00%)  2664889.7003 *  -1.34%*
    Hmean     faults/sec-7  3517170.9157 (   0.00%)  3491122.8242 *  -0.74%*
    Hmean     faults/sec-8  3965729.6187 (   0.00%)  3939727.0243 *  -0.66%*

    There is a small hit in the number of faults per second but given that the
    results are more stable, it's borderline noise.

    [akpm@linux-foundation.org: add missing local_unlock_irqrestore() on contention path]
    Link: https://lkml.kernel.org/r/20220624125423.6126-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:24 -04:00
Chris von Recklinghausen 9a94e94bf3 mm/page_alloc: remove mistaken page == NULL check in rmqueue
Bugzilla: https://bugzilla.redhat.com/2160210

commit e2a66c21b774a4e8d0079089fafdc30a31414d40
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jun 24 13:54:20 2022 +0100

    mm/page_alloc: remove mistaken page == NULL check in rmqueue

    If a page allocation fails, the ZONE_BOOSTER_WATERMARK should be tested,
    cleared and kswapd woken whether the allocation attempt was via the PCP or
    directly via the buddy list.

    Remove the page == NULL so the ZONE_BOOSTED_WATERMARK bit is checked
    unconditionally.  As it is unlikely that ZONE_BOOSTED_WATERMARK is set,
    mark the branch accordingly.

    Link: https://lkml.kernel.org/r/20220624125423.6126-5-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:24 -04:00
Chris von Recklinghausen a96d6e26bb mm/page_alloc: split out buddy removal code from rmqueue into separate helper
Bugzilla: https://bugzilla.redhat.com/2160210

commit 589d9973c1d2c3344a94a57441071340b0c71097
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jun 24 13:54:19 2022 +0100

    mm/page_alloc: split out buddy removal code from rmqueue into separate helper

    This is a preparation page to allow the buddy removal code to be reused in
    a later patch.

    No functional change.

    Link: https://lkml.kernel.org/r/20220624125423.6126-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Tested-by: Minchan Kim <minchan@kernel.org>
    Acked-by: Minchan Kim <minchan@kernel.org>
    Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:24 -04:00
Chris von Recklinghausen abf92e2d9f mm/page_alloc: use only one PCP list for THP-sized allocations
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5d0a661d808fc8ddc26940b1a12b82ae356f3ae2
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jun 24 13:54:18 2022 +0100

    mm/page_alloc: use only one PCP list for THP-sized allocations

    The per_cpu_pages is cache-aligned on a standard x86-64 distribution
    configuration but a later patch will add a new field which would push the
    structure into the next cache line.  Use only one list to store THP-sized
    pages on the per-cpu list.  This assumes that the vast majority of
    THP-sized allocations are GFP_MOVABLE but even if it was another type, it
    would not contribute to serious fragmentation that potentially causes a
    later THP allocation failure.  Align per_cpu_pages on the cacheline
    boundary to ensure there is no false cache sharing.

    After this patch, the structure sizing is;

    struct per_cpu_pages {
            int                        count;                /*     0     4 */
            int                        high;                 /*     4     4 */
            int                        batch;                /*     8     4 */
            short int                  free_factor;          /*    12     2 */
            short int                  expire;               /*    14     2 */
            struct list_head           lists[13];            /*    16   208 */

            /* size: 256, cachelines: 4, members: 6 */
            /* padding: 32 */
    } __attribute__((__aligned__(64)));

    Link: https://lkml.kernel.org/r/20220624125423.6126-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Tested-by: Minchan Kim <minchan@kernel.org>
    Acked-by: Minchan Kim <minchan@kernel.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:24 -04:00
Chris von Recklinghausen 479efc8006 mm/page_alloc: add page->buddy_list and page->pcp_list
Bugzilla: https://bugzilla.redhat.com/2160210

commit bf75f200569dd05ac2112797f44548beb6b4be26
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jun 24 13:54:17 2022 +0100

    mm/page_alloc: add page->buddy_list and page->pcp_list

    Patch series "Drain remote per-cpu directly", v5.

    Some setups, notably NOHZ_FULL CPUs, may be running realtime or
    latency-sensitive applications that cannot tolerate interference due to
    per-cpu drain work queued by __drain_all_pages().  Introduce a new
    mechanism to remotely drain the per-cpu lists.  It is made possible by
    remotely locking 'struct per_cpu_pages' new per-cpu spinlocks.  This has
    two advantages, the time to drain is more predictable and other unrelated
    tasks are not interrupted.

    This series has the same intent as Nicolas' series "mm/page_alloc: Remote
    per-cpu lists drain support" -- avoid interference of a high priority task
    due to a workqueue item draining per-cpu page lists.  While many workloads
    can tolerate a brief interruption, it may cause a real-time task running
    on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is
    non-deterministic.

    Currently an IRQ-safe local_lock protects the page allocator per-cpu
    lists.  The local_lock on its own prevents migration and the IRQ disabling
    protects from corruption due to an interrupt arriving while a page
    allocation is in progress.

    This series adjusts the locking.  A spinlock is added to struct
    per_cpu_pages to protect the list contents while local_lock_irq is
    ultimately replaced by just the spinlock in the final patch.  This allows
    a remote CPU to safely.  Follow-on work should allow the spin_lock_irqsave
    to be converted to spin_lock to avoid IRQs being disabled/enabled in most
    cases.  The follow-on patch will be one kernel release later as it is
    relatively high risk and it'll make bisections more clear if there are any
    problems.

    Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
            and when it is storing per-cpu pages.

    Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
            this is not necessary but it avoids per_cpu_pages consuming another
            cache line.

    Patch 3 is a preparation patch to avoid code duplication.

    Patch 4 is a minor correction.

    Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
            relying on local_lock to prevent migration, stabilise the pcp
            lookup and prevent IRQ reentrancy.

    Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

    Patch 7 uses a normal spinlock instead of local_lock for remote draining

    This patch (of 7):

    The page allocator uses page->lru for storing pages on either buddy or PCP
    lists.  Create page->buddy_list and page->pcp_list as a union with
    page->lru.  This is simply to clarify what type of list a page is on in
    the page allocator.

    No functional change intended.

    [minchan@kernel.org: fix page lru fields in macros]
    Link: https://lkml.kernel.org/r/20220624125423.6126-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Tested-by: Minchan Kim <minchan@kernel.org>
    Acked-by: Minchan Kim <minchan@kernel.org>
    Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:24 -04:00
Chris von Recklinghausen 05bae72e4f mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6d05141a393071e104bf5be5ad4d0c79c6dff343
Author: Catalin Marinas <catalin.marinas@arm.com>
Date:   Fri Jun 10 16:21:40 2022 +0100

    mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON

    Currently post_alloc_hook() skips the kasan unpoisoning if the tags will
    be zeroed (__GFP_ZEROTAGS) or __GFP_SKIP_KASAN_UNPOISON is passed. Since
    __GFP_ZEROTAGS is now accompanied by __GFP_SKIP_KASAN_UNPOISON, remove
    the extra check.

    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
    Link: https://lore.kernel.org/r/20220610152141.2148929-4-catalin.marinas@arm.com
    Signed-off-by: Will Deacon <will@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:22 -04:00
Chris von Recklinghausen 0bcda21835 mm: kasan: Skip unpoisoning of user pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit 70c248aca9e7efa85a6664d5ab56c17c326c958f
Author: Catalin Marinas <catalin.marinas@arm.com>
Date:   Fri Jun 10 16:21:39 2022 +0100

    mm: kasan: Skip unpoisoning of user pages

    Commit c275c5c6d5 ("kasan: disable freed user page poisoning with HW
    tags") added __GFP_SKIP_KASAN_POISON to GFP_HIGHUSER_MOVABLE. A similar
    argument can be made about unpoisoning, so also add
    __GFP_SKIP_KASAN_UNPOISON to user pages. To ensure the user page is
    still accessible via page_address() without a kasan fault, reset the
    page->flags tag.

    With the above changes, there is no need for the arm64
    tag_clear_highpage() to reset the page->flags tag.

    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
    Link: https://lore.kernel.org/r/20220610152141.2148929-3-catalin.marinas@arm.com
    Signed-off-by: Will Deacon <will@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:22 -04:00
Chris von Recklinghausen 668121b9be mm/page_alloc: make the annotations of available memory more accurate
Bugzilla: https://bugzilla.redhat.com/2160210

commit ade63b419c4e8d27f0642804b6c8c7a76ffc18ac
Author: Yang Yang <yang.yang29@zte.com.cn>
Date:   Thu Jun 23 02:08:34 2022 +0000

    mm/page_alloc: make the annotations of available memory more accurate

    Not all systems use swap, so estimating available memory would help to
    prevent swapping or OOM of system that not use swap.

    And we need to reserve some page cache to prevent swapping or thrashing.
    If somebody is accessing the pages in pagecache, and if too much would be
    freed, most accesses might mean reading data from disk, i.e.  thrashing.

    Link: https://lkml.kernel.org/r/20220623020833.972979-1-yang.yang29@zte.com.cn
    Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
    Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:21 -04:00
Chris von Recklinghausen eed5a2e492 mm: convert destroy_compound_page() to destroy_large_folio()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5375336c8c42a343c3b440b6f1e21c65e7b174b9
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 18:50:17 2022 +0100

    mm: convert destroy_compound_page() to destroy_large_folio()

    All callers now have a folio, so push the folio->page conversion
    down to this function.

    [akpm@linux-foundation.org: uninline destroy_large_folio() to fix build issue]
    Link: https://lkml.kernel.org/r/20220617175020.717127-20-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:20 -04:00
Chris von Recklinghausen dc32bddd4e mm: introduce clear_highpage_kasan_tagged
Bugzilla: https://bugzilla.redhat.com/2160210

commit d9da8f6cf55eeca642c021912af1890002464c64
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Jun 9 20:18:46 2022 +0200

    mm: introduce clear_highpage_kasan_tagged

    Add a clear_highpage_kasan_tagged() helper that does clear_highpage() on a
    page potentially tagged by KASAN.

    This helper is used by the following patch.

    Link: https://lkml.kernel.org/r/4471979b46b2c487787ddcd08b9dc5fedd1b6ffd.1654798516.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Chris von Recklinghausen a59713cb0c mm: rename kernel_init_free_pages to kernel_init_pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit aeaec8e27eddc147b96fe32df2671980ce7ca87c
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Jun 9 20:18:45 2022 +0200

    mm: rename kernel_init_free_pages to kernel_init_pages

    Rename kernel_init_free_pages() to kernel_init_pages().  This function is
    not only used for free pages but also for pages that were just allocated.

    Link: https://lkml.kernel.org/r/1ecaffc0a9c1404d4d7cf52efe0b2dc8a0c681d8.1654798516.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Chris von Recklinghausen 41ec620cfc mm/page_alloc: use might_alloc()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 446ec83805ddaab5b8734d30ba4ae8c56739a9b4
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sun Jun 5 17:25:37 2022 +0200

    mm/page_alloc: use might_alloc()

    ...  instead of open coding it.  Completely equivalent code, just a notch
    more meaningful when reading.

    Link: https://lkml.kernel.org/r/20220605152539.3196045-1-daniel.vetter@ffwll.ch
    Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:14 -04:00
Chris von Recklinghausen 3fe654e9dd memblock: Disable mirror feature if kernelcore is not specified
Bugzilla: https://bugzilla.redhat.com/2160210

commit 902c2d91582c7ff0cb5f57ffb3766656f9b910c6
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Tue Jun 14 17:21:56 2022 +0800

    memblock: Disable mirror feature if kernelcore is not specified

    If system have some mirrored memory and mirrored feature is not specified
    in boot parameter, the basic mirrored feature will be enabled and this will
    lead to the following situations:

    - memblock memory allocation prefers mirrored region. This may have some
      unexpected influence on numa affinity.

    - contiguous memory will be split into several parts if parts of them
      is mirrored memory via memblock_mark_mirror().

    To fix this, variable mirrored_kernelcore will be checked in
    memblock_mark_mirror(). Mark mirrored memory with flag MEMBLOCK_MIRROR iff
    kernelcore=mirror is added in the kernel parameters.

    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Link: https://lore.kernel.org/r/20220614092156.1972846-6-mawupeng1@huawei.com
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:13 -04:00
Chris von Recklinghausen a04ba46849 mm: fix is_pinnable_page against a cma page
Bugzilla: https://bugzilla.redhat.com/2160210

commit 1c563432588dbffa71e67ca6e37c826f9fa86e04
Author: Minchan Kim <minchan@kernel.org>
Date:   Tue May 24 10:15:25 2022 -0700

    mm: fix is_pinnable_page against a cma page

    Pages in the CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA so
    the current is_pinnable_page() could miss CMA pages which have
    MIGRATE_ISOLATE.  It ends up pinning CMA pages as longterm for the
    pin_user_pages() API so CMA allocations keep failing until the pin is
    released.

         CPU 0                                   CPU 1 - Task B

    cma_alloc
    alloc_contig_range
                                            pin_user_pages_fast(FOLL_LONGTERM)
    change pageblock as MIGRATE_ISOLATE
                                            internal_get_user_pages_fast
                                            lockless_pages_from_mm
                                            gup_pte_range
                                            try_grab_folio
                                            is_pinnable_page
                                              return true;
                                            So, pinned the page successfully.
    page migration failure with pinned page
                                            ..
                                            .. After 30 sec
                                            unpin_user_page(page)

    CMA allocation succeeded after 30 sec.

    The CMA allocation path protects the migration type change race using
    zone->lock but what GUP path need to know is just whether the page is on
    CMA area or not rather than exact migration type.  Thus, we don't need
    zone->lock but just checks migration type in either of (MIGRATE_ISOLATE
    and MIGRATE_CMA).

    Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause rejecting
    of pinning pages on MIGRATE_ISOLATE pageblocks even though it's neither
    CMA nor movable zone if the page is temporarily unmovable.  However, such
    a migration failure by unexpected temporal refcount holding is general
    issue, not only come from MIGRATE_ISOLATE and the MIGRATE_ISOLATE is also
    transient state like other temporal elevated refcount problem.

    Link: https://lkml.kernel.org/r/20220524171525.976723-1-minchan@kernel.org
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Acked-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:12 -04:00
Chris von Recklinghausen f0971a2aaf mm: split free page with properly free memory accounting and without race
Bugzilla: https://bugzilla.redhat.com/2160210

commit 86d28b0709279ccc636ef9ba267b7f3bcef79a4b
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 26 19:15:31 2022 -0400

    mm: split free page with properly free memory accounting and without race

    In isolate_single_pageblock(), free pages are checked without holding zone
    lock, but they can go away in split_free_page() when zone lock is held.
    Check the free page and its order again in split_free_page() when zone lock
    is held. Recheck the page if the free page is gone under zone lock.

    In addition, in split_free_page(), the free page was deleted from the page
    list without changing free page accounting. Add the missing free page
    accounting code.

    Fix the type of order parameter in split_free_page().

    Link: https://lore.kernel.org/lkml/20220525103621.987185e2ca0079f7b97b856d@linux-foundation.org/
    Link: https://lkml.kernel.org/r/20220526231531.2404977-2-zi.yan@sent.com
    Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reported-by: Doug Berger <opendmb@gmail.com>
      Link: https://lore.kernel.org/linux-mm/c3932a6f-77fe-29f7-0c29-fe6b1c67ab7b@gmail.com/
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Qian Cai <quic_qiancai@quicinc.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michael Walle <michael@walle.cc>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:12 -04:00
Chris von Recklinghausen c4ca97baf9 mm: fix a potential infinite loop in start_isolate_page_range()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 88ee134320b8311ca7a00630e5ba013cd0239350
Author: Zi Yan <ziy@nvidia.com>
Date:   Tue May 24 15:47:56 2022 -0400

    mm: fix a potential infinite loop in start_isolate_page_range()

    In isolate_single_pageblock() called by start_isolate_page_range(), there
    are some pageblock isolation issues causing a potential infinite loop when
    isolating a page range.  This is reported by Qian Cai.

    1. the pageblock was isolated by just changing pageblock migratetype
       without checking unmovable pages. Calling set_migratetype_isolate() to
       isolate pageblock properly.
    2. an off-by-one error caused migrating pages unnecessarily, since the page
       is not crossing pageblock boundary.
    3. migrating a compound page across pageblock boundary then splitting the
       free page later has a small race window that the free page might be
       allocated again, so that the code will try again, causing an potential
       infinite loop. Temporarily set the to-be-migrated page's pageblock to
       MIGRATE_ISOLATE to prevent that and bail out early if no free page is
       found after page migration.

    An additional fix to split_free_page() aims to avoid crashing in
    __free_one_page().  When the free page is split at the specified
    split_pfn_offset, free_page_order should check both the first bit of
    free_page_pfn and the last bit of split_pfn_offset and use the smaller
    one.  For example, if free_page_pfn=0x10000, split_pfn_offset=0xc000,
    free_page_order should first be 0x8000 then 0x4000, instead of 0x4000 then
    0x8000, which the original algorithm did.

    [akpm@linux-foundation.org: suppress min() warning]
    Link: https://lkml.kernel.org/r/20220524194756.1698351-1-zi.yan@sent.com
    Fixes: b2c9e2fbba3253 ("mm: make alloc_contig_range work at pageblock granularity")
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reported-by: Qian Cai <quic_qiancai@quicinc.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen a4674660fa mm: fix missing handler for __GFP_NOWARN
Bugzilla: https://bugzilla.redhat.com/2160210

commit 3f913fc5f9745613088d3c569778c9813ab9c129
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Thu May 19 14:08:55 2022 -0700

    mm: fix missing handler for __GFP_NOWARN

    We expect no warnings to be issued when we specify __GFP_NOWARN, but
    currently in paths like alloc_pages() and kmalloc(), there are still some
    warnings printed, fix it.

    But for some warnings that report usage problems, we don't deal with them.
    If such warnings are printed, then we should fix the usage problems.
    Such as the following case:

            WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

    [zhengqi.arch@bytedance.com: v2]
     Link: https://lkml.kernel.org/r/20220511061951.1114-1-zhengqi.arch@bytedance.com
    Link: https://lkml.kernel.org/r/20220510113809.80626-1-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Akinobu Mita <akinobu.mita@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen 5c438547f0 mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 10e0f7530205799e7e971aba699a7cb3a47456de
Author: Wonhyuk Yang <vvghjk1234@gmail.com>
Date:   Thu May 19 14:08:54 2022 -0700

    mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()

    Currently, trace point mm_page_alloc_zone_locked() doesn't show correct
    information.

    First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can be allocated
    from MIGRATE_HIGHATOMIC/MIGRATE_CMA.  Nevertheless, tracepoint use
    requested migration type not MIGRATE_HIGHATOMIC and MIGRATE_CMA.

    Second, after commit 44042b4498 ("mm/page_alloc: allow high-order pages
    to be stored on the per-cpu lists") percpu-list can store high order
    pages.  But trace point determine whether it is a refiil of percpu-list by
    comparing requested order and 0.

    To handle these problems, make mm_page_alloc_zone_locked() only be called
    by __rmqueue_smallest with correct migration type.  With a new argument
    called percpu_refill, it can show roughly whether it is a refill of
    percpu-list.

    Link: https://lkml.kernel.org/r/20220512025307.57924-1-vvghjk1234@gmail.com
    Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Baik Song An <bsahn@etri.re.kr>
    Cc: Hong Yeon Kim <kimhy@etri.re.kr>
    Cc: Taeung Song <taeung@reallinux.co.kr>
    Cc: <linuxgeek@linuxgeek.io>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen 32443a7ae5 mm/memory-failure.c: simplify num_poisoned_pages_dec
Bugzilla: https://bugzilla.redhat.com/2160210

commit c8bd84f73fd6215d5b8d0b3cfc914a3671b16d1c
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:09 2022 -0700

    mm/memory-failure.c: simplify num_poisoned_pages_dec

    Don't decrease the number of poisoned pages in page_alloc.c, let the
    memory-failure.c do inc/dec poisoned pages only.

    Also simplify unpoison_memory(), only decrease the number of
    poisoned pages when:
     - TestClearPageHWPoison() succeed
     - put_page_back_buddy succeed

    After decreasing, print necessary log.

    Finally, remove clear_page_hwpoison() and unpoison_taken_off_page().

    Link: https://lkml.kernel.org/r/20220509105641.491313-3-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen 1369e752bb mm: cma: use pageblock_order as the single alignment
Bugzilla: https://bugzilla.redhat.com/2160210

commit 11ac3e87ce09c27f4587a8c4fe0829d814021a82
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 12 20:22:58 2022 -0700

    mm: cma: use pageblock_order as the single alignment

    Now alloc_contig_range() works at pageblock granularity.  Change CMA
    allocation, which uses alloc_contig_range(), to use pageblock_nr_pages
    alignment.

    Link: https://lkml.kernel.org/r/20220425143118.2850746-6-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:05 -04:00
Chris von Recklinghausen 9ac05d330c mm: page_isolation: enable arbitrary range page isolation.
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6e263fff1de48fcd97b680b54cd8d1695fc3c776
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 12 20:22:58 2022 -0700

    mm: page_isolation: enable arbitrary range page isolation.

    Now start_isolate_page_range() is ready to handle arbitrary range
    isolation, so move the alignment check/adjustment into the function body.
    Do the same for its counterpart undo_isolate_page_range().
    alloc_contig_range(), its caller, can pass an arbitrary range instead of a
    MAX_ORDER_NR_PAGES aligned one.

    Link: https://lkml.kernel.org/r/20220425143118.2850746-5-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:05 -04:00
Chris von Recklinghausen 69bfe65709 mm: make alloc_contig_range work at pageblock granularity
Bugzilla: https://bugzilla.redhat.com/2160210

commit b2c9e2fbba32539626522b6aed30d1dde7b7e971
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 12 20:22:58 2022 -0700

    mm: make alloc_contig_range work at pageblock granularity

    alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
    merging pageblocks with different migratetypes.  It might unnecessarily
    convert extra pageblocks at the beginning and at the end of the range.
    Change alloc_contig_range() to work at pageblock granularity.

    Special handling is needed for free pages and in-use pages across the
    boundaries of the range specified by alloc_contig_range().  Because these=

    Partially isolated pages causes free page accounting issues.  The free
    pages will be split and freed into separate migratetype lists; the in-use=

    Pages will be migrated then the freed pages will be handled in the
    aforementioned way.

    [ziy@nvidia.com: fix deadlock/crash]
      Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com
    Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:05 -04:00
Chris von Recklinghausen 7f32c40d39 mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
Bugzilla: https://bugzilla.redhat.com/2160210

commit b48d8a8e5ce53e3114a1ffe96563e3555b51d40b
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 12 20:22:57 2022 -0700

    mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c

    Patch series "Use pageblock_order for cma and alloc_contig_range alignment", v11.

    This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
    and alloc_contig_range(). It prepares for my upcoming changes to make
    MAX_ORDER adjustable at boot time[1].

    The MAX_ORDER - 1 alignment requirement comes from that
    alloc_contig_range() isolates pageblocks to remove free memory from buddy
    allocator but isolating only a subset of pageblocks within a page spanning
    across multiple pageblocks causes free page accounting issues.  Isolated
    page might not be put into the right free list, since the code assumes the
    migratetype of the first pageblock as the whole free page migratetype.
    This is based on the discussion at [2].

    To remove the requirement, this patchset:
    1. isolates pages at pageblock granularity instead of
       max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
    2. splits free pages across the specified range or migrates in-use pages
       across the specified range then splits the freed page to avoid free page
       accounting issues (it happens when multiple pageblocks within a single page
       have different migratetypes);
    3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
       range during isolation to avoid alloc_contig_range() failure when pageblocks
       within a MAX_ORDER - 1 aligned range are allocated separately.
    4. returns pages not in the range as it did before.

    One optimization might come later:
    1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
       migratetypes when isolation fails in the middle of the range.

    [1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
    [2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/

    This patch (of 6):

    has_unmovable_pages() is only used in mm/page_isolation.c.  Move it from
    mm/page_alloc.c and make it static.

    Link: https://lkml.kernel.org/r/20220425143118.2850746-2-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: kernel test robot <lkp@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:05 -04:00
Chris von Recklinghausen 3f1b5303a4 mm/page_alloc: cache the result of node_dirty_ok()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 8a87d6959f0d81108d95b0dbd3d7dc2cecea853d
Author: Wonhyuk Yang <vvghjk1234@gmail.com>
Date:   Thu May 12 20:22:51 2022 -0700

    mm/page_alloc: cache the result of node_dirty_ok()

    To spread dirty pages, nodes are checked whether they have reached the
    dirty limit using the expensive node_dirty_ok().  To reduce the frequency
    of calling node_dirty_ok(), the last node that hit the dirty limit can be
    cached.

    Instead of caching the node, caching both the node and its node_dirty_ok()
    status can reduce the number of calle to node_dirty_ok().

    [akpm@linux-foundation.org: rename last_pgdat_dirty_limit to last_pgdat_dirty_ok]
    Link: https://lkml.kernel.org/r/20220430011032.64071-1-vvghjk1234@gmail.com
    Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Donghyeok Kim <dthex5d@gmail.com>
    Cc: JaeSang Yoo <jsyoo5b@gmail.com>
    Cc: Jiyoup Kim <lakroforce@gmail.com>
    Cc: Ohhoon Kwon <ohkwon1043@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:03 -04:00
Chris von Recklinghausen f942ace7a2 mm: create new mm/swap.h header file
Bugzilla: https://bugzilla.redhat.com/2160210

commit 014bb1de4fc17d54907d54418126a9a9736f4aff
Author: NeilBrown <neilb@suse.de>
Date:   Mon May 9 18:20:47 2022 -0700

    mm: create new mm/swap.h header file

    Patch series "MM changes to improve swap-over-NFS support".

    Assorted improvements for swap-via-filesystem.

    This is a resend of these patches, rebased on current HEAD.  The only
    substantial changes is that swap_dirty_folio has replaced
    swap_set_page_dirty.

    Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem.  It
    has previously worked for NFS but that broke a few releases back.  This
    series changes to use a new ->swap_rw rather than ->readpage and
    ->direct_IO.  It also makes other improvements.

    There is a companion series already in linux-next which fixes various
    issues with NFS.  Once both series land, a final patch is needed which
    changes NFS over to use ->swap_rw.

    This patch (of 10):

    Many functions declared in include/linux/swap.h are only used within mm/

    Create a new "mm/swap.h" and move some of these declarations there.
    Remove the redundant 'extern' from the function declarations.

    [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
    Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
    Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brown
    Signed-off-by: NeilBrown <neilb@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Tested-by: David Howells <dhowells@redhat.com>
    Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:00 -04:00
Chris von Recklinghausen 9529ba417b mm/page_alloc: simplify update of pgdat in wake_all_kswapds
Bugzilla: https://bugzilla.redhat.com/2160210

commit d137a7cb9b2ab8155184b2da9a304afff8f84d36
Author: Chen Wandun <chenwandun@huawei.com>
Date:   Fri Apr 29 14:36:59 2022 -0700

    mm/page_alloc: simplify update of pgdat in wake_all_kswapds

    There is no need to update last_pgdat for each zone, only update
    last_pgdat when iterating the first zone of a node.

    Link: https://lkml.kernel.org/r/20220322115635.2708989-1-chenwandun@huawei.com
    Signed-off-by: Chen Wandun <chenwandun@huawei.com>
    Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:57 -04:00
Chris von Recklinghausen 399ea3c9ec mm/page_alloc: reuse tail struct pages for compound devmaps
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6fd3620b342861de9547ea01d28f664892ef51a1
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Thu Apr 28 23:16:16 2022 -0700

    mm/page_alloc: reuse tail struct pages for compound devmaps

    Currently memmap_init_zone_device() ends up initializing 32768 pages when
    it only needs to initialize 128 given tail page reuse.  That number is
    worse with 1GB compound pages, 262144 instead of 128.  Update
    memmap_init_zone_device() to skip redundant initialization, detailed
    below.

    When a pgmap @vmemmap_shift is set, all pages are mapped at a given huge
    page alignment and use compound pages to describe them as opposed to a
    struct per 4K.

    With @vmemmap_shift > 0 and when struct pages are stored in ram (!altmap)
    most tail pages are reused.  Consequently, the amount of unique struct
    pages is a lot smaller than the total amount of struct pages being mapped.

    The altmap path is left alone since it does not support memory savings
    based on compound pages devmap.

    Link: https://lkml.kernel.org/r/20220420155310.9712-6-joao.m.martins@oracle.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:55 -04:00
Chris von Recklinghausen 5f728eb268 mm/page_alloc.c: calc the right pfn if page size is not 4K
Bugzilla: https://bugzilla.redhat.com/2160210

commit aa282a157bf8ff79bed9164dc5e0e99f0d9e9755
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Thu Apr 28 23:16:14 2022 -0700

    mm/page_alloc.c: calc the right pfn if page size is not 4K

    Previous 0x100000 is used to check the 4G limit in
    find_zone_movable_pfns_for_nodes().  This is right in x86 because the page
    size can only be 4K.  But 16K and 64K are available in arm64.  So replace
    it with PHYS_PFN(SZ_4G).

    Link: https://lkml.kernel.org/r/20220414101314.1250667-8-mawupeng1@huawei.com
    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:55 -04:00
Chris von Recklinghausen 5cea471a90 mm/vmscan: make sure wakeup_kswapd with managed zone
Bugzilla: https://bugzilla.redhat.com/2160210

commit bc53008eea55330f485c956338d3c59f96c70c08
Author: Wei Yang <richard.weiyang@gmail.com>
Date:   Thu Apr 28 23:16:03 2022 -0700

    mm/vmscan: make sure wakeup_kswapd with managed zone

    wakeup_kswapd() only wake up kswapd when the zone is managed.

    For two callers of wakeup_kswapd(), they are node perspective.

      * wake_all_kswapds
      * numamigrate_isolate_page

    If we picked up a !managed zone, this is not we expected.

    This patch makes sure we pick up a managed zone for wakeup_kswapd().  And
    it also use managed_zone in migrate_balanced_pgdat() to get the proper
    zone.

    [richard.weiyang@gmail.com: adjust the usage in migrate_balanced_pgdat()]
      Link: https://lkml.kernel.org/r/20220329010901.1654-2-richard.weiyang@gmail.com
    Link: https://lkml.kernel.org/r/20220327024101.10378-2-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Chris von Recklinghausen 6cf53831b0 mm: wrap __find_buddy_pfn() with a necessary buddy page validation
Bugzilla: https://bugzilla.redhat.com/2160210

commit 8170ac4700d26f65a9a4ebc8ae488539158dc5f7
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu Apr 28 23:16:01 2022 -0700

    mm: wrap __find_buddy_pfn() with a necessary buddy page validation

    Whenever the buddy of a page is found from __find_buddy_pfn(),
    page_is_buddy() should be used to check its validity.  Add a helper
    function find_buddy_page_pfn() to find the buddy page and do the check
    together.

    [ziy@nvidia.com: updates per David]
    Link: https://lkml.kernel.org/r/20220401230804.1658207-2-zi.yan@sent.com
    Link: https://lore.kernel.org/linux-mm/CAHk-=wji_AmYygZMTsPMdJ7XksMt7kOur8oDfDdniBRMjm4VkQ@mail.gmail.com/
    Link: https://lkml.kernel.org/r/7236E7CA-B5F1-4C04-AB85-E86FA3E9A54B@nvidia.com
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen a99a43f58b mm: page_alloc: simplify pageblock migratetype check in __free_one_page()
Bugzilla: https://bugzilla.redhat.com/2160210

commit bb0e28eb5bc2b3a22e47861ca59bccca566023e8
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu Apr 28 23:16:01 2022 -0700

    mm: page_alloc: simplify pageblock migratetype check in __free_one_page()

    Move pageblock migratetype check code in the while loop to simplify the
    logic. It also saves redundant buddy page checking code.

    Link: https://lkml.kernel.org/r/20220401230804.1658207-1-zi.yan@sent.com
    Link: https://lore.kernel.org/linux-mm/27ff69f9-60c5-9e59-feb2-295250077551@suse.cz/
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Suggested-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen 57def8f6ec mm/page_alloc: adding same penalty is enough to get round-robin order
Bugzilla: https://bugzilla.redhat.com/2160210

commit 379313241e77abc18258da1afd49d111c72c5a3d
Author: Wei Yang <richard.weiyang@gmail.com>
Date:   Thu Apr 28 23:16:01 2022 -0700

    mm/page_alloc: adding same penalty is enough to get round-robin order

    To make node order in round-robin in the same distance group, we add a
    penalty to the first node we got in each round.

    To get a round-robin order in the same distance group, we don't need to
    decrease the penalty since:

      * find_next_best_node() always iterates node in the same order
      * distance matters more then penalty in find_next_best_node()
      * in nodes with the same distance, the first one would be picked up

    So it is fine to increase same penalty when we get the first node in the
    same distance group.  Since we just increase a constance of 1 to node
    penalty, it is not necessary to multiply MAX_NODE_LOAD for preference.

    [richard.weiyang@gmail.com: remove remove MAX_NODE_LOAD, per Vlastimil]
      Link: https://lkml.kernel.org/r/20220412001319.7462-1-richard.weiyang@gmail.com
    Link: https://lkml.kernel.org/r/20220123013537.20491-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Oscar Salvador <osalvador@suse.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Nico Pache bd508dfa58 mm: prep_compound_tail() clear page->private
commit 5aae9265ee1a30cf716d6caf6b29fe99b9d55130
Author: Hugh Dickins <hughd@google.com>
Date:   Sat Oct 22 00:51:06 2022 -0700

    mm: prep_compound_tail() clear page->private

    Although page allocation always clears page->private in the first page or
    head page of an allocation, it has never made a point of clearing
    page->private in the tails (though 0 is often what is already there).

    But now commit 71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t
    during THP split") issues a warning when page_tail->private is found to be
    non-0 (unless it's swapcache).

    Change that warning to dump page_tail (which also dumps head), instead of
    just the head: so far we have seen dead000000000122, dead000000000003,
    dead000000000001 or 0000000000000002 in the raw output for tail private.

    We could just delete the warning, but today's consensus appears to want
    page->private to be 0, unless there's a good reason for it to be set: so
    now clear it in prep_compound_tail() (more general than just for THP; but
    not for high order allocation, which makes no pass down the tails).

    Link: https://lkml.kernel.org/r/1c4233bb-4e4d-5969-fbd4-96604268a285@google.com
    Fixes: 71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t during THP split")
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:44 -07:00
Nico Pache 45f65b1711 mm/page_alloc: use local variable zone_idx directly
commit c035290424a9b7b64477752058b460d0ecc21987
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:50 2022 +0800

    mm/page_alloc: use local variable zone_idx directly

    Use local variable zone_idx directly since it holds the exact value of
    zone_idx().  No functional change intended.

    Link: https://lkml.kernel.org/r/20220916072257.9639-10-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache acc09c7bfe mm/page_alloc: add missing is_migrate_isolate() check in set_page_guard()
commit b36184553d41c59e6712f9d4699aca24577fbd4a
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:49 2022 +0800

    mm/page_alloc: add missing is_migrate_isolate() check in set_page_guard()

    In MIGRATE_ISOLATE case, zone freepage state shouldn't be modified as
    caller will take care of it.  Add missing is_migrate_isolate() here to
    avoid possible unbalanced freepage state.  This would happen if someone
    isolates the block, and then we face an MCE failure/soft-offline on a page
    within that block.  __mod_zone_freepage_state() will be triggered via
    below call trace which already had been triggered back when block was
    isolated:

    take_page_off_buddy
      break_down_buddy_pages
        set_page_guard

    Link: https://lkml.kernel.org/r/20220916072257.9639-9-linmiaohe@huawei.com
    Fixes: 06be6ff3d2 ("mm,hwpoison: rework soft offline for free pages")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 350a1bc7ca mm/page_alloc: fix freeing static percpu memory
commit 022e7fa0f73d7c90cf3d6bea3d4e4cc5df1e1087
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:47 2022 +0800

    mm/page_alloc: fix freeing static percpu memory

    The size of struct per_cpu_zonestat can be 0 on !SMP && !NUMA.  In that
    case, zone->per_cpu_zonestats will always equal to boot_zonestats.  But in
    zone_pcp_reset(), zone->per_cpu_zonestats is freed via free_percpu()
    directly without checking against boot_zonestats first.  boot_zonestats
    will be released by free_percpu() unexpectedly.

    Link: https://lkml.kernel.org/r/20220916072257.9639-7-linmiaohe@huawei.com
    Fixes: 28f836b677 ("mm/page_alloc: split per cpu page lists and zone stats")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 7655dc1ad4 mm/page_alloc: add __init annotations to init_mem_debugging_and_hardening()
commit 5749fcc5f04cef4091dea0c2ba6b5c5f5e05a549
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:46 2022 +0800

    mm/page_alloc: add __init annotations to init_mem_debugging_and_hardening()

    It's only called by mm_init(). Add __init annotations to it.

    Link: https://lkml.kernel.org/r/20220916072257.9639-6-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 067e773fe8 mm/page_alloc: remove obsolete comment in zone_statistics()
commit 709924bc7555db4867403f1f6e51cac4250bca87
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:45 2022 +0800

    mm/page_alloc: remove obsolete comment in zone_statistics()

    Since commit 43c95bcc51 ("mm/page_alloc: reduce duration that IRQs are
    disabled for VM counters"), zone_statistics() is not called with
    interrupts disabled.  Update the corresponding comment.

    Link: https://lkml.kernel.org/r/20220916072257.9639-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache e34506a89f mm: remove obsolete macro NR_PCP_ORDER_MASK and NR_PCP_ORDER_WIDTH
commit 638a9ae97ab596f1f7b7522dad709e69cb5b4e9d
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:44 2022 +0800

    mm: remove obsolete macro NR_PCP_ORDER_MASK and NR_PCP_ORDER_WIDTH

    Since commit 8b10b465d0e1 ("mm/page_alloc: free pages in a single pass
    during bulk free"), they're not used anymore.  Remove them.

    Link: https://lkml.kernel.org/r/20220916072257.9639-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 5c785b9d4f mm/page_alloc: make zone_pcp_update() static
commit b89f1735169b8ab54b6a03bf4823657ee4e30073
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:43 2022 +0800

    mm/page_alloc: make zone_pcp_update() static

    Since commit b92ca18e8c ("mm/page_alloc: disassociate the pcp->high from
    pcp->batch"), zone_pcp_update() is only used in mm/page_alloc.c.  Move
    zone_pcp_update() up to avoid forward declaration and then make it static.
    No functional change intended.

    Link: https://lkml.kernel.org/r/20220916072257.9639-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 7b4db7d995 mm/page_alloc: ensure kswapd doesn't accidentally go to sleep
commit ce96fa6223ee851cb83118678f6e75f260852a80
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:42 2022 +0800

    mm/page_alloc: ensure kswapd doesn't accidentally go to sleep

    Patch series "A few cleanup patches for mm", v2.

    This series contains a few cleanup patches to remove the obsolete comments
    and functions, use helper macro to improve readability and so on.  More
    details can be found in the respective changelogs.

    This patch (of 16):

    If ALLOC_KSWAPD is set, wake_all_kswapds() will be called to ensure kswapd
    doesn't accidentally go to sleep.  But when reserve_flags is set,
    alloc_flags will be overwritten and ALLOC_KSWAPD is thus lost.  Preserve
    the ALLOC_KSWAPD flag in alloc_flags to ensure kswapd won't go to sleep
    accidentally.

    Link: https://lkml.kernel.org/r/20220916072257.9639-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220916072257.9639-2-linmiaohe@huawei.com
    Fixes: 0a79cdad5e ("mm: use alloc_flags to record if kswapd can wake")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 9e5684f3de mm/page_alloc: fix race condition between build_all_zonelists and page allocation
commit 3d36424b3b5850bd92f3e89b953a430d7cfc88ef
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Wed Aug 24 12:14:50 2022 +0100

    mm/page_alloc: fix race condition between build_all_zonelists and page allocation

    Patrick Daly reported the following problem;

            NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
            [0] - ZONE_MOVABLE
            [1] - ZONE_NORMAL
            [2] - NULL

            For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
            offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
            memory_offline operation removes the last page from ZONE_MOVABLE,
            build_all_zonelists() & build_zonerefs_node() will update
            node_zonelists as shown below. Only populated zones are added.

            NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
            [0] - ZONE_NORMAL
            [1] - NULL
            [2] - NULL

    The race is simple -- page allocation could be in progress when a memory
    hot-remove operation triggers a zonelist rebuild that removes zones.  The
    allocation request will still have a valid ac->preferred_zoneref that is
    now pointing to NULL and triggers an OOM kill.

    This problem probably always existed but may be slightly easier to trigger
    due to 6aa303defb ("mm, vmscan: only allocate and reclaim from zones
    with pages managed by the buddy allocator") which distinguishes between
    zones that are completely unpopulated versus zones that have valid pages
    not managed by the buddy allocator (e.g.  reserved, memblock, ballooning
    etc).  Memory hotplug had multiple stages with timing considerations
    around managed/present page updates, the zonelist rebuild and the zone
    span updates.  As David Hildenbrand puts it

            memory offlining adjusts managed+present pages of the zone
            essentially in one go. If after the adjustments, the zone is no
            longer populated (present==0), we rebuild the zone lists.

            Once that's done, we try shrinking the zone (start+spanned
            pages) -- which results in zone_start_pfn == 0 if there are no
            more pages. That happens *after* rebuilding the zonelists via
            remove_pfn_range_from_zone().

    The only requirement to fix the race is that a page allocation request
    identifies when a zonelist rebuild has happened since the allocation
    request started and no page has yet been allocated.  Use a seqlock_t to
    track zonelist updates with a lockless read-side of the zonelist and
    protecting the rebuild and update of the counter with a spinlock.

    [akpm@linux-foundation.org: make zonelist_update_seq static]
    Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net
    Fixes: 6aa303defb ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: <stable@vger.kernel.org>    [4.9+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:41 -07:00
Nico Pache ec2c7f4c5b page_alloc: fix invalid watermark check on a negative value
commit 9282012fc0aa248b77a69f5eb802b67c5a16bb13
Author: Jaewon Kim <jaewon31.kim@samsung.com>
Date:   Mon Jul 25 18:52:12 2022 +0900

    page_alloc: fix invalid watermark check on a negative value

    There was a report that a task is waiting at the
    throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was
    increasing.

    This is a bug where zone_watermark_fast returns true even when the free
    is very low. The commit f27ce0e140 ("page_alloc: consider highatomic
    reserve in watermark fast") changed the watermark fast to consider
    highatomic reserve. But it did not handle a negative value case which
    can be happened when reserved_highatomic pageblock is bigger than the
    actual free.

    If watermark is considered as ok for the negative value, allocating
    contexts for order-0 will consume all free pages without direct reclaim,
    and finally free page may become depleted except highatomic free.

    Then allocating contexts may fall into throttle_direct_reclaim. This
    symptom may easily happen in a system where wmark min is low and other
    reclaimers like kswapd does not make free pages quickly.

    Handle the negative case by using MIN.

    Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com
    Fixes: f27ce0e140 ("page_alloc: consider highatomic reserve in watermark fast")
    Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
    Reported-by: GyeongHwan Hong <gh21.hong@samsung.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Baoquan He <bhe@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Yong-Taek Lee <ytk.lee@samsung.com>
    Cc: <stable@vger.kerenl.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:40 -07:00
Frantisek Hrbata e9e9bc8da2 Merge: mm changes through v5.18 for 9.2
Merge conflicts:
-----------------
Conflicts with !1142(merged) "io_uring: update to v5.15"

fs/io-wq.c
        - static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
          !1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
          along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142)
        - static int io_wqe_worker(void *data)
          !1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers")
          Resolved in favor of HEAD(!1142)
        - static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
          HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread")
          Resolved in favor of !1370
        - static void create_worker_cont(struct callback_head *cb)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static void io_workqueue_create(struct work_struct *work)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
          !1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          Resolved in favor of HEAD(!1142)
        - static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
          !1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          removed wrongly merged run_cancel label
          Resolved in favor of HEAD(!1142)
        - static bool io_task_work_match(struct callback_head *cb, void *data)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - static void io_wq_exit_workers(struct io_wq *wq)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - int io_wq_max_workers(struct io_wq *wq, int *new_count)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
fs/io_uring.c
        - static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
          !1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
          Resolved in favor of HEAD(!1142)
include/uapi/linux/io_uring.h
        - !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items")
          just a comment conflict
          Resolved in favor of HEAD(!1142)
kernel/exit.c
        - void __noreturn do_exit(long code)
        - !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions")
          Resolved in favor of !1370

Conflicts with !1357(merged) "NFS refresh for RHEL-9.2"

fs/nfs/callback.c
        - nfs4_callback_svc(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed
          Resolved in favor of HEAD(!1357)
fs/nfs/file.c
          !1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
          Resolved in favor of HEAD(!1370)
fs/nfsd/nfssvc.c
        - nfsd(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module")
          Resolved in favor of HEAD(!1357)
-----------------

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370

Bugzilla: https://bugzilla.redhat.com/2120352

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

Patches 1-9 are changes to selftests
Patches 10-31 are reverts of RHEL-only patches to address COR CVE
Patches 32-320 are the machine dependent mm changes ported by Rafael
Patch 321 reverts the backport of 6692c98c7df5. See below.
Patches 322-981 are the machine independent mm changes
Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE

RHEL commit b23c298982 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5
is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added
after 40966e316f86.

Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bc369921d670 io-wq: max_worker fixes
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774

Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die()
	unsupported arch

Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die()
	unsupported arch

Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c
        unsupported arch

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter
        reverted later

Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter"
        revert of above omitted fix

Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak
	unsupported fs

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-23 19:49:41 +02:00
Chris von Recklinghausen 943e17aaec page_alloc: use vmalloc_huge for large system hash
Bugzilla: https://bugzilla.redhat.com/2120352

commit f2edd118d02dd11449b126f786f09749ca152ba5
Author: Song Liu <song@kernel.org>
Date:   Fri Apr 15 09:44:11 2022 -0700

    page_alloc: use vmalloc_huge for large system hash

    Use vmalloc_huge() in alloc_large_system_hash() so that large system
    hash (>= PMD_SIZE) could benefit from huge pages.

    Note that vmalloc_huge only allocates huge pages for systems with
    HAVE_ARCH_HUGE_VMALLOC.

    Signed-off-by: Song Liu <song@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Rik van Riel <riel@surriel.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen 96f852f113 mm, page_alloc: fix build_zonerefs_node()
Bugzilla: https://bugzilla.redhat.com/2120352

commit e553f62f10d93551eb883eca227ac54d1a4fad84
Author: Juergen Gross <jgross@suse.com>
Date:   Thu Apr 14 19:13:43 2022 -0700

    mm, page_alloc: fix build_zonerefs_node()

    Since commit 6aa303defb ("mm, vmscan: only allocate and reclaim from
    zones with pages managed by the buddy allocator") only zones with free
    memory are included in a built zonelist.  This is problematic when e.g.
    all memory of a zone has been ballooned out when zonelists are being
    rebuilt.

    The decision whether to rebuild the zonelists when onlining new memory
    is done based on populated_zone() returning 0 for the zone the memory
    will be added to.  The new zone is added to the zonelists only, if it
    has free memory pages (managed_zone() returns a non-zero value) after
    the memory has been onlined.  This implies, that onlining memory will
    always free the added pages to the allocator immediately, but this is
    not true in all cases: when e.g. running as a Xen guest the onlined new
    memory will be added only to the ballooned memory list, it will be freed
    only when the guest is being ballooned up afterwards.

    Another problem with using managed_zone() for the decision whether a
    zone is being added to the zonelists is, that a zone with all memory
    used will in fact be removed from all zonelists in case the zonelists
    happen to be rebuilt.

    Use populated_zone() when building a zonelist as it has been done before
    that commit.

    There was a report that QubesOS (based on Xen) is hitting this problem.
    Xen has switched to use the zone device functionality in kernel 5.9 and
    QubesOS wants to use memory hotplugging for guests in order to be able
    to start a guest with minimal memory and expand it as needed.  This was
    the report leading to the patch.

    Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
    Fixes: 6aa303defb ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
    Signed-off-by: Juergen Gross <jgross@suse.com>
    Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
    Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen a104807957 Revert "mm/page_alloc: mark pagesets as __maybe_unused"
Bugzilla: https://bugzilla.redhat.com/2120352

commit 273ba85b5e8b971ed28eb5c17e1638543be9237d
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Mon Mar 28 16:58:10 2022 +0200

    Revert "mm/page_alloc: mark pagesets as __maybe_unused"

    The local_lock() is now using a proper static inline function which is
    enough for llvm to accept that the variable is used.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220328145810.86783-4-bigeasy@linutronix.de

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:04 -04:00
Chris von Recklinghausen 21608e466e mm: page_alloc: validate buddy before check its migratetype.
Bugzilla: https://bugzilla.redhat.com/2120352

commit 787af64d05cd528aac9ad16752d11bb1c6061bb9
Author: Zi Yan <ziy@nvidia.com>
Date:   Wed Mar 30 15:45:43 2022 -0700

    mm: page_alloc: validate buddy before check its migratetype.

    Whenever a buddy page is found, page_is_buddy() should be called to
    check its validity.  Add the missing check during pageblock merge check.

    Fixes: 1dd214b8f21c ("mm: page_alloc: avoid merging non-fallbackable pageblocks with others")
    Link: https://lore.kernel.org/all/20220330154208.71aca532@gandalf.local.home/
    Reported-and-tested-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:03 -04:00
Chris von Recklinghausen 397b77192d kasan, page_alloc: allow skipping memory init for HW_TAGS
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9353ffa6e9e90d2b6348209cf2b95a8ffee18711
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:29 2022 -0700

    kasan, page_alloc: allow skipping memory init for HW_TAGS

    Add a new GFP flag __GFP_SKIP_ZERO that allows to skip memory
    initialization.  The flag is only effective with HW_TAGS KASAN.

    This flag will be used by vmalloc code for page_alloc allocations backing
    vmalloc() mappings in a following patch.  The reason to skip memory
    initialization for these pages in page_alloc is because vmalloc code will
    be initializing them instead.

    With the current implementation, when __GFP_SKIP_ZERO is provided,
    __GFP_ZEROTAGS is ignored.  This doesn't matter, as these two flags are
    never provided at the same time.  However, if this is changed in the
    future, this particular implementation detail can be changed as well.

    Link: https://lkml.kernel.org/r/0d53efeff345de7d708e0baa0d8829167772521e.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:00 -04:00
Chris von Recklinghausen 17be80da62 kasan, page_alloc: allow skipping unpoisoning for HW_TAGS
Bugzilla: https://bugzilla.redhat.com/2120352

commit 53ae233c30a623ff44ff2f83854e92530c5d9fc2
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:26 2022 -0700

    kasan, page_alloc: allow skipping unpoisoning for HW_TAGS

    Add a new GFP flag __GFP_SKIP_KASAN_UNPOISON that allows skipping KASAN
    poisoning for page_alloc allocations.  The flag is only effective with
    HW_TAGS KASAN.

    This flag will be used by vmalloc code for page_alloc allocations backing
    vmalloc() mappings in a following patch.  The reason to skip KASAN
    poisoning for these pages in page_alloc is because vmalloc code will be
    poisoning them instead.

    Also reword the comment for __GFP_SKIP_KASAN_POISON.

    Link: https://lkml.kernel.org/r/35c97d77a704f6ff971dd3bfe4be95855744108e.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:00 -04:00
Chris von Recklinghausen cc6b8ef6d0 kasan, page_alloc: rework kasan_unpoison_pages call site
Bugzilla: https://bugzilla.redhat.com/2120352

commit e9d0ca9228162f5442b751edf8c9721b15dcfa1e
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:43 2022 -0700

    kasan, page_alloc: rework kasan_unpoison_pages call site

    Rework the checks around kasan_unpoison_pages() call in post_alloc_hook().

    The logical condition for calling this function is:

     - If a software KASAN mode is enabled, we need to mark shadow memory.

     - Otherwise, HW_TAGS KASAN is enabled, and it only makes sense to set
       tags if they haven't already been cleared by tag_clear_highpage(),
       which is indicated by init_tags.

    This patch concludes the changes for post_alloc_hook().

    Link: https://lkml.kernel.org/r/0ecebd0d7ccd79150e3620ea4185a32d3dfe912f.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen e7f379f2b5 kasan, page_alloc: move kernel_init_free_pages in post_alloc_hook
Bugzilla: https://bugzilla.redhat.com/2120352

commit 7e3cbba65de22f20ad18a2de09f65238bfe84c5b
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:40 2022 -0700

    kasan, page_alloc: move kernel_init_free_pages in post_alloc_hook

    Pull the kernel_init_free_pages() call in post_alloc_hook() out of the big
    if clause for better code readability.  This also allows for more
    simplifications in the following patch.

    This patch does no functional changes.

    Link: https://lkml.kernel.org/r/a7a76456501eb37ddf9fca6529cee9555e59cdb1.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen d1e56915b0 kasan, page_alloc: move SetPageSkipKASanPoison in post_alloc_hook
Bugzilla: https://bugzilla.redhat.com/2120352

commit 89b2711633281b3d712b1df96c5065a82ccbfb9c
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:37 2022 -0700

    kasan, page_alloc: move SetPageSkipKASanPoison in post_alloc_hook

    Pull the SetPageSkipKASanPoison() call in post_alloc_hook() out of the big
    if clause for better code readability.  This also allows for more
    simplifications in the following patches.

    Also turn the kasan_has_integrated_init() check into the proper
    kasan_hw_tags_enabled() one.  These checks evaluate to the same value, but
    logically skipping kasan poisoning has nothing to do with integrated init.

    Link: https://lkml.kernel.org/r/7214c1698b754ccfaa44a792113c95cc1f807c48.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 96a3439b72 kasan, page_alloc: combine tag_clear_highpage calls in post_alloc_hook
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9294b1281d0a212ef775a175b98ce71e6ac27b90
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:34 2022 -0700

    kasan, page_alloc: combine tag_clear_highpage calls in post_alloc_hook

    Move tag_clear_highpage() loops out of the kasan_has_integrated_init()
    clause as a code simplification.

    This patch does no functional changes.

    Link: https://lkml.kernel.org/r/587e3fc36358b88049320a89cc8dc6deaecb0cda.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen acc738b6b9 kasan, page_alloc: merge kasan_alloc_pages into post_alloc_hook
Bugzilla: https://bugzilla.redhat.com/2120352

commit b42090ae6f3aa07b0a39403545d688489548a6a8
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:31 2022 -0700

    kasan, page_alloc: merge kasan_alloc_pages into post_alloc_hook

    Currently, the code responsible for initializing and poisoning memory in
    post_alloc_hook() is scattered across two locations: kasan_alloc_pages()
    hook for HW_TAGS KASAN and post_alloc_hook() itself.  This is confusing.

    This and a few following patches combine the code from these two
    locations.  Along the way, these patches do a step-by-step restructure the
    many performed checks to make them easier to follow.

    Replace the only caller of kasan_alloc_pages() with its implementation.

    As kasan_has_integrated_init() is only true when CONFIG_KASAN_HW_TAGS is
    enabled, moving the code does no functional changes.

    Also move init and init_tags variables definitions out of
    kasan_has_integrated_init() clause in post_alloc_hook(), as they have the
    same values regardless of what the if condition evaluates to.

    This patch is not useful by itself but makes the simplifications in the
    following patches easier to follow.

    Link: https://lkml.kernel.org/r/5ac7e0b30f5cbb177ec363ddd7878a3141289592.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 49736c675e kasan, page_alloc: refactor init checks in post_alloc_hook
Bugzilla: https://bugzilla.redhat.com/2120352

commit b8491b9052fef036aac0ca3afc18ef223aef6f61
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:28 2022 -0700

    kasan, page_alloc: refactor init checks in post_alloc_hook

    Separate code for zeroing memory from the code clearing tags in
    post_alloc_hook().

    This patch is not useful by itself but makes the simplifications in the
    following patches easier to follow.

    This patch does no functional changes.

    Link: https://lkml.kernel.org/r/2283fde963adfd8a2b29a92066f106cc16661a3c.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen cd5acd85e1 kasan: drop skip_kasan_poison variable in free_pages_prepare
Bugzilla: https://bugzilla.redhat.com/2120352

commit 487a32ec24be819e747af8c2ab0d5c515508086a
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:19 2022 -0700

    kasan: drop skip_kasan_poison variable in free_pages_prepare

    skip_kasan_poison is only used in a single place.  Call
    should_skip_kasan_poison() directly for simplicity.

    Link: https://lkml.kernel.org/r/1d33212e79bc9ef0b4d3863f903875823e89046f.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Suggested-by: Marco Elver <elver@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 37a79115b8 kasan, page_alloc: init memory of skipped pages on free
Bugzilla: https://bugzilla.redhat.com/2120352

commit db8a04774a8195c529f1e87cd1df87f116559b52
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:16 2022 -0700

    kasan, page_alloc: init memory of skipped pages on free

    Since commit 7a3b835371 ("kasan: use separate (un)poison implementation
    for integrated init"), when all init, kasan_has_integrated_init(), and
    skip_kasan_poison are true, free_pages_prepare() doesn't initialize the
    page.  This is wrong.

    Fix it by remembering whether kasan_poison_pages() performed
    initialization, and call kernel_init_free_pages() if it didn't.

    Reordering kasan_poison_pages() and kernel_init_free_pages() is OK, since
    kernel_init_free_pages() can handle poisoned memory.

    Link: https://lkml.kernel.org/r/1d97df75955e52727a3dc1c4e33b3b50506fc3fd.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 1c9fec123a kasan, page_alloc: simplify kasan_poison_pages call site
Bugzilla: https://bugzilla.redhat.com/2120352

commit c3525330a04d0a47b4e11f5cf6d44e21a6520885
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:13 2022 -0700

    kasan, page_alloc: simplify kasan_poison_pages call site

    Simplify the code around calling kasan_poison_pages() in
    free_pages_prepare().

    This patch does no functional changes.

    Link: https://lkml.kernel.org/r/ae4f9bcf071577258e786bcec4798c145d718c46.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 722c33889c kasan, page_alloc: merge kasan_free_pages into free_pages_prepare
Bugzilla: https://bugzilla.redhat.com/2120352

commit 7c13c163e036c646b77753deacfe2f5478b654bc
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:10 2022 -0700

    kasan, page_alloc: merge kasan_free_pages into free_pages_prepare

    Currently, the code responsible for initializing and poisoning memory in
    free_pages_prepare() is scattered across two locations: kasan_free_pages()
    for HW_TAGS KASAN and free_pages_prepare() itself.  This is confusing.

    This and a few following patches combine the code from these two
    locations.  Along the way, these patches also simplify the performed
    checks to make them easier to follow.

    Replaces the only caller of kasan_free_pages() with its implementation.

    As kasan_has_integrated_init() is only true when CONFIG_KASAN_HW_TAGS is
    enabled, moving the code does no functional changes.

    This patch is not useful by itself but makes the simplifications in the
    following patches easier to follow.

    Link: https://lkml.kernel.org/r/303498d15840bb71905852955c6e2390ecc87139.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 3bd6fc41ae kasan, page_alloc: move tag_clear_highpage out of kernel_init_free_pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5b2c07138cbd8c0c415c6d3ff5b8040532024814
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:07 2022 -0700

    kasan, page_alloc: move tag_clear_highpage out of kernel_init_free_pages

    Currently, kernel_init_free_pages() serves two purposes: it either only
    zeroes memory or zeroes both memory and memory tags via a different code
    path.  As this function has only two callers, each using only one code
    path, this behaviour is confusing.

    Pull the code that zeroes both memory and tags out of
    kernel_init_free_pages().

    As a result of this change, the code in free_pages_prepare() starts to
    look complicated, but this is improved in the few following patches.
    Those improvements are not integrated into this patch to make diffs easier
    to read.

    This patch does no functional changes.

    Link: https://lkml.kernel.org/r/7719874e68b23902629c7cf19f966c4fd5f57979.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 58c7b65b06 kasan, page_alloc: deduplicate should_skip_kasan_poison
Bugzilla: https://bugzilla.redhat.com/2120352

commit 94ae8b83fefcdaf281e0bcfb76a19f5ed5019c8d
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:04 2022 -0700

    kasan, page_alloc: deduplicate should_skip_kasan_poison

    Patch series "kasan, vmalloc, arm64: add vmalloc tagging support for SW/HW_TAGS", v6.

    This patchset adds vmalloc tagging support for SW_TAGS and HW_TAGS
    KASAN modes.

    About half of patches are cleanups I went for along the way.  None of them
    seem to be important enough to go through stable, so I decided not to
    split them out into separate patches/series.

    The patchset is partially based on an early version of the HW_TAGS
    patchset by Vincenzo that had vmalloc support.  Thus, I added a
    Co-developed-by tag into a few patches.

    SW_TAGS vmalloc tagging support is straightforward.  It reuses all of the
    generic KASAN machinery, but uses shadow memory to store tags instead of
    magic values.  Naturally, vmalloc tagging requires adding a few
    kasan_reset_tag() annotations to the vmalloc code.

    HW_TAGS vmalloc tagging support stands out.  HW_TAGS KASAN is based on Arm
    MTE, which can only assigns tags to physical memory.  As a result, HW_TAGS
    KASAN only tags vmalloc() allocations, which are backed by page_alloc
    memory.  It ignores vmap() and others.

    This patch (of 39):

    Currently, should_skip_kasan_poison() has two definitions: one for when
    CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, one for when it's not.

    Instead of duplicating the checks, add a deferred_pages_enabled() helper
    and use it in a single should_skip_kasan_poison() definition.

    Also move should_skip_kasan_poison() closer to its caller and clarify all
    conditions in the comment.

    Link: https://lkml.kernel.org/r/cover.1643047180.git.andreyknvl@google.com
    Link: https://lkml.kernel.org/r/658b79f5fb305edaf7dc16bc52ea870d3220d4a8.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 63534db797 NUMA balancing: optimize page placement for memory tiering system
Bugzilla: https://bugzilla.redhat.com/2120352

commit c574bbe917036c8968b984c82c7b13194fe5ce98
Author: Huang Ying <ying.huang@intel.com>
Date:   Tue Mar 22 14:46:23 2022 -0700

    NUMA balancing: optimize page placement for memory tiering system

    With the advent of various new memory types, some machines will have
    multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
    memory subsystem of these machines can be called memory tiering system,
    because the performance of the different types of memory are usually
    different.

    In such system, because of the memory accessing pattern changing etc,
    some pages in the slow memory may become hot globally.  So in this
    patch, the NUMA balancing mechanism is enhanced to optimize the page
    placement among the different memory types according to hot/cold
    dynamically.

    In a typical memory tiering system, there are CPUs, fast memory and slow
    memory in each physical NUMA node.  The CPUs and the fast memory will be
    put in one logical node (called fast memory node), while the slow memory
    will be put in another (faked) logical node (called slow memory node).
    That is, the fast memory is regarded as local while the slow memory is
    regarded as remote.  So it's possible for the recently accessed pages in
    the slow memory node to be promoted to the fast memory node via the
    existing NUMA balancing mechanism.

    The original NUMA balancing mechanism will stop to migrate pages if the
    free memory of the target node becomes below the high watermark.  This
    is a reasonable policy if there's only one memory type.  But this makes
    the original NUMA balancing mechanism almost do not work to optimize
    page placement among different memory types.  Details are as follows.

    It's the common cases that the working-set size of the workload is
    larger than the size of the fast memory nodes.  Otherwise, it's
    unnecessary to use the slow memory at all.  So, there are almost always
    no enough free pages in the fast memory nodes, so that the globally hot
    pages in the slow memory node cannot be promoted to the fast memory
    node.  To solve the issue, we have 2 choices as follows,

    a. Ignore the free pages watermark checking when promoting hot pages
       from the slow memory node to the fast memory node.  This will
       create some memory pressure in the fast memory node, thus trigger
       the memory reclaiming.  So that, the cold pages in the fast memory
       node will be demoted to the slow memory node.

    b. Define a new watermark called wmark_promo which is higher than
       wmark_high, and have kswapd reclaiming pages until free pages reach
       such watermark.  The scenario is as follows: when we want to promote
       hot-pages from a slow memory to a fast memory, but fast memory's free
       pages would go lower than high watermark with such promotion, we wake
       up kswapd with wmark_promo watermark in order to demote cold pages and
       free us up some space.  So, next time we want to promote hot-pages we
       might have a chance of doing so.

    The choice "a" may create high memory pressure in the fast memory node.
    If the memory pressure of the workload is high, the memory pressure
    may become so high that the memory allocation latency of the workload
    is influenced, e.g.  the direct reclaiming may be triggered.

    The choice "b" works much better at this aspect.  If the memory
    pressure of the workload is high, the hot pages promotion will stop
    earlier because its allocation watermark is higher than that of the
    normal memory allocation.  So in this patch, choice "b" is implemented.
    A new zone watermark (WMARK_PROMO) is added.  Which is larger than the
    high watermark and can be controlled via watermark_scale_factor.

    In addition to the original page placement optimization among sockets,
    the NUMA balancing mechanism is extended to be used to optimize page
    placement according to hot/cold among different memory types.  So the
    sysctl user space interface (numa_balancing) is extended in a backward
    compatible way as follow, so that the users can enable/disable these
    functionality individually.

    The sysctl is converted from a Boolean value to a bits field.  The
    definition of the flags is,

    - 0: NUMA_BALANCING_DISABLED
    - 1: NUMA_BALANCING_NORMAL
    - 2: NUMA_BALANCING_MEMORY_TIERING

    We have tested the patch with the pmbench memory accessing benchmark
    with the 80:20 read/write ratio and the Gauss access address
    distribution on a 2 socket Intel server with Optane DC Persistent
    Memory Model.  The test results shows that the pmbench score can
    improve up to 95.9%.

    Thanks Andrew Morton to help fix the document format error.

    Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: Feng Tang <feng.tang@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:54 -04:00
Chris von Recklinghausen 92747f6c92 mm/hwpoison-inject: support injecting hwpoison to free page
Bugzilla: https://bugzilla.redhat.com/2120352

commit a581865ecd0a5a0b8464d6f1e668ae6681c1572f
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:35 2022 -0700

    mm/hwpoison-inject: support injecting hwpoison to free page

    memory_failure() can handle free buddy page.  Support injecting hwpoison
    to free page by adding is_free_buddy_page check when hwpoison filter is
    disabled.

    [akpm@linux-foundation.org: export is_free_buddy_page() to modules]

    Link: https://lkml.kernel.org/r/20220218092052.3853-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen 768e07c448 mm/page_alloc: call check_new_pages() while zone spinlock is not held
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3313204c8ad553cf93f1ee8cc89456c73a7df938
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Mar 22 14:43:57 2022 -0700

    mm/page_alloc: call check_new_pages() while zone spinlock is not held

    For high order pages not using pcp, rmqueue() is currently calling the
    costly check_new_pages() while zone spinlock is held, and hard irqs
    masked.

    This is not needed, we can release the spinlock sooner to reduce zone
    spinlock contention.

    Note that after this patch, we call __mod_zone_freepage_state() before
    deciding to leak the page because it is in bad state.

    Link: https://lkml.kernel.org/r/20220304170215.1868106-1-eric.dumazet@gmail.com
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen abe1eb721b mm: count time in drain_all_pages during direct reclaim as memory pressure
Bugzilla: https://bugzilla.redhat.com/2120352

commit fa7fc75f6319dcd044e332ad309a86126a610bdf
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Tue Mar 22 14:43:54 2022 -0700

    mm: count time in drain_all_pages during direct reclaim as memory pressure

    When page allocation in direct reclaim path fails, the system will make
    one attempt to shrink per-cpu page lists and free pages from high alloc
    reserves.  Draining per-cpu pages into buddy allocator can be a very
    slow operation because it's done using workqueues and the task in direct
    reclaim waits for all of them to finish before proceeding.  Currently
    this time is not accounted as psi memory stall.

    While testing mobile devices under extreme memory pressure, when
    allocations are failing during direct reclaim, we notices that psi
    events which would be expected in such conditions were not triggered.
    After profiling these cases it was determined that the reason for
    missing psi events was that a big chunk of time spent in direct reclaim
    is not accounted as memory stall, therefore psi would not reach the
    levels at which an event is generated.  Further investigation revealed
    that the bulk of that unaccounted time was spent inside drain_all_pages
    call.

    A typical captured case when drain_all_pages path gets activated:

    __alloc_pages_slowpath  took 44.644.613ns
        __perform_reclaim   took    751.668ns (1.7%)
        drain_all_pages     took 43.887.167ns (98.3%)

    PSI in this case records the time spent in __perform_reclaim but ignores
    drain_all_pages, IOW it misses 98.3% of the time spent in
    __alloc_pages_slowpath.

    Annotate __alloc_pages_direct_reclaim in its entirety so that delays
    from handling page allocation failure in the direct reclaim path are
    accounted as memory stall.

    Link: https://lkml.kernel.org/r/20220223194812.1299646-1-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reported-by: Tim Murray <timmurray@google.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Minchan Kim <minchan@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen 011c300849 mm: enforce pageblock_order < MAX_ORDER
Bugzilla: https://bugzilla.redhat.com/2120352

commit b3d40a2b6d10c9d0424d2b398bf962fb6adad87e
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Mar 22 14:43:20 2022 -0700

    mm: enforce pageblock_order < MAX_ORDER

    Some places in the kernel don't really expect pageblock_order >=
    MAX_ORDER, and it looks like this is only possible in corner cases:

    1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order
       pages via __free_pages_core(), which cannot possibly work.

    2) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE
       start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger
       pageblock_order, we could have a single pageblock partially managed by
       two zones.

    3) compaction code runs into __fragmentation_index() with order
       >= MAX_ORDER, when checking WARN_ON_ONCE(order >= MAX_ORDER). [1]

    4) mm/page_reporting.c won't be reporting any pages with default
       page_reporting_order == pageblock_order, as we'll be skipping the
       reporting loop inside page_reporting_process_zone().

    5) __rmqueue_fallback() will never be able to steal with
       ALLOC_NOFRAGMENT.

    pageblock_order >= MAX_ORDER is weird either way: it's a pure
    optimization for making alloc_contig_range(), as used for allcoation of
    gigantic pages, a little more reliable to succeed.  However, if there is
    demand for somewhat reliable allocation of gigantic pages, affected
    setups should be using CMA or boottime allocations instead.

    So let's make sure that pageblock_order < MAX_ORDER and simplify.

    [1] https://lkml.kernel.org/r/87r189a2ks.fsf@linux.ibm.com

    Link: https://lkml.kernel.org/r/20220214174132.219303-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Frank Rowand <frowand.list@gmail.com>
    Cc: John Garry via iommu <iommu@lists.linux-foundation.org>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Rob Herring <robh+dt@kernel.org>
    Cc: Robin Murphy <robin.murphy@arm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen ab6763a437 mm/page_alloc: don't pass pfn to free_unref_page_commit()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 566513775dca7f0d4ba15da4bc8394cdb2c98829
Author: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Date:   Tue Mar 22 14:43:14 2022 -0700

    mm/page_alloc: don't pass pfn to free_unref_page_commit()

    free_unref_page_commit() doesn't make use of its pfn argument, so get
    rid of it.

    Link: https://lkml.kernel.org/r/20220202140451.415928-1-nsaenzju@redhat.com
    Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen 002a87236c mm: page_alloc: avoid merging non-fallbackable pageblocks with others
Bugzilla: https://bugzilla.redhat.com/2120352

commit 1dd214b8f21ca46d5431be9b2db8513c59e07a26
Author: Zi Yan <ziy@nvidia.com>
Date:   Tue Mar 22 14:43:05 2022 -0700

    mm: page_alloc: avoid merging non-fallbackable pageblocks with others

    This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance.
    It prepares for the upcoming removal of the MAX_ORDER-1 alignment
    requirement for CMA and alloc_contig_range().

    MIGRATE_HIGHATOMIC should not merge with other migratetypes like
    MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too.

    Remove MIGRATE_CMA and MIGRATE_ISOLATE from fallbacks list, since they
    are never used.

    [1] https://lore.kernel.org/linux-mm/20211130100853.GP3366@techsingularity.net/

    Link: https://lkml.kernel.org/r/20220124175957.1261961-1-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Mike Rapoport <rppt@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen 9c8f59f0d2 delayacct: track delays from memory compact
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5bf18281534451bf1ad56a45a3085cd7ad46860d
Author: wangyong <wang.yong12@zte.com.cn>
Date:   Wed Jan 19 18:10:15 2022 -0800

    delayacct: track delays from memory compact

    Delay accounting does not track the delay of memory compact.  When there
    is not enough free memory, tasks can spend a amount of their time
    waiting for compact.

    To get the impact of tasks in direct memory compact, measure the delay
    when allocating memory through memory compact.

    Also update tools/accounting/getdelays.c:

        / # ./getdelays_next  -di -p 304
        print delayacct stats ON
        printing IO accounting
        PID     304

        CPU             count     real total  virtual total    delay total  delay average
                          277      780000000      849039485       18877296          0.068ms
        IO              count    delay total  delay average
                            0              0              0ms
        SWAP            count    delay total  delay average
                            0              0              0ms
        RECLAIM         count    delay total  delay average
                            5    11088812685           2217ms
        THRASHING       count    delay total  delay average
                            0              0              0ms
        COMPACT         count    delay total  delay average
                            3          72758              0ms
        watch: read=0, write=0, cancelled_write=0

    Link: https://lkml.kernel.org/r/1638619795-71451-1-git-send-email-wang.yong12@zte.com.cn
    Signed-off-by: wangyong <wang.yong12@zte.com.cn>
    Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
    Reviewed-by: Zhang Wenya <zhang.wenya1@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Reviewed-by: Balbir Singh <bsingharora@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:42 -04:00
Chris von Recklinghausen 508db56386 mm/page_alloc.c: modify the comment section for alloc_contig_pages()
Bugzilla: https://bugzilla.redhat.com/2120352

commit eaab8e753632b8e961701d02a5bb398c820f309c
Author: Anshuman Khandual <anshuman.khandual@arm.com>
Date:   Fri Jan 14 14:07:33 2022 -0800

    mm/page_alloc.c: modify the comment section for alloc_contig_pages()

    Clarify that the alloc_contig_pages() allocated range will always be
    aligned to the requested nr_pages.

    Link: https://lkml.kernel.org/r/1639545478-12160-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:39 -04:00
Chris von Recklinghausen ee38b98d63 mm: page_alloc: fix building error on -Werror=array-compare
Bugzilla: https://bugzilla.redhat.com/2120352

commit ca831f29f8f25c97182e726429b38c0802200c8f
Author: Xiongwei Song <sxwjean@gmail.com>
Date:   Fri Jan 14 14:07:24 2022 -0800

    mm: page_alloc: fix building error on -Werror=array-compare

    Arthur Marsh reported we would hit the error below when building kernel
    with gcc-12:

      CC      mm/page_alloc.o
      mm/page_alloc.c: In function `mem_init_print_info':
      mm/page_alloc.c:8173:27: error: comparison between two arrays [-Werror=array-compare]
       8173 |                 if (start <= pos && pos < end && size > adj) \
            |

    In C++20, the comparision between arrays should be warned.

    Link: https://lkml.kernel.org/r/20211125130928.32465-1-sxwjean@me.com
    Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
    Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:39 -04:00
Chris von Recklinghausen 73067bf28e mm/memremap: add ZONE_DEVICE support for compound pages
Conflicts:
	include/linux/memremap.h - The presence of
		536939ff5163 ("mm: Add three folio wrappers")
		and
		dc90f0846df4 ("mm: don't include <linux/memremap.h> in <linux/mm.h>")
		causes a merge conflict. make sure all 4 functions are defined.
	mm/memremap.c - The backport of
		b80892ca022e ("memremap: remove support for external pgmap refcounts")
		changed percpu_ref_get_many to take the address of the ref.
		This patch wants to pass the ref to percpu_ref_get_many by
		value but later merge commit
		f56caedaf94f ("Merge branch 'akpm' (patches from Andrew)")
		changed it back to passing the ref by address. squash that
		change in with this one.

Bugzilla: https://bugzilla.redhat.com/2120352

commit c4386bd8ee3a921c3c799b7197dc898ade76a453
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Fri Jan 14 14:04:22 2022 -0800

    mm/memremap: add ZONE_DEVICE support for compound pages

    Add a new @vmemmap_shift property for struct dev_pagemap which specifies
    that a devmap is composed of a set of compound pages of order
    @vmemmap_shift, instead of base pages.  When a compound page devmap is
    requested, all but the first page are initialised as tail pages instead
    of order-0 pages.

    For certain ZONE_DEVICE users like device-dax which have a fixed page
    size, this creates an opportunity to optimize GUP and GUP-fast walkers,
    treating it the same way as THP or hugetlb pages.

    Additionally, commit 7118fc2906 ("hugetlb: address ref count racing in
    prep_compound_gigantic_page") removed set_page_count() because the
    setting of page ref count to zero was redundant.  devmap pages don't
    come from page allocator though and only head page refcount is used for
    compound pages, hence initialize tail page count to zero.

    Link: https://lkml.kernel.org/r/20211202204422.26777-5-joao.m.martins@oracle
.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:37 -04:00
Chris von Recklinghausen f3d31b7be9 mm/page_alloc: refactor memmap_init_zone_device() page init
Bugzilla: https://bugzilla.redhat.com/2120352

commit 46487e0095f895c25da9feae27dc06d2aa76793d
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Fri Jan 14 14:04:18 2022 -0800

    mm/page_alloc: refactor memmap_init_zone_device() page init

    Move struct page init to an helper function __init_zone_device_page().

    This is in preparation for sharing the storage for compound page
    metadata.

    Link: https://lkml.kernel.org/r/20211202204422.26777-4-joao.m.martins@oracle.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:37 -04:00
Chris von Recklinghausen d6a800636f mm/page_alloc: split prep_compound_page into head and tail subparts
Conflicts: mm/page_alloc.c - We already have
	5232c63f46fd ("mm: Make compound_pincount always available")
	which removed the hpage_pincount_available check before calling
	atomic_set(compound_pincount_ptr(page), 0), leading to a difference
	in deleted code. The upstream version of this patch adds a
	call to hpage_pincount_available in prep_compound_head. Remove that
	too.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 5b24eeef06701cca6852f1bf768248ccc912819b
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Fri Jan 14 14:04:15 2022 -0800

    mm/page_alloc: split prep_compound_page into head and tail subparts

    Patch series "mm, device-dax: Introduce compound pages in devmap", v7.

    This series converts device-dax to use compound pages, and moves away
    from the 'struct page per basepage on PMD/PUD' that is done today.

    Doing so
     1) unlocks a few noticeable improvements on unpin_user_pages() and
        makes device-dax+altmap case 4x times faster in pinning (numbers
        below and in last patch)
     2) as mentioned in various other threads it's one important step
        towards cleaning up ZONE_DEVICE refcounting.

    I've split the compound pages on devmap part from the rest based on
    recent discussions on devmap pending and future work planned[5][6].
    There is consensus that device-dax should be using compound pages to
    represent its PMD/PUDs just like HugeTLB and THP, and that leads to less
    specialization of the dax parts.  I will pursue the rest of the work in
    parallel once this part is merged, particular the GUP-{slow,fast}
    improvements [7] and the tail struct page deduplication memory savings
    part[8].

    To summarize what the series does:

    Patch 1: Prepare hwpoisoning to work with dax compound pages.

    Patches 2-3: Split the current utility function of prep_compound_page()
    into head and tail and use those two helpers where appropriate to take
    advantage of caches being warm after __init_single_page().  This is used
    when initializing zone device when we bring up device-dax namespaces.

    Patches 4-10: Add devmap support for compound pages in device-dax.
    memmap_init_zone_device() initialize its metadata as compound pages, and
    it introduces a new devmap property known as vmemmap_shift which
    outlines how the vmemmap is structured (defaults to base pages as done
    today).  The property describe the page order of the metadata
    essentially.  While at it do a few cleanups in device-dax in patches
    5-9.  Finally enable device-dax usage of devmap @vmemmap_shift to a
    value based on its own @align property.  @vmemmap_shift returns 0 by
    default (which is today's case of base pages in devmap, like fsdax or
    the others) and the usage of compound devmap is optional.  Starting with
    device-dax (*not* fsdax) we enable it by default.  There are a few
    pinning improvements particular on the unpinning case and altmap, as
    well as unpin_user_page_range_dirty_lock() being just as effective as
    THP/hugetlb[0] pages.

        $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
        (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
        [altmap]
        (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put
:~71ms

         $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
        (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
        [altmap with -m 127004]
        (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec
 put:~563ms

    Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with
    and without altmap), alongside gup_test selftests with dynamic dax
    regions and static dax regions.  Coupled with ndctl unit tests for
    dynamic dax devices that exercise all of this.  Note, for dynamic dax
    regions I had to revert commit 8aa83e6395 ("x86/setup: Call
    early_reserve_memory() earlier"), it is a known issue that this commit
    broke efi_fake_mem=.

    This patch (of 11):

    Split the utility function prep_compound_page() into head and tail
    counterparts, and use them accordingly.

    This is in preparation for sharing the storage for compound page
    metadata.

    Link: https://lkml.kernel.org/r/20211202204422.26777-1-joao.m.martins@oracle
.com
    Link: https://lkml.kernel.org/r/20211202204422.26777-3-joao.m.martins@oracle
.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:37 -04:00
Chris von Recklinghausen 16b2c55cfa mm/page_alloc: remove the throttling logic from the page allocator
Bugzilla: https://bugzilla.redhat.com/2120352

commit 132b0d21d21f14f74fbe44dd5b8b1848215fff09
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 5 13:42:38 2021 -0700

    mm/page_alloc: remove the throttling logic from the page allocator

    The page allocator stalls based on the number of pages that are waiting
    for writeback to start but this should now be redundant.
    shrink_inactive_list() will wake flusher threads if the LRU tail are
    unqueued dirty pages so the flusher should be active.  If it fails to
    make progress due to pages under writeback not being completed quickly
    then it should stall on VMSCAN_THROTTLE_WRITEBACK.

    Link: https://lkml.kernel.org/r/20211022144651.19914-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: "Darrick J . Wong" <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:29 -04:00
Chris von Recklinghausen 809d37d23f mm/vmscan: throttle reclaim until some writeback completes if congested
Conflicts:
	mm/filemap.c - We already have
		4268b48077e5 ("mm/filemap: Add folio_end_writeback()")
		so so put the acct_reclaim_writeback call between the
		folio_wake call and the folio_put call and pass it a
		folio
	mm/internal.h - We already have
		646010009d35 ("mm: Add folio_raw_mapping()")
		so keep definition of folio_raw_mapping.
		Squash in changes from merge commit
		512b7931ad05 ("Merge branch 'akpm' (patches from Andrew)")
		to be comptible with existing folio changes.
	mm/vmscan.c - Squash in changes from merge commit
                512b7931ad05 ("Merge branch 'akpm' (patches from Andrew)")
                to be comptible with existing folio changes.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 8cd7c588decf470bf7e14f2be93b709f839a965e
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 5 13:42:25 2021 -0700

    mm/vmscan: throttle reclaim until some writeback completes if congested

    Patch series "Remove dependency on congestion_wait in mm/", v5.

    This series that removes all calls to congestion_wait in mm/ and deletes
    wait_iff_congested.  It's not a clever implementation but
    congestion_wait has been broken for a long time [1].

    Even if congestion throttling worked, it was never a great idea.  While
    excessive dirty/writeback pages at the tail of the LRU is one
    possibility that reclaim may be slow, there is also the problem of too
    many pages being isolated and reclaim failing for other reasons
    (elevated references, too many pages isolated, excessive LRU contention
    etc).

    This series replaces the "congestion" throttling with 3 different types.

     - If there are too many dirty/writeback pages, sleep until a timeout or
       enough pages get cleaned

     - If too many pages are isolated, sleep until enough isolated pages are
       either reclaimed or put back on the LRU

     - If no progress is being made, direct reclaim tasks sleep until
       another task makes progress with acceptable efficiency.

    This was initially tested with a mix of workloads that used to trigger
    corner cases that no longer work.  A new test case was created called
    "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
    created XFS filesystem.  Note that it may be necessary to increase the
    timeout of ssh if executing remotely as ssh itself can get throttled and
    the connection may timeout.

    stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
    to check the impact as the number of direct reclaimers increase.  It has
    four types of worker.

     - One "anon latency" worker creates small mappings with mmap() and
       times how long it takes to fault the mapping reading it 4K at a time

     - X file writers which is fio randomly writing X files where the total
       size of the files add up to the allowed dirty_ratio. fio is allowed
       to run for a warmup period to allow some file-backed pages to
       accumulate. The duration of the warmup is based on the best-case
       linear write speed of the storage.

     - Y file readers which is fio randomly reading small files

     - Z anon memory hogs which continually map (100-dirty_ratio)% of memory

     - Total estimated WSS = (100+dirty_ration) percentage of memory

    X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

    The intent is to maximise the total WSS with a mix of file and anon
    memory where some anonymous memory must be swapped and there is a high
    likelihood of dirty/writeback pages reaching the end of the LRU.

    The test can be configured to have no background readers to stress
    dirty/writeback pages.  The results below are based on having zero
    readers.

    The short summary of the results is that the series works and stalls
    until some event occurs but the timeouts may need adjustment.

    The test results are not broken down by patch as the series should be
    treated as one block that replaces a broken throttling mechanism with a
    working one.

    Finally, three machines were tested but I'm reporting the worst set of
    results.  The other two machines had much better latencies for example.

    First the results of the "anon latency" latency

      stutterp
                                    5.15.0-rc1             5.15.0-rc1
                                       vanilla mm-reclaimcongest-v5r4
      Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
      Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
      Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
      Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
      Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
      Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
      Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
      Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
      Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
      Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
      Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
      Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
      Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
      Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
      Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
      Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
      Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
      Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
      Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
      Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
      Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
      Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
      Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
      Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)

    For most thread counts, the time to mmap() is unfortunately increased.
    In earlier versions of the series, this was lower but a large number of
    throttling events were reaching their timeout increasing the amount of
    inefficient scanning of the LRU.  There is no prioritisation of reclaim
    tasks making progress based on each tasks rate of page allocation versus
    progress of reclaim.  The variance is also impacted for high worker
    counts but in all cases, the differences in latency are not
    statistically significant due to very large maximum outliers.  Max-90
    shows that 90% of the stalls are comparable but the Max results show the
    massive outliers which are increased to to stalling.

    It is expected that this will be very machine dependant.  Due to the
    test design, reclaim is difficult so allocations stall and there are
    variances depending on whether THPs can be allocated or not.  The amount
    of memory will affect exactly how bad the corner cases are and how often
    they trigger.  The warmup period calculation is not ideal as it's based
    on linear writes where as fio is randomly writing multiple files from
    multiple tasks so the start state of the test is variable.  For example,
    these are the latencies on a single-socket machine that had more memory

      Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
      Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
      Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
      Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
      Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)

    The overall system CPU usage and elapsed time is as follows

                        5.15.0-rc3  5.15.0-rc3
                           vanilla mm-reclaimcongest-v5r4
      Duration User        6989.03      983.42
      Duration System      7308.12      799.68
      Duration Elapsed     2277.67     2092.98

    The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
    stalling.

    The high-level /proc/vmstats show

                                           5.15.0-rc1     5.15.0-rc1
                                              vanilla mm-reclaimcongest-v5r2
      Ops Direct pages scanned          1056608451.00   503594991.00
      Ops Kswapd pages scanned           109795048.00   147289810.00
      Ops Kswapd pages reclaimed          63269243.00    31036005.00
      Ops Direct pages reclaimed          10803973.00     6328887.00
      Ops Kswapd efficiency %                   57.62          21.07
      Ops Kswapd velocity                    48204.98       57572.86
      Ops Direct efficiency %                    1.02           1.26
      Ops Direct velocity                   463898.83      196845.97

    Kswapd scanned less pages but the detailed pattern is different.  The
    vanilla kernel scans slowly over time where as the patches exhibits
    burst patterns of scan activity.  Direct reclaim scanning is reduced by
    52% due to stalling.

    The pattern for stealing pages is also slightly different.  Both kernels
    exhibit spikes but the vanilla kernel when reclaiming shows pages being
    reclaimed over a period of time where as the patches tend to reclaim in
    spikes.  The difference is that vanilla is not throttling and instead
    scanning constantly finding some pages over time where as the patched
    kernel throttles and reclaims in spikes.

      Ops Percentage direct scans               90.59          77.37

    For direct reclaim, vanilla scanned 90.59% of pages where as with the
    patches, 77.37% were direct reclaim due to throttling

      Ops Page writes by reclaim           2613590.00     1687131.00

    Page writes from reclaim context are reduced.

      Ops Page writes anon                 2932752.00     1917048.00

    And there is less swapping.

      Ops Page reclaim immediate         996248528.00   107664764.00

    The number of pages encountered at the tail of the LRU tagged for
    immediate reclaim but still dirty/writeback is reduced by 89%.

      Ops Slabs scanned                     164284.00      153608.00

    Slab scan activity is similar.

    ftrace was used to gather stall activity

      Vanilla
      -------
          1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
          2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
          8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
         29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
      82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

    The fast majority of wait_iff_congested calls do not stall at all.  What
    is likely happening is that cond_resched() reschedules the task for a
    short period when the BDI is not registering congestion (which it never
    will in this test setup).

          1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
          2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
          4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
        380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
        778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

    congestion_wait if called always exceeds the timeout as there is no
    trigger to wake it up.

    Bottom line: Vanilla will throttle but it's not effective.

    Patch series
    ------------

    Kswapd throttle activity was always due to scanning pages tagged for
    immediate reclaim at the tail of the LRU

          1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITE
BACK
          4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITE
BACK
          6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITE
BACK
         11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRIT
EBACK
         11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEB
ACK
         94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEB
ACK

    The majority of events did not stall or stalled for a short period.
    Roughly 16% of stalls reached the timeout before expiry.  For direct
    reclaim, the number of times stalled for each reason were

       6624 reason=VMSCAN_THROTTLE_ISOLATED
      93246 reason=VMSCAN_THROTTLE_NOPROGRESS
      96934 reason=VMSCAN_THROTTLE_WRITEBACK

    The most common reason to stall was due to excessive pages tagged for
    immediate reclaim at the tail of the LRU followed by a failure to make
    forward.  A relatively small number were due to too many pages isolated
    from the LRU by parallel threads

    For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was

          9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATE
D
         12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLAT
ED
         83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLAT
ED
       6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

    Most did not stall at all.  A small number reached the timeout.

    For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
    the map

          1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
        146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
        266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
       2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
       2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROG
RESS
       7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROG
RESS
      22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRES
S
      51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPR
OGRESS

    The full timeout is often hit but a large number also do not stall at
    all.  The remainder slept a little allowing other reclaim tasks to make
    progress.

    While this timeout could be further increased, it could also negatively
    impact worst-case behaviour when there is no prioritisation of what task
    should make progress.

    For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

          1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITE
BACK
          2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITE
BACK
          3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITE
BACK
          6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITE
BACK
          7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITE
BACK
         11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITE
BACK
         12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITE
BACK
         16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITE
BACK
         24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITE
BACK
         28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
         30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITE
BACK
         30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITE
BACK
         32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITE
BACK
         42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITE
BACK
         77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITE
BACK
         99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITE
BACK
        137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITE
BACK
        190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITE
BACK
        339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITE
BACK
        518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITE
BACK
        852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEB
ACK
       3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEB
ACK
       7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
      83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRIT
EBACK

    The majority hit the timeout in direct reclaim context although a
    sizable number did not stall at all.  This is very different to kswapd
    where only a tiny percentage of stalls due to writeback reached the
    timeout.

    Bottom line, the throttling appears to work and the wakeup events may
    limit worst case stalls.  There might be some grounds for adjusting
    timeouts but it's likely futile as the worst-case scenarios depend on
    the workload, memory size and the speed of the storage.  A better
    approach to improve the series further would be to prioritise tasks
    based on their rate of allocation with the caveat that it may be very
    expensive to track.

    This patch (of 5):

    Page reclaim throttles on wait_iff_congested under the following
    conditions:

     - kswapd is encountering pages under writeback and marked for immediate
       reclaim implying that pages are cycling through the LRU faster than
       pages can be cleaned.

     - Direct reclaim will stall if all dirty pages are backed by congested
       inodes.

    wait_iff_congested is almost completely broken with few exceptions.
    This patch adds a new node-based workqueue and tracks the number of
    throttled tasks and pages written back since throttling started.  If
    enough pages belonging to the node are written back then the throttled
    tasks will wake early.  If not, the throttled tasks sleeps until the
    timeout expires.

    [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
    [hdanton@sina.com: Avoid race when reclaim starts]
    [vbabka@suse.cz: vmstat irq-safe api, clarifications]

    Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@
kernel.dk/ [1]
    Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingulari
ty.net
    Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingulari
ty.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: NeilBrown <neilb@suse.de>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: "Darrick J . Wong" <djwong@kernel.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:29 -04:00
Chris von Recklinghausen 4e48ab5bc4 mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged
Bugzilla: https://bugzilla.redhat.com/2120352

commit bd3400ea173fb611cdf2030d03620185ff6c0b0e
Author: Liangcai Fan <liangcaifan19@gmail.com>
Date:   Fri Nov 5 13:41:36 2021 -0700

    mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged

    When initializing transparent huge pages, min_free_kbytes would be
    calculated according to what khugepaged expected.

    So when transparent huge pages get disabled, min_free_kbytes should be
    recalculated instead of the higher value set by khugepaged.

    Link: https://lkml.kernel.org/r/1633937809-16558-1-git-send-email-liangcaifan19@gmail.com
    Signed-off-by: Liangcai Fan <liangcaifan19@gmail.com>
    Signed-off-by: Chunyan Zhang <zhang.lyra@gmail.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 87f4fface3 mm/page_alloc: use clamp() to simplify code
Bugzilla: https://bugzilla.redhat.com/2120352

commit 59d336bdf6931a6a8c2ba41e533267d1cc799fc9
Author: Wang ShaoBo <bobo.shaobowang@huawei.com>
Date:   Fri Nov 5 13:40:55 2021 -0700

    mm/page_alloc: use clamp() to simplify code

    This patch uses clamp() to simplify code in init_per_zone_wmark_min().

    Link: https://lkml.kernel.org/r/20211021034830.1049150-1-bobo.shaobowang@huawei.com
    Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Wei Yongjun <weiyongjun1@huawei.com>
    Cc: Li Bin <huawei.libin@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 18db1af2f9 mm: page_alloc: use migrate_disable() in drain_local_pages_wq()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9c25cbfcb38462803a3d68f5d88e66a587f5f045
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Fri Nov 5 13:40:52 2021 -0700

    mm: page_alloc: use migrate_disable() in drain_local_pages_wq()

    drain_local_pages_wq() disables preemption to avoid CPU migration during
    CPU hotplug and can't use cpus_read_lock().

    Using migrate_disable() works here, too.  The scheduler won't take the
    CPU offline until the task left the migrate-disable section.  The
    problem with disabled preemption here is that drain_local_pages()
    acquires locks which are turned into sleeping locks on PREEMPT_RT and
    can't be acquired with disabled preemption.

    Use migrate_disable() in drain_local_pages_wq().

    Link: https://lkml.kernel.org/r/20211015210933.viw6rjvo64qtqxn4@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 1419281dae mm/page_alloc.c: show watermark_boost of zone in zoneinfo
Bugzilla: https://bugzilla.redhat.com/2120352

commit a6ea8b5b9f1ce3403a1c8516035d653006741e80
Author: Liangcai Fan <liangcaifan19@gmail.com>
Date:   Fri Nov 5 13:40:37 2021 -0700

    mm/page_alloc.c: show watermark_boost of zone in zoneinfo

    min/low/high_wmark_pages(z) is defined as

      (z->_watermark[WMARK_MIN/LOW/HIGH] + z->watermark_boost)

    If kswapd is frequently woken up due to the increase of
    min/low/high_wmark_pages, printing watermark_boost can quickly locate
    whether watermark_boost or _watermark[WMARK_MIN/LOW/HIGH] caused
    min/low/high_wmark_pages to increase.

    Link: https://lkml.kernel.org/r/1632472566-12246-1-git-send-email-liangcaifan19@gmail.com
    Signed-off-by: Liangcai Fan <liangcaifan19@gmail.com>
    Cc: Chunyan Zhang <zhang.lyra@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 7815cb2138 mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 8446b59baaf45e83e1187cdb174ac78ac5d7d0ae
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Nov 5 13:40:31 2021 -0700

    mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page()

    Grabbing zone lock in is_free_buddy_page() gives a wrong sense of
    safety, and has potential performance implications when zone is
    experiencing lock contention.

    In any case, if a caller needs a stable result, it should grab zone lock
    before calling this function.

    Link: https://lkml.kernel.org/r/20210922152833.4023972-1-eric.dumazet@gmail.com
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 8350d36a42 mm/page_alloc: use accumulated load when building node fallback list
Bugzilla: https://bugzilla.redhat.com/2120352

commit 54d032ced98378bcb9d32dd5e378b7e402b36ad8
Author: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
Date:   Fri Nov 5 13:40:21 2021 -0700

    mm/page_alloc: use accumulated load when building node fallback list

    In build_zonelists(), when the fallback list is built for the nodes, the
    node load gets reinitialized during each iteration.  This results in
    nodes with same distances occupying the same slot in different node
    fallback lists rather than appearing in the intended round- robin
    manner.  This results in one node getting picked for allocation more
    compared to other nodes with the same distance.

    As an example, consider a 4 node system with the following distance
    matrix.

      Node 0  1  2  3
      ----------------
      0    10 12 32 32
      1    12 10 32 32
      2    32 32 10 12
      3    32 32 12 10

    For this case, the node fallback list gets built like this:

      Node  Fallback list
      ---------------------
      0     0 1 2 3
      1     1 0 3 2
      2     2 3 0 1
      3     3 2 0 1 <-- Unexpected fallback order

    In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the
    same order which results in more allocations getting satisfied from node
    0 compared to node 1.

    The effect of this on remote memory bandwidth as seen by stream
    benchmark is shown below:

      Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
            (numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
      Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
            (numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)

      ----------------------------------------
                    BANDWIDTH (MB/s)
          TEST      Case 1          Case 2
      ----------------------------------------
          COPY      57479.6         110791.8
         SCALE      55372.9         105685.9
           ADD      50460.6         96734.2
        TRIADD      50397.6         97119.1
      ----------------------------------------

    The bandwidth drop in Case 1 occurs because most of the allocations get
    satisfied by node 0 as it appears first in the fallback order for both
    nodes 2 and 3.

    This can be fixed by accumulating the node load in build_zonelists()
    rather than reinitializing it during each iteration.  With this the
    nodes with the same distance rightly get assigned in the round robin
    manner.

    In fact this was how it was originally until commit f0c0b2b808
    ("change zonelist order: zonelist order selection logic") dropped the
    load accumulation and resorted to initializing the load during each
    iteration.

    While zonelist ordering was removed by commit c9bff3eebc ("mm,
    page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load
    accumulation in build_zonelists() remained.  So essentially this patch
    reverts back to the accumulated node load logic.

    After this fix, the fallback order gets built like this:

      Node Fallback list
      ------------------
      0    0 1 2 3
      1    1 0 3 2
      2    2 3 0 1
      3    3 2 1 0 <-- Note the change here

    The bandwidth in Case 1 improves and matches Case 2 as shown below.

      ----------------------------------------
                    BANDWIDTH (MB/s)
          TEST      Case 1          Case 2
      ----------------------------------------
          COPY      110438.9        110107.2
         SCALE      105930.5        105817.5
           ADD      97005.1         96159.8
        TRIADD      97441.5         96757.1
      ----------------------------------------

    The correctness of the fallback list generation has been verified for
    the above node configuration where the node 3 starts as memory-less node
    and comes up online only during memory hotplug.

    [bharata@amd.com: Added changelog, review, test validation]

    Link: https://lkml.kernel.org/r/20210830121603.1081-3-bharata@amd.com
    Fixes: f0c0b2b808 ("change zonelist order: zonelist order selection logic")
    Signed-off-by: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
    Co-developed-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
    Signed-off-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
    Signed-off-by: Bharata B Rao <bharata@amd.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 978a8fb0f0 mm/page_alloc: print node fallback order
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6cf253925df72e522c06dac09ede7e81a6e38121
Author: Bharata B Rao <bharata@amd.com>
Date:   Fri Nov 5 13:40:18 2021 -0700

    mm/page_alloc: print node fallback order

    Patch series "Fix NUMA nodes fallback list ordering".

    For a NUMA system that has multiple nodes at same distance from other
    nodes, the fallback list generation prefers same node order for them
    instead of round-robin thereby penalizing one node over others.  This
    series fixes it.

    More description of the problem and the fix is present in the patch
    description.

    This patch (of 2):

    Print information message about the allocation fallback order for each
    NUMA node during boot.

    No functional changes here.  This makes it easier to illustrate the
    problem in the node fallback list generation, which the next patch
    fixes.

    Link: https://lkml.kernel.org/r/20210830121603.1081-1-bharata@amd.com
    Link: https://lkml.kernel.org/r/20210830121603.1081-2-bharata@amd.com
    Signed-off-by: Bharata B Rao <bharata@amd.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
    Cc: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen a7216c2dfb mm/page_alloc.c: use helper function zone_spans_pfn()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 86fb05b9cc1ac7cdcf37e5408b927dd3ad95db96
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Nov 5 13:40:11 2021 -0700

    mm/page_alloc.c: use helper function zone_spans_pfn()

    Use helper function zone_spans_pfn() to check whether pfn is within a
    zone to simplify the code slightly.

    Link: https://lkml.kernel.org/r/20210902121242.41607-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen 9d23d1b42b mm/page_alloc.c: simplify the code by using macro K()
Bugzilla: https://bugzilla.redhat.com/2120352

commit ff7ed9e4532d14e0478d192548e77d78d72387e9
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Nov 5 13:40:05 2021 -0700

    mm/page_alloc.c: simplify the code by using macro K()

    Use helper macro K() to convert the pages to the corresponding size.
    Minor readability improvement.

    Link: https://lkml.kernel.org/r/20210902121242.41607-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen b603fd9497 mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order()
Bugzilla: https://bugzilla.redhat.com/2120352

commit ea808b4efd15f6f019e9779617a166c9708856c1
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Nov 5 13:40:02 2021 -0700

    mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order()

    Patch series "Cleanups and fixup for page_alloc", v2.

    This series contains cleanups to remove meaningless VM_BUG_ON(), use
    helpers to simplify the code and remove obsolete comment.  Also we avoid
    allocating highmem pages via alloc_pages_exact[_nid].  More details can be
    found in the respective changelogs.

    This patch (of 5):

    It's meaningless to VM_BUG_ON() order != pageblock_order just after
    setting order to pageblock_order.  Remove it.

    Link: https://lkml.kernel.org/r/20210902121242.41607-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20210902121242.41607-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen 6bf0954126 arch/x86/mm/numa: Do not initialize nodes twice
Bugzilla: https://bugzilla.redhat.com/2120352

commit 1ca75fa7f19d694c58af681fa023295072b03120
Author: Oscar Salvador <osalvador@suse.de>
Date:   Tue Mar 22 14:43:51 2022 -0700

    arch/x86/mm/numa: Do not initialize nodes twice

    On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA
    nodes could be allocated at three different places.

     - numa_register_memblks
     - init_cpu_to_node
     - init_gi_nodes

    All these calls happen at setup_arch, and have the following order:

    setup_arch
      ...
      x86_numa_init
       numa_init
        numa_register_memblks
      ...
      init_cpu_to_node
       init_memory_less_node
        alloc_node_data
        free_area_init_memoryless_node
      init_gi_nodes
       init_memory_less_node
        alloc_node_data
        free_area_init_memoryless_node

    numa_register_memblks() is only interested in those nodes which have
    memory, so it skips over any memoryless node it founds.  Later on, when
    we have read ACPI's SRAT table, we call init_cpu_to_node() and
    init_gi_nodes(), which initialize any memoryless node we might have that
    have either CPU or Initiator affinity, meaning we allocate pg_data_t
    struct for them and we mark them as ONLINE.

    So far so good, but the thing is that after ("mm: handle uninitialized
    numa nodes gracefully"), we allocate all possible NUMA nodes in
    free_area_init(), meaning we have a picture like the following:

    setup_arch
      x86_numa_init
       numa_init
        numa_register_memblks  <-- allocate non-memoryless node
      x86_init.paging.pagetable_init
       ...
        free_area_init
         free_area_init_memoryless <-- allocate memoryless node
      init_cpu_to_node
       alloc_node_data             <-- allocate memoryless node with CPU
       free_area_init_memoryless_node
      init_gi_nodes
       alloc_node_data             <-- allocate memoryless node with Initiator
       free_area_init_memoryless_node

    free_area_init() already allocates all possible NUMA nodes, but
    init_cpu_to_node() and init_gi_nodes() are clueless about that, so they
    go ahead and allocate a new pg_data_t struct without checking anything,
    meaning we end up allocating twice.

    It should be mad clear that this only happens in the case where
    memoryless NUMA node happens to have a CPU/Initiator affinity.

    So get rid of init_memory_less_node() and just set the node online.

    Note that setting the node online is needed, otherwise we choke down the
    chain when bringup_nonboot_cpus() ends up calling
    __try_online_node()->register_one_node()->...  and we blow up in
    bus_add_device().  As can be seen here:

      BUG: kernel NULL pointer dereference, address: 0000000000000060
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
      RIP: 0010:bus_add_device+0x5a/0x140
      Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
      RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
      RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
      RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
      R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
      R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
      Call Trace:
       device_add+0x4c0/0x910
       __register_one_node+0x97/0x2d0
       __try_online_node+0x85/0xc0
       try_online_node+0x25/0x40
       cpu_up+0x4f/0x100
       bringup_nonboot_cpus+0x4f/0x60
       smp_init+0x26/0x79
       kernel_init_freeable+0x130/0x2f1
       kernel_init+0x17/0x150
       ret_from_fork+0x22/0x30

    The reason is simple, by the time bringup_nonboot_cpus() gets called, we
    did not register the node_subsys bus yet, so we crash when
    bus_add_device() tries to dereference bus()->p.

    The following shows the order of the calls:

    kernel_init_freeable
     smp_init
      bringup_nonboot_cpus
       ...
         bus_add_device()      <- we did not register node_subsys yet
     do_basic_setup
      do_initcalls
       postcore_initcall(register_node_type);
        register_node_type
         subsys_system_register
          subsys_register
           bus_register         <- register node_subsys bus

    Why setting the node online saves us then? Well, simply because
    __try_online_node() backs off when the node is online, meaning we do not
    end up calling register_one_node() in the first place.

    This is subtle, broken and deserves a deep analysis and thought about
    how to put this into shape, but for now let us have this easy fix for
    the leaking memory issue.

    [osalvador@suse.de: add comments]
      Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de

    Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de
    Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
    Signed-off-by: Oscar Salvador <osalvador@suse.de>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Rafael Aquini <raquini@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Alexey Makhalov <amakhalov@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:16 -04:00
Chris von Recklinghausen 63d42d92cd mm: page table check
Bugzilla: https://bugzilla.redhat.com/2120352

commit df4e817b710809425d899340dbfa8504a3ca4ba5
Author: Pasha Tatashin <pasha.tatashin@soleen.com>
Date:   Fri Jan 14 14:06:37 2022 -0800

    mm: page table check

    Check user page table entries at the time they are added and removed.

    Allows to synchronously catch memory corruption issues related to double
    mapping.

    When a pte for an anonymous page is added into page table, we verify
    that this pte does not already point to a file backed page, and vice
    versa if this is a file backed page that is being added we verify that
    this page does not have an anonymous mapping

    We also enforce that read-only sharing for anonymous pages is allowed
    (i.e.  cow after fork).  All other sharing must be for file pages.

    Page table check allows to protect and debug cases where "struct page"
    metadata became corrupted for some reason.  For example, when refcnt or
    mapcount become invalid.

    Link: https://lkml.kernel.org/r/20211221154650.1047963-4-pasha.tatashin@soleen.com
    Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Frederic Weisbecker <frederic@kernel.org>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Masahiro Yamada <masahiroy@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Paul Turner <pjt@google.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Sami Tolvanen <samitolvanen@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:14 -04:00
Izabela Bakollari 49b9685e02 mm: prevent page_frag_alloc() from corrupting the memory
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2104445

A number of drivers call page_frag_alloc() with a fragment's size >
PAGE_SIZE.

In low memory conditions, __page_frag_cache_refill() may fail the order
3 cache allocation and fall back to order 0; In this case, the cache
will be smaller than the fragment, causing memory corruptions.

Prevent this from happening by checking if the newly allocated cache is
large enough for the fragment; if not, the allocation will fail and
page_frag_alloc() will return NULL.

Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com
Fixes: b63ae8ca09 ("mm/net: Rename and move page fragment handling from net/ to mm/")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Cc: Chen Lin <chen45464546@163.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit dac22531bbd4af2426c4e29e05594415ccfa365d)
Signed-off-by: Izabela Bakollari <ibakolla@redhat.com>
2022-10-04 15:31:49 +02:00
Patrick Talbert 45f9f33cc3 Merge: mm/munlock: Fix sleeping function called from invalid context bug
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1168

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2109671
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1168

The 2nd patch fixes the sleeping function called from invalid context
bug reported in the BZ. The first patch is added to minimize context
diff with the upstream patch.

Signed-off-by: Waiman Long <longman@redhat.com>
~~~
Waiman Long (2):
  mm/migration: add trace events for base page and HugeTLB migrations
  mm/munlock: protect the per-CPU pagevec by a local_lock_t

 arch/x86/mm/init.c             |  1 -
 include/trace/events/migrate.h | 31 +++++++++++++++++++++++
 mm/internal.h                  |  6 +++--
 mm/migrate.c                   |  6 +++--
 mm/mlock.c                     | 46 ++++++++++++++++++++++++++--------
 mm/page_alloc.c                |  1 +
 mm/rmap.c                      | 10 ++++++--
 mm/swap.c                      |  4 ++-
 8 files changed, 87 insertions(+), 18 deletions(-)

Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-08-03 11:54:33 -04:00
Patrick Talbert 840d62781b Merge: cgroup: Miscellaneous bug fixes and enhancements
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/609

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2060150
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/609

This patchset pulls in miscellaneous cgroup fixes and enhancements.

[v3: Drop commit a06247c6804f as it has been merged and also drop commit b1e2c8df0f00 ("cgroup: use irqsave in
 cgroup_rstat_flush_locked().") as it may cause performance regression.]
[v4: Drop bpf commits as they have been merged. Drop commit 0061270307f2 ("group: cgroup-v1: do not exclude
    cgrp_dfl_root") as it cause network performance regression, and add back commit b1e2c8df0f00 ("cgroup: use irqsave in
 cgroup_rstat_flush_locked().")]

Signed-off-by: Waiman Long <longman@redhat.com>
~~~
Waiman Long (12):
  cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem
  cgroup: reduce dependency on cgroup_mutex
  cgroup: remove cgroup_mutex from cgroupstats_build
  cgroup: no need for cgroup_mutex for /proc/cgroups
  cgroup: Fix rootcg cpu.stat guest double counting
  mm/page_alloc: detect allocation forbidden by cpuset and bail out
    early
  cgroup/cpuset: Don't let child cpusets restrict parent in default
    hierarchy
  cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
  psi: Fix uaf issue when psi trigger is destroyed while being polled
  cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug
  cgroup-v1: Correct privileges check in release_agent writes
  cgroup: use irqsave in cgroup_rstat_flush_locked().

 Documentation/accounting/psi.rst |   3 +-
 include/linux/cpuset.h           |  17 ++++
 include/linux/mmzone.h           |  22 ++++++
 include/linux/psi.h              |   2 +-
 include/linux/psi_types.h        |   3 -
 kernel/cgroup/cgroup-v1.c        |  20 ++---
 kernel/cgroup/cgroup.c           |  62 +++++++++------
 kernel/cgroup/cpuset.c           | 131 +++++++++++++++++++++----------
 kernel/cgroup/rstat.c            |  15 +++-
 kernel/sched/psi.c               |  66 +++++++---------
 mm/page_alloc.c                  |  13 +++
 11 files changed, 229 insertions(+), 125 deletions(-)

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-08-01 08:02:11 -04:00
Waiman Long e462accf60 mm/munlock: protect the per-CPU pagevec by a local_lock_t
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2109671
Conflicts: A minor fuzz in mm/migrate.c due to missing upstream commit
	   1eba86c096e3 ("mm: change page type prior to adding page
	   table entry"). Pulling it, however, will require taking in
	   a number of additional patches. So it is not done here.

commit adb11e78c5dc5e26774acb05f983da36447f7911
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Fri, 1 Apr 2022 11:28:33 -0700

    mm/munlock: protect the per-CPU pagevec by a local_lock_t

    The access to mlock_pvec is protected by disabling preemption via
    get_cpu_var() or implicit by having preemption disabled by the caller
    (in mlock_page_drain() case).  This breaks on PREEMPT_RT since
    folio_lruvec_lock_irq() acquires a sleeping lock in this section.

    Create struct mlock_pvec which consits of the local_lock_t and the
    pagevec.  Acquire the local_lock() before accessing the per-CPU pagevec.
    Replace mlock_page_drain() with a _local() version which is invoked on
    the local CPU and acquires the local_lock_t and a _remote() version
    which uses the pagevec from a remote CPU which offline.

    Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-07-21 14:50:55 -04:00
Patrick Talbert 379ca607c0 Merge: mm: folio backports part 2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1097

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Omitted-fix: a04cd1600b831a16625b45226b90a292c8f6e8d9

This is the second part of folio backports for 9.1. Like the first part, I tried
to avoid touching other subsystems as much as possible. Since folio conversions
leave the original functions as compatibility layer, other teams can bring their
subsystems changes whenever they want.

These are not all folio changes for 9.1 and the work will continue in 9.2.

adb11e78c5dc5 was not backported due b74355078b not being present
a04cd1600b831 fixes an issue already fixed by ec4858e07ed62eceb, which is strange because ec4858e07ed62eceb was committed earlier

v2:
- added missing fixes and dependencies
- fixed a backport error on "mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()"
- added Conflicts for everything to keep scripts happy

v3:
- included 3ed4bb77156d patchset as requested

v4:
- fixed bisect build problems

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Conflicts:
- drivers/gpu/drm/drm_cache.c: context differs due to !717.
- drivers/gpu/drm/nouveau/nouveau_dmem.c: context differs due to !717.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-15 10:00:05 +02:00
Aristeu Rozanski 02c4025a8d mm: Make compound_pincount always available
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: Notice we have RHEL-only 44740bc20b applied but that shouldn't be a problem space wise since we don't ship 32 bit kernels anymore and we're well under 40 byte limit

commit 5232c63f46fdd779303527ec36c518cc1e9c6b4e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Jan 6 16:46:43 2022 -0500

    mm: Make compound_pincount always available

    Move compound_pincount from the third page to the second page, which
    means it's available for all compound pages.  That lets us delete
    hpage_pincount_available().

    On 32-bit systems, there isn't enough space for both compound_pincount
    and compound_nr in the second page (it would collide with page->private,
    which is in use for pages in the swap cache), so revert the optimisation
    of storing both compound_order and compound_nr on 32-bit systems.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Joel Savitz 4c8ad89c62 mm/page_alloc: always attempt to allocate at least one page during bulk allocation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2094045

commit c572e4888ad1be123c1516ec577ad30a700bbec4
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Thu May 26 10:12:10 2022 +0100

    mm/page_alloc: always attempt to allocate at least one page during bulk allocation

    Peter Pavlisko reported the following problem on kernel bugzilla 216007.

            When I try to extract an uncompressed tar archive (2.6 milion
            files, 760.3 GiB in size) on newly created (empty) XFS file system,
            after first low tens of gigabytes extracted the process hangs in
            iowait indefinitely. One CPU core is 100% occupied with iowait,
            the other CPU core is idle (on 2-core Intel Celeron G1610T).

    It was bisected to c9fa563072 ("xfs: use alloc_pages_bulk_array() for
    buffers") but XFS is only the messenger.  The problem is that nothing is
    waking kswapd to reclaim some pages at a time the PCP lists cannot be
    refilled until some reclaim happens.  The bulk allocator checks that there
    are some pages in the array and the original intent was that a bulk
    allocator did not necessarily need all the requested pages and it was best
    to return as quickly as possible.

    This was fine for the first user of the API but both NFS and XFS require
    the requested number of pages be available before making progress.  Both
    could be adjusted to call the page allocator directly if a bulk allocation
    fails but it puts a burden on users of the API.  Adjust the semantics to
    attempt at least one allocation via __alloc_pages() before returning so
    kswapd is woken if necessary.

    It was reported via bugzilla that the patch addressed the problem and that
    the tar extraction completed successfully.  This may also address bug
    215975 but has yet to be confirmed.

    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216007
    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215975
    Link: https://lkml.kernel.org/r/20220526091210.GC3441@techsingularity.net
    Fixes: 387ba26fb1 ("mm/page_alloc: add a bulk page allocator")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: "Darrick J. Wong" <djwong@kernel.org>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Chuck Lever <chuck.lever@oracle.com>
    Cc: <stable@vger.kernel.org>    [5.13+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
2022-06-15 10:35:41 -04:00
Waiman Long 83fb75916e mm/page_alloc: detect allocation forbidden by cpuset and bail out early
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2060150

commit 8ca1b5a49885f0c0c486544da46a9e0ac790831d
Author: Feng Tang <feng.tang@intel.com>
Date:   Fri, 5 Nov 2021 13:40:34 -0700

    mm/page_alloc: detect allocation forbidden by cpuset and bail out early

    There was a report that starting an Ubuntu in docker while using cpuset
    to bind it to movable nodes (a node only has movable zone, like a node
    for hotplug or a Persistent Memory node in normal usage) will fail due
    to memory allocation failure, and then OOM is involved and many other
    innocent processes got killed.

    It can be reproduced with command:

        $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"

    (where node 4 is a movable node)

      runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
      CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
      Call Trace:
       dump_stack+0x6b/0x88
       dump_header+0x4a/0x1e2
       oom_kill_process.cold+0xb/0x10
       out_of_memory.part.0+0xaf/0x230
       out_of_memory+0x3d/0x80
       __alloc_pages_slowpath.constprop.0+0x954/0xa20
       __alloc_pages_nodemask+0x2d3/0x300
       pipe_write+0x322/0x590
       new_sync_write+0x196/0x1b0
       vfs_write+0x1c3/0x1f0
       ksys_write+0xa7/0xe0
       do_syscall_64+0x52/0xd0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

      Mem-Info:
      active_anon:392832 inactive_anon:182 isolated_anon:0
       active_file:68130 inactive_file:151527 isolated_file:0
       unevictable:2701 dirty:0 writeback:7
       slab_reclaimable:51418 slab_unreclaimable:116300
       mapped:45825 shmem:735 pagetables:2540 bounce:0
       free:159849484 free_pcp:73 free_cma:0
      Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
      Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
      lowmem_reserve[]: 0 0 0 0 0
      Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB

      oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
      Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
      oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
      oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    The reason is that in this case, the target cpuset nodes only have
    movable zone, while the creation of an OS in docker sometimes needs to
    allocate memory in non-movable zones (dma/dma32/normal) like
    GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
    out-of-memory killing is involved even when normal nodes and movable
    nodes both have many free memory.

    The OOM killer cannot help to resolve the situation as there is no
    usable memory for the request in the cpuset scope.  The only reasonable
    measure to take is to fail the allocation right away and have the caller
    to deal with it.

    So add a check for cases like this in the slowpath of allocation, and
    bail out early returning NULL for the allocation.

    As page allocation is one of the hottest path in kernel, this check will
    hurt all users with sane cpuset configuration, add a static branch check
    and detect the abnormal config in cpuset memory binding setup so that
    the extra check cost in page allocation is not paid by everyone.

    [thanks to Micho Hocko and David Rientjes for suggesting not handling
     it inside OOM code, adding cpuset check, refining comments]

    Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com
    Signed-off-by: Feng Tang <feng.tang@intel.com>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-06-13 09:47:53 -04:00
Patrick Talbert 407ad35116 Merge: mm: backport folio support
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/678

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests with a stock kernel test run for comparison

This backport includes the base folio patches *without* touching any subsystems.
Patches are mostly straight forward converting functions to use folios.

v2: merge conflict, dropped 78525c74d9e7d1a6ce69bd4388f045f6e474a20b as contradicts the fact we're trying to not do subsystems converting in this MR

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Carlos Maiolino <cmaiolino@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-03 10:59:25 +02:00
Patrick Talbert dfb49ebc4b Merge: Preallocate pgdat struct for all nodes during boot
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/630

```
The page allocator blows up when an allocation from a possible node is requested.
The underlying reason is that NODE_DATA for the specific node is not allocated.

Preallocate the pgdat struct for all nodes rather then all online nodes to handle these cases more gracefully.
```
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2024054

Upstream-Status: Linus

Signed-off-by: Nico Pache <npache@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-04-19 12:23:39 +02:00
Aristeu Rozanski 3b6cedb421 mm/page_alloc: Add folio allocation functions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok
Conflicts: context due c00b6b9610991c042ff4c3153daaa3ea8522c210 being backported already

commit cc09cb134124a42fbe3bdcebefdc54e286d8f3e5
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Dec 15 22:55:54 2020 -0500

    mm/page_alloc: Add folio allocation functions

    The __folio_alloc(), __folio_alloc_node() and folio_alloc() functions
    are mostly for type safety, but they also ensure that the page allocator
    allocates a compound page and initialises the deferred list if the page
    is large enough to have one.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:31 -04:00
Aristeu Rozanski 98caaaf947 mm/memcg: Convert mem_cgroup_uncharge() to take a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit bbc6b703b21963e909f633cf7718903ed5094319
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat May 1 20:42:23 2021 -0400

    mm/memcg: Convert mem_cgroup_uncharge() to take a folio

    Convert all the callers to call page_folio().  Most of them were already
    using a head page, but a few of them I can't prove were, so this may
    actually fix a bug.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:28 -04:00
Nico Pache 5a2c4c0b3c mm: make free_area_init_node aware of memory less nodes
commit 7c30daac20698cb035255089c896f230982b085e
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Mar 22 14:47:03 2022 -0700

    mm: make free_area_init_node aware of memory less nodes

    free_area_init_node is also called from memory less node initialization
    path (free_area_init_memoryless_node).  It doesn't really make much sense
    to display the physical memory range for those nodes: Initmem setup node
    XX [mem 0x0000000000000000-0x0000000000000000]

    Instead be explicit that the node is memoryless: Initmem setup node XX as
    memoryless

    Link: https://lkml.kernel.org/r/20220127085305.20890-6-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Rafael Aquini <raquini@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Alexey Makhalov <amakhalov@vmware.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Cc: Nico Pache <npache@redhat.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2024054
Signed-off-by: Nico Pache <npache@redhat.com>
2022-03-28 12:41:39 -06:00
Nico Pache 078bc11654 mm, memory_hotplug: reorganize new pgdat initialization
commit 70b5b46a754245d383811b4d2f2c76c34bb7e145
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Mar 22 14:47:00 2022 -0700

    mm, memory_hotplug: reorganize new pgdat initialization

    When a !node_online node is brought up it needs a hotplug specific
    initialization because the node could be either uninitialized yet or it
    could have been recycled after previous hotremove.  hotadd_init_pgdat is
    responsible for that.

    Internal pgdat state is initialized at two places currently
            - hotadd_init_pgdat
            - free_area_init_core_hotplug

    There is no real clear cut what should go where but this patch's chosen to
    move the whole internal state initialization into
    free_area_init_core_hotplug.  hotadd_init_pgdat is still responsible to
    pull all the parts together - most notably to initialize zonelists because
    those depend on the overall topology.

    This patch doesn't introduce any functional change.

    Link: https://lkml.kernel.org/r/20220127085305.20890-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Rafael Aquini <raquini@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Alexey Makhalov <amakhalov@vmware.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nico Pache <npache@redhat.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2024054
Signed-off-by: Nico Pache <npache@redhat.com>
2022-03-28 12:41:39 -06:00
Nico Pache 8e3254a841 mm: handle uninitialized numa nodes gracefully
commit 09f49dca570a917a8c6bccd7e8c61f5141534e3a
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Mar 22 14:46:54 2022 -0700

    mm: handle uninitialized numa nodes gracefully

    We have had several reports [1][2][3] that page allocator blows up when an
    allocation from a possible node is requested.  The underlying reason is
    that NODE_DATA for the specific node is not allocated.

    NUMA specific initialization is arch specific and it can vary a lot.  E.g.
    x86 tries to initialize all nodes that have some cpu affinity (see
    init_cpu_to_node) but this can be insufficient because the node might be
    cpuless for example.

    One way to address this problem would be to check for !node_online nodes
    when trying to get a zonelist and silently fall back to another node.
    That is unfortunately adding a branch into allocator hot path and it
    doesn't handle any other potential NODE_DATA users.

    This patch takes a different approach (following a lead of [3]) and it pre
    allocates pgdat for all possible nodes in an arch indipendent code -
    free_area_init.  All uninitialized nodes are treated as memoryless nodes.
    node_state of the node is not changed because that would lead to other
    side effects - e.g.  sysfs representation of such a node and from past
    discussions [4] it is known that some tools might have problems digesting
    that.

    Newly allocated pgdat only gets a minimal initialization and the rest of
    the work is expected to be done by the memory hotplug - hotadd_new_pgdat
    (renamed to hotadd_init_pgdat).

    generic_alloc_nodedata is changed to use the memblock allocator because
    neither page nor slab allocators are available at the stage when all
    pgdats are allocated.  Hotplug doesn't allocate pgdat anymore so we can
    use the early boot allocator.  The only arch specific implementation is
    ia64 and that is changed to use the early allocator as well.

    [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
    [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
    [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
    [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com

    [akpm@linux-foundation.org: replace comment, per Mike]

    Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz
    Reported-by: Alexey Makhalov <amakhalov@vmware.com>
    Tested-by: Alexey Makhalov <amakhalov@vmware.com>
    Reported-by: Nico Pache <npache@redhat.com>
    Acked-by: Rafael Aquini <raquini@redhat.com>
    Tested-by: Rafael Aquini <raquini@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2024054
Signed-off-by: Nico Pache <npache@redhat.com>
2022-03-28 12:41:38 -06:00
Rafael Aquini b501affe62 mm/page_alloc: check high-order pages for corruption during PCP operations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 77fe7f136a7312954b1b8b7eeb4bc91fc3c14a3f
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Tue Mar 22 14:44:00 2022 -0700

    mm/page_alloc: check high-order pages for corruption during PCP operations

    Eric Dumazet pointed out that commit 44042b4498 ("mm/page_alloc: allow
    high-order pages to be stored on the per-cpu lists") only checks the
    head page during PCP refill and allocation operations.  This was an
    oversight and all pages should be checked.  This will incur a small
    performance penalty but it's necessary for correctness.

    Link: https://lkml.kernel.org/r/20220310092456.GJ15701@techsingularity.net
    Fixes: 44042b4498 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reported-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Rientjes <rientjes@google.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:43 -04:00