Commit Graph

2073 Commits

Author SHA1 Message Date
Chris von Recklinghausen 05bae72e4f mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6d05141a393071e104bf5be5ad4d0c79c6dff343
Author: Catalin Marinas <catalin.marinas@arm.com>
Date:   Fri Jun 10 16:21:40 2022 +0100

    mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON

    Currently post_alloc_hook() skips the kasan unpoisoning if the tags will
    be zeroed (__GFP_ZEROTAGS) or __GFP_SKIP_KASAN_UNPOISON is passed. Since
    __GFP_ZEROTAGS is now accompanied by __GFP_SKIP_KASAN_UNPOISON, remove
    the extra check.

    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
    Link: https://lore.kernel.org/r/20220610152141.2148929-4-catalin.marinas@arm.com
    Signed-off-by: Will Deacon <will@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:22 -04:00
Chris von Recklinghausen 0bcda21835 mm: kasan: Skip unpoisoning of user pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit 70c248aca9e7efa85a6664d5ab56c17c326c958f
Author: Catalin Marinas <catalin.marinas@arm.com>
Date:   Fri Jun 10 16:21:39 2022 +0100

    mm: kasan: Skip unpoisoning of user pages

    Commit c275c5c6d5 ("kasan: disable freed user page poisoning with HW
    tags") added __GFP_SKIP_KASAN_POISON to GFP_HIGHUSER_MOVABLE. A similar
    argument can be made about unpoisoning, so also add
    __GFP_SKIP_KASAN_UNPOISON to user pages. To ensure the user page is
    still accessible via page_address() without a kasan fault, reset the
    page->flags tag.

    With the above changes, there is no need for the arm64
    tag_clear_highpage() to reset the page->flags tag.

    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
    Link: https://lore.kernel.org/r/20220610152141.2148929-3-catalin.marinas@arm.com
    Signed-off-by: Will Deacon <will@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:22 -04:00
Chris von Recklinghausen 668121b9be mm/page_alloc: make the annotations of available memory more accurate
Bugzilla: https://bugzilla.redhat.com/2160210

commit ade63b419c4e8d27f0642804b6c8c7a76ffc18ac
Author: Yang Yang <yang.yang29@zte.com.cn>
Date:   Thu Jun 23 02:08:34 2022 +0000

    mm/page_alloc: make the annotations of available memory more accurate

    Not all systems use swap, so estimating available memory would help to
    prevent swapping or OOM of system that not use swap.

    And we need to reserve some page cache to prevent swapping or thrashing.
    If somebody is accessing the pages in pagecache, and if too much would be
    freed, most accesses might mean reading data from disk, i.e.  thrashing.

    Link: https://lkml.kernel.org/r/20220623020833.972979-1-yang.yang29@zte.com.cn
    Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
    Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:21 -04:00
Chris von Recklinghausen eed5a2e492 mm: convert destroy_compound_page() to destroy_large_folio()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5375336c8c42a343c3b440b6f1e21c65e7b174b9
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 18:50:17 2022 +0100

    mm: convert destroy_compound_page() to destroy_large_folio()

    All callers now have a folio, so push the folio->page conversion
    down to this function.

    [akpm@linux-foundation.org: uninline destroy_large_folio() to fix build issue]
    Link: https://lkml.kernel.org/r/20220617175020.717127-20-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:20 -04:00
Chris von Recklinghausen dc32bddd4e mm: introduce clear_highpage_kasan_tagged
Bugzilla: https://bugzilla.redhat.com/2160210

commit d9da8f6cf55eeca642c021912af1890002464c64
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Jun 9 20:18:46 2022 +0200

    mm: introduce clear_highpage_kasan_tagged

    Add a clear_highpage_kasan_tagged() helper that does clear_highpage() on a
    page potentially tagged by KASAN.

    This helper is used by the following patch.

    Link: https://lkml.kernel.org/r/4471979b46b2c487787ddcd08b9dc5fedd1b6ffd.1654798516.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Chris von Recklinghausen a59713cb0c mm: rename kernel_init_free_pages to kernel_init_pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit aeaec8e27eddc147b96fe32df2671980ce7ca87c
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Jun 9 20:18:45 2022 +0200

    mm: rename kernel_init_free_pages to kernel_init_pages

    Rename kernel_init_free_pages() to kernel_init_pages().  This function is
    not only used for free pages but also for pages that were just allocated.

    Link: https://lkml.kernel.org/r/1ecaffc0a9c1404d4d7cf52efe0b2dc8a0c681d8.1654798516.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Chris von Recklinghausen 41ec620cfc mm/page_alloc: use might_alloc()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 446ec83805ddaab5b8734d30ba4ae8c56739a9b4
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sun Jun 5 17:25:37 2022 +0200

    mm/page_alloc: use might_alloc()

    ...  instead of open coding it.  Completely equivalent code, just a notch
    more meaningful when reading.

    Link: https://lkml.kernel.org/r/20220605152539.3196045-1-daniel.vetter@ffwll.ch
    Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:14 -04:00
Chris von Recklinghausen 3fe654e9dd memblock: Disable mirror feature if kernelcore is not specified
Bugzilla: https://bugzilla.redhat.com/2160210

commit 902c2d91582c7ff0cb5f57ffb3766656f9b910c6
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Tue Jun 14 17:21:56 2022 +0800

    memblock: Disable mirror feature if kernelcore is not specified

    If system have some mirrored memory and mirrored feature is not specified
    in boot parameter, the basic mirrored feature will be enabled and this will
    lead to the following situations:

    - memblock memory allocation prefers mirrored region. This may have some
      unexpected influence on numa affinity.

    - contiguous memory will be split into several parts if parts of them
      is mirrored memory via memblock_mark_mirror().

    To fix this, variable mirrored_kernelcore will be checked in
    memblock_mark_mirror(). Mark mirrored memory with flag MEMBLOCK_MIRROR iff
    kernelcore=mirror is added in the kernel parameters.

    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Link: https://lore.kernel.org/r/20220614092156.1972846-6-mawupeng1@huawei.com
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:13 -04:00
Chris von Recklinghausen a04ba46849 mm: fix is_pinnable_page against a cma page
Bugzilla: https://bugzilla.redhat.com/2160210

commit 1c563432588dbffa71e67ca6e37c826f9fa86e04
Author: Minchan Kim <minchan@kernel.org>
Date:   Tue May 24 10:15:25 2022 -0700

    mm: fix is_pinnable_page against a cma page

    Pages in the CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA so
    the current is_pinnable_page() could miss CMA pages which have
    MIGRATE_ISOLATE.  It ends up pinning CMA pages as longterm for the
    pin_user_pages() API so CMA allocations keep failing until the pin is
    released.

         CPU 0                                   CPU 1 - Task B

    cma_alloc
    alloc_contig_range
                                            pin_user_pages_fast(FOLL_LONGTERM)
    change pageblock as MIGRATE_ISOLATE
                                            internal_get_user_pages_fast
                                            lockless_pages_from_mm
                                            gup_pte_range
                                            try_grab_folio
                                            is_pinnable_page
                                              return true;
                                            So, pinned the page successfully.
    page migration failure with pinned page
                                            ..
                                            .. After 30 sec
                                            unpin_user_page(page)

    CMA allocation succeeded after 30 sec.

    The CMA allocation path protects the migration type change race using
    zone->lock but what GUP path need to know is just whether the page is on
    CMA area or not rather than exact migration type.  Thus, we don't need
    zone->lock but just checks migration type in either of (MIGRATE_ISOLATE
    and MIGRATE_CMA).

    Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause rejecting
    of pinning pages on MIGRATE_ISOLATE pageblocks even though it's neither
    CMA nor movable zone if the page is temporarily unmovable.  However, such
    a migration failure by unexpected temporal refcount holding is general
    issue, not only come from MIGRATE_ISOLATE and the MIGRATE_ISOLATE is also
    transient state like other temporal elevated refcount problem.

    Link: https://lkml.kernel.org/r/20220524171525.976723-1-minchan@kernel.org
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Acked-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:12 -04:00
Chris von Recklinghausen f0971a2aaf mm: split free page with properly free memory accounting and without race
Bugzilla: https://bugzilla.redhat.com/2160210

commit 86d28b0709279ccc636ef9ba267b7f3bcef79a4b
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 26 19:15:31 2022 -0400

    mm: split free page with properly free memory accounting and without race

    In isolate_single_pageblock(), free pages are checked without holding zone
    lock, but they can go away in split_free_page() when zone lock is held.
    Check the free page and its order again in split_free_page() when zone lock
    is held. Recheck the page if the free page is gone under zone lock.

    In addition, in split_free_page(), the free page was deleted from the page
    list without changing free page accounting. Add the missing free page
    accounting code.

    Fix the type of order parameter in split_free_page().

    Link: https://lore.kernel.org/lkml/20220525103621.987185e2ca0079f7b97b856d@linux-foundation.org/
    Link: https://lkml.kernel.org/r/20220526231531.2404977-2-zi.yan@sent.com
    Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reported-by: Doug Berger <opendmb@gmail.com>
      Link: https://lore.kernel.org/linux-mm/c3932a6f-77fe-29f7-0c29-fe6b1c67ab7b@gmail.com/
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Qian Cai <quic_qiancai@quicinc.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michael Walle <michael@walle.cc>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:12 -04:00
Chris von Recklinghausen c4ca97baf9 mm: fix a potential infinite loop in start_isolate_page_range()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 88ee134320b8311ca7a00630e5ba013cd0239350
Author: Zi Yan <ziy@nvidia.com>
Date:   Tue May 24 15:47:56 2022 -0400

    mm: fix a potential infinite loop in start_isolate_page_range()

    In isolate_single_pageblock() called by start_isolate_page_range(), there
    are some pageblock isolation issues causing a potential infinite loop when
    isolating a page range.  This is reported by Qian Cai.

    1. the pageblock was isolated by just changing pageblock migratetype
       without checking unmovable pages. Calling set_migratetype_isolate() to
       isolate pageblock properly.
    2. an off-by-one error caused migrating pages unnecessarily, since the page
       is not crossing pageblock boundary.
    3. migrating a compound page across pageblock boundary then splitting the
       free page later has a small race window that the free page might be
       allocated again, so that the code will try again, causing an potential
       infinite loop. Temporarily set the to-be-migrated page's pageblock to
       MIGRATE_ISOLATE to prevent that and bail out early if no free page is
       found after page migration.

    An additional fix to split_free_page() aims to avoid crashing in
    __free_one_page().  When the free page is split at the specified
    split_pfn_offset, free_page_order should check both the first bit of
    free_page_pfn and the last bit of split_pfn_offset and use the smaller
    one.  For example, if free_page_pfn=0x10000, split_pfn_offset=0xc000,
    free_page_order should first be 0x8000 then 0x4000, instead of 0x4000 then
    0x8000, which the original algorithm did.

    [akpm@linux-foundation.org: suppress min() warning]
    Link: https://lkml.kernel.org/r/20220524194756.1698351-1-zi.yan@sent.com
    Fixes: b2c9e2fbba3253 ("mm: make alloc_contig_range work at pageblock granularity")
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reported-by: Qian Cai <quic_qiancai@quicinc.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen a4674660fa mm: fix missing handler for __GFP_NOWARN
Bugzilla: https://bugzilla.redhat.com/2160210

commit 3f913fc5f9745613088d3c569778c9813ab9c129
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Thu May 19 14:08:55 2022 -0700

    mm: fix missing handler for __GFP_NOWARN

    We expect no warnings to be issued when we specify __GFP_NOWARN, but
    currently in paths like alloc_pages() and kmalloc(), there are still some
    warnings printed, fix it.

    But for some warnings that report usage problems, we don't deal with them.
    If such warnings are printed, then we should fix the usage problems.
    Such as the following case:

            WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

    [zhengqi.arch@bytedance.com: v2]
     Link: https://lkml.kernel.org/r/20220511061951.1114-1-zhengqi.arch@bytedance.com
    Link: https://lkml.kernel.org/r/20220510113809.80626-1-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Akinobu Mita <akinobu.mita@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen 5c438547f0 mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 10e0f7530205799e7e971aba699a7cb3a47456de
Author: Wonhyuk Yang <vvghjk1234@gmail.com>
Date:   Thu May 19 14:08:54 2022 -0700

    mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()

    Currently, trace point mm_page_alloc_zone_locked() doesn't show correct
    information.

    First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can be allocated
    from MIGRATE_HIGHATOMIC/MIGRATE_CMA.  Nevertheless, tracepoint use
    requested migration type not MIGRATE_HIGHATOMIC and MIGRATE_CMA.

    Second, after commit 44042b4498 ("mm/page_alloc: allow high-order pages
    to be stored on the per-cpu lists") percpu-list can store high order
    pages.  But trace point determine whether it is a refiil of percpu-list by
    comparing requested order and 0.

    To handle these problems, make mm_page_alloc_zone_locked() only be called
    by __rmqueue_smallest with correct migration type.  With a new argument
    called percpu_refill, it can show roughly whether it is a refill of
    percpu-list.

    Link: https://lkml.kernel.org/r/20220512025307.57924-1-vvghjk1234@gmail.com
    Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Baik Song An <bsahn@etri.re.kr>
    Cc: Hong Yeon Kim <kimhy@etri.re.kr>
    Cc: Taeung Song <taeung@reallinux.co.kr>
    Cc: <linuxgeek@linuxgeek.io>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen 32443a7ae5 mm/memory-failure.c: simplify num_poisoned_pages_dec
Bugzilla: https://bugzilla.redhat.com/2160210

commit c8bd84f73fd6215d5b8d0b3cfc914a3671b16d1c
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:09 2022 -0700

    mm/memory-failure.c: simplify num_poisoned_pages_dec

    Don't decrease the number of poisoned pages in page_alloc.c, let the
    memory-failure.c do inc/dec poisoned pages only.

    Also simplify unpoison_memory(), only decrease the number of
    poisoned pages when:
     - TestClearPageHWPoison() succeed
     - put_page_back_buddy succeed

    After decreasing, print necessary log.

    Finally, remove clear_page_hwpoison() and unpoison_taken_off_page().

    Link: https://lkml.kernel.org/r/20220509105641.491313-3-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen 1369e752bb mm: cma: use pageblock_order as the single alignment
Bugzilla: https://bugzilla.redhat.com/2160210

commit 11ac3e87ce09c27f4587a8c4fe0829d814021a82
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 12 20:22:58 2022 -0700

    mm: cma: use pageblock_order as the single alignment

    Now alloc_contig_range() works at pageblock granularity.  Change CMA
    allocation, which uses alloc_contig_range(), to use pageblock_nr_pages
    alignment.

    Link: https://lkml.kernel.org/r/20220425143118.2850746-6-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:05 -04:00
Chris von Recklinghausen 9ac05d330c mm: page_isolation: enable arbitrary range page isolation.
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6e263fff1de48fcd97b680b54cd8d1695fc3c776
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 12 20:22:58 2022 -0700

    mm: page_isolation: enable arbitrary range page isolation.

    Now start_isolate_page_range() is ready to handle arbitrary range
    isolation, so move the alignment check/adjustment into the function body.
    Do the same for its counterpart undo_isolate_page_range().
    alloc_contig_range(), its caller, can pass an arbitrary range instead of a
    MAX_ORDER_NR_PAGES aligned one.

    Link: https://lkml.kernel.org/r/20220425143118.2850746-5-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:05 -04:00
Chris von Recklinghausen 69bfe65709 mm: make alloc_contig_range work at pageblock granularity
Bugzilla: https://bugzilla.redhat.com/2160210

commit b2c9e2fbba32539626522b6aed30d1dde7b7e971
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 12 20:22:58 2022 -0700

    mm: make alloc_contig_range work at pageblock granularity

    alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
    merging pageblocks with different migratetypes.  It might unnecessarily
    convert extra pageblocks at the beginning and at the end of the range.
    Change alloc_contig_range() to work at pageblock granularity.

    Special handling is needed for free pages and in-use pages across the
    boundaries of the range specified by alloc_contig_range().  Because these=

    Partially isolated pages causes free page accounting issues.  The free
    pages will be split and freed into separate migratetype lists; the in-use=

    Pages will be migrated then the freed pages will be handled in the
    aforementioned way.

    [ziy@nvidia.com: fix deadlock/crash]
      Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com
    Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:05 -04:00
Chris von Recklinghausen 7f32c40d39 mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
Bugzilla: https://bugzilla.redhat.com/2160210

commit b48d8a8e5ce53e3114a1ffe96563e3555b51d40b
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 12 20:22:57 2022 -0700

    mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c

    Patch series "Use pageblock_order for cma and alloc_contig_range alignment", v11.

    This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
    and alloc_contig_range(). It prepares for my upcoming changes to make
    MAX_ORDER adjustable at boot time[1].

    The MAX_ORDER - 1 alignment requirement comes from that
    alloc_contig_range() isolates pageblocks to remove free memory from buddy
    allocator but isolating only a subset of pageblocks within a page spanning
    across multiple pageblocks causes free page accounting issues.  Isolated
    page might not be put into the right free list, since the code assumes the
    migratetype of the first pageblock as the whole free page migratetype.
    This is based on the discussion at [2].

    To remove the requirement, this patchset:
    1. isolates pages at pageblock granularity instead of
       max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
    2. splits free pages across the specified range or migrates in-use pages
       across the specified range then splits the freed page to avoid free page
       accounting issues (it happens when multiple pageblocks within a single page
       have different migratetypes);
    3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
       range during isolation to avoid alloc_contig_range() failure when pageblocks
       within a MAX_ORDER - 1 aligned range are allocated separately.
    4. returns pages not in the range as it did before.

    One optimization might come later:
    1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
       migratetypes when isolation fails in the middle of the range.

    [1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
    [2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/

    This patch (of 6):

    has_unmovable_pages() is only used in mm/page_isolation.c.  Move it from
    mm/page_alloc.c and make it static.

    Link: https://lkml.kernel.org/r/20220425143118.2850746-2-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: kernel test robot <lkp@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:05 -04:00
Chris von Recklinghausen 3f1b5303a4 mm/page_alloc: cache the result of node_dirty_ok()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 8a87d6959f0d81108d95b0dbd3d7dc2cecea853d
Author: Wonhyuk Yang <vvghjk1234@gmail.com>
Date:   Thu May 12 20:22:51 2022 -0700

    mm/page_alloc: cache the result of node_dirty_ok()

    To spread dirty pages, nodes are checked whether they have reached the
    dirty limit using the expensive node_dirty_ok().  To reduce the frequency
    of calling node_dirty_ok(), the last node that hit the dirty limit can be
    cached.

    Instead of caching the node, caching both the node and its node_dirty_ok()
    status can reduce the number of calle to node_dirty_ok().

    [akpm@linux-foundation.org: rename last_pgdat_dirty_limit to last_pgdat_dirty_ok]
    Link: https://lkml.kernel.org/r/20220430011032.64071-1-vvghjk1234@gmail.com
    Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Donghyeok Kim <dthex5d@gmail.com>
    Cc: JaeSang Yoo <jsyoo5b@gmail.com>
    Cc: Jiyoup Kim <lakroforce@gmail.com>
    Cc: Ohhoon Kwon <ohkwon1043@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:03 -04:00
Chris von Recklinghausen f942ace7a2 mm: create new mm/swap.h header file
Bugzilla: https://bugzilla.redhat.com/2160210

commit 014bb1de4fc17d54907d54418126a9a9736f4aff
Author: NeilBrown <neilb@suse.de>
Date:   Mon May 9 18:20:47 2022 -0700

    mm: create new mm/swap.h header file

    Patch series "MM changes to improve swap-over-NFS support".

    Assorted improvements for swap-via-filesystem.

    This is a resend of these patches, rebased on current HEAD.  The only
    substantial changes is that swap_dirty_folio has replaced
    swap_set_page_dirty.

    Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem.  It
    has previously worked for NFS but that broke a few releases back.  This
    series changes to use a new ->swap_rw rather than ->readpage and
    ->direct_IO.  It also makes other improvements.

    There is a companion series already in linux-next which fixes various
    issues with NFS.  Once both series land, a final patch is needed which
    changes NFS over to use ->swap_rw.

    This patch (of 10):

    Many functions declared in include/linux/swap.h are only used within mm/

    Create a new "mm/swap.h" and move some of these declarations there.
    Remove the redundant 'extern' from the function declarations.

    [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
    Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
    Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brown
    Signed-off-by: NeilBrown <neilb@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Tested-by: David Howells <dhowells@redhat.com>
    Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:00 -04:00
Chris von Recklinghausen 9529ba417b mm/page_alloc: simplify update of pgdat in wake_all_kswapds
Bugzilla: https://bugzilla.redhat.com/2160210

commit d137a7cb9b2ab8155184b2da9a304afff8f84d36
Author: Chen Wandun <chenwandun@huawei.com>
Date:   Fri Apr 29 14:36:59 2022 -0700

    mm/page_alloc: simplify update of pgdat in wake_all_kswapds

    There is no need to update last_pgdat for each zone, only update
    last_pgdat when iterating the first zone of a node.

    Link: https://lkml.kernel.org/r/20220322115635.2708989-1-chenwandun@huawei.com
    Signed-off-by: Chen Wandun <chenwandun@huawei.com>
    Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:57 -04:00
Chris von Recklinghausen 399ea3c9ec mm/page_alloc: reuse tail struct pages for compound devmaps
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6fd3620b342861de9547ea01d28f664892ef51a1
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Thu Apr 28 23:16:16 2022 -0700

    mm/page_alloc: reuse tail struct pages for compound devmaps

    Currently memmap_init_zone_device() ends up initializing 32768 pages when
    it only needs to initialize 128 given tail page reuse.  That number is
    worse with 1GB compound pages, 262144 instead of 128.  Update
    memmap_init_zone_device() to skip redundant initialization, detailed
    below.

    When a pgmap @vmemmap_shift is set, all pages are mapped at a given huge
    page alignment and use compound pages to describe them as opposed to a
    struct per 4K.

    With @vmemmap_shift > 0 and when struct pages are stored in ram (!altmap)
    most tail pages are reused.  Consequently, the amount of unique struct
    pages is a lot smaller than the total amount of struct pages being mapped.

    The altmap path is left alone since it does not support memory savings
    based on compound pages devmap.

    Link: https://lkml.kernel.org/r/20220420155310.9712-6-joao.m.martins@oracle.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:55 -04:00
Chris von Recklinghausen 5f728eb268 mm/page_alloc.c: calc the right pfn if page size is not 4K
Bugzilla: https://bugzilla.redhat.com/2160210

commit aa282a157bf8ff79bed9164dc5e0e99f0d9e9755
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Thu Apr 28 23:16:14 2022 -0700

    mm/page_alloc.c: calc the right pfn if page size is not 4K

    Previous 0x100000 is used to check the 4G limit in
    find_zone_movable_pfns_for_nodes().  This is right in x86 because the page
    size can only be 4K.  But 16K and 64K are available in arm64.  So replace
    it with PHYS_PFN(SZ_4G).

    Link: https://lkml.kernel.org/r/20220414101314.1250667-8-mawupeng1@huawei.com
    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:55 -04:00
Chris von Recklinghausen 5cea471a90 mm/vmscan: make sure wakeup_kswapd with managed zone
Bugzilla: https://bugzilla.redhat.com/2160210

commit bc53008eea55330f485c956338d3c59f96c70c08
Author: Wei Yang <richard.weiyang@gmail.com>
Date:   Thu Apr 28 23:16:03 2022 -0700

    mm/vmscan: make sure wakeup_kswapd with managed zone

    wakeup_kswapd() only wake up kswapd when the zone is managed.

    For two callers of wakeup_kswapd(), they are node perspective.

      * wake_all_kswapds
      * numamigrate_isolate_page

    If we picked up a !managed zone, this is not we expected.

    This patch makes sure we pick up a managed zone for wakeup_kswapd().  And
    it also use managed_zone in migrate_balanced_pgdat() to get the proper
    zone.

    [richard.weiyang@gmail.com: adjust the usage in migrate_balanced_pgdat()]
      Link: https://lkml.kernel.org/r/20220329010901.1654-2-richard.weiyang@gmail.com
    Link: https://lkml.kernel.org/r/20220327024101.10378-2-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Chris von Recklinghausen 6cf53831b0 mm: wrap __find_buddy_pfn() with a necessary buddy page validation
Bugzilla: https://bugzilla.redhat.com/2160210

commit 8170ac4700d26f65a9a4ebc8ae488539158dc5f7
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu Apr 28 23:16:01 2022 -0700

    mm: wrap __find_buddy_pfn() with a necessary buddy page validation

    Whenever the buddy of a page is found from __find_buddy_pfn(),
    page_is_buddy() should be used to check its validity.  Add a helper
    function find_buddy_page_pfn() to find the buddy page and do the check
    together.

    [ziy@nvidia.com: updates per David]
    Link: https://lkml.kernel.org/r/20220401230804.1658207-2-zi.yan@sent.com
    Link: https://lore.kernel.org/linux-mm/CAHk-=wji_AmYygZMTsPMdJ7XksMt7kOur8oDfDdniBRMjm4VkQ@mail.gmail.com/
    Link: https://lkml.kernel.org/r/7236E7CA-B5F1-4C04-AB85-E86FA3E9A54B@nvidia.com
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen a99a43f58b mm: page_alloc: simplify pageblock migratetype check in __free_one_page()
Bugzilla: https://bugzilla.redhat.com/2160210

commit bb0e28eb5bc2b3a22e47861ca59bccca566023e8
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu Apr 28 23:16:01 2022 -0700

    mm: page_alloc: simplify pageblock migratetype check in __free_one_page()

    Move pageblock migratetype check code in the while loop to simplify the
    logic. It also saves redundant buddy page checking code.

    Link: https://lkml.kernel.org/r/20220401230804.1658207-1-zi.yan@sent.com
    Link: https://lore.kernel.org/linux-mm/27ff69f9-60c5-9e59-feb2-295250077551@suse.cz/
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Suggested-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen 57def8f6ec mm/page_alloc: adding same penalty is enough to get round-robin order
Bugzilla: https://bugzilla.redhat.com/2160210

commit 379313241e77abc18258da1afd49d111c72c5a3d
Author: Wei Yang <richard.weiyang@gmail.com>
Date:   Thu Apr 28 23:16:01 2022 -0700

    mm/page_alloc: adding same penalty is enough to get round-robin order

    To make node order in round-robin in the same distance group, we add a
    penalty to the first node we got in each round.

    To get a round-robin order in the same distance group, we don't need to
    decrease the penalty since:

      * find_next_best_node() always iterates node in the same order
      * distance matters more then penalty in find_next_best_node()
      * in nodes with the same distance, the first one would be picked up

    So it is fine to increase same penalty when we get the first node in the
    same distance group.  Since we just increase a constance of 1 to node
    penalty, it is not necessary to multiply MAX_NODE_LOAD for preference.

    [richard.weiyang@gmail.com: remove remove MAX_NODE_LOAD, per Vlastimil]
      Link: https://lkml.kernel.org/r/20220412001319.7462-1-richard.weiyang@gmail.com
    Link: https://lkml.kernel.org/r/20220123013537.20491-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Oscar Salvador <osalvador@suse.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Nico Pache bd508dfa58 mm: prep_compound_tail() clear page->private
commit 5aae9265ee1a30cf716d6caf6b29fe99b9d55130
Author: Hugh Dickins <hughd@google.com>
Date:   Sat Oct 22 00:51:06 2022 -0700

    mm: prep_compound_tail() clear page->private

    Although page allocation always clears page->private in the first page or
    head page of an allocation, it has never made a point of clearing
    page->private in the tails (though 0 is often what is already there).

    But now commit 71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t
    during THP split") issues a warning when page_tail->private is found to be
    non-0 (unless it's swapcache).

    Change that warning to dump page_tail (which also dumps head), instead of
    just the head: so far we have seen dead000000000122, dead000000000003,
    dead000000000001 or 0000000000000002 in the raw output for tail private.

    We could just delete the warning, but today's consensus appears to want
    page->private to be 0, unless there's a good reason for it to be set: so
    now clear it in prep_compound_tail() (more general than just for THP; but
    not for high order allocation, which makes no pass down the tails).

    Link: https://lkml.kernel.org/r/1c4233bb-4e4d-5969-fbd4-96604268a285@google.com
    Fixes: 71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t during THP split")
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:44 -07:00
Nico Pache 45f65b1711 mm/page_alloc: use local variable zone_idx directly
commit c035290424a9b7b64477752058b460d0ecc21987
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:50 2022 +0800

    mm/page_alloc: use local variable zone_idx directly

    Use local variable zone_idx directly since it holds the exact value of
    zone_idx().  No functional change intended.

    Link: https://lkml.kernel.org/r/20220916072257.9639-10-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache acc09c7bfe mm/page_alloc: add missing is_migrate_isolate() check in set_page_guard()
commit b36184553d41c59e6712f9d4699aca24577fbd4a
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:49 2022 +0800

    mm/page_alloc: add missing is_migrate_isolate() check in set_page_guard()

    In MIGRATE_ISOLATE case, zone freepage state shouldn't be modified as
    caller will take care of it.  Add missing is_migrate_isolate() here to
    avoid possible unbalanced freepage state.  This would happen if someone
    isolates the block, and then we face an MCE failure/soft-offline on a page
    within that block.  __mod_zone_freepage_state() will be triggered via
    below call trace which already had been triggered back when block was
    isolated:

    take_page_off_buddy
      break_down_buddy_pages
        set_page_guard

    Link: https://lkml.kernel.org/r/20220916072257.9639-9-linmiaohe@huawei.com
    Fixes: 06be6ff3d2 ("mm,hwpoison: rework soft offline for free pages")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 350a1bc7ca mm/page_alloc: fix freeing static percpu memory
commit 022e7fa0f73d7c90cf3d6bea3d4e4cc5df1e1087
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:47 2022 +0800

    mm/page_alloc: fix freeing static percpu memory

    The size of struct per_cpu_zonestat can be 0 on !SMP && !NUMA.  In that
    case, zone->per_cpu_zonestats will always equal to boot_zonestats.  But in
    zone_pcp_reset(), zone->per_cpu_zonestats is freed via free_percpu()
    directly without checking against boot_zonestats first.  boot_zonestats
    will be released by free_percpu() unexpectedly.

    Link: https://lkml.kernel.org/r/20220916072257.9639-7-linmiaohe@huawei.com
    Fixes: 28f836b677 ("mm/page_alloc: split per cpu page lists and zone stats")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 7655dc1ad4 mm/page_alloc: add __init annotations to init_mem_debugging_and_hardening()
commit 5749fcc5f04cef4091dea0c2ba6b5c5f5e05a549
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:46 2022 +0800

    mm/page_alloc: add __init annotations to init_mem_debugging_and_hardening()

    It's only called by mm_init(). Add __init annotations to it.

    Link: https://lkml.kernel.org/r/20220916072257.9639-6-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 067e773fe8 mm/page_alloc: remove obsolete comment in zone_statistics()
commit 709924bc7555db4867403f1f6e51cac4250bca87
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:45 2022 +0800

    mm/page_alloc: remove obsolete comment in zone_statistics()

    Since commit 43c95bcc51 ("mm/page_alloc: reduce duration that IRQs are
    disabled for VM counters"), zone_statistics() is not called with
    interrupts disabled.  Update the corresponding comment.

    Link: https://lkml.kernel.org/r/20220916072257.9639-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache e34506a89f mm: remove obsolete macro NR_PCP_ORDER_MASK and NR_PCP_ORDER_WIDTH
commit 638a9ae97ab596f1f7b7522dad709e69cb5b4e9d
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:44 2022 +0800

    mm: remove obsolete macro NR_PCP_ORDER_MASK and NR_PCP_ORDER_WIDTH

    Since commit 8b10b465d0e1 ("mm/page_alloc: free pages in a single pass
    during bulk free"), they're not used anymore.  Remove them.

    Link: https://lkml.kernel.org/r/20220916072257.9639-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 5c785b9d4f mm/page_alloc: make zone_pcp_update() static
commit b89f1735169b8ab54b6a03bf4823657ee4e30073
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:43 2022 +0800

    mm/page_alloc: make zone_pcp_update() static

    Since commit b92ca18e8c ("mm/page_alloc: disassociate the pcp->high from
    pcp->batch"), zone_pcp_update() is only used in mm/page_alloc.c.  Move
    zone_pcp_update() up to avoid forward declaration and then make it static.
    No functional change intended.

    Link: https://lkml.kernel.org/r/20220916072257.9639-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 7b4db7d995 mm/page_alloc: ensure kswapd doesn't accidentally go to sleep
commit ce96fa6223ee851cb83118678f6e75f260852a80
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:42 2022 +0800

    mm/page_alloc: ensure kswapd doesn't accidentally go to sleep

    Patch series "A few cleanup patches for mm", v2.

    This series contains a few cleanup patches to remove the obsolete comments
    and functions, use helper macro to improve readability and so on.  More
    details can be found in the respective changelogs.

    This patch (of 16):

    If ALLOC_KSWAPD is set, wake_all_kswapds() will be called to ensure kswapd
    doesn't accidentally go to sleep.  But when reserve_flags is set,
    alloc_flags will be overwritten and ALLOC_KSWAPD is thus lost.  Preserve
    the ALLOC_KSWAPD flag in alloc_flags to ensure kswapd won't go to sleep
    accidentally.

    Link: https://lkml.kernel.org/r/20220916072257.9639-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220916072257.9639-2-linmiaohe@huawei.com
    Fixes: 0a79cdad5e ("mm: use alloc_flags to record if kswapd can wake")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache 9e5684f3de mm/page_alloc: fix race condition between build_all_zonelists and page allocation
commit 3d36424b3b5850bd92f3e89b953a430d7cfc88ef
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Wed Aug 24 12:14:50 2022 +0100

    mm/page_alloc: fix race condition between build_all_zonelists and page allocation

    Patrick Daly reported the following problem;

            NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
            [0] - ZONE_MOVABLE
            [1] - ZONE_NORMAL
            [2] - NULL

            For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
            offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
            memory_offline operation removes the last page from ZONE_MOVABLE,
            build_all_zonelists() & build_zonerefs_node() will update
            node_zonelists as shown below. Only populated zones are added.

            NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
            [0] - ZONE_NORMAL
            [1] - NULL
            [2] - NULL

    The race is simple -- page allocation could be in progress when a memory
    hot-remove operation triggers a zonelist rebuild that removes zones.  The
    allocation request will still have a valid ac->preferred_zoneref that is
    now pointing to NULL and triggers an OOM kill.

    This problem probably always existed but may be slightly easier to trigger
    due to 6aa303defb ("mm, vmscan: only allocate and reclaim from zones
    with pages managed by the buddy allocator") which distinguishes between
    zones that are completely unpopulated versus zones that have valid pages
    not managed by the buddy allocator (e.g.  reserved, memblock, ballooning
    etc).  Memory hotplug had multiple stages with timing considerations
    around managed/present page updates, the zonelist rebuild and the zone
    span updates.  As David Hildenbrand puts it

            memory offlining adjusts managed+present pages of the zone
            essentially in one go. If after the adjustments, the zone is no
            longer populated (present==0), we rebuild the zone lists.

            Once that's done, we try shrinking the zone (start+spanned
            pages) -- which results in zone_start_pfn == 0 if there are no
            more pages. That happens *after* rebuilding the zonelists via
            remove_pfn_range_from_zone().

    The only requirement to fix the race is that a page allocation request
    identifies when a zonelist rebuild has happened since the allocation
    request started and no page has yet been allocated.  Use a seqlock_t to
    track zonelist updates with a lockless read-side of the zonelist and
    protecting the rebuild and update of the counter with a spinlock.

    [akpm@linux-foundation.org: make zonelist_update_seq static]
    Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net
    Fixes: 6aa303defb ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: <stable@vger.kernel.org>    [4.9+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:41 -07:00
Nico Pache ec2c7f4c5b page_alloc: fix invalid watermark check on a negative value
commit 9282012fc0aa248b77a69f5eb802b67c5a16bb13
Author: Jaewon Kim <jaewon31.kim@samsung.com>
Date:   Mon Jul 25 18:52:12 2022 +0900

    page_alloc: fix invalid watermark check on a negative value

    There was a report that a task is waiting at the
    throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was
    increasing.

    This is a bug where zone_watermark_fast returns true even when the free
    is very low. The commit f27ce0e140 ("page_alloc: consider highatomic
    reserve in watermark fast") changed the watermark fast to consider
    highatomic reserve. But it did not handle a negative value case which
    can be happened when reserved_highatomic pageblock is bigger than the
    actual free.

    If watermark is considered as ok for the negative value, allocating
    contexts for order-0 will consume all free pages without direct reclaim,
    and finally free page may become depleted except highatomic free.

    Then allocating contexts may fall into throttle_direct_reclaim. This
    symptom may easily happen in a system where wmark min is low and other
    reclaimers like kswapd does not make free pages quickly.

    Handle the negative case by using MIN.

    Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com
    Fixes: f27ce0e140 ("page_alloc: consider highatomic reserve in watermark fast")
    Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
    Reported-by: GyeongHwan Hong <gh21.hong@samsung.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Baoquan He <bhe@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Yong-Taek Lee <ytk.lee@samsung.com>
    Cc: <stable@vger.kerenl.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:40 -07:00
Frantisek Hrbata e9e9bc8da2 Merge: mm changes through v5.18 for 9.2
Merge conflicts:
-----------------
Conflicts with !1142(merged) "io_uring: update to v5.15"

fs/io-wq.c
        - static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
          !1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
          along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142)
        - static int io_wqe_worker(void *data)
          !1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers")
          Resolved in favor of HEAD(!1142)
        - static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
          HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread")
          Resolved in favor of !1370
        - static void create_worker_cont(struct callback_head *cb)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static void io_workqueue_create(struct work_struct *work)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
          !1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          Resolved in favor of HEAD(!1142)
        - static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
          !1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          removed wrongly merged run_cancel label
          Resolved in favor of HEAD(!1142)
        - static bool io_task_work_match(struct callback_head *cb, void *data)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - static void io_wq_exit_workers(struct io_wq *wq)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - int io_wq_max_workers(struct io_wq *wq, int *new_count)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
fs/io_uring.c
        - static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
          !1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
          Resolved in favor of HEAD(!1142)
include/uapi/linux/io_uring.h
        - !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items")
          just a comment conflict
          Resolved in favor of HEAD(!1142)
kernel/exit.c
        - void __noreturn do_exit(long code)
        - !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions")
          Resolved in favor of !1370

Conflicts with !1357(merged) "NFS refresh for RHEL-9.2"

fs/nfs/callback.c
        - nfs4_callback_svc(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed
          Resolved in favor of HEAD(!1357)
fs/nfs/file.c
          !1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
          Resolved in favor of HEAD(!1370)
fs/nfsd/nfssvc.c
        - nfsd(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module")
          Resolved in favor of HEAD(!1357)
-----------------

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370

Bugzilla: https://bugzilla.redhat.com/2120352

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

Patches 1-9 are changes to selftests
Patches 10-31 are reverts of RHEL-only patches to address COR CVE
Patches 32-320 are the machine dependent mm changes ported by Rafael
Patch 321 reverts the backport of 6692c98c7df5. See below.
Patches 322-981 are the machine independent mm changes
Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE

RHEL commit b23c298982 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5
is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added
after 40966e316f86.

Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bc369921d670 io-wq: max_worker fixes
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774

Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die()
	unsupported arch

Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die()
	unsupported arch

Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c
        unsupported arch

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter
        reverted later

Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter"
        revert of above omitted fix

Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak
	unsupported fs

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-23 19:49:41 +02:00
Chris von Recklinghausen 943e17aaec page_alloc: use vmalloc_huge for large system hash
Bugzilla: https://bugzilla.redhat.com/2120352

commit f2edd118d02dd11449b126f786f09749ca152ba5
Author: Song Liu <song@kernel.org>
Date:   Fri Apr 15 09:44:11 2022 -0700

    page_alloc: use vmalloc_huge for large system hash

    Use vmalloc_huge() in alloc_large_system_hash() so that large system
    hash (>= PMD_SIZE) could benefit from huge pages.

    Note that vmalloc_huge only allocates huge pages for systems with
    HAVE_ARCH_HUGE_VMALLOC.

    Signed-off-by: Song Liu <song@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Rik van Riel <riel@surriel.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen 96f852f113 mm, page_alloc: fix build_zonerefs_node()
Bugzilla: https://bugzilla.redhat.com/2120352

commit e553f62f10d93551eb883eca227ac54d1a4fad84
Author: Juergen Gross <jgross@suse.com>
Date:   Thu Apr 14 19:13:43 2022 -0700

    mm, page_alloc: fix build_zonerefs_node()

    Since commit 6aa303defb ("mm, vmscan: only allocate and reclaim from
    zones with pages managed by the buddy allocator") only zones with free
    memory are included in a built zonelist.  This is problematic when e.g.
    all memory of a zone has been ballooned out when zonelists are being
    rebuilt.

    The decision whether to rebuild the zonelists when onlining new memory
    is done based on populated_zone() returning 0 for the zone the memory
    will be added to.  The new zone is added to the zonelists only, if it
    has free memory pages (managed_zone() returns a non-zero value) after
    the memory has been onlined.  This implies, that onlining memory will
    always free the added pages to the allocator immediately, but this is
    not true in all cases: when e.g. running as a Xen guest the onlined new
    memory will be added only to the ballooned memory list, it will be freed
    only when the guest is being ballooned up afterwards.

    Another problem with using managed_zone() for the decision whether a
    zone is being added to the zonelists is, that a zone with all memory
    used will in fact be removed from all zonelists in case the zonelists
    happen to be rebuilt.

    Use populated_zone() when building a zonelist as it has been done before
    that commit.

    There was a report that QubesOS (based on Xen) is hitting this problem.
    Xen has switched to use the zone device functionality in kernel 5.9 and
    QubesOS wants to use memory hotplugging for guests in order to be able
    to start a guest with minimal memory and expand it as needed.  This was
    the report leading to the patch.

    Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
    Fixes: 6aa303defb ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
    Signed-off-by: Juergen Gross <jgross@suse.com>
    Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
    Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen a104807957 Revert "mm/page_alloc: mark pagesets as __maybe_unused"
Bugzilla: https://bugzilla.redhat.com/2120352

commit 273ba85b5e8b971ed28eb5c17e1638543be9237d
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Mon Mar 28 16:58:10 2022 +0200

    Revert "mm/page_alloc: mark pagesets as __maybe_unused"

    The local_lock() is now using a proper static inline function which is
    enough for llvm to accept that the variable is used.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220328145810.86783-4-bigeasy@linutronix.de

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:04 -04:00
Chris von Recklinghausen 21608e466e mm: page_alloc: validate buddy before check its migratetype.
Bugzilla: https://bugzilla.redhat.com/2120352

commit 787af64d05cd528aac9ad16752d11bb1c6061bb9
Author: Zi Yan <ziy@nvidia.com>
Date:   Wed Mar 30 15:45:43 2022 -0700

    mm: page_alloc: validate buddy before check its migratetype.

    Whenever a buddy page is found, page_is_buddy() should be called to
    check its validity.  Add the missing check during pageblock merge check.

    Fixes: 1dd214b8f21c ("mm: page_alloc: avoid merging non-fallbackable pageblocks with others")
    Link: https://lore.kernel.org/all/20220330154208.71aca532@gandalf.local.home/
    Reported-and-tested-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:03 -04:00
Chris von Recklinghausen 397b77192d kasan, page_alloc: allow skipping memory init for HW_TAGS
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9353ffa6e9e90d2b6348209cf2b95a8ffee18711
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:29 2022 -0700

    kasan, page_alloc: allow skipping memory init for HW_TAGS

    Add a new GFP flag __GFP_SKIP_ZERO that allows to skip memory
    initialization.  The flag is only effective with HW_TAGS KASAN.

    This flag will be used by vmalloc code for page_alloc allocations backing
    vmalloc() mappings in a following patch.  The reason to skip memory
    initialization for these pages in page_alloc is because vmalloc code will
    be initializing them instead.

    With the current implementation, when __GFP_SKIP_ZERO is provided,
    __GFP_ZEROTAGS is ignored.  This doesn't matter, as these two flags are
    never provided at the same time.  However, if this is changed in the
    future, this particular implementation detail can be changed as well.

    Link: https://lkml.kernel.org/r/0d53efeff345de7d708e0baa0d8829167772521e.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:00 -04:00
Chris von Recklinghausen 17be80da62 kasan, page_alloc: allow skipping unpoisoning for HW_TAGS
Bugzilla: https://bugzilla.redhat.com/2120352

commit 53ae233c30a623ff44ff2f83854e92530c5d9fc2
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:26 2022 -0700

    kasan, page_alloc: allow skipping unpoisoning for HW_TAGS

    Add a new GFP flag __GFP_SKIP_KASAN_UNPOISON that allows skipping KASAN
    poisoning for page_alloc allocations.  The flag is only effective with
    HW_TAGS KASAN.

    This flag will be used by vmalloc code for page_alloc allocations backing
    vmalloc() mappings in a following patch.  The reason to skip KASAN
    poisoning for these pages in page_alloc is because vmalloc code will be
    poisoning them instead.

    Also reword the comment for __GFP_SKIP_KASAN_POISON.

    Link: https://lkml.kernel.org/r/35c97d77a704f6ff971dd3bfe4be95855744108e.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:00 -04:00
Chris von Recklinghausen cc6b8ef6d0 kasan, page_alloc: rework kasan_unpoison_pages call site
Bugzilla: https://bugzilla.redhat.com/2120352

commit e9d0ca9228162f5442b751edf8c9721b15dcfa1e
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:43 2022 -0700

    kasan, page_alloc: rework kasan_unpoison_pages call site

    Rework the checks around kasan_unpoison_pages() call in post_alloc_hook().

    The logical condition for calling this function is:

     - If a software KASAN mode is enabled, we need to mark shadow memory.

     - Otherwise, HW_TAGS KASAN is enabled, and it only makes sense to set
       tags if they haven't already been cleared by tag_clear_highpage(),
       which is indicated by init_tags.

    This patch concludes the changes for post_alloc_hook().

    Link: https://lkml.kernel.org/r/0ecebd0d7ccd79150e3620ea4185a32d3dfe912f.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen e7f379f2b5 kasan, page_alloc: move kernel_init_free_pages in post_alloc_hook
Bugzilla: https://bugzilla.redhat.com/2120352

commit 7e3cbba65de22f20ad18a2de09f65238bfe84c5b
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:40 2022 -0700

    kasan, page_alloc: move kernel_init_free_pages in post_alloc_hook

    Pull the kernel_init_free_pages() call in post_alloc_hook() out of the big
    if clause for better code readability.  This also allows for more
    simplifications in the following patch.

    This patch does no functional changes.

    Link: https://lkml.kernel.org/r/a7a76456501eb37ddf9fca6529cee9555e59cdb1.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen d1e56915b0 kasan, page_alloc: move SetPageSkipKASanPoison in post_alloc_hook
Bugzilla: https://bugzilla.redhat.com/2120352

commit 89b2711633281b3d712b1df96c5065a82ccbfb9c
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:37 2022 -0700

    kasan, page_alloc: move SetPageSkipKASanPoison in post_alloc_hook

    Pull the SetPageSkipKASanPoison() call in post_alloc_hook() out of the big
    if clause for better code readability.  This also allows for more
    simplifications in the following patches.

    Also turn the kasan_has_integrated_init() check into the proper
    kasan_hw_tags_enabled() one.  These checks evaluate to the same value, but
    logically skipping kasan poisoning has nothing to do with integrated init.

    Link: https://lkml.kernel.org/r/7214c1698b754ccfaa44a792113c95cc1f807c48.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 96a3439b72 kasan, page_alloc: combine tag_clear_highpage calls in post_alloc_hook
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9294b1281d0a212ef775a175b98ce71e6ac27b90
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:34 2022 -0700

    kasan, page_alloc: combine tag_clear_highpage calls in post_alloc_hook

    Move tag_clear_highpage() loops out of the kasan_has_integrated_init()
    clause as a code simplification.

    This patch does no functional changes.

    Link: https://lkml.kernel.org/r/587e3fc36358b88049320a89cc8dc6deaecb0cda.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen acc738b6b9 kasan, page_alloc: merge kasan_alloc_pages into post_alloc_hook
Bugzilla: https://bugzilla.redhat.com/2120352

commit b42090ae6f3aa07b0a39403545d688489548a6a8
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:31 2022 -0700

    kasan, page_alloc: merge kasan_alloc_pages into post_alloc_hook

    Currently, the code responsible for initializing and poisoning memory in
    post_alloc_hook() is scattered across two locations: kasan_alloc_pages()
    hook for HW_TAGS KASAN and post_alloc_hook() itself.  This is confusing.

    This and a few following patches combine the code from these two
    locations.  Along the way, these patches do a step-by-step restructure the
    many performed checks to make them easier to follow.

    Replace the only caller of kasan_alloc_pages() with its implementation.

    As kasan_has_integrated_init() is only true when CONFIG_KASAN_HW_TAGS is
    enabled, moving the code does no functional changes.

    Also move init and init_tags variables definitions out of
    kasan_has_integrated_init() clause in post_alloc_hook(), as they have the
    same values regardless of what the if condition evaluates to.

    This patch is not useful by itself but makes the simplifications in the
    following patches easier to follow.

    Link: https://lkml.kernel.org/r/5ac7e0b30f5cbb177ec363ddd7878a3141289592.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 49736c675e kasan, page_alloc: refactor init checks in post_alloc_hook
Bugzilla: https://bugzilla.redhat.com/2120352

commit b8491b9052fef036aac0ca3afc18ef223aef6f61
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:28 2022 -0700

    kasan, page_alloc: refactor init checks in post_alloc_hook

    Separate code for zeroing memory from the code clearing tags in
    post_alloc_hook().

    This patch is not useful by itself but makes the simplifications in the
    following patches easier to follow.

    This patch does no functional changes.

    Link: https://lkml.kernel.org/r/2283fde963adfd8a2b29a92066f106cc16661a3c.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen cd5acd85e1 kasan: drop skip_kasan_poison variable in free_pages_prepare
Bugzilla: https://bugzilla.redhat.com/2120352

commit 487a32ec24be819e747af8c2ab0d5c515508086a
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:19 2022 -0700

    kasan: drop skip_kasan_poison variable in free_pages_prepare

    skip_kasan_poison is only used in a single place.  Call
    should_skip_kasan_poison() directly for simplicity.

    Link: https://lkml.kernel.org/r/1d33212e79bc9ef0b4d3863f903875823e89046f.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Suggested-by: Marco Elver <elver@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 37a79115b8 kasan, page_alloc: init memory of skipped pages on free
Bugzilla: https://bugzilla.redhat.com/2120352

commit db8a04774a8195c529f1e87cd1df87f116559b52
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:16 2022 -0700

    kasan, page_alloc: init memory of skipped pages on free

    Since commit 7a3b835371 ("kasan: use separate (un)poison implementation
    for integrated init"), when all init, kasan_has_integrated_init(), and
    skip_kasan_poison are true, free_pages_prepare() doesn't initialize the
    page.  This is wrong.

    Fix it by remembering whether kasan_poison_pages() performed
    initialization, and call kernel_init_free_pages() if it didn't.

    Reordering kasan_poison_pages() and kernel_init_free_pages() is OK, since
    kernel_init_free_pages() can handle poisoned memory.

    Link: https://lkml.kernel.org/r/1d97df75955e52727a3dc1c4e33b3b50506fc3fd.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 1c9fec123a kasan, page_alloc: simplify kasan_poison_pages call site
Bugzilla: https://bugzilla.redhat.com/2120352

commit c3525330a04d0a47b4e11f5cf6d44e21a6520885
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:13 2022 -0700

    kasan, page_alloc: simplify kasan_poison_pages call site

    Simplify the code around calling kasan_poison_pages() in
    free_pages_prepare().

    This patch does no functional changes.

    Link: https://lkml.kernel.org/r/ae4f9bcf071577258e786bcec4798c145d718c46.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 722c33889c kasan, page_alloc: merge kasan_free_pages into free_pages_prepare
Bugzilla: https://bugzilla.redhat.com/2120352

commit 7c13c163e036c646b77753deacfe2f5478b654bc
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:10 2022 -0700

    kasan, page_alloc: merge kasan_free_pages into free_pages_prepare

    Currently, the code responsible for initializing and poisoning memory in
    free_pages_prepare() is scattered across two locations: kasan_free_pages()
    for HW_TAGS KASAN and free_pages_prepare() itself.  This is confusing.

    This and a few following patches combine the code from these two
    locations.  Along the way, these patches also simplify the performed
    checks to make them easier to follow.

    Replaces the only caller of kasan_free_pages() with its implementation.

    As kasan_has_integrated_init() is only true when CONFIG_KASAN_HW_TAGS is
    enabled, moving the code does no functional changes.

    This patch is not useful by itself but makes the simplifications in the
    following patches easier to follow.

    Link: https://lkml.kernel.org/r/303498d15840bb71905852955c6e2390ecc87139.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 3bd6fc41ae kasan, page_alloc: move tag_clear_highpage out of kernel_init_free_pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5b2c07138cbd8c0c415c6d3ff5b8040532024814
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:07 2022 -0700

    kasan, page_alloc: move tag_clear_highpage out of kernel_init_free_pages

    Currently, kernel_init_free_pages() serves two purposes: it either only
    zeroes memory or zeroes both memory and memory tags via a different code
    path.  As this function has only two callers, each using only one code
    path, this behaviour is confusing.

    Pull the code that zeroes both memory and tags out of
    kernel_init_free_pages().

    As a result of this change, the code in free_pages_prepare() starts to
    look complicated, but this is improved in the few following patches.
    Those improvements are not integrated into this patch to make diffs easier
    to read.

    This patch does no functional changes.

    Link: https://lkml.kernel.org/r/7719874e68b23902629c7cf19f966c4fd5f57979.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 58c7b65b06 kasan, page_alloc: deduplicate should_skip_kasan_poison
Bugzilla: https://bugzilla.redhat.com/2120352

commit 94ae8b83fefcdaf281e0bcfb76a19f5ed5019c8d
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:04 2022 -0700

    kasan, page_alloc: deduplicate should_skip_kasan_poison

    Patch series "kasan, vmalloc, arm64: add vmalloc tagging support for SW/HW_TAGS", v6.

    This patchset adds vmalloc tagging support for SW_TAGS and HW_TAGS
    KASAN modes.

    About half of patches are cleanups I went for along the way.  None of them
    seem to be important enough to go through stable, so I decided not to
    split them out into separate patches/series.

    The patchset is partially based on an early version of the HW_TAGS
    patchset by Vincenzo that had vmalloc support.  Thus, I added a
    Co-developed-by tag into a few patches.

    SW_TAGS vmalloc tagging support is straightforward.  It reuses all of the
    generic KASAN machinery, but uses shadow memory to store tags instead of
    magic values.  Naturally, vmalloc tagging requires adding a few
    kasan_reset_tag() annotations to the vmalloc code.

    HW_TAGS vmalloc tagging support stands out.  HW_TAGS KASAN is based on Arm
    MTE, which can only assigns tags to physical memory.  As a result, HW_TAGS
    KASAN only tags vmalloc() allocations, which are backed by page_alloc
    memory.  It ignores vmap() and others.

    This patch (of 39):

    Currently, should_skip_kasan_poison() has two definitions: one for when
    CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, one for when it's not.

    Instead of duplicating the checks, add a deferred_pages_enabled() helper
    and use it in a single should_skip_kasan_poison() definition.

    Also move should_skip_kasan_poison() closer to its caller and clarify all
    conditions in the comment.

    Link: https://lkml.kernel.org/r/cover.1643047180.git.andreyknvl@google.com
    Link: https://lkml.kernel.org/r/658b79f5fb305edaf7dc16bc52ea870d3220d4a8.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:58 -04:00
Chris von Recklinghausen 63534db797 NUMA balancing: optimize page placement for memory tiering system
Bugzilla: https://bugzilla.redhat.com/2120352

commit c574bbe917036c8968b984c82c7b13194fe5ce98
Author: Huang Ying <ying.huang@intel.com>
Date:   Tue Mar 22 14:46:23 2022 -0700

    NUMA balancing: optimize page placement for memory tiering system

    With the advent of various new memory types, some machines will have
    multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
    memory subsystem of these machines can be called memory tiering system,
    because the performance of the different types of memory are usually
    different.

    In such system, because of the memory accessing pattern changing etc,
    some pages in the slow memory may become hot globally.  So in this
    patch, the NUMA balancing mechanism is enhanced to optimize the page
    placement among the different memory types according to hot/cold
    dynamically.

    In a typical memory tiering system, there are CPUs, fast memory and slow
    memory in each physical NUMA node.  The CPUs and the fast memory will be
    put in one logical node (called fast memory node), while the slow memory
    will be put in another (faked) logical node (called slow memory node).
    That is, the fast memory is regarded as local while the slow memory is
    regarded as remote.  So it's possible for the recently accessed pages in
    the slow memory node to be promoted to the fast memory node via the
    existing NUMA balancing mechanism.

    The original NUMA balancing mechanism will stop to migrate pages if the
    free memory of the target node becomes below the high watermark.  This
    is a reasonable policy if there's only one memory type.  But this makes
    the original NUMA balancing mechanism almost do not work to optimize
    page placement among different memory types.  Details are as follows.

    It's the common cases that the working-set size of the workload is
    larger than the size of the fast memory nodes.  Otherwise, it's
    unnecessary to use the slow memory at all.  So, there are almost always
    no enough free pages in the fast memory nodes, so that the globally hot
    pages in the slow memory node cannot be promoted to the fast memory
    node.  To solve the issue, we have 2 choices as follows,

    a. Ignore the free pages watermark checking when promoting hot pages
       from the slow memory node to the fast memory node.  This will
       create some memory pressure in the fast memory node, thus trigger
       the memory reclaiming.  So that, the cold pages in the fast memory
       node will be demoted to the slow memory node.

    b. Define a new watermark called wmark_promo which is higher than
       wmark_high, and have kswapd reclaiming pages until free pages reach
       such watermark.  The scenario is as follows: when we want to promote
       hot-pages from a slow memory to a fast memory, but fast memory's free
       pages would go lower than high watermark with such promotion, we wake
       up kswapd with wmark_promo watermark in order to demote cold pages and
       free us up some space.  So, next time we want to promote hot-pages we
       might have a chance of doing so.

    The choice "a" may create high memory pressure in the fast memory node.
    If the memory pressure of the workload is high, the memory pressure
    may become so high that the memory allocation latency of the workload
    is influenced, e.g.  the direct reclaiming may be triggered.

    The choice "b" works much better at this aspect.  If the memory
    pressure of the workload is high, the hot pages promotion will stop
    earlier because its allocation watermark is higher than that of the
    normal memory allocation.  So in this patch, choice "b" is implemented.
    A new zone watermark (WMARK_PROMO) is added.  Which is larger than the
    high watermark and can be controlled via watermark_scale_factor.

    In addition to the original page placement optimization among sockets,
    the NUMA balancing mechanism is extended to be used to optimize page
    placement according to hot/cold among different memory types.  So the
    sysctl user space interface (numa_balancing) is extended in a backward
    compatible way as follow, so that the users can enable/disable these
    functionality individually.

    The sysctl is converted from a Boolean value to a bits field.  The
    definition of the flags is,

    - 0: NUMA_BALANCING_DISABLED
    - 1: NUMA_BALANCING_NORMAL
    - 2: NUMA_BALANCING_MEMORY_TIERING

    We have tested the patch with the pmbench memory accessing benchmark
    with the 80:20 read/write ratio and the Gauss access address
    distribution on a 2 socket Intel server with Optane DC Persistent
    Memory Model.  The test results shows that the pmbench score can
    improve up to 95.9%.

    Thanks Andrew Morton to help fix the document format error.

    Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: Feng Tang <feng.tang@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:54 -04:00
Chris von Recklinghausen 92747f6c92 mm/hwpoison-inject: support injecting hwpoison to free page
Bugzilla: https://bugzilla.redhat.com/2120352

commit a581865ecd0a5a0b8464d6f1e668ae6681c1572f
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:35 2022 -0700

    mm/hwpoison-inject: support injecting hwpoison to free page

    memory_failure() can handle free buddy page.  Support injecting hwpoison
    to free page by adding is_free_buddy_page check when hwpoison filter is
    disabled.

    [akpm@linux-foundation.org: export is_free_buddy_page() to modules]

    Link: https://lkml.kernel.org/r/20220218092052.3853-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen 768e07c448 mm/page_alloc: call check_new_pages() while zone spinlock is not held
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3313204c8ad553cf93f1ee8cc89456c73a7df938
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Mar 22 14:43:57 2022 -0700

    mm/page_alloc: call check_new_pages() while zone spinlock is not held

    For high order pages not using pcp, rmqueue() is currently calling the
    costly check_new_pages() while zone spinlock is held, and hard irqs
    masked.

    This is not needed, we can release the spinlock sooner to reduce zone
    spinlock contention.

    Note that after this patch, we call __mod_zone_freepage_state() before
    deciding to leak the page because it is in bad state.

    Link: https://lkml.kernel.org/r/20220304170215.1868106-1-eric.dumazet@gmail.com
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen abe1eb721b mm: count time in drain_all_pages during direct reclaim as memory pressure
Bugzilla: https://bugzilla.redhat.com/2120352

commit fa7fc75f6319dcd044e332ad309a86126a610bdf
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Tue Mar 22 14:43:54 2022 -0700

    mm: count time in drain_all_pages during direct reclaim as memory pressure

    When page allocation in direct reclaim path fails, the system will make
    one attempt to shrink per-cpu page lists and free pages from high alloc
    reserves.  Draining per-cpu pages into buddy allocator can be a very
    slow operation because it's done using workqueues and the task in direct
    reclaim waits for all of them to finish before proceeding.  Currently
    this time is not accounted as psi memory stall.

    While testing mobile devices under extreme memory pressure, when
    allocations are failing during direct reclaim, we notices that psi
    events which would be expected in such conditions were not triggered.
    After profiling these cases it was determined that the reason for
    missing psi events was that a big chunk of time spent in direct reclaim
    is not accounted as memory stall, therefore psi would not reach the
    levels at which an event is generated.  Further investigation revealed
    that the bulk of that unaccounted time was spent inside drain_all_pages
    call.

    A typical captured case when drain_all_pages path gets activated:

    __alloc_pages_slowpath  took 44.644.613ns
        __perform_reclaim   took    751.668ns (1.7%)
        drain_all_pages     took 43.887.167ns (98.3%)

    PSI in this case records the time spent in __perform_reclaim but ignores
    drain_all_pages, IOW it misses 98.3% of the time spent in
    __alloc_pages_slowpath.

    Annotate __alloc_pages_direct_reclaim in its entirety so that delays
    from handling page allocation failure in the direct reclaim path are
    accounted as memory stall.

    Link: https://lkml.kernel.org/r/20220223194812.1299646-1-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reported-by: Tim Murray <timmurray@google.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Minchan Kim <minchan@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen 011c300849 mm: enforce pageblock_order < MAX_ORDER
Bugzilla: https://bugzilla.redhat.com/2120352

commit b3d40a2b6d10c9d0424d2b398bf962fb6adad87e
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Mar 22 14:43:20 2022 -0700

    mm: enforce pageblock_order < MAX_ORDER

    Some places in the kernel don't really expect pageblock_order >=
    MAX_ORDER, and it looks like this is only possible in corner cases:

    1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order
       pages via __free_pages_core(), which cannot possibly work.

    2) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE
       start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger
       pageblock_order, we could have a single pageblock partially managed by
       two zones.

    3) compaction code runs into __fragmentation_index() with order
       >= MAX_ORDER, when checking WARN_ON_ONCE(order >= MAX_ORDER). [1]

    4) mm/page_reporting.c won't be reporting any pages with default
       page_reporting_order == pageblock_order, as we'll be skipping the
       reporting loop inside page_reporting_process_zone().

    5) __rmqueue_fallback() will never be able to steal with
       ALLOC_NOFRAGMENT.

    pageblock_order >= MAX_ORDER is weird either way: it's a pure
    optimization for making alloc_contig_range(), as used for allcoation of
    gigantic pages, a little more reliable to succeed.  However, if there is
    demand for somewhat reliable allocation of gigantic pages, affected
    setups should be using CMA or boottime allocations instead.

    So let's make sure that pageblock_order < MAX_ORDER and simplify.

    [1] https://lkml.kernel.org/r/87r189a2ks.fsf@linux.ibm.com

    Link: https://lkml.kernel.org/r/20220214174132.219303-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Frank Rowand <frowand.list@gmail.com>
    Cc: John Garry via iommu <iommu@lists.linux-foundation.org>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Rob Herring <robh+dt@kernel.org>
    Cc: Robin Murphy <robin.murphy@arm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen ab6763a437 mm/page_alloc: don't pass pfn to free_unref_page_commit()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 566513775dca7f0d4ba15da4bc8394cdb2c98829
Author: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Date:   Tue Mar 22 14:43:14 2022 -0700

    mm/page_alloc: don't pass pfn to free_unref_page_commit()

    free_unref_page_commit() doesn't make use of its pfn argument, so get
    rid of it.

    Link: https://lkml.kernel.org/r/20220202140451.415928-1-nsaenzju@redhat.com
    Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen 002a87236c mm: page_alloc: avoid merging non-fallbackable pageblocks with others
Bugzilla: https://bugzilla.redhat.com/2120352

commit 1dd214b8f21ca46d5431be9b2db8513c59e07a26
Author: Zi Yan <ziy@nvidia.com>
Date:   Tue Mar 22 14:43:05 2022 -0700

    mm: page_alloc: avoid merging non-fallbackable pageblocks with others

    This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance.
    It prepares for the upcoming removal of the MAX_ORDER-1 alignment
    requirement for CMA and alloc_contig_range().

    MIGRATE_HIGHATOMIC should not merge with other migratetypes like
    MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too.

    Remove MIGRATE_CMA and MIGRATE_ISOLATE from fallbacks list, since they
    are never used.

    [1] https://lore.kernel.org/linux-mm/20211130100853.GP3366@techsingularity.net/

    Link: https://lkml.kernel.org/r/20220124175957.1261961-1-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Mike Rapoport <rppt@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen 9c8f59f0d2 delayacct: track delays from memory compact
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5bf18281534451bf1ad56a45a3085cd7ad46860d
Author: wangyong <wang.yong12@zte.com.cn>
Date:   Wed Jan 19 18:10:15 2022 -0800

    delayacct: track delays from memory compact

    Delay accounting does not track the delay of memory compact.  When there
    is not enough free memory, tasks can spend a amount of their time
    waiting for compact.

    To get the impact of tasks in direct memory compact, measure the delay
    when allocating memory through memory compact.

    Also update tools/accounting/getdelays.c:

        / # ./getdelays_next  -di -p 304
        print delayacct stats ON
        printing IO accounting
        PID     304

        CPU             count     real total  virtual total    delay total  delay average
                          277      780000000      849039485       18877296          0.068ms
        IO              count    delay total  delay average
                            0              0              0ms
        SWAP            count    delay total  delay average
                            0              0              0ms
        RECLAIM         count    delay total  delay average
                            5    11088812685           2217ms
        THRASHING       count    delay total  delay average
                            0              0              0ms
        COMPACT         count    delay total  delay average
                            3          72758              0ms
        watch: read=0, write=0, cancelled_write=0

    Link: https://lkml.kernel.org/r/1638619795-71451-1-git-send-email-wang.yong12@zte.com.cn
    Signed-off-by: wangyong <wang.yong12@zte.com.cn>
    Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
    Reviewed-by: Zhang Wenya <zhang.wenya1@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Reviewed-by: Balbir Singh <bsingharora@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:42 -04:00
Chris von Recklinghausen 508db56386 mm/page_alloc.c: modify the comment section for alloc_contig_pages()
Bugzilla: https://bugzilla.redhat.com/2120352

commit eaab8e753632b8e961701d02a5bb398c820f309c
Author: Anshuman Khandual <anshuman.khandual@arm.com>
Date:   Fri Jan 14 14:07:33 2022 -0800

    mm/page_alloc.c: modify the comment section for alloc_contig_pages()

    Clarify that the alloc_contig_pages() allocated range will always be
    aligned to the requested nr_pages.

    Link: https://lkml.kernel.org/r/1639545478-12160-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:39 -04:00
Chris von Recklinghausen ee38b98d63 mm: page_alloc: fix building error on -Werror=array-compare
Bugzilla: https://bugzilla.redhat.com/2120352

commit ca831f29f8f25c97182e726429b38c0802200c8f
Author: Xiongwei Song <sxwjean@gmail.com>
Date:   Fri Jan 14 14:07:24 2022 -0800

    mm: page_alloc: fix building error on -Werror=array-compare

    Arthur Marsh reported we would hit the error below when building kernel
    with gcc-12:

      CC      mm/page_alloc.o
      mm/page_alloc.c: In function `mem_init_print_info':
      mm/page_alloc.c:8173:27: error: comparison between two arrays [-Werror=array-compare]
       8173 |                 if (start <= pos && pos < end && size > adj) \
            |

    In C++20, the comparision between arrays should be warned.

    Link: https://lkml.kernel.org/r/20211125130928.32465-1-sxwjean@me.com
    Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
    Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:39 -04:00
Chris von Recklinghausen 73067bf28e mm/memremap: add ZONE_DEVICE support for compound pages
Conflicts:
	include/linux/memremap.h - The presence of
		536939ff5163 ("mm: Add three folio wrappers")
		and
		dc90f0846df4 ("mm: don't include <linux/memremap.h> in <linux/mm.h>")
		causes a merge conflict. make sure all 4 functions are defined.
	mm/memremap.c - The backport of
		b80892ca022e ("memremap: remove support for external pgmap refcounts")
		changed percpu_ref_get_many to take the address of the ref.
		This patch wants to pass the ref to percpu_ref_get_many by
		value but later merge commit
		f56caedaf94f ("Merge branch 'akpm' (patches from Andrew)")
		changed it back to passing the ref by address. squash that
		change in with this one.

Bugzilla: https://bugzilla.redhat.com/2120352

commit c4386bd8ee3a921c3c799b7197dc898ade76a453
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Fri Jan 14 14:04:22 2022 -0800

    mm/memremap: add ZONE_DEVICE support for compound pages

    Add a new @vmemmap_shift property for struct dev_pagemap which specifies
    that a devmap is composed of a set of compound pages of order
    @vmemmap_shift, instead of base pages.  When a compound page devmap is
    requested, all but the first page are initialised as tail pages instead
    of order-0 pages.

    For certain ZONE_DEVICE users like device-dax which have a fixed page
    size, this creates an opportunity to optimize GUP and GUP-fast walkers,
    treating it the same way as THP or hugetlb pages.

    Additionally, commit 7118fc2906 ("hugetlb: address ref count racing in
    prep_compound_gigantic_page") removed set_page_count() because the
    setting of page ref count to zero was redundant.  devmap pages don't
    come from page allocator though and only head page refcount is used for
    compound pages, hence initialize tail page count to zero.

    Link: https://lkml.kernel.org/r/20211202204422.26777-5-joao.m.martins@oracle
.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:37 -04:00
Chris von Recklinghausen f3d31b7be9 mm/page_alloc: refactor memmap_init_zone_device() page init
Bugzilla: https://bugzilla.redhat.com/2120352

commit 46487e0095f895c25da9feae27dc06d2aa76793d
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Fri Jan 14 14:04:18 2022 -0800

    mm/page_alloc: refactor memmap_init_zone_device() page init

    Move struct page init to an helper function __init_zone_device_page().

    This is in preparation for sharing the storage for compound page
    metadata.

    Link: https://lkml.kernel.org/r/20211202204422.26777-4-joao.m.martins@oracle.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:37 -04:00
Chris von Recklinghausen d6a800636f mm/page_alloc: split prep_compound_page into head and tail subparts
Conflicts: mm/page_alloc.c - We already have
	5232c63f46fd ("mm: Make compound_pincount always available")
	which removed the hpage_pincount_available check before calling
	atomic_set(compound_pincount_ptr(page), 0), leading to a difference
	in deleted code. The upstream version of this patch adds a
	call to hpage_pincount_available in prep_compound_head. Remove that
	too.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 5b24eeef06701cca6852f1bf768248ccc912819b
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Fri Jan 14 14:04:15 2022 -0800

    mm/page_alloc: split prep_compound_page into head and tail subparts

    Patch series "mm, device-dax: Introduce compound pages in devmap", v7.

    This series converts device-dax to use compound pages, and moves away
    from the 'struct page per basepage on PMD/PUD' that is done today.

    Doing so
     1) unlocks a few noticeable improvements on unpin_user_pages() and
        makes device-dax+altmap case 4x times faster in pinning (numbers
        below and in last patch)
     2) as mentioned in various other threads it's one important step
        towards cleaning up ZONE_DEVICE refcounting.

    I've split the compound pages on devmap part from the rest based on
    recent discussions on devmap pending and future work planned[5][6].
    There is consensus that device-dax should be using compound pages to
    represent its PMD/PUDs just like HugeTLB and THP, and that leads to less
    specialization of the dax parts.  I will pursue the rest of the work in
    parallel once this part is merged, particular the GUP-{slow,fast}
    improvements [7] and the tail struct page deduplication memory savings
    part[8].

    To summarize what the series does:

    Patch 1: Prepare hwpoisoning to work with dax compound pages.

    Patches 2-3: Split the current utility function of prep_compound_page()
    into head and tail and use those two helpers where appropriate to take
    advantage of caches being warm after __init_single_page().  This is used
    when initializing zone device when we bring up device-dax namespaces.

    Patches 4-10: Add devmap support for compound pages in device-dax.
    memmap_init_zone_device() initialize its metadata as compound pages, and
    it introduces a new devmap property known as vmemmap_shift which
    outlines how the vmemmap is structured (defaults to base pages as done
    today).  The property describe the page order of the metadata
    essentially.  While at it do a few cleanups in device-dax in patches
    5-9.  Finally enable device-dax usage of devmap @vmemmap_shift to a
    value based on its own @align property.  @vmemmap_shift returns 0 by
    default (which is today's case of base pages in devmap, like fsdax or
    the others) and the usage of compound devmap is optional.  Starting with
    device-dax (*not* fsdax) we enable it by default.  There are a few
    pinning improvements particular on the unpinning case and altmap, as
    well as unpin_user_page_range_dirty_lock() being just as effective as
    THP/hugetlb[0] pages.

        $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
        (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
        [altmap]
        (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put
:~71ms

         $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
        (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
        [altmap with -m 127004]
        (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec
 put:~563ms

    Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with
    and without altmap), alongside gup_test selftests with dynamic dax
    regions and static dax regions.  Coupled with ndctl unit tests for
    dynamic dax devices that exercise all of this.  Note, for dynamic dax
    regions I had to revert commit 8aa83e6395 ("x86/setup: Call
    early_reserve_memory() earlier"), it is a known issue that this commit
    broke efi_fake_mem=.

    This patch (of 11):

    Split the utility function prep_compound_page() into head and tail
    counterparts, and use them accordingly.

    This is in preparation for sharing the storage for compound page
    metadata.

    Link: https://lkml.kernel.org/r/20211202204422.26777-1-joao.m.martins@oracle
.com
    Link: https://lkml.kernel.org/r/20211202204422.26777-3-joao.m.martins@oracle
.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:37 -04:00
Chris von Recklinghausen 16b2c55cfa mm/page_alloc: remove the throttling logic from the page allocator
Bugzilla: https://bugzilla.redhat.com/2120352

commit 132b0d21d21f14f74fbe44dd5b8b1848215fff09
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 5 13:42:38 2021 -0700

    mm/page_alloc: remove the throttling logic from the page allocator

    The page allocator stalls based on the number of pages that are waiting
    for writeback to start but this should now be redundant.
    shrink_inactive_list() will wake flusher threads if the LRU tail are
    unqueued dirty pages so the flusher should be active.  If it fails to
    make progress due to pages under writeback not being completed quickly
    then it should stall on VMSCAN_THROTTLE_WRITEBACK.

    Link: https://lkml.kernel.org/r/20211022144651.19914-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: "Darrick J . Wong" <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:29 -04:00
Chris von Recklinghausen 809d37d23f mm/vmscan: throttle reclaim until some writeback completes if congested
Conflicts:
	mm/filemap.c - We already have
		4268b48077e5 ("mm/filemap: Add folio_end_writeback()")
		so so put the acct_reclaim_writeback call between the
		folio_wake call and the folio_put call and pass it a
		folio
	mm/internal.h - We already have
		646010009d35 ("mm: Add folio_raw_mapping()")
		so keep definition of folio_raw_mapping.
		Squash in changes from merge commit
		512b7931ad05 ("Merge branch 'akpm' (patches from Andrew)")
		to be comptible with existing folio changes.
	mm/vmscan.c - Squash in changes from merge commit
                512b7931ad05 ("Merge branch 'akpm' (patches from Andrew)")
                to be comptible with existing folio changes.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 8cd7c588decf470bf7e14f2be93b709f839a965e
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 5 13:42:25 2021 -0700

    mm/vmscan: throttle reclaim until some writeback completes if congested

    Patch series "Remove dependency on congestion_wait in mm/", v5.

    This series that removes all calls to congestion_wait in mm/ and deletes
    wait_iff_congested.  It's not a clever implementation but
    congestion_wait has been broken for a long time [1].

    Even if congestion throttling worked, it was never a great idea.  While
    excessive dirty/writeback pages at the tail of the LRU is one
    possibility that reclaim may be slow, there is also the problem of too
    many pages being isolated and reclaim failing for other reasons
    (elevated references, too many pages isolated, excessive LRU contention
    etc).

    This series replaces the "congestion" throttling with 3 different types.

     - If there are too many dirty/writeback pages, sleep until a timeout or
       enough pages get cleaned

     - If too many pages are isolated, sleep until enough isolated pages are
       either reclaimed or put back on the LRU

     - If no progress is being made, direct reclaim tasks sleep until
       another task makes progress with acceptable efficiency.

    This was initially tested with a mix of workloads that used to trigger
    corner cases that no longer work.  A new test case was created called
    "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
    created XFS filesystem.  Note that it may be necessary to increase the
    timeout of ssh if executing remotely as ssh itself can get throttled and
    the connection may timeout.

    stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
    to check the impact as the number of direct reclaimers increase.  It has
    four types of worker.

     - One "anon latency" worker creates small mappings with mmap() and
       times how long it takes to fault the mapping reading it 4K at a time

     - X file writers which is fio randomly writing X files where the total
       size of the files add up to the allowed dirty_ratio. fio is allowed
       to run for a warmup period to allow some file-backed pages to
       accumulate. The duration of the warmup is based on the best-case
       linear write speed of the storage.

     - Y file readers which is fio randomly reading small files

     - Z anon memory hogs which continually map (100-dirty_ratio)% of memory

     - Total estimated WSS = (100+dirty_ration) percentage of memory

    X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

    The intent is to maximise the total WSS with a mix of file and anon
    memory where some anonymous memory must be swapped and there is a high
    likelihood of dirty/writeback pages reaching the end of the LRU.

    The test can be configured to have no background readers to stress
    dirty/writeback pages.  The results below are based on having zero
    readers.

    The short summary of the results is that the series works and stalls
    until some event occurs but the timeouts may need adjustment.

    The test results are not broken down by patch as the series should be
    treated as one block that replaces a broken throttling mechanism with a
    working one.

    Finally, three machines were tested but I'm reporting the worst set of
    results.  The other two machines had much better latencies for example.

    First the results of the "anon latency" latency

      stutterp
                                    5.15.0-rc1             5.15.0-rc1
                                       vanilla mm-reclaimcongest-v5r4
      Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
      Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
      Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
      Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
      Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
      Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
      Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
      Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
      Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
      Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
      Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
      Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
      Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
      Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
      Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
      Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
      Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
      Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
      Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
      Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
      Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
      Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
      Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
      Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)

    For most thread counts, the time to mmap() is unfortunately increased.
    In earlier versions of the series, this was lower but a large number of
    throttling events were reaching their timeout increasing the amount of
    inefficient scanning of the LRU.  There is no prioritisation of reclaim
    tasks making progress based on each tasks rate of page allocation versus
    progress of reclaim.  The variance is also impacted for high worker
    counts but in all cases, the differences in latency are not
    statistically significant due to very large maximum outliers.  Max-90
    shows that 90% of the stalls are comparable but the Max results show the
    massive outliers which are increased to to stalling.

    It is expected that this will be very machine dependant.  Due to the
    test design, reclaim is difficult so allocations stall and there are
    variances depending on whether THPs can be allocated or not.  The amount
    of memory will affect exactly how bad the corner cases are and how often
    they trigger.  The warmup period calculation is not ideal as it's based
    on linear writes where as fio is randomly writing multiple files from
    multiple tasks so the start state of the test is variable.  For example,
    these are the latencies on a single-socket machine that had more memory

      Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
      Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
      Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
      Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
      Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)

    The overall system CPU usage and elapsed time is as follows

                        5.15.0-rc3  5.15.0-rc3
                           vanilla mm-reclaimcongest-v5r4
      Duration User        6989.03      983.42
      Duration System      7308.12      799.68
      Duration Elapsed     2277.67     2092.98

    The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
    stalling.

    The high-level /proc/vmstats show

                                           5.15.0-rc1     5.15.0-rc1
                                              vanilla mm-reclaimcongest-v5r2
      Ops Direct pages scanned          1056608451.00   503594991.00
      Ops Kswapd pages scanned           109795048.00   147289810.00
      Ops Kswapd pages reclaimed          63269243.00    31036005.00
      Ops Direct pages reclaimed          10803973.00     6328887.00
      Ops Kswapd efficiency %                   57.62          21.07
      Ops Kswapd velocity                    48204.98       57572.86
      Ops Direct efficiency %                    1.02           1.26
      Ops Direct velocity                   463898.83      196845.97

    Kswapd scanned less pages but the detailed pattern is different.  The
    vanilla kernel scans slowly over time where as the patches exhibits
    burst patterns of scan activity.  Direct reclaim scanning is reduced by
    52% due to stalling.

    The pattern for stealing pages is also slightly different.  Both kernels
    exhibit spikes but the vanilla kernel when reclaiming shows pages being
    reclaimed over a period of time where as the patches tend to reclaim in
    spikes.  The difference is that vanilla is not throttling and instead
    scanning constantly finding some pages over time where as the patched
    kernel throttles and reclaims in spikes.

      Ops Percentage direct scans               90.59          77.37

    For direct reclaim, vanilla scanned 90.59% of pages where as with the
    patches, 77.37% were direct reclaim due to throttling

      Ops Page writes by reclaim           2613590.00     1687131.00

    Page writes from reclaim context are reduced.

      Ops Page writes anon                 2932752.00     1917048.00

    And there is less swapping.

      Ops Page reclaim immediate         996248528.00   107664764.00

    The number of pages encountered at the tail of the LRU tagged for
    immediate reclaim but still dirty/writeback is reduced by 89%.

      Ops Slabs scanned                     164284.00      153608.00

    Slab scan activity is similar.

    ftrace was used to gather stall activity

      Vanilla
      -------
          1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
          2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
          8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
         29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
      82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

    The fast majority of wait_iff_congested calls do not stall at all.  What
    is likely happening is that cond_resched() reschedules the task for a
    short period when the BDI is not registering congestion (which it never
    will in this test setup).

          1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
          2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
          4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
        380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
        778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

    congestion_wait if called always exceeds the timeout as there is no
    trigger to wake it up.

    Bottom line: Vanilla will throttle but it's not effective.

    Patch series
    ------------

    Kswapd throttle activity was always due to scanning pages tagged for
    immediate reclaim at the tail of the LRU

          1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITE
BACK
          4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITE
BACK
          6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITE
BACK
         11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRIT
EBACK
         11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEB
ACK
         94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEB
ACK

    The majority of events did not stall or stalled for a short period.
    Roughly 16% of stalls reached the timeout before expiry.  For direct
    reclaim, the number of times stalled for each reason were

       6624 reason=VMSCAN_THROTTLE_ISOLATED
      93246 reason=VMSCAN_THROTTLE_NOPROGRESS
      96934 reason=VMSCAN_THROTTLE_WRITEBACK

    The most common reason to stall was due to excessive pages tagged for
    immediate reclaim at the tail of the LRU followed by a failure to make
    forward.  A relatively small number were due to too many pages isolated
    from the LRU by parallel threads

    For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was

          9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATE
D
         12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLAT
ED
         83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLAT
ED
       6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

    Most did not stall at all.  A small number reached the timeout.

    For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
    the map

          1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
        146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
        266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
       2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
       2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROG
RESS
       7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROG
RESS
      22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRES
S
      51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPR
OGRESS

    The full timeout is often hit but a large number also do not stall at
    all.  The remainder slept a little allowing other reclaim tasks to make
    progress.

    While this timeout could be further increased, it could also negatively
    impact worst-case behaviour when there is no prioritisation of what task
    should make progress.

    For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

          1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITE
BACK
          2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITE
BACK
          3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITE
BACK
          6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITE
BACK
          7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITE
BACK
         11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITE
BACK
         12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITE
BACK
         16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITE
BACK
         24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITE
BACK
         28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
         30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITE
BACK
         30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITE
BACK
         32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITE
BACK
         42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITE
BACK
         77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITE
BACK
         99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITE
BACK
        137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITE
BACK
        190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITE
BACK
        339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITE
BACK
        518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITE
BACK
        852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEB
ACK
       3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEB
ACK
       7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
      83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRIT
EBACK

    The majority hit the timeout in direct reclaim context although a
    sizable number did not stall at all.  This is very different to kswapd
    where only a tiny percentage of stalls due to writeback reached the
    timeout.

    Bottom line, the throttling appears to work and the wakeup events may
    limit worst case stalls.  There might be some grounds for adjusting
    timeouts but it's likely futile as the worst-case scenarios depend on
    the workload, memory size and the speed of the storage.  A better
    approach to improve the series further would be to prioritise tasks
    based on their rate of allocation with the caveat that it may be very
    expensive to track.

    This patch (of 5):

    Page reclaim throttles on wait_iff_congested under the following
    conditions:

     - kswapd is encountering pages under writeback and marked for immediate
       reclaim implying that pages are cycling through the LRU faster than
       pages can be cleaned.

     - Direct reclaim will stall if all dirty pages are backed by congested
       inodes.

    wait_iff_congested is almost completely broken with few exceptions.
    This patch adds a new node-based workqueue and tracks the number of
    throttled tasks and pages written back since throttling started.  If
    enough pages belonging to the node are written back then the throttled
    tasks will wake early.  If not, the throttled tasks sleeps until the
    timeout expires.

    [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
    [hdanton@sina.com: Avoid race when reclaim starts]
    [vbabka@suse.cz: vmstat irq-safe api, clarifications]

    Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@
kernel.dk/ [1]
    Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingulari
ty.net
    Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingulari
ty.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: NeilBrown <neilb@suse.de>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: "Darrick J . Wong" <djwong@kernel.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:29 -04:00
Chris von Recklinghausen 4e48ab5bc4 mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged
Bugzilla: https://bugzilla.redhat.com/2120352

commit bd3400ea173fb611cdf2030d03620185ff6c0b0e
Author: Liangcai Fan <liangcaifan19@gmail.com>
Date:   Fri Nov 5 13:41:36 2021 -0700

    mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged

    When initializing transparent huge pages, min_free_kbytes would be
    calculated according to what khugepaged expected.

    So when transparent huge pages get disabled, min_free_kbytes should be
    recalculated instead of the higher value set by khugepaged.

    Link: https://lkml.kernel.org/r/1633937809-16558-1-git-send-email-liangcaifan19@gmail.com
    Signed-off-by: Liangcai Fan <liangcaifan19@gmail.com>
    Signed-off-by: Chunyan Zhang <zhang.lyra@gmail.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 87f4fface3 mm/page_alloc: use clamp() to simplify code
Bugzilla: https://bugzilla.redhat.com/2120352

commit 59d336bdf6931a6a8c2ba41e533267d1cc799fc9
Author: Wang ShaoBo <bobo.shaobowang@huawei.com>
Date:   Fri Nov 5 13:40:55 2021 -0700

    mm/page_alloc: use clamp() to simplify code

    This patch uses clamp() to simplify code in init_per_zone_wmark_min().

    Link: https://lkml.kernel.org/r/20211021034830.1049150-1-bobo.shaobowang@huawei.com
    Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Wei Yongjun <weiyongjun1@huawei.com>
    Cc: Li Bin <huawei.libin@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 18db1af2f9 mm: page_alloc: use migrate_disable() in drain_local_pages_wq()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9c25cbfcb38462803a3d68f5d88e66a587f5f045
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Fri Nov 5 13:40:52 2021 -0700

    mm: page_alloc: use migrate_disable() in drain_local_pages_wq()

    drain_local_pages_wq() disables preemption to avoid CPU migration during
    CPU hotplug and can't use cpus_read_lock().

    Using migrate_disable() works here, too.  The scheduler won't take the
    CPU offline until the task left the migrate-disable section.  The
    problem with disabled preemption here is that drain_local_pages()
    acquires locks which are turned into sleeping locks on PREEMPT_RT and
    can't be acquired with disabled preemption.

    Use migrate_disable() in drain_local_pages_wq().

    Link: https://lkml.kernel.org/r/20211015210933.viw6rjvo64qtqxn4@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 1419281dae mm/page_alloc.c: show watermark_boost of zone in zoneinfo
Bugzilla: https://bugzilla.redhat.com/2120352

commit a6ea8b5b9f1ce3403a1c8516035d653006741e80
Author: Liangcai Fan <liangcaifan19@gmail.com>
Date:   Fri Nov 5 13:40:37 2021 -0700

    mm/page_alloc.c: show watermark_boost of zone in zoneinfo

    min/low/high_wmark_pages(z) is defined as

      (z->_watermark[WMARK_MIN/LOW/HIGH] + z->watermark_boost)

    If kswapd is frequently woken up due to the increase of
    min/low/high_wmark_pages, printing watermark_boost can quickly locate
    whether watermark_boost or _watermark[WMARK_MIN/LOW/HIGH] caused
    min/low/high_wmark_pages to increase.

    Link: https://lkml.kernel.org/r/1632472566-12246-1-git-send-email-liangcaifan19@gmail.com
    Signed-off-by: Liangcai Fan <liangcaifan19@gmail.com>
    Cc: Chunyan Zhang <zhang.lyra@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 7815cb2138 mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 8446b59baaf45e83e1187cdb174ac78ac5d7d0ae
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Nov 5 13:40:31 2021 -0700

    mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page()

    Grabbing zone lock in is_free_buddy_page() gives a wrong sense of
    safety, and has potential performance implications when zone is
    experiencing lock contention.

    In any case, if a caller needs a stable result, it should grab zone lock
    before calling this function.

    Link: https://lkml.kernel.org/r/20210922152833.4023972-1-eric.dumazet@gmail.com
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 8350d36a42 mm/page_alloc: use accumulated load when building node fallback list
Bugzilla: https://bugzilla.redhat.com/2120352

commit 54d032ced98378bcb9d32dd5e378b7e402b36ad8
Author: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
Date:   Fri Nov 5 13:40:21 2021 -0700

    mm/page_alloc: use accumulated load when building node fallback list

    In build_zonelists(), when the fallback list is built for the nodes, the
    node load gets reinitialized during each iteration.  This results in
    nodes with same distances occupying the same slot in different node
    fallback lists rather than appearing in the intended round- robin
    manner.  This results in one node getting picked for allocation more
    compared to other nodes with the same distance.

    As an example, consider a 4 node system with the following distance
    matrix.

      Node 0  1  2  3
      ----------------
      0    10 12 32 32
      1    12 10 32 32
      2    32 32 10 12
      3    32 32 12 10

    For this case, the node fallback list gets built like this:

      Node  Fallback list
      ---------------------
      0     0 1 2 3
      1     1 0 3 2
      2     2 3 0 1
      3     3 2 0 1 <-- Unexpected fallback order

    In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the
    same order which results in more allocations getting satisfied from node
    0 compared to node 1.

    The effect of this on remote memory bandwidth as seen by stream
    benchmark is shown below:

      Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
            (numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
      Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
            (numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)

      ----------------------------------------
                    BANDWIDTH (MB/s)
          TEST      Case 1          Case 2
      ----------------------------------------
          COPY      57479.6         110791.8
         SCALE      55372.9         105685.9
           ADD      50460.6         96734.2
        TRIADD      50397.6         97119.1
      ----------------------------------------

    The bandwidth drop in Case 1 occurs because most of the allocations get
    satisfied by node 0 as it appears first in the fallback order for both
    nodes 2 and 3.

    This can be fixed by accumulating the node load in build_zonelists()
    rather than reinitializing it during each iteration.  With this the
    nodes with the same distance rightly get assigned in the round robin
    manner.

    In fact this was how it was originally until commit f0c0b2b808
    ("change zonelist order: zonelist order selection logic") dropped the
    load accumulation and resorted to initializing the load during each
    iteration.

    While zonelist ordering was removed by commit c9bff3eebc ("mm,
    page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load
    accumulation in build_zonelists() remained.  So essentially this patch
    reverts back to the accumulated node load logic.

    After this fix, the fallback order gets built like this:

      Node Fallback list
      ------------------
      0    0 1 2 3
      1    1 0 3 2
      2    2 3 0 1
      3    3 2 1 0 <-- Note the change here

    The bandwidth in Case 1 improves and matches Case 2 as shown below.

      ----------------------------------------
                    BANDWIDTH (MB/s)
          TEST      Case 1          Case 2
      ----------------------------------------
          COPY      110438.9        110107.2
         SCALE      105930.5        105817.5
           ADD      97005.1         96159.8
        TRIADD      97441.5         96757.1
      ----------------------------------------

    The correctness of the fallback list generation has been verified for
    the above node configuration where the node 3 starts as memory-less node
    and comes up online only during memory hotplug.

    [bharata@amd.com: Added changelog, review, test validation]

    Link: https://lkml.kernel.org/r/20210830121603.1081-3-bharata@amd.com
    Fixes: f0c0b2b808 ("change zonelist order: zonelist order selection logic")
    Signed-off-by: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
    Co-developed-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
    Signed-off-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
    Signed-off-by: Bharata B Rao <bharata@amd.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Chris von Recklinghausen 978a8fb0f0 mm/page_alloc: print node fallback order
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6cf253925df72e522c06dac09ede7e81a6e38121
Author: Bharata B Rao <bharata@amd.com>
Date:   Fri Nov 5 13:40:18 2021 -0700

    mm/page_alloc: print node fallback order

    Patch series "Fix NUMA nodes fallback list ordering".

    For a NUMA system that has multiple nodes at same distance from other
    nodes, the fallback list generation prefers same node order for them
    instead of round-robin thereby penalizing one node over others.  This
    series fixes it.

    More description of the problem and the fix is present in the patch
    description.

    This patch (of 2):

    Print information message about the allocation fallback order for each
    NUMA node during boot.

    No functional changes here.  This makes it easier to illustrate the
    problem in the node fallback list generation, which the next patch
    fixes.

    Link: https://lkml.kernel.org/r/20210830121603.1081-1-bharata@amd.com
    Link: https://lkml.kernel.org/r/20210830121603.1081-2-bharata@amd.com
    Signed-off-by: Bharata B Rao <bharata@amd.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
    Cc: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen a7216c2dfb mm/page_alloc.c: use helper function zone_spans_pfn()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 86fb05b9cc1ac7cdcf37e5408b927dd3ad95db96
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Nov 5 13:40:11 2021 -0700

    mm/page_alloc.c: use helper function zone_spans_pfn()

    Use helper function zone_spans_pfn() to check whether pfn is within a
    zone to simplify the code slightly.

    Link: https://lkml.kernel.org/r/20210902121242.41607-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen 9d23d1b42b mm/page_alloc.c: simplify the code by using macro K()
Bugzilla: https://bugzilla.redhat.com/2120352

commit ff7ed9e4532d14e0478d192548e77d78d72387e9
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Nov 5 13:40:05 2021 -0700

    mm/page_alloc.c: simplify the code by using macro K()

    Use helper macro K() to convert the pages to the corresponding size.
    Minor readability improvement.

    Link: https://lkml.kernel.org/r/20210902121242.41607-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen b603fd9497 mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order()
Bugzilla: https://bugzilla.redhat.com/2120352

commit ea808b4efd15f6f019e9779617a166c9708856c1
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Nov 5 13:40:02 2021 -0700

    mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order()

    Patch series "Cleanups and fixup for page_alloc", v2.

    This series contains cleanups to remove meaningless VM_BUG_ON(), use
    helpers to simplify the code and remove obsolete comment.  Also we avoid
    allocating highmem pages via alloc_pages_exact[_nid].  More details can be
    found in the respective changelogs.

    This patch (of 5):

    It's meaningless to VM_BUG_ON() order != pageblock_order just after
    setting order to pageblock_order.  Remove it.

    Link: https://lkml.kernel.org/r/20210902121242.41607-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20210902121242.41607-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen 6bf0954126 arch/x86/mm/numa: Do not initialize nodes twice
Bugzilla: https://bugzilla.redhat.com/2120352

commit 1ca75fa7f19d694c58af681fa023295072b03120
Author: Oscar Salvador <osalvador@suse.de>
Date:   Tue Mar 22 14:43:51 2022 -0700

    arch/x86/mm/numa: Do not initialize nodes twice

    On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA
    nodes could be allocated at three different places.

     - numa_register_memblks
     - init_cpu_to_node
     - init_gi_nodes

    All these calls happen at setup_arch, and have the following order:

    setup_arch
      ...
      x86_numa_init
       numa_init
        numa_register_memblks
      ...
      init_cpu_to_node
       init_memory_less_node
        alloc_node_data
        free_area_init_memoryless_node
      init_gi_nodes
       init_memory_less_node
        alloc_node_data
        free_area_init_memoryless_node

    numa_register_memblks() is only interested in those nodes which have
    memory, so it skips over any memoryless node it founds.  Later on, when
    we have read ACPI's SRAT table, we call init_cpu_to_node() and
    init_gi_nodes(), which initialize any memoryless node we might have that
    have either CPU or Initiator affinity, meaning we allocate pg_data_t
    struct for them and we mark them as ONLINE.

    So far so good, but the thing is that after ("mm: handle uninitialized
    numa nodes gracefully"), we allocate all possible NUMA nodes in
    free_area_init(), meaning we have a picture like the following:

    setup_arch
      x86_numa_init
       numa_init
        numa_register_memblks  <-- allocate non-memoryless node
      x86_init.paging.pagetable_init
       ...
        free_area_init
         free_area_init_memoryless <-- allocate memoryless node
      init_cpu_to_node
       alloc_node_data             <-- allocate memoryless node with CPU
       free_area_init_memoryless_node
      init_gi_nodes
       alloc_node_data             <-- allocate memoryless node with Initiator
       free_area_init_memoryless_node

    free_area_init() already allocates all possible NUMA nodes, but
    init_cpu_to_node() and init_gi_nodes() are clueless about that, so they
    go ahead and allocate a new pg_data_t struct without checking anything,
    meaning we end up allocating twice.

    It should be mad clear that this only happens in the case where
    memoryless NUMA node happens to have a CPU/Initiator affinity.

    So get rid of init_memory_less_node() and just set the node online.

    Note that setting the node online is needed, otherwise we choke down the
    chain when bringup_nonboot_cpus() ends up calling
    __try_online_node()->register_one_node()->...  and we blow up in
    bus_add_device().  As can be seen here:

      BUG: kernel NULL pointer dereference, address: 0000000000000060
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
      RIP: 0010:bus_add_device+0x5a/0x140
      Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
      RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
      RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
      RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
      R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
      R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
      Call Trace:
       device_add+0x4c0/0x910
       __register_one_node+0x97/0x2d0
       __try_online_node+0x85/0xc0
       try_online_node+0x25/0x40
       cpu_up+0x4f/0x100
       bringup_nonboot_cpus+0x4f/0x60
       smp_init+0x26/0x79
       kernel_init_freeable+0x130/0x2f1
       kernel_init+0x17/0x150
       ret_from_fork+0x22/0x30

    The reason is simple, by the time bringup_nonboot_cpus() gets called, we
    did not register the node_subsys bus yet, so we crash when
    bus_add_device() tries to dereference bus()->p.

    The following shows the order of the calls:

    kernel_init_freeable
     smp_init
      bringup_nonboot_cpus
       ...
         bus_add_device()      <- we did not register node_subsys yet
     do_basic_setup
      do_initcalls
       postcore_initcall(register_node_type);
        register_node_type
         subsys_system_register
          subsys_register
           bus_register         <- register node_subsys bus

    Why setting the node online saves us then? Well, simply because
    __try_online_node() backs off when the node is online, meaning we do not
    end up calling register_one_node() in the first place.

    This is subtle, broken and deserves a deep analysis and thought about
    how to put this into shape, but for now let us have this easy fix for
    the leaking memory issue.

    [osalvador@suse.de: add comments]
      Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de

    Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de
    Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
    Signed-off-by: Oscar Salvador <osalvador@suse.de>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Rafael Aquini <raquini@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Alexey Makhalov <amakhalov@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:16 -04:00
Chris von Recklinghausen 63d42d92cd mm: page table check
Bugzilla: https://bugzilla.redhat.com/2120352

commit df4e817b710809425d899340dbfa8504a3ca4ba5
Author: Pasha Tatashin <pasha.tatashin@soleen.com>
Date:   Fri Jan 14 14:06:37 2022 -0800

    mm: page table check

    Check user page table entries at the time they are added and removed.

    Allows to synchronously catch memory corruption issues related to double
    mapping.

    When a pte for an anonymous page is added into page table, we verify
    that this pte does not already point to a file backed page, and vice
    versa if this is a file backed page that is being added we verify that
    this page does not have an anonymous mapping

    We also enforce that read-only sharing for anonymous pages is allowed
    (i.e.  cow after fork).  All other sharing must be for file pages.

    Page table check allows to protect and debug cases where "struct page"
    metadata became corrupted for some reason.  For example, when refcnt or
    mapcount become invalid.

    Link: https://lkml.kernel.org/r/20211221154650.1047963-4-pasha.tatashin@soleen.com
    Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Frederic Weisbecker <frederic@kernel.org>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Masahiro Yamada <masahiroy@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Paul Turner <pjt@google.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Sami Tolvanen <samitolvanen@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:14 -04:00
Izabela Bakollari 49b9685e02 mm: prevent page_frag_alloc() from corrupting the memory
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2104445

A number of drivers call page_frag_alloc() with a fragment's size >
PAGE_SIZE.

In low memory conditions, __page_frag_cache_refill() may fail the order
3 cache allocation and fall back to order 0; In this case, the cache
will be smaller than the fragment, causing memory corruptions.

Prevent this from happening by checking if the newly allocated cache is
large enough for the fragment; if not, the allocation will fail and
page_frag_alloc() will return NULL.

Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com
Fixes: b63ae8ca09 ("mm/net: Rename and move page fragment handling from net/ to mm/")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Cc: Chen Lin <chen45464546@163.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit dac22531bbd4af2426c4e29e05594415ccfa365d)
Signed-off-by: Izabela Bakollari <ibakolla@redhat.com>
2022-10-04 15:31:49 +02:00
Patrick Talbert 45f9f33cc3 Merge: mm/munlock: Fix sleeping function called from invalid context bug
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1168

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2109671
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1168

The 2nd patch fixes the sleeping function called from invalid context
bug reported in the BZ. The first patch is added to minimize context
diff with the upstream patch.

Signed-off-by: Waiman Long <longman@redhat.com>
~~~
Waiman Long (2):
  mm/migration: add trace events for base page and HugeTLB migrations
  mm/munlock: protect the per-CPU pagevec by a local_lock_t

 arch/x86/mm/init.c             |  1 -
 include/trace/events/migrate.h | 31 +++++++++++++++++++++++
 mm/internal.h                  |  6 +++--
 mm/migrate.c                   |  6 +++--
 mm/mlock.c                     | 46 ++++++++++++++++++++++++++--------
 mm/page_alloc.c                |  1 +
 mm/rmap.c                      | 10 ++++++--
 mm/swap.c                      |  4 ++-
 8 files changed, 87 insertions(+), 18 deletions(-)

Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-08-03 11:54:33 -04:00
Patrick Talbert 840d62781b Merge: cgroup: Miscellaneous bug fixes and enhancements
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/609

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2060150
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/609

This patchset pulls in miscellaneous cgroup fixes and enhancements.

[v3: Drop commit a06247c6804f as it has been merged and also drop commit b1e2c8df0f00 ("cgroup: use irqsave in
 cgroup_rstat_flush_locked().") as it may cause performance regression.]
[v4: Drop bpf commits as they have been merged. Drop commit 0061270307f2 ("group: cgroup-v1: do not exclude
    cgrp_dfl_root") as it cause network performance regression, and add back commit b1e2c8df0f00 ("cgroup: use irqsave in
 cgroup_rstat_flush_locked().")]

Signed-off-by: Waiman Long <longman@redhat.com>
~~~
Waiman Long (12):
  cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem
  cgroup: reduce dependency on cgroup_mutex
  cgroup: remove cgroup_mutex from cgroupstats_build
  cgroup: no need for cgroup_mutex for /proc/cgroups
  cgroup: Fix rootcg cpu.stat guest double counting
  mm/page_alloc: detect allocation forbidden by cpuset and bail out
    early
  cgroup/cpuset: Don't let child cpusets restrict parent in default
    hierarchy
  cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
  psi: Fix uaf issue when psi trigger is destroyed while being polled
  cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug
  cgroup-v1: Correct privileges check in release_agent writes
  cgroup: use irqsave in cgroup_rstat_flush_locked().

 Documentation/accounting/psi.rst |   3 +-
 include/linux/cpuset.h           |  17 ++++
 include/linux/mmzone.h           |  22 ++++++
 include/linux/psi.h              |   2 +-
 include/linux/psi_types.h        |   3 -
 kernel/cgroup/cgroup-v1.c        |  20 ++---
 kernel/cgroup/cgroup.c           |  62 +++++++++------
 kernel/cgroup/cpuset.c           | 131 +++++++++++++++++++++----------
 kernel/cgroup/rstat.c            |  15 +++-
 kernel/sched/psi.c               |  66 +++++++---------
 mm/page_alloc.c                  |  13 +++
 11 files changed, 229 insertions(+), 125 deletions(-)

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-08-01 08:02:11 -04:00
Waiman Long e462accf60 mm/munlock: protect the per-CPU pagevec by a local_lock_t
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2109671
Conflicts: A minor fuzz in mm/migrate.c due to missing upstream commit
	   1eba86c096e3 ("mm: change page type prior to adding page
	   table entry"). Pulling it, however, will require taking in
	   a number of additional patches. So it is not done here.

commit adb11e78c5dc5e26774acb05f983da36447f7911
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Fri, 1 Apr 2022 11:28:33 -0700

    mm/munlock: protect the per-CPU pagevec by a local_lock_t

    The access to mlock_pvec is protected by disabling preemption via
    get_cpu_var() or implicit by having preemption disabled by the caller
    (in mlock_page_drain() case).  This breaks on PREEMPT_RT since
    folio_lruvec_lock_irq() acquires a sleeping lock in this section.

    Create struct mlock_pvec which consits of the local_lock_t and the
    pagevec.  Acquire the local_lock() before accessing the per-CPU pagevec.
    Replace mlock_page_drain() with a _local() version which is invoked on
    the local CPU and acquires the local_lock_t and a _remote() version
    which uses the pagevec from a remote CPU which offline.

    Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-07-21 14:50:55 -04:00
Patrick Talbert 379ca607c0 Merge: mm: folio backports part 2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1097

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Omitted-fix: a04cd1600b831a16625b45226b90a292c8f6e8d9

This is the second part of folio backports for 9.1. Like the first part, I tried
to avoid touching other subsystems as much as possible. Since folio conversions
leave the original functions as compatibility layer, other teams can bring their
subsystems changes whenever they want.

These are not all folio changes for 9.1 and the work will continue in 9.2.

adb11e78c5dc5 was not backported due b74355078b not being present
a04cd1600b831 fixes an issue already fixed by ec4858e07ed62eceb, which is strange because ec4858e07ed62eceb was committed earlier

v2:
- added missing fixes and dependencies
- fixed a backport error on "mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()"
- added Conflicts for everything to keep scripts happy

v3:
- included 3ed4bb77156d patchset as requested

v4:
- fixed bisect build problems

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Conflicts:
- drivers/gpu/drm/drm_cache.c: context differs due to !717.
- drivers/gpu/drm/nouveau/nouveau_dmem.c: context differs due to !717.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-15 10:00:05 +02:00
Aristeu Rozanski 02c4025a8d mm: Make compound_pincount always available
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: Notice we have RHEL-only 44740bc20b applied but that shouldn't be a problem space wise since we don't ship 32 bit kernels anymore and we're well under 40 byte limit

commit 5232c63f46fdd779303527ec36c518cc1e9c6b4e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Jan 6 16:46:43 2022 -0500

    mm: Make compound_pincount always available

    Move compound_pincount from the third page to the second page, which
    means it's available for all compound pages.  That lets us delete
    hpage_pincount_available().

    On 32-bit systems, there isn't enough space for both compound_pincount
    and compound_nr in the second page (it would collide with page->private,
    which is in use for pages in the swap cache), so revert the optimisation
    of storing both compound_order and compound_nr on 32-bit systems.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Joel Savitz 4c8ad89c62 mm/page_alloc: always attempt to allocate at least one page during bulk allocation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2094045

commit c572e4888ad1be123c1516ec577ad30a700bbec4
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Thu May 26 10:12:10 2022 +0100

    mm/page_alloc: always attempt to allocate at least one page during bulk allocation

    Peter Pavlisko reported the following problem on kernel bugzilla 216007.

            When I try to extract an uncompressed tar archive (2.6 milion
            files, 760.3 GiB in size) on newly created (empty) XFS file system,
            after first low tens of gigabytes extracted the process hangs in
            iowait indefinitely. One CPU core is 100% occupied with iowait,
            the other CPU core is idle (on 2-core Intel Celeron G1610T).

    It was bisected to c9fa563072 ("xfs: use alloc_pages_bulk_array() for
    buffers") but XFS is only the messenger.  The problem is that nothing is
    waking kswapd to reclaim some pages at a time the PCP lists cannot be
    refilled until some reclaim happens.  The bulk allocator checks that there
    are some pages in the array and the original intent was that a bulk
    allocator did not necessarily need all the requested pages and it was best
    to return as quickly as possible.

    This was fine for the first user of the API but both NFS and XFS require
    the requested number of pages be available before making progress.  Both
    could be adjusted to call the page allocator directly if a bulk allocation
    fails but it puts a burden on users of the API.  Adjust the semantics to
    attempt at least one allocation via __alloc_pages() before returning so
    kswapd is woken if necessary.

    It was reported via bugzilla that the patch addressed the problem and that
    the tar extraction completed successfully.  This may also address bug
    215975 but has yet to be confirmed.

    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216007
    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215975
    Link: https://lkml.kernel.org/r/20220526091210.GC3441@techsingularity.net
    Fixes: 387ba26fb1 ("mm/page_alloc: add a bulk page allocator")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: "Darrick J. Wong" <djwong@kernel.org>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Chuck Lever <chuck.lever@oracle.com>
    Cc: <stable@vger.kernel.org>    [5.13+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
2022-06-15 10:35:41 -04:00
Waiman Long 83fb75916e mm/page_alloc: detect allocation forbidden by cpuset and bail out early
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2060150

commit 8ca1b5a49885f0c0c486544da46a9e0ac790831d
Author: Feng Tang <feng.tang@intel.com>
Date:   Fri, 5 Nov 2021 13:40:34 -0700

    mm/page_alloc: detect allocation forbidden by cpuset and bail out early

    There was a report that starting an Ubuntu in docker while using cpuset
    to bind it to movable nodes (a node only has movable zone, like a node
    for hotplug or a Persistent Memory node in normal usage) will fail due
    to memory allocation failure, and then OOM is involved and many other
    innocent processes got killed.

    It can be reproduced with command:

        $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"

    (where node 4 is a movable node)

      runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
      CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
      Call Trace:
       dump_stack+0x6b/0x88
       dump_header+0x4a/0x1e2
       oom_kill_process.cold+0xb/0x10
       out_of_memory.part.0+0xaf/0x230
       out_of_memory+0x3d/0x80
       __alloc_pages_slowpath.constprop.0+0x954/0xa20
       __alloc_pages_nodemask+0x2d3/0x300
       pipe_write+0x322/0x590
       new_sync_write+0x196/0x1b0
       vfs_write+0x1c3/0x1f0
       ksys_write+0xa7/0xe0
       do_syscall_64+0x52/0xd0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

      Mem-Info:
      active_anon:392832 inactive_anon:182 isolated_anon:0
       active_file:68130 inactive_file:151527 isolated_file:0
       unevictable:2701 dirty:0 writeback:7
       slab_reclaimable:51418 slab_unreclaimable:116300
       mapped:45825 shmem:735 pagetables:2540 bounce:0
       free:159849484 free_pcp:73 free_cma:0
      Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
      Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
      lowmem_reserve[]: 0 0 0 0 0
      Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB

      oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
      Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
      oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
      oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    The reason is that in this case, the target cpuset nodes only have
    movable zone, while the creation of an OS in docker sometimes needs to
    allocate memory in non-movable zones (dma/dma32/normal) like
    GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
    out-of-memory killing is involved even when normal nodes and movable
    nodes both have many free memory.

    The OOM killer cannot help to resolve the situation as there is no
    usable memory for the request in the cpuset scope.  The only reasonable
    measure to take is to fail the allocation right away and have the caller
    to deal with it.

    So add a check for cases like this in the slowpath of allocation, and
    bail out early returning NULL for the allocation.

    As page allocation is one of the hottest path in kernel, this check will
    hurt all users with sane cpuset configuration, add a static branch check
    and detect the abnormal config in cpuset memory binding setup so that
    the extra check cost in page allocation is not paid by everyone.

    [thanks to Micho Hocko and David Rientjes for suggesting not handling
     it inside OOM code, adding cpuset check, refining comments]

    Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com
    Signed-off-by: Feng Tang <feng.tang@intel.com>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-06-13 09:47:53 -04:00
Patrick Talbert 407ad35116 Merge: mm: backport folio support
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/678

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests with a stock kernel test run for comparison

This backport includes the base folio patches *without* touching any subsystems.
Patches are mostly straight forward converting functions to use folios.

v2: merge conflict, dropped 78525c74d9e7d1a6ce69bd4388f045f6e474a20b as contradicts the fact we're trying to not do subsystems converting in this MR

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Carlos Maiolino <cmaiolino@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-03 10:59:25 +02:00
Patrick Talbert dfb49ebc4b Merge: Preallocate pgdat struct for all nodes during boot
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/630

```
The page allocator blows up when an allocation from a possible node is requested.
The underlying reason is that NODE_DATA for the specific node is not allocated.

Preallocate the pgdat struct for all nodes rather then all online nodes to handle these cases more gracefully.
```
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2024054

Upstream-Status: Linus

Signed-off-by: Nico Pache <npache@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-04-19 12:23:39 +02:00
Aristeu Rozanski 3b6cedb421 mm/page_alloc: Add folio allocation functions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok
Conflicts: context due c00b6b9610991c042ff4c3153daaa3ea8522c210 being backported already

commit cc09cb134124a42fbe3bdcebefdc54e286d8f3e5
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Dec 15 22:55:54 2020 -0500

    mm/page_alloc: Add folio allocation functions

    The __folio_alloc(), __folio_alloc_node() and folio_alloc() functions
    are mostly for type safety, but they also ensure that the page allocator
    allocates a compound page and initialises the deferred list if the page
    is large enough to have one.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:31 -04:00
Aristeu Rozanski 98caaaf947 mm/memcg: Convert mem_cgroup_uncharge() to take a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit bbc6b703b21963e909f633cf7718903ed5094319
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat May 1 20:42:23 2021 -0400

    mm/memcg: Convert mem_cgroup_uncharge() to take a folio

    Convert all the callers to call page_folio().  Most of them were already
    using a head page, but a few of them I can't prove were, so this may
    actually fix a bug.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:28 -04:00
Nico Pache 5a2c4c0b3c mm: make free_area_init_node aware of memory less nodes
commit 7c30daac20698cb035255089c896f230982b085e
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Mar 22 14:47:03 2022 -0700

    mm: make free_area_init_node aware of memory less nodes

    free_area_init_node is also called from memory less node initialization
    path (free_area_init_memoryless_node).  It doesn't really make much sense
    to display the physical memory range for those nodes: Initmem setup node
    XX [mem 0x0000000000000000-0x0000000000000000]

    Instead be explicit that the node is memoryless: Initmem setup node XX as
    memoryless

    Link: https://lkml.kernel.org/r/20220127085305.20890-6-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Rafael Aquini <raquini@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Alexey Makhalov <amakhalov@vmware.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Cc: Nico Pache <npache@redhat.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2024054
Signed-off-by: Nico Pache <npache@redhat.com>
2022-03-28 12:41:39 -06:00
Nico Pache 078bc11654 mm, memory_hotplug: reorganize new pgdat initialization
commit 70b5b46a754245d383811b4d2f2c76c34bb7e145
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Mar 22 14:47:00 2022 -0700

    mm, memory_hotplug: reorganize new pgdat initialization

    When a !node_online node is brought up it needs a hotplug specific
    initialization because the node could be either uninitialized yet or it
    could have been recycled after previous hotremove.  hotadd_init_pgdat is
    responsible for that.

    Internal pgdat state is initialized at two places currently
            - hotadd_init_pgdat
            - free_area_init_core_hotplug

    There is no real clear cut what should go where but this patch's chosen to
    move the whole internal state initialization into
    free_area_init_core_hotplug.  hotadd_init_pgdat is still responsible to
    pull all the parts together - most notably to initialize zonelists because
    those depend on the overall topology.

    This patch doesn't introduce any functional change.

    Link: https://lkml.kernel.org/r/20220127085305.20890-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Rafael Aquini <raquini@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Alexey Makhalov <amakhalov@vmware.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nico Pache <npache@redhat.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2024054
Signed-off-by: Nico Pache <npache@redhat.com>
2022-03-28 12:41:39 -06:00
Nico Pache 8e3254a841 mm: handle uninitialized numa nodes gracefully
commit 09f49dca570a917a8c6bccd7e8c61f5141534e3a
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Mar 22 14:46:54 2022 -0700

    mm: handle uninitialized numa nodes gracefully

    We have had several reports [1][2][3] that page allocator blows up when an
    allocation from a possible node is requested.  The underlying reason is
    that NODE_DATA for the specific node is not allocated.

    NUMA specific initialization is arch specific and it can vary a lot.  E.g.
    x86 tries to initialize all nodes that have some cpu affinity (see
    init_cpu_to_node) but this can be insufficient because the node might be
    cpuless for example.

    One way to address this problem would be to check for !node_online nodes
    when trying to get a zonelist and silently fall back to another node.
    That is unfortunately adding a branch into allocator hot path and it
    doesn't handle any other potential NODE_DATA users.

    This patch takes a different approach (following a lead of [3]) and it pre
    allocates pgdat for all possible nodes in an arch indipendent code -
    free_area_init.  All uninitialized nodes are treated as memoryless nodes.
    node_state of the node is not changed because that would lead to other
    side effects - e.g.  sysfs representation of such a node and from past
    discussions [4] it is known that some tools might have problems digesting
    that.

    Newly allocated pgdat only gets a minimal initialization and the rest of
    the work is expected to be done by the memory hotplug - hotadd_new_pgdat
    (renamed to hotadd_init_pgdat).

    generic_alloc_nodedata is changed to use the memblock allocator because
    neither page nor slab allocators are available at the stage when all
    pgdats are allocated.  Hotplug doesn't allocate pgdat anymore so we can
    use the early boot allocator.  The only arch specific implementation is
    ia64 and that is changed to use the early allocator as well.

    [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
    [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
    [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
    [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com

    [akpm@linux-foundation.org: replace comment, per Mike]

    Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz
    Reported-by: Alexey Makhalov <amakhalov@vmware.com>
    Tested-by: Alexey Makhalov <amakhalov@vmware.com>
    Reported-by: Nico Pache <npache@redhat.com>
    Acked-by: Rafael Aquini <raquini@redhat.com>
    Tested-by: Rafael Aquini <raquini@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2024054
Signed-off-by: Nico Pache <npache@redhat.com>
2022-03-28 12:41:38 -06:00
Rafael Aquini b501affe62 mm/page_alloc: check high-order pages for corruption during PCP operations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 77fe7f136a7312954b1b8b7eeb4bc91fc3c14a3f
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Tue Mar 22 14:44:00 2022 -0700

    mm/page_alloc: check high-order pages for corruption during PCP operations

    Eric Dumazet pointed out that commit 44042b4498 ("mm/page_alloc: allow
    high-order pages to be stored on the per-cpu lists") only checks the
    head page during PCP refill and allocation operations.  This was an
    oversight and all pages should be checked.  This will incur a small
    performance penalty but it's necessary for correctness.

    Link: https://lkml.kernel.org/r/20220310092456.GJ15701@techsingularity.net
    Fixes: 44042b4498 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reported-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Rientjes <rientjes@google.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:43 -04:00
Rafael Aquini 08458eeb32 mm/page_alloc: do not prefetch buddies during bulk free
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 2a791f4412cba41330453527a3045cf39818e72a
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Tue Mar 22 14:43:48 2022 -0700

    mm/page_alloc: do not prefetch buddies during bulk free

    free_pcppages_bulk() has taken two passes through the pcp lists since
    commit 0a5f4e5b45 ("mm/free_pcppages_bulk: do not hold lock when
    picking pages to free") due to deferring the cost of selecting PCP lists
    until the zone lock is held.

    As the list processing now takes place under the zone lock, it's less
    clear that this will always benefit for two reasons.

    1. There is a guaranteed cost to calculating the buddy which definitely
       has to be calculated again. However, as the zone lock is held and
       there is no deferring of buddy merging, there is no guarantee that the
       prefetch will have completed when the second buddy calculation takes
       place and buddies are being merged.  With or without the prefetch, there
       may be further stalls depending on how many pages get merged. In other
       words, a stall due to merging is inevitable and at best only one stall
       might be avoided at the cost of calculating the buddy location twice.

    2. As the zone lock is held, prefetch_nr makes less sense as once
       prefetch_nr expires, the cache lines of interest have already been
       merged.

    The main concern is that there is a definite cost to calculating the
    buddy location early for the prefetch and it is a "maybe win" depending
    on whether the CPU prefetch logic and memory is fast enough.  Remove the
    prefetch logic on the basis that reduced instructions in a path is
    always a saving where as the prefetch might save one memory stall
    depending on the CPU and memory.

    In most cases, this has marginal benefit as the calculations are a small
    part of the overall freeing of pages.  However, it was detectable on at
    least one machine.

                                  5.17.0-rc3             5.17.0-rc3
                        mm-highpcplimit-v2r1     mm-noprefetch-v1r1
    Min       elapsed      630.00 (   0.00%)      610.00 (   3.17%)
    Amean     elapsed      639.00 (   0.00%)      623.00 *   2.50%*
    Max       elapsed      660.00 (   0.00%)      660.00 (   0.00%)

    Link: https://lkml.kernel.org/r/20220221094119.15282-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Suggested-by: Aaron Lu <aaron.lu@intel.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Aaron Lu <aaron.lu@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:42 -04:00
Rafael Aquini ece5251696 mm/page_alloc: limit number of high-order pages on PCP during bulk free
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit f26b3fa046116a7dedcaafe30083402113941451
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Tue Mar 22 14:43:45 2022 -0700

    mm/page_alloc: limit number of high-order pages on PCP during bulk free

    When a PCP is mostly used for frees then high-order pages can exist on
    PCP lists for some time.  This is problematic when the allocation
    pattern is all allocations from one CPU and all frees from another
    resulting in colder pages being used.  When bulk freeing pages, limit
    the number of high-order pages that are stored on the PCP lists.

    Netperf running on localhost exhibits this pattern and while it does not
    matter for some machines, it does matter for others with smaller caches
    where cache misses cause problems due to reduced page reuse.  Pages
    freed directly to the buddy list may be reused quickly while still cache
    hot where as storing on the PCP lists may be cold by the time
    free_pcppages_bulk() is called.

    Using perf kmem:mm_page_alloc, the 5 most used page frames were

    5.17-rc3
      13041 pfn=0x111a30
      13081 pfn=0x5814d0
      13097 pfn=0x108258
      13121 pfn=0x689598
      13128 pfn=0x5814d8

    5.17-revert-highpcp
     192009 pfn=0x54c140
     195426 pfn=0x1081d0
     200908 pfn=0x61c808
     243515 pfn=0xa9dc20
     402523 pfn=0x222bb8

    5.17-full-series
     142693 pfn=0x346208
     162227 pfn=0x13bf08
     166413 pfn=0x2711e0
     166950 pfn=0x2702f8

    The spread is wider as there is still time before pages freed to one PCP
    get released with a tradeoff between fast reuse and reduced zone lock
    acquisition.

    On the machine used to gather the traces, the headline performance was
    equivalent.

    netperf-tcp
                                5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
                                   vanilla  mm-reverthighpcp-v1r1     mm-highpcplimit-v2
    Hmean     64         839.93 (   0.00%)      840.77 (   0.10%)      841.02 (   0.13%)
    Hmean     128       1614.22 (   0.00%)     1622.07 *   0.49%*     1636.41 *   1.37%*
    Hmean     256       2952.00 (   0.00%)     2953.19 (   0.04%)     2977.76 *   0.87%*
    Hmean     1024     10291.67 (   0.00%)    10239.17 (  -0.51%)    10434.41 *   1.39%*
    Hmean     2048     17335.08 (   0.00%)    17399.97 (   0.37%)    17134.81 *  -1.16%*
    Hmean     3312     22628.15 (   0.00%)    22471.97 (  -0.69%)    22422.78 (  -0.91%)
    Hmean     4096     25009.50 (   0.00%)    24752.83 *  -1.03%*    24740.41 (  -1.08%)
    Hmean     8192     32745.01 (   0.00%)    31682.63 *  -3.24%*    32153.50 *  -1.81%*
    Hmean     16384    39759.59 (   0.00%)    36805.78 *  -7.43%*    38948.13 *  -2.04%*

    On a 1-socket skylake machine with a small CPU cache that suffers more if
    cache misses are too high

    netperf-tcp
                                5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
                                   vanilla    mm-reverthighpcp-v1     mm-highpcplimit-v2
    Hmean     64         938.95 (   0.00%)      941.50 *   0.27%*      943.61 *   0.50%*
    Hmean     128       1843.10 (   0.00%)     1857.58 *   0.79%*     1861.09 *   0.98%*
    Hmean     256       3573.07 (   0.00%)     3667.45 *   2.64%*     3674.91 *   2.85%*
    Hmean     1024     13206.52 (   0.00%)    13487.80 *   2.13%*    13393.21 *   1.41%*
    Hmean     2048     22870.23 (   0.00%)    23337.96 *   2.05%*    23188.41 *   1.39%*
    Hmean     3312     31001.99 (   0.00%)    32206.50 *   3.89%*    31863.62 *   2.78%*
    Hmean     4096     35364.59 (   0.00%)    36490.96 *   3.19%*    36112.54 *   2.11%*
    Hmean     8192     48497.71 (   0.00%)    49954.05 *   3.00%*    49588.26 *   2.25%*
    Hmean     16384    58410.86 (   0.00%)    60839.80 *   4.16%*    62282.96 *   6.63%*

    Note that this was a machine that did not benefit from caching high-order
    pages and performance is almost restored with the series applied.  It's
    not fully restored as cache misses are still higher.  This is a trade-off
    between optimising for a workload that does all allocs on one CPU and
    frees on another or more general workloads that need high-order pages for
    SLUB and benefit from avoiding zone->lock for every SLUB refill/drain.

    Link: https://lkml.kernel.org/r/20220217002227.5739-7-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Aaron Lu <aaron.lu@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:41 -04:00
Rafael Aquini dbf0d205e6 mm/page_alloc: free pages in a single pass during bulk free
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 8b10b465d0e18b002b290b2162145abc7167e53d
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Tue Mar 22 14:43:42 2022 -0700

    mm/page_alloc: free pages in a single pass during bulk free

    free_pcppages_bulk() has taken two passes through the pcp lists since
    commit 0a5f4e5b45 ("mm/free_pcppages_bulk: do not hold lock when
    picking pages to free") due to deferring the cost of selecting PCP lists
    until the zone lock is held.  Now that list selection is simplier, the
    main cost during selection is bulkfree_pcp_prepare() which in the normal
    case is a simple check and prefetching.  As the list manipulations have
    cost in itself, go back to freeing pages in a single pass.

    The series up to this point was evaulated using a trunc microbenchmark
    that is truncating sparse files stored in page cache (mmtests config
    config-io-trunc).  Sparse files were used to limit filesystem
    interaction.  The results versus a revert of storing high-order pages in
    the PCP lists is

    1-socket Skylake
                                   5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
                                      vanilla      mm-reverthighpcp-v1     mm-highpcpopt-v2
     Min       elapsed      540.00 (   0.00%)      530.00 (   1.85%)      530.00 (   1.85%)
     Amean     elapsed      543.00 (   0.00%)      530.00 *   2.39%*      530.00 *   2.39%*
     Stddev    elapsed        4.83 (   0.00%)        0.00 ( 100.00%)        0.00 ( 100.00%)
     CoeffVar  elapsed        0.89 (   0.00%)        0.00 ( 100.00%)        0.00 ( 100.00%)
     Max       elapsed      550.00 (   0.00%)      530.00 (   3.64%)      530.00 (   3.64%)
     BAmean-50 elapsed      540.00 (   0.00%)      530.00 (   1.85%)      530.00 (   1.85%)
     BAmean-95 elapsed      542.22 (   0.00%)      530.00 (   2.25%)      530.00 (   2.25%)
     BAmean-99 elapsed      542.22 (   0.00%)      530.00 (   2.25%)      530.00 (   2.25%)

    2-socket CascadeLake
                                   5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
                                      vanilla    mm-reverthighpcp-v1       mm-highpcpopt-v2
     Min       elapsed      510.00 (   0.00%)      500.00 (   1.96%)      500.00 (   1.96%)
     Amean     elapsed      529.00 (   0.00%)      521.00 (   1.51%)      510.00 *   3.59%*
     Stddev    elapsed       16.63 (   0.00%)       12.87 (  22.64%)       11.55 (  30.58%)
     CoeffVar  elapsed        3.14 (   0.00%)        2.47 (  21.46%)        2.26 (  27.99%)
     Max       elapsed      550.00 (   0.00%)      540.00 (   1.82%)      530.00 (   3.64%)
     BAmean-50 elapsed      516.00 (   0.00%)      512.00 (   0.78%)      500.00 (   3.10%)
     BAmean-95 elapsed      526.67 (   0.00%)      518.89 (   1.48%)      507.78 (   3.59%)
     BAmean-99 elapsed      526.67 (   0.00%)      518.89 (   1.48%)      507.78 (   3.59%)

    The original motivation for multi-passes was will-it-scale page_fault1
    using $nr_cpu processes.

    2-socket CascadeLake (40 cores, 80 CPUs HT enabled)
                                                         5.17.0-rc3                 5.17.0-rc3
                                                            vanilla           mm-highpcpopt-v2
     Hmean     page_fault1-processes-2        2694662.26 (   0.00%)      2695780.35 (   0.04%)
     Hmean     page_fault1-processes-5        6425819.34 (   0.00%)      6435544.57 *   0.15%*
     Hmean     page_fault1-processes-8        9642169.10 (   0.00%)      9658962.39 (   0.17%)
     Hmean     page_fault1-processes-12      12167502.10 (   0.00%)     12190163.79 (   0.19%)
     Hmean     page_fault1-processes-21      15636859.03 (   0.00%)     15612447.26 (  -0.16%)
     Hmean     page_fault1-processes-30      25157348.61 (   0.00%)     25169456.65 (   0.05%)
     Hmean     page_fault1-processes-48      27694013.85 (   0.00%)     27671111.46 (  -0.08%)
     Hmean     page_fault1-processes-79      25928742.64 (   0.00%)     25934202.02 (   0.02%) <--
     Hmean     page_fault1-processes-110     25730869.75 (   0.00%)     25671880.65 *  -0.23%*
     Hmean     page_fault1-processes-141     25626992.42 (   0.00%)     25629551.61 (   0.01%)
     Hmean     page_fault1-processes-172     25611651.35 (   0.00%)     25614927.99 (   0.01%)
     Hmean     page_fault1-processes-203     25577298.75 (   0.00%)     25583445.59 (   0.02%)
     Hmean     page_fault1-processes-234     25580686.07 (   0.00%)     25608240.71 (   0.11%)
     Hmean     page_fault1-processes-265     25570215.47 (   0.00%)     25568647.58 (  -0.01%)
     Hmean     page_fault1-processes-296     25549488.62 (   0.00%)     25543935.00 (  -0.02%)
     Hmean     page_fault1-processes-320     25555149.05 (   0.00%)     25575696.74 (   0.08%)

    The differences are mostly within the noise and the difference close to
    $nr_cpus is negligible.

    Link: https://lkml.kernel.org/r/20220217002227.5739-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Aaron Lu <aaron.lu@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:40 -04:00
Rafael Aquini 30fdd7202c mm/page_alloc: drain the requested list first during bulk free
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit d61372bc41cfe91d6170434fc44b6af49cd2c755
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Tue Mar 22 14:43:38 2022 -0700

    mm/page_alloc: drain the requested list first during bulk free

    Prior to the series, pindex 0 (order-0 MIGRATE_UNMOVABLE) was always
    skipped first and the precise reason is forgotten.  A potential reason
    may have been to artificially preserve MIGRATE_UNMOVABLE but there is no
    reason why that would be optimal as it depends on the workload.  The
    more likely reason is that it was less complicated to do a pre-increment
    instead of a post-increment in terms of overall code flow.  As
    free_pcppages_bulk() now typically receives the pindex of the PCP list
    that exceeded high, always start draining that list.

    Link: https://lkml.kernel.org/r/20220217002227.5739-5-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Aaron Lu <aaron.lu@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:39 -04:00
Rafael Aquini 06ca44e003 mm/page_alloc: simplify how many pages are selected per pcp list during bulk free
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit fd56eef258a17bbc8eda2ca773fa538f354c5f49
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Tue Mar 22 14:43:36 2022 -0700

    mm/page_alloc: simplify how many pages are selected per pcp list during bulk free

    free_pcppages_bulk() selects pages to free by round-robining between
    lists.  Originally this was to evenly shrink pages by migratetype but
    uneven freeing is inevitable due to high pages.  Simplify list selection
    by starting with a list that definitely has pages on it in
    free_unref_page_commit() and for drain, it does not matter where
    draining starts as all pages are removed.

    Link: https://lkml.kernel.org/r/20220217002227.5739-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Aaron Lu <aaron.lu@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:38 -04:00
Rafael Aquini 6b5d51fbb5 mm/page_alloc: track range of active PCP lists during bulk free
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 35b6d770e6334aa470080570f0f81c8b74a07afd
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Tue Mar 22 14:43:33 2022 -0700

    mm/page_alloc: track range of active PCP lists during bulk free

    free_pcppages_bulk() frees pages in a round-robin fashion.  Originally,
    this was dealing only with migratetypes but storing high-order pages
    means that there can be many more empty lists that are uselessly
    checked.  Track the minimum and maximum active pindex to reduce the
    search space.

    Link: https://lkml.kernel.org/r/20220217002227.5739-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Aaron Lu <aaron.lu@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:38 -04:00
Rafael Aquini bfeafadd15 mm/page_alloc: fetch the correct pcp buddy during bulk free
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit ca7b59b1de72450b3e696bada3506a519ac5455c
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Tue Mar 22 14:43:30 2022 -0700

    mm/page_alloc: fetch the correct pcp buddy during bulk free

    Patch series "Follow-up on high-order PCP caching", v2.

    Commit 44042b4498 ("mm/page_alloc: allow high-order pages to be stored
    on the per-cpu lists") was primarily aimed at reducing the cost of SLUB
    cache refills of high-order pages in two ways.  Firstly, zone lock
    acquisitions was reduced and secondly, there were fewer buddy list
    modifications.  This is a follow-up series fixing some issues that
    became apparant after merging.

    Patch 1 is a functional fix.  It's harmless but inefficient.

    Patches 2-5 reduce the overhead of bulk freeing of PCP pages.  While the
    overhead is small, it's cumulative and noticable when truncating large
    files.  The changelog for patch 4 includes results of a microbench that
    deletes large sparse files with data in page cache.  Sparse files were
    used to eliminate filesystem overhead.

    Patch 6 addresses issues with high-order PCP pages being stored on PCP
    lists for too long.  Pages freed on a CPU potentially may not be quickly
    reused and in some cases this can increase cache miss rates.  Details
    are included in the changelog.

    This patch (of 6):

    free_pcppages_bulk() prefetches buddies about to be freed but the order
    must also be passed in as PCP lists store multiple orders.

    Link: https://lkml.kernel.org/r/20220217002227.5739-1-mgorman@techsingularity.net
    Link: https://lkml.kernel.org/r/20220217002227.5739-2-mgorman@techsingularity.net
    Fixes: 44042b4498 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Aaron Lu <aaron.lu@intel.com>
    Tested-by: Aaron Lu <aaron.lu@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:37 -04:00
Rafael Aquini 8bf8296d82 mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit ddbc84f3f595cf1fc8234a191193b5d20ad43938
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Mar 22 14:43:26 2022 -0700

    mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node

    ZONE_MOVABLE uses the remaining memory in each node.  Its starting pfn
    is also aligned to MAX_ORDER_NR_PAGES.  It is possible for the remaining
    memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
    not enough room for ZONE_MOVABLE on that node.

    Unfortunately this condition is not checked for.  This leads to
    zone_movable_pfn[] getting set to a pfn greater than the last pfn in a
    node.

    calculate_node_totalpages() then sets zone->present_pages to be greater
    than zone->spanned_pages which is invalid, as spanned_pages represents
    the maximum number of pages in a zone assuming no holes.

    Subsequently it is possible free_area_init_core() will observe a zone of
    size zero with present pages.  In this case it will skip setting up the
    zone, including the initialisation of free_lists[].

    However populated_zone() checks zone->present_pages to see if a zone has
    memory available.  This is used by iterators such as
    walk_zones_in_node().  pagetypeinfo_showfree() uses this to walk the
    free_list of each zone in each node, which are assumed to be initialised
    due to the zone not being empty.

    As free_area_init_core() never initialised the free_lists[] this results
    in the following kernel crash when trying to read /proc/pagetypeinfo:

      BUG: kernel NULL pointer dereference, address: 0000000000000000
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
      CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 #461
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
      RIP: 0010:pagetypeinfo_show+0x163/0x460
      Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 <48> 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82
      RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003
      RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001
      RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b
      RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e
      R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00
      R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000
      FS:  00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0
      Call Trace:
       seq_read_iter+0x128/0x460
       proc_reg_read_iter+0x51/0x80
       new_sync_read+0x113/0x1a0
       vfs_read+0x136/0x1d0
       ksys_read+0x70/0xf0
       __x64_sys_read+0x1a/0x20
       do_syscall_64+0x3b/0xc0
       entry_SYSCALL_64_after_hwframe+0x44/0xae

    Fix this by checking that the aligned zone_movable_pfn[] does not exceed
    the end of the node, and if it does skip creating a movable zone on this
    node.

    Link: https://lkml.kernel.org/r/20220215025831.2113067-1-apopple@nvidia.com
    Fixes: 2a1e274acf ("Create the ZONE_MOVABLE zone")
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:36 -04:00
Rafael Aquini 5e1b86d952 mm/page_alloc: mark pagesets as __maybe_unused
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit a4812d47deff0642b3315f0528d579f0a99c45c2
Author: Nathan Chancellor <nathan@kernel.org>
Date:   Tue Mar 22 14:43:23 2022 -0700

    mm/page_alloc: mark pagesets as __maybe_unused

    Commit 9983a9d577db ("locking/local_lock: Make the empty local_lock_*()
    function a macro.") in the -tip tree converted the local_lock_*()
    functions into macros, which causes a warning with clang with
    CONFIG_PREEMPT_RT=n + CONFIG_DEBUG_LOCK_ALLOC=n:

      mm/page_alloc.c:131:40: error: variable 'pagesets' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration]
      static DEFINE_PER_CPU(struct pagesets, pagesets) = {
                                             ^
      1 error generated.

    Prior to that change, clang was not able to tell that pagesets was
    unused in this configuration because it does not perform cross function
    analysis in the frontend.  After that change, it sees that the macros
    just do a typecheck on the lock member of pagesets, which is evaluated
    at compile time (so the variable is technically "used"), meaning the
    variable is not needed in the final assembly, as the warning states.

    Mark the variable as __maybe_unused to make it clear to clang that this
    is expected in this configuration so there is no more warning.

    Link: https://github.com/ClangBuiltLinux/linux/issues/1593
    Link: https://lkml.kernel.org/r/20220215184322.440969-1-nathan@kernel.org
    Signed-off-by: Nathan Chancellor <nathan@kernel.org>
    Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
    Reported-by: "kernelci.org bot" <bot@kernelci.org>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:35 -04:00
Rafael Aquini fd51d8153c mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 7cba630bd830472f88ada5fdd34bbfd5825d9217
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Nov 5 13:40:08 2021 -0700

    mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk()

    The second two paragraphs about "all pages pinned" and pages_scanned is
    obsolete.  And There are PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP orders
    in pcp.  So the same order assumption is not held now.

    Link: https://lkml.kernel.org/r/20210902121242.41607-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:18 -04:00
Rafael Aquini ea678bc416 mm/large system hash: avoid possible NULL deref in alloc_large_system_hash
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 084f7e2377e89ccbc8375b5486c6b4c16682f602
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Nov 5 13:39:59 2021 -0700

    mm/large system hash: avoid possible NULL deref in alloc_large_system_hash

    If __vmalloc() returned NULL, is_vm_area_hugepages(NULL) will fault if
    CONFIG_HAVE_ARCH_HUGE_VMALLOC=y

    Link: https://lkml.kernel.org/r/20210915212530.2321545-1-eric.dumazet@gmail.com
    Fixes: 121e6f3258 ("mm/vmalloc: hugepage vmalloc mappings")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:18 -04:00
Herton R. Krzesinski 7c794ec2d4 Merge: Backport page unpoisoning fixes
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/490

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1972220
Tested: with reproducer

This patchset fixes reference counting issues that still exist in RHEL9 and
can be reproduced by soft poisoning/unpoisoning along with fixes to prevent
silent corruption in tmpfs and shmem when a page is poisoned.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-02-16 22:47:32 +00:00
Aristeu Rozanski d5f97bda11 mm/hwpoison: fix unpoison_memory()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1972220
Tested: with reproducer

commit bf181c582588f8f7406d52f2ee228539b465f173
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Fri Jan 14 14:09:09 2022 -0800

    mm/hwpoison: fix unpoison_memory()

    After recent soft-offline rework, error pages can be taken off from
    buddy allocator, but the existing unpoison_memory() does not properly
    undo the operation.  Moreover, due to the recent change on
    __get_hwpoison_page(), get_page_unless_zero() is hardly called for
    hwpoisoned pages.  So __get_hwpoison_page() highly likely returns -EBUSY
    (meaning to fail to grab page refcount) and unpoison just clears
    PG_hwpoison without releasing a refcount.  That does not lead to a
    critical issue like kernel panic, but unpoisoned pages never get back to
    buddy (leaked permanently), which is not good.

    To (partially) fix this, we need to identify "taken off" pages from
    other types of hwpoisoned pages.  We can't use refcount or page flags
    for this purpose, so a pseudo flag is defined by hacking ->private
    field.  Someone might think that put_page() is enough to cancel
    taken-off pages, but the normal free path contains some operations not
    suitable for the current purpose, and can fire VM_BUG_ON().

    Note that unpoison_memory() is now supposed to be cancel hwpoison events
    injected only by madvise() or
    /sys/devices/system/memory/{hard,soft}_offline_page, not by MCE
    injection, so please don't try to use unpoison when testing with MCE
    injection.

    [lkp@intel.com: report build failure for ARCH=i386]

    Link: https://lkml.kernel.org/r/20211115084006.3728254-4-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Ding Hui <dinghui@sangfor.com.cn>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-02-07 11:31:53 -05:00
Baoquan He 4168f57c4f mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages
This is back ported from upstream directly, no conflict

Bugzilla: https://bugzilla.redhat.com/2024381
Upstream: In linus's tree

commit c4dc63f0032c77464fbd4e7a6afc22fa6913c4a7
Author: Baoquan He <bhe@redhat.com>
Date:   Fri Jan 14 14:07:44 2022 -0800

    mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages

Signed-off-by: Baoquan He <bhe@redhat.com>
2022-01-24 03:05:38 -05:00
Baoquan He efe0220509 mm_zone: add function to check if managed dma zone exists
This is back ported from upstream directly, no conflict

Bugzilla: https://bugzilla.redhat.com/2024381
Upstream: In linus's tree

commit 62b3107073646e0946bd97ff926832bafb846d17
Author: Baoquan He <bhe@redhat.com>
Date:   Fri Jan 14 14:07:37 2022 -0800

    mm_zone: add function to check if managed dma zone exists

Signed-off-by: Baoquan He <bhe@redhat.com>
2022-01-24 03:05:37 -05:00
Herton R. Krzesinski f13f32b81b Merge: sched: backports from 5.16 merge window
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/217
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2020279

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2029640

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1921343

Upstream Status: Linux
Tested: By me, with scheduler stress and sanity tests. Boot tested
    on Alderlake for topology changes.

5.16+ scheduler fixes. This includes some commits requested by
the Livepatch team and some AlderLake topology changes. A few
additional patches were pulled in to make the rest apply. With
those and the dependency all patches apply cleanly.

v2: added 3 more commits from sched/urgent.

Added one last (hopefully) fix from sched/urgent.

Signed-off-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: Wander Lairson Costa <wander@redhat.com>
RH-Acked-by: Waiman Long <longman@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-12-22 10:22:13 -03:00
Phil Auld 0b20e27c69 mm: move node_reclaim_distance to fix NUMA without SMP
Bugzilla: http://bugzilla.redhat.com/2020279

commit 61bb6cd2f765b90cfc5f0f91696889c366a6a13d
Author: Geert Uytterhoeven <geert+renesas@glider.be>
Date:   Fri Nov 5 13:40:24 2021 -0700

    mm: move node_reclaim_distance to fix NUMA without SMP

    Patch series "Fix NUMA without SMP".

    SuperH is the only architecture which still supports NUMA without SMP,
    for good reasons (various memories scattered around the address space,
    each with varying latencies).

    This series fixes two build errors due to variables and functions used
    by the NUMA code being provided by SMP-only source files or sections.

    This patch (of 2):

    If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):

        sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist':
        page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance'

    Fix this by moving the declaration of node_reclaim_distance from an
    SMP-only to a generic file.

    Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be
    Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be
    Fixes: a55c7454a8 ("sched/topology: Improve load balancing on AMD EPYC systems")
    Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Suggested-by: Matt Fleming <matt@codeblueprint.co.uk>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yoshinori Sato <ysato@users.osdn.me>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Gon Solo <gonsolo@gmail.com>
    Cc: Geert Uytterhoeven <geert+renesas@glider.be>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:51 -05:00
Rafael Aquini a77ba4ce70 mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit eac96c3efdb593df1a57bb5b95dbe037bfa9a522
Author: Yang Shi <shy828301@gmail.com>
Date:   Thu Oct 28 14:36:11 2021 -0700

    mm: filemap: check if THP has hwpoisoned subpage for PMD page fault

    When handling shmem page fault the THP with corrupted subpage could be
    PMD mapped if certain conditions are satisfied.  But kernel is supposed
    to send SIGBUS when trying to map hwpoisoned page.

    There are two paths which may do PMD map: fault around and regular
    fault.

    Before commit f9ce0be71d ("mm: Cleanup faultaround and finish_fault()
    codepaths") the thing was even worse in fault around path.  The THP
    could be PMD mapped as long as the VMA fits regardless what subpage is
    accessed and corrupted.  After this commit as long as head page is not
    corrupted the THP could be PMD mapped.

    In the regular fault path the THP could be PMD mapped as long as the
    corrupted page is not accessed and the VMA fits.

    This loophole could be fixed by iterating every subpage to check if any
    of them is hwpoisoned or not, but it is somewhat costly in page fault
    path.

    So introduce a new page flag called HasHWPoisoned on the first tail
    page.  It indicates the THP has hwpoisoned subpage(s).  It is set if any
    subpage of THP is found hwpoisoned by memory failure and after the
    refcount is bumped successfully, then cleared when the THP is freed or
    split.

    The soft offline path doesn't need this since soft offline handler just
    marks a subpage hwpoisoned when the subpage is migrated successfully.
    But shmem THP didn't get split then migrated at all.

    Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
    Fixes: 800d8c63b2 ("shmem: add huge pages support")
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:14 -05:00
Rafael Aquini 7efd7b58cf memcg: page_alloc: skip bulk allocator for __GFP_ACCOUNT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 8dcb3060d81dbfa8d954a2ec64ef7ca330f5bb4d
Author: Shakeel Butt <shakeelb@google.com>
Date:   Thu Oct 28 14:36:04 2021 -0700

    memcg: page_alloc: skip bulk allocator for __GFP_ACCOUNT

    Commit 5c1f4e690e ("mm/vmalloc: switch to bulk allocator in
    __vmalloc_area_node()") switched to bulk page allocator for order 0
    allocation backing vmalloc.  However bulk page allocator does not
    support __GFP_ACCOUNT allocations and there are several users of
    kvmalloc(__GFP_ACCOUNT).

    For now make __GFP_ACCOUNT allocations bypass bulk page allocator.  In
    future if there is workload that can be significantly improved with the
    bulk page allocator with __GFP_ACCCOUNT support, we can revisit the
    decision.

    Link: https://lkml.kernel.org/r/20211014151607.2171970-1-shakeelb@google.com
    Fixes: 5c1f4e690e ("mm/vmalloc: switch to bulk allocator in __vmalloc_area_node()")
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Reported-by: Vasily Averin <vvs@virtuozzo.com>
    Tested-by: Vasily Averin <vvs@virtuozzo.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Roman Gushchin <guro@fb.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:12 -05:00
Rafael Aquini 6c4d18fd2a mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 053cfda102306a3394012f9fe2594811c34925e4
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Wed Sep 8 18:10:11 2021 -0700

    mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype

    If it's not prepared to free unref page, the pcp page migratetype is
    unset.  Thus we will get rubbish from get_pcppage_migratetype() and
    might list_del(&page->lru) again after it's already deleted from the list
    leading to grumble about data corruption.

    Link: https://lkml.kernel.org/r/20210902115447.57050-1-linmiaohe@huawei.com
    Fixes: df1acc8569 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:42 -05:00
Rafael Aquini 726fecee67 mm: track present early pages per zone
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 4b0970024408afb17886e0c76e9761c4264db2a8
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Sep 7 19:55:19 2021 -0700

    mm: track present early pages per zone

    Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.

    I. Goal

    The goal of this series is improving in-kernel auto-online support.  It
    tackles the fundamental problems that:

     1) We can create zone imbalances when onlining all memory blindly to
        ZONE_MOVABLE, in the worst case crashing the system. We have to know
        upfront how much memory we are going to hotplug such that we can
        safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
        via "online_movable". This is far from practical and only applicable in
        limited setups -- like inside VMs under the RHV/oVirt hypervisor which
        will never hotplug more than 3 times the boot memory (and the
        limitation is only in place due to the Linux limitation).

     2) We see more setups that implement dynamic VM resizing, hot(un)plugging
        memory to resize VM memory. In these setups, we might hotplug a lot of
        memory, but it might happen in various small steps in both directions
        (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
        primary driver of this upstream right now, performing such dynamic
        resizing NUMA-aware via multiple virtio-mem devices.

        Onlining all hotplugged memory to ZONE_NORMAL means we basically have
        no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
        easily run into zone imbalances when growing a VM. We want a mixture,
        and we want as much memory as reasonable/configured in ZONE_MOVABLE.
        Details regarding zone imbalances can be found at [1].

     3) Memory devices consist of 1..X memory block devices, however, the
        kernel doesn't really track the relationship. Consequently, also user
        space has no idea. We want to make per-device decisions.

        As one example, for memory hotunplug it doesn't make sense to use a
        mixture of zones within a single DIMM: we want all MOVABLE if
        possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
        block the whole DIMM from getting hotunplugged.

        As another example, virtio-mem operates on individual units that span
        1..X memory blocks. Similar to a DIMM, we want a unit to either be all
        MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
        all units of a virtio-mem device logically belong together and are
        managed (added/removed) by a single driver. We want as much memory of
        a virtio-mem device to be MOVABLE as possible.

     4) We want memory onlining to be done right from the kernel while adding
        memory, not triggered by user space via udev rules; for example, this
        is reqired for fast memory hotplug for drivers that add individual
        memory blocks, like virito-mem. We want a way to configure a policy in
        the kernel and avoid implementing advanced policies in user space.

    The auto-onlining support we have in the kernel is not sufficient.  All we
    have is a) online everything MOVABLE (online_movable) b) online everything
    !MOVABLE (online_kernel) c) keep zones contiguous (online).  This series
    allows configuring c) to mean instead "online movable if possible
    according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
    -- a new onlining policy.

    II. Approach

    This series does 3 things:

     1) Introduces the "auto-movable" online policy that initially operates on
        individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
        to make a decision whether a memory block will be onlined to
        ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
        memory does not allow for more MOVABLE memory (details in the
        patches). CMA memory is treated like MOVABLE memory.

     2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
        groups and uses group information to make decisions in the
        "auto-movable" online policy across memory blocks of a single memory
        device (modeled as memory group). More details can be found in patch
        #3 or in the DIMM example below.

     3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
        allowing ZONE_NORMAL memory within a dynamic memory group to allow for
        more ZONE_MOVABLE memory within the same memory group. The target use
        case is dynamic VM resizing using virtio-mem. See the virtio-mem
        example below.

    I remember that the basic idea of using a ratio to implement a policy in
    the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
    lost the pointer to that discussion).

    For me, the main use case is using it along with virtio-mem (and DIMMs /
    ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
    amount of memory we can hotunplug reliably again if we might eventually
    hotplug a lot of memory to a VM.

    III. Target Usage

    The target usage will be:

     1) Linux boots with "mhp_default_online_type=offline"

     2) User space (e.g., systemd unit) configures memory onlining (according
        to a config file and system properties), for example:
        * Setting memory_hotplug.online_policy=auto-movable
        * Setting memory_hotplug.auto_movable_ratio=301
        * Setting memory_hotplug.auto_movable_numa_aware=true

     3) User space enabled auto onlining via "echo online >
        /sys/devices/system/memory/auto_online_blocks"

     4) User space triggers manual onlining of all already-offline memory
        blocks (go over offline memory blocks and set them to "online")

    IV. Example

    For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
    301% results in the following layout:
            Memory block 0-15:    DMA32   (early)
            Memory block 32-47:   Normal  (early)
            Memory block 48-79:   Movable (DIMM 0)
            Memory block 80-111:  Movable (DIMM 1)
            Memory block 112-143: Movable (DIMM 2)
            Memory block 144-275: Normal  (DIMM 3)
            Memory block 176-207: Normal  (DIMM 4)
            ... all Normal
            (-> hotplugged Normal memory does not allow for more Movable memory)

    For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
    will result in the following layout:
            Memory block 0-15:    DMA32   (early)
            Memory block 32-47:   Normal  (early)
            Memory block 48-143:  Movable (virtio-mem, first 12 GiB)
            Memory block 144:     Normal  (virtio-mem, next 128 MiB)
            Memory block 145-147: Movable (virtio-mem, next 384 MiB)
            Memory block 148:     Normal  (virtio-mem, next 128 MiB)
            Memory block 149-151: Movable (virtio-mem, next 384 MiB)
            ... Normal/Movable mixture as above
            (-> hotplugged Normal memory allows for more Movable memory within
                the same device)

    Which gives us maximum flexibility when dynamically growing/shrinking a
    VM in smaller steps.

    V. Doc Update

    I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
    usptream. Until then, details can be found in patch #2.

    VI. Future Work

     1) Use memory groups for ppc64 dlpar
     2) Being able to specify a portion of (early) kernel memory that will be
        excluded from the ratio. Like "128 MiB globally/per node" are excluded.

        This might be helpful when starting VMs with extremely small memory
        footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
        the first hotplugged units getting onlined to ZONE_MOVABLE. One
        alternative would be a trigger to not consider ZONE_DMA memory
        in the ratio. We'll have to see if this is really rrequired.
     3) Indicate to user space that MOVABLE might be a bad idea -- especially
        relevant when memory ballooning without support for balloon compaction
        is active.

    This patch (of 9):

    For implementing a new memory onlining policy, which determines when to
    online memory blocks to ZONE_MOVABLE semi-automatically, we need the
    number of present early (boot) pages -- present pages excluding hotplugged
    pages.  Let's track these pages per zone.

    Pass a page instead of the zone to adjust_present_page_count(), similar as
    adjust_managed_page_count() and derive the zone from the page.

    It's worth noting that a memory block to be offlined/onlined is either
    completely "early" or "not early".  add_memory() and friends can only add
    complete memory blocks and we only online/offline complete (individual)
    memory blocks.

    Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Marek Kedzierski <mkedzier@redhat.com>
    Cc: Hui Zhu <teawater@gmail.com>
    Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
    Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
    Cc: Len Brown <lenb@kernel.org>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:07 -05:00
Rafael Aquini af51d4fce9 mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 859a85ddf90e714092dea71a0e54c7b9896621be
Author: Mike Rapoport <rppt@kernel.org>
Date:   Tue Sep 7 19:54:52 2021 -0700

    mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE

    Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".

    After recent updates to freeing unused parts of the memory map, no
    architecture can have holes in the memory map within a pageblock.  This
    makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
    option redundant.

    The first patch removes them both in a mechanical way and the second patch
    simplifies memory_hotplug::test_pages_in_a_zone() that had
    pfn_valid_within() surrounded by more logic than simple if.

    This patch (of 2):

    After recent changes in freeing of the unused parts of the memory map and
    rework of pfn_valid() in arm and arm64 there are no architectures that can
    have holes in the memory map within a pageblock and so nothing can enable
    CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
    pfn_valid_within().

    With that, pfn_valid_within() is always hardwired to 1 and can be
    completely removed.

    Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.

    Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
    Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:02 -05:00
Rafael Aquini 2e0da4572f mm/migrate: enable returning precise migrate_pages() success count
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 5ac95884a784e822b8cbe3d4bd6e9f96b3b71e3f
Author: Yang Shi <yang.shi@linux.alibaba.com>
Date:   Thu Sep 2 14:59:13 2021 -0700

    mm/migrate: enable returning precise migrate_pages() success count

    Under normal circumstances, migrate_pages() returns the number of pages
    migrated.  In error conditions, it returns an error code.  When returning
    an error code, there is no way to know how many pages were migrated or not
    migrated.

    Make migrate_pages() return how many pages are demoted successfully for
    all cases, including when encountering errors.  Page reclaim behavior will
    depend on this in subsequent patches.

    Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
    Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:08 -05:00
Rafael Aquini 489bee842d mm/numa: automatically generate node migration order
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 79c28a41672278283fa72e03d0bf80e6644d4ac4
Author: Dave Hansen <dave.hansen@linux.intel.com>
Date:   Thu Sep 2 14:59:06 2021 -0700

    mm/numa: automatically generate node migration order

    Patch series "Migrate Pages in lieu of discard", v11.

    We're starting to see systems with more and more kinds of memory such as
    Intel's implementation of persistent memory.

    Let's say you have a system with some DRAM and some persistent memory.
    Today, once DRAM fills up, reclaim will start and some of the DRAM
    contents will be thrown out.  Allocations will, at some point, start
    falling over to the slower persistent memory.

    That has two nasty properties.  First, the newer allocations can end up in
    the slower persistent memory.  Second, reclaimed data in DRAM are just
    discarded even if there are gobs of space in persistent memory that could
    be used.

    This patchset implements a solution to these problems.  At the end of the
    reclaim process in shrink_page_list() just before the last page refcount
    is dropped, the page is migrated to persistent memory instead of being
    dropped.

    While I've talked about a DRAM/PMEM pairing, this approach would function
    in any environment where memory tiers exist.

    This is not perfect.  It "strands" pages in slower memory and never brings
    them back to fast DRAM.  Huang Ying has follow-on work which repurposes
    NUMA balancing to promote hot pages back to DRAM.

    This is also all based on an upstream mechanism that allows persistent
    memory to be onlined and used as if it were volatile:

            http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

    With that, the DRAM and PMEM in each socket will be represented as 2
    separate NUMA nodes, with the CPUs sit in the DRAM node.  So the
    general inter-NUMA demotion mechanism introduced in the patchset can
    migrate the cold DRAM pages to the PMEM node.

    We have tested the patchset with the postgresql and pgbench.  On a
    2-socket server machine with DRAM and PMEM, the kernel with the patchset
    can improve the score of pgbench up to 22.1% compared with that of the
    DRAM only + disk case.  This comes from the reduced disk read throughput
    (which reduces up to 70.8%).

    == Open Issues ==

     * Memory policies and cpusets that, for instance, restrict allocations
       to DRAM can be demoted to PMEM whenever they opt in to this
       new mechanism.  A cgroup-level API to opt-in or opt-out of
       these migrations will likely be required as a follow-on.
     * Could be more aggressive about where anon LRU scanning occurs
       since it no longer necessarily involves I/O.  get_scan_count()
       for instance says: "If we have no swap space, do not bother
       scanning anon pages"

    This patch (of 9):

    Prepare for the kernel to auto-migrate pages to other memory nodes with a
    node migration table.  This allows creating single migration target for
    each NUMA node to enable the kernel to do NUMA page migrations instead of
    simply discarding colder pages.  A node with no target is a "terminal
    node", so reclaim acts normally there.  The migration target does not
    fundamentally _need_ to be a single node, but this implementation starts
    there to limit complexity.

    When memory fills up on a node, memory contents can be automatically
    migrated to another node.  The biggest problems are knowing when to
    migrate and to where the migration should be targeted.

    The most straightforward way to generate the "to where" list would be to
    follow the page allocator fallback lists.  Those lists already tell us if
    memory is full where to look next.  It would also be logical to move
    memory in that order.

    But, the allocator fallback lists have a fatal flaw: most nodes appear in
    all the lists.  This would potentially lead to migration cycles (A->B,
    B->A, A->B, ...).

    Instead of using the allocator fallback lists directly, keep a separate
    node migration ordering.  But, reuse the same data used to generate page
    allocator fallback in the first place: find_next_best_node().

    This means that the firmware data used to populate node distances
    essentially dictates the ordering for now.  It should also be
    architecture-neutral since all NUMA architectures have a working
    find_next_best_node().

    RCU is used to allow lock-less read of node_demotion[] and prevent
    demotion cycles been observed.  If multiple reads of node_demotion[] are
    performed, a single rcu_read_lock() must be held over all reads to ensure
    no cycles are observed.  Details are as follows.

    === What does RCU provide? ===

    Imagine a simple loop which walks down the demotion path looking
    for the last node:

            terminal_node = start_node;
            while (node_demotion[terminal_node] != NUMA_NO_NODE) {
                    terminal_node = node_demotion[terminal_node];
            }

    The initial values are:

            node_demotion[0] = 1;
            node_demotion[1] = NUMA_NO_NODE;

    and are updated to:

            node_demotion[0] = NUMA_NO_NODE;
            node_demotion[1] = 0;

    What guarantees that the cycle is not observed:

            node_demotion[0] = 1;
            node_demotion[1] = 0;

    and would loop forever?

    With RCU, a rcu_read_lock/unlock() can be placed around the loop.  Since
    the write side does a synchronize_rcu(), the loop that observed the old
    contents is known to be complete before the synchronize_rcu() has
    completed.

    RCU, combined with disable_all_migrate_targets(), ensures that the old
    migration state is not visible by the time __set_migration_target_nodes()
    is called.

    === What does READ_ONCE() provide? ===

    READ_ONCE() forbids the compiler from merging or reordering successive
    reads of node_demotion[].  This ensures that any updates are *eventually*
    observed.

    Consider the above loop again.  The compiler could theoretically read the
    entirety of node_demotion[] into local storage (registers) and never go
    back to memory, and *permanently* observe bad values for node_demotion[].

    Note: RCU does not provide any universal compiler-ordering
    guarantees:

            https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/

    This code is unused for now.  It will be called later in the
    series.

    Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:06 -05:00
Rafael Aquini 8e2f2a5c11 mm/page_alloc.c: use in_task()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 88dc6f208829cfdbc0b96495c5c73a6af0559300
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Thu Sep 2 14:58:13 2021 -0700

    mm/page_alloc.c: use in_task()

    Obsoleted in_intrrupt() include task context with disabled BH, it's better
    to use in_task() instead.

    Link: https://lkml.kernel.org/r/877caa99-1994-5545-92d2-d0bb2e394182@virtuozzo.com
    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:54 -05:00
Rafael Aquini 971ec3055a mm/page_alloc: make alloc_node_mem_map() __init rather than __ref
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 3b446da6be7a722d769e23f68dbaf4ebb2eda542
Author: Mike Rapoport <rppt@kernel.org>
Date:   Thu Sep 2 14:58:10 2021 -0700

    mm/page_alloc: make alloc_node_mem_map() __init rather than __ref

    alloc_node_mem_map() is never only called from free_area_init_node() that
    is an __init function.

    Make the actual alloc_node_mem_map() also __init and its stub version
    static inline.

    Link: https://lkml.kernel.org/r/20210716064124.31865-1-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:54 -05:00
Rafael Aquini 6e4487e327 mm/page_alloc.c: fix 'zone_id' may be used uninitialized in this function warning
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit b346075fcf5dda7f9e9ae671703aae60e8a94564
Author: Nico Pache <npache@redhat.com>
Date:   Thu Sep 2 14:58:08 2021 -0700

    mm/page_alloc.c: fix 'zone_id' may be used uninitialized in this function warning

    When compiling with -Werror, cc1 will warn that 'zone_id' may be used
    uninitialized in this function warning.

    Initialize the zone_id as 0.

    Its safe to assume that if the code reaches this point it has at least one
    numa node with memory, so no need for an assertion before
    init_unavilable_range.

    Link: https://lkml.kernel.org/r/20210716210336.1114114-1-npache@redhat.com
    Fixes: 122e093c17 ("mm/page_alloc: fix memory map initialization for descending nodes")
    Signed-off-by: Nico Pache <npache@redhat.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:53 -05:00
Rafael Aquini f1e6a8f806 mm: introduce memmap_alloc() to unify memory map allocation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit c803b3c8b3b70f306ee6300bf8acdd70ffd1441a
Author: Mike Rapoport <rppt@kernel.org>
Date:   Thu Sep 2 14:58:02 2021 -0700

    mm: introduce memmap_alloc() to unify memory map allocation

    There are several places that allocate memory for the memory map:
    alloc_node_mem_map() for FLATMEM, sparse_buffer_init() and
    __populate_section_memmap() for SPARSEMEM.

    The memory allocated in the FLATMEM case is zeroed and it is never
    poisoned, regardless of CONFIG_PAGE_POISON setting.

    The memory allocated in the SPARSEMEM cases is not zeroed and it is
    implicitly poisoned inside memblock if CONFIG_PAGE_POISON is set.

    Introduce memmap_alloc() wrapper for memblock allocators that will be used
    for both FLATMEM and SPARSEMEM cases and will makei memory map zeroing and
    poisoning consistent for different memory models.

    Link: https://lkml.kernel.org/r/20210714123739.16493-4-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Michal Simek <monstr@monstr.eu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:51 -05:00
Rafael Aquini 7056abe778 mm/page_alloc: always initialize memory map for the holes
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit c3ab6baf6a004eab7344a1d8880a971f2414e1b6
Author: Mike Rapoport <rppt@kernel.org>
Date:   Thu Sep 2 14:57:56 2021 -0700

    mm/page_alloc: always initialize memory map for the holes

    Patch series "mm: ensure consistency of memory map poisoning".

    Currently memory map allocation for FLATMEM case does not poison the
    struct pages regardless of CONFIG_PAGE_POISON setting.

    This happens because allocation of the memory map for FLATMEM and SPARSMEM
    use different memblock functions and those that are used for SPARSMEM case
    (namely memblock_alloc_try_nid_raw() and memblock_alloc_exact_nid_raw())
    implicitly poison the allocated memory.

    Another side effect of this implicit poisoning is that early setup code
    that uses the same functions to allocate memory burns cycles for the
    memory poisoning even if it was not intended.

    These patches introduce memmap_alloc() wrapper that ensure that the memory
    map allocation is consistent for different memory models.

    This patch (of 4):

    Currently memory map for the holes is initialized only when SPARSEMEM
    memory model is used.  Yet, even with FLATMEM there could be holes in the
    physical memory layout that have memory map entries.

    For instance, the memory reserved using e820 API on i386 or
    "reserved-memory" nodes in device tree would not appear in memblock.memory
    and hence the struct pages for such holes will be skipped during memory
    map initialization.

    These struct pages will be zeroed because the memory map for FLATMEM
    systems is allocated with memblock_alloc_node() that clears the allocated
    memory.  While zeroed struct pages do not cause immediate problems, the
    correct behaviour is to initialize every page using __init_single_page().
    Besides, enabling page poison for FLATMEM case will trigger
    PF_POISONED_CHECK() unless the memory map is properly initialized.

    Make sure init_unavailable_range() is called for both SPARSEMEM and
    FLATMEM so that struct pages representing memory holes would appear as
    PG_Reserved with any memory layout.

    [rppt@kernel.org: fix microblaze]
      Link: https://lkml.kernel.org/r/YQWW3RCE4eWBuMu/@kernel.org

    Link: https://lkml.kernel.org/r/20210714123739.16493-1-rppt@kernel.org
    Link: https://lkml.kernel.org/r/20210714123739.16493-2-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Tested-by: Guenter Roeck <linux@roeck-us.net>
    Cc: Michal Simek <monstr@monstr.eu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:51 -05:00
Rafael Aquini 381a7b55bb mm: add kernel_misc_reclaimable in show_free_areas
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit eb2169cee36fc492407a2d483c286b977a46288a
Author: liuhailong <liuhailong@oppo.com>
Date:   Thu Sep 2 14:53:01 2021 -0700

    mm: add kernel_misc_reclaimable in show_free_areas

    Print NR_KERNEL_MISC_RECLAIMABLE stat from show_free_areas() so users can
    check whether the shrinker is working correctly and to show the current
    memory usage.

    Link: https://lkml.kernel.org/r/20210813104725.4562-1-liuhailong@oppo.com
    Signed-off-by: liuhailong <liuhailong@oppo.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:44 -05:00
Rafael Aquini 8715929b88 mm: report a more useful address for reclaim acquisition
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 4f3eaf452a14ff3982f71c1ca8bdf757254231fa
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Sep 2 14:52:58 2021 -0700

    mm: report a more useful address for reclaim acquisition

    A recent lockdep report included these lines:

    [   96.177910] 3 locks held by containerd/770:
    [   96.177934]  #0: ffff88810815ea28 (&mm->mmap_lock#2){++++}-{3:3},
    at: do_user_addr_fault+0x115/0x770
    [   96.177999]  #1: ffffffff82915020 (rcu_read_lock){....}-{1:2}, at:
    get_swap_device+0x33/0x140
    [   96.178057]  #2: ffffffff82955ba0 (fs_reclaim){+.+.}-{0:0}, at:
    __fs_reclaim_acquire+0x5/0x30

    While it was not useful to that bug report to know where the reclaim lock
    had been acquired, it might be useful under other circumstances.  Allow
    the caller of __fs_reclaim_acquire to specify the instruction pointer to
    use.

    Link: https://lkml.kernel.org/r/20210719185709.1755149-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Omar Sandoval <osandov@fb.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:43 -05:00
Doug Berger 47aef6010b mm/page_alloc: don't corrupt pcppage_migratetype
When placing pages on a pcp list, migratetype values over
MIGRATE_PCPTYPES get added to the MIGRATE_MOVABLE pcp list.

However, the actual migratetype is preserved in the page and should
not be changed to MIGRATE_MOVABLE or the page may end up on the wrong
free_list.

The impact is that HIGHATOMIC or CMA pages getting bulk freed from the
PCP lists could potentially end up on the wrong buddy list.  There are
various consequences but minimally NR_FREE_CMA_PAGES accounting could
get screwed up.

[mgorman@techsingularity.net: changelog update]

Link: https://lkml.kernel.org/r/20210811182917.2607994-1-opendmb@gmail.com
Fixes: df1acc8569 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-20 11:31:42 -07:00
Sergei Trofimovich 69e5d322a2 mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction
To reproduce the failure we need the following system:

 - kernel command: page_poison=1 init_on_free=0 init_on_alloc=0

 - kernel config:
    * CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
    * CONFIG_INIT_ON_FREE_DEFAULT_ON=y
    * CONFIG_PAGE_POISONING=y

Resulting in:

    0000000085629bdd: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    0000000022861832: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000000c597f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    CPU: 11 PID: 15195 Comm: bash Kdump: loaded Tainted: G     U     O      5.13.1-gentoo-x86_64 #1
    Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2801 01/13/2021
    Call Trace:
     dump_stack+0x64/0x7c
     __kernel_unpoison_pages.cold+0x48/0x84
     post_alloc_hook+0x60/0xa0
     get_page_from_freelist+0xdb8/0x1000
     __alloc_pages+0x163/0x2b0
     __get_free_pages+0xc/0x30
     pgd_alloc+0x2e/0x1a0
     mm_init+0x185/0x270
     dup_mm+0x6b/0x4f0
     copy_process+0x190d/0x1b10
     kernel_clone+0xba/0x3b0
     __do_sys_clone+0x8f/0xb0
     do_syscall_64+0x68/0x80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

Before commit 51cba1ebc6 ("init_on_alloc: Optimize static branches")
init_on_alloc never enabled static branch by default.  It could only be
enabed explicitly by init_mem_debugging_and_hardening().

But after commit 51cba1ebc6, a static branch could already be enabled
by default.  There was no code to ever disable it.  That caused
page_poison=1 / init_on_free=1 conflict.

This change extends init_mem_debugging_and_hardening() to also disable
static branch disabling.

Link: https://lkml.kernel.org/r/20210714031935.4094114-1-keescook@chromium.org
Link: https://lore.kernel.org/r/20210712215816.1512739-1-slyfox@gentoo.org
Fixes: 51cba1ebc6 ("init_on_alloc: Optimize static branches")
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Reported-by: Mikhail Morfikov <mmorfikov@gmail.com>
Reported-by: <bowsingbetee@pm.me>
Tested-by: <bowsingbetee@protonmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23 17:43:28 -07:00
Chuck Lever 061478438d mm/page_alloc: further fix __alloc_pages_bulk() return value
The author of commit b3b64ebd38 ("mm/page_alloc: do bulk array
bounds check after checking populated elements") was possibly
confused by the mixture of return values throughout the function.

The API contract is clear that the function "Returns the number of pages
on the list or array." It does not list zero as a unique return value with
a special meaning.  Therefore zero is a plausible return value only if
@nr_pages is zero or less.

Clean up the return logic to make it clear that the returned value is
always the total number of pages in the array/list, not the number of
pages that were allocated during this call.

The only change in behavior with this patch is the value returned if
prepare_alloc_pages() fails.  To match the API contract, the number of
pages currently in the array/list is returned in this case.

The call site in __page_pool_alloc_pages_slow() also seems to be confused
on this matter.  It should be attended to by someone who is familiar with
that code.

[mel@techsingularity.net: Return nr_populated if 0 pages are requested]

Link: https://lkml.kernel.org/r/20210713152100.10381-4-mgorman@techsingularity.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Cc: Zhang Qiang <Qiang.Zhang@windriver.com>
Cc: Yanfei Xu <yanfei.xu@windriver.com>
Cc: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15 10:13:49 -07:00
Yanfei Xu e5c15cea33 mm/page_alloc: correct return value when failing at preparing
If the array passed in is already partially populated, we should return
"nr_populated" even failing at preparing arguments stage.

Link: https://lkml.kernel.org/r/20210713152100.10381-3-mgorman@techsingularity.net
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20210709102855.55058-1-yanfei.xu@windriver.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15 10:13:49 -07:00
Mel Gorman 187ad460b8 mm/page_alloc: avoid page allocator recursion with pagesets.lock held
Syzbot is reporting potential deadlocks due to pagesets.lock when
PAGE_OWNER is enabled.  One example from Desmond Cheong Zhi Xi is as
follows

  __alloc_pages_bulk()
    local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here
    prep_new_page():
      post_alloc_hook():
        set_page_owner():
          __set_page_owner():
            save_stack():
              stack_depot_save():
                alloc_pages():
                  alloc_page_interleave():
                    __alloc_pages():
                      get_page_from_freelist():
                        rm_queue():
                          rm_queue_pcplist():
                            local_lock_irqsave(&pagesets.lock, flags);
                            *** DEADLOCK ***

Zhang, Qiang also reported

  BUG: sleeping function called from invalid context at mm/page_alloc.c:5179
  in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
  .....
  __dump_stack lib/dump_stack.c:79 [inline]
  dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:96
  ___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:9153
  prepare_alloc_pages+0x3da/0x580 mm/page_alloc.c:5179
  __alloc_pages+0x12f/0x500 mm/page_alloc.c:5375
  alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2147
  alloc_pages+0x238/0x2a0 mm/mempolicy.c:2270
  stack_depot_save+0x39d/0x4e0 lib/stackdepot.c:303
  save_stack+0x15e/0x1e0 mm/page_owner.c:120
  __set_page_owner+0x50/0x290 mm/page_owner.c:181
  prep_new_page mm/page_alloc.c:2445 [inline]
  __alloc_pages_bulk+0x8b9/0x1870 mm/page_alloc.c:5313
  alloc_pages_bulk_array_node include/linux/gfp.h:557 [inline]
  vm_area_alloc_pages mm/vmalloc.c:2775 [inline]
  __vmalloc_area_node mm/vmalloc.c:2845 [inline]
  __vmalloc_node_range+0x39d/0x960 mm/vmalloc.c:2947
  __vmalloc_node mm/vmalloc.c:2996 [inline]
  vzalloc+0x67/0x80 mm/vmalloc.c:3066

There are a number of ways it could be fixed.  The page owner code could
be audited to strip GFP flags that allow sleeping but it'll impair the
functionality of PAGE_OWNER if allocations fail.  The bulk allocator could
add a special case to release/reacquire the lock for prep_new_page and
lookup PCP after the lock is reacquired at the cost of performance.  The
pages requiring prep could be tracked using the least significant bit and
looping through the array although it is more complicated for the list
interface.  The options are relatively complex and the second one still
incurs a performance penalty when PAGE_OWNER is active so this patch takes
the simple approach -- disable bulk allocation of PAGE_OWNER is active.
The caller will be forced to allocate one page at a time incurring a
performance penalty but PAGE_OWNER is already a performance penalty.

Link: https://lkml.kernel.org/r/20210708081434.GV3840@techsingularity.net
Fixes: dbbee9d5cd ("mm/page_alloc: convert per-cpu list protection to local_lock")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reported-by: "Zhang, Qiang" <Qiang.Zhang@windriver.com>
Reported-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com
Tested-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15 10:13:49 -07:00
Matteo Croce 54aa386661 Revert "mm/page_alloc: make should_fail_alloc_page() static"
This reverts commit f717309003.

Fix an unresolved symbol error when CONFIG_DEBUG_INFO_BTF=y:

    LD      vmlinux
    BTFIDS  vmlinux
  FAILED unresolved symbol should_fail_alloc_page
  make: *** [Makefile:1199: vmlinux] Error 255
  make: *** Deleting file 'vmlinux'

Link: https://lkml.kernel.org/r/20210708191128.153796-1-mcroce@linux.microsoft.com
Fixes: f717309003 ("mm/page_alloc: make should_fail_alloc_page() static")
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15 10:13:49 -07:00
Mel Gorman 6bce244390 mm/page_alloc: Revert pahole zero-sized workaround
Commit dbbee9d5cd ("mm/page_alloc: convert per-cpu list protection to
local_lock") folded in a workaround patch for pahole that was unable to
deal with zero-sized percpu structures.

A superior workaround is achieved with commit a0b8200d06 ("kbuild:
skip per-CPU BTF generation for pahole v1.18-v1.21").

This patch reverts the dummy field and the pahole version check.

Fixes: dbbee9d5cd ("mm/page_alloc: convert per-cpu list protection to local_lock")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-10 16:09:39 -07:00
Linus Torvalds 71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Mel Gorman f717309003 mm/page_alloc: make should_fail_alloc_page() static
make W=1 generates the following warning for mm/page_alloc.c

  mm/page_alloc.c:3651:15: warning: no previous prototype for `should_fail_alloc_page' [-Wmissing-prototypes]
   noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
                 ^~~~~~~~~~~~~~~~~~~~~~

This function is deliberately split out for BPF to allow errors to be
injected.  The function is not used anywhere else so it is local to the
file.  Make it static which should still allow error injection to be used
similar to how block/blk-core.c:should_fail_bio() works.

Link: https://lkml.kernel.org/r/20210520084809.8576-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:02 -07:00
Zhen Lei 041711ce7c mm: fix spelling mistakes
Fix some spelling mistakes in comments:
each having differents usage ==> each has a different usage
statments ==> statements
adresses ==> addresses
aggresive ==> aggressive
datas ==> data
posion ==> poison
higer ==> higher
precisly ==> precisely
wont ==> won't
We moves tha ==> We move the
endianess ==> endianness

Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:02 -07:00
Mike Kravetz 7118fc2906 hugetlb: address ref count racing in prep_compound_gigantic_page
In [1], Jann Horn points out a possible race between
prep_compound_gigantic_page and __page_cache_add_speculative.  The root
cause of the possible race is prep_compound_gigantic_page uncondittionally
setting the ref count of pages to zero.  It does this because
prep_compound_gigantic_page is handed a 'group' of pages from an allocator
and needs to convert that group of pages to a compound page.  The ref
count of each page in this 'group' is one as set by the allocator.
However, the ref count of compound page tail pages must be zero.

The potential race comes about when ref counted pages are returned from
the allocator.  When this happens, other mm code could also take a
reference on the page.  __page_cache_add_speculative is one such example.
Therefore, prep_compound_gigantic_page can not just set the ref count of
pages to zero as it does today.  Doing so would lose the reference taken
by any other code.  This would lead to BUGs in code checking ref counts
and could possibly even lead to memory corruption.

There are two possible ways to address this issue.

1) Make all allocators of gigantic groups of pages be able to return a
   properly constructed compound page.

2) Make prep_compound_gigantic_page be more careful when constructing a
   compound page.

This patch takes approach 2.

In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
if it is one.  If the cmpxchg fails, call synchronize_rcu() in the hope
that the extra ref count will be driopped during a rcu grace period.  This
is not a performance critical code path and the wait should be
accceptable.  If the ref count is still inflated after the grace period,
then undo any modifications made and return an error.

Currently prep_compound_gigantic_page is type void and does not return
errors.  Modify the two callers to check for and handle error returns.  On
error, the caller must free the 'group' of pages as they can not be used
to form a gigantic page.  After freeing pages, the runtime caller
(alloc_fresh_huge_page) will retry the allocation once.  Boot time
allocations can not be retried.

The routine prep_compound_page also unconditionally sets the ref count of
compound page tail pages to zero.  However, in this case the buddy
allocator is constructing a compound page from freshly allocated pages.
The ref count on those freshly allocated pages is already zero, so the
set_page_count(p, 0) is unnecessary and could lead to confusion.  Just
remove it.

[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/

Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
Fixes: 58a84aa927 ("thp: set compound tail page _count to zero")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: Youquan Song <youquan.song@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:27 -07:00
Linus Torvalds 65090f30ab Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "191 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
  slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
  mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm,hwpoison: send SIGBUS with error virutal address
  mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
  mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  docs: remove description of DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  arc: remove support for DISCONTIGMEM
  arc: update comment about HIGHMEM implementation
  alpha: remove DISCONTIGMEM and NUMA
  mm/page_alloc: move free_the_page
  mm/page_alloc: fix counting of managed_pages
  mm/page_alloc: improve memmap_pages dbg msg
  mm: drop SECTION_SHIFT in code comments
  mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
  mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
  mm/page_alloc: scale the number of pages that are batch freed
  ...
2021-06-29 17:29:11 -07:00
Mel Gorman 203c06eef5 mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
Dave Hansen reported the following about Feng Tang's tests on a machine
with persistent memory onlined as a DRAM-like device.

  Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
  ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
  use" mode and being managed via the buddy just like the normal RAM.

  The PMEM zones are big ones:

        present  65011712 = 248 G
        high       134595 = 525 M

  The PMEM nodes, of course, don't have any CPUs in them.

  With your series, the pcp->high value per-cpu is 69584 pages or about
  270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
  worst-case memory in the pcps per zone, or roughly 10% of the size of
  the zone.

This should not cause a problem as such although it could trigger reclaim
due to pages being stored on per-cpu lists for CPUs remote to a node.  It
is not possible to treat cpuless nodes exactly the same as normal nodes
but the worst-case scenario can be mitigated by splitting pcp->high across
all online CPUs for cpuless memory nodes.

Link: https://lkml.kernel.org/r/20210616110743.GK30378@techsingularity.net
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Tang, Feng" <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mel Gorman 44042b4498 mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
The per-cpu page allocator (PCP) only stores order-0 pages.  This means
that all THP and "cheap" high-order allocations including SLUB contends on
the zone->lock.  This patch extends the PCP allocator to store THP and
"cheap" high-order pages.  Note that struct per_cpu_pages increases in
size to 256 bytes (4 cache lines) on x86-64.

Note that this is not necessarily a universal performance win because of
how it is implemented.  High-order pages can cause pcp->high to be
exceeded prematurely for lower-orders so for example, a large number of
THP pages being freed could release order-0 pages from the PCP lists.
Hence, much depends on the allocation/free pattern as observed by a single
CPU to determine if caching helps or hurts a particular workload.

That said, basic performance testing passed.  The following is a netperf
UDP_STREAM test which hits the relevant patches as some of the network
allocations are high-order.

netperf-udp
                                 5.13.0-rc2             5.13.0-rc2
                           mm-pcpburst-v3r4   mm-pcphighorder-v1r7
Hmean     send-64         261.46 (   0.00%)      266.30 *   1.85%*
Hmean     send-128        516.35 (   0.00%)      536.78 *   3.96%*
Hmean     send-256       1014.13 (   0.00%)     1034.63 *   2.02%*
Hmean     send-1024      3907.65 (   0.00%)     4046.11 *   3.54%*
Hmean     send-2048      7492.93 (   0.00%)     7754.85 *   3.50%*
Hmean     send-3312     11410.04 (   0.00%)    11772.32 *   3.18%*
Hmean     send-4096     13521.95 (   0.00%)    13912.34 *   2.89%*
Hmean     send-8192     21660.50 (   0.00%)    22730.72 *   4.94%*
Hmean     send-16384    31902.32 (   0.00%)    32637.50 *   2.30%*

Functionally, a patch like this is necessary to make bulk allocation of
high-order pages work with similar performance to order-0 bulk
allocations.  The bulk allocator is not updated in this series as it would
have to be determined by bulk allocation users how they want to track the
order of pages allocated with the bulk allocator.

Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mike Rapoport 43b02ba93b mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
configuration option is equivalent to FLATMEM.

Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.

Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mike Rapoport a9ee6cf5c6 mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
configuration options are equivalent.

Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.

Done with

	$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
		$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
	$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
		$(git grep -wl NEED_MULTIPLE_NODES)

with manual tweaks afterwards.

[rppt@linux.ibm.com: fix arm boot crash]
  Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com

Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mike Rapoport bb1c50d396 mm: remove CONFIG_DISCONTIGMEM
There are no architectures that support DISCONTIGMEM left.

Remove the configuration option and the dead code it was guarding in the
generic memory management code.

Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mel Gorman 21d02f8f84 mm/page_alloc: move free_the_page
Patch series "Allow high order pages to be stored on PCP", v2.

The per-cpu page allocator (PCP) only handles order-0 pages.  With the
series "Use local_lock for pcp protection and reduce stat overhead" and
"Calculate pcp->high based on zone sizes and active CPUs", it's now
feasible to store high-order pages on PCP lists.

This small series allows PCP to store "cheap" orders where cheap is
determined by PAGE_ALLOC_COSTLY_ORDER and THP-sized allocations.

This patch (of 2):

In the next page, free_compount_page is going to use the common helper
free_the_page.  This patch moves the definition to ease review.  No
functional change.

Link: https://lkml.kernel.org/r/20210603142220.10851-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210603142220.10851-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Liu Shixin f7ec104458 mm/page_alloc: fix counting of managed_pages
commit f63661566f ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if
the zone is empty") clears out zone->lowmem_reserve[] if zone is empty.
But when zone is not empty and sysctl_lowmem_reserve_ratio[i] is set to
zero, zone_managed_pages(zone) is not counted in the managed_pages either.
This is inconsistent with the description of lowmem_reserve, so fix it.

Link: https://lkml.kernel.org/r/20210527125707.3760259-1-liushixin2@huawei.com
Fixes: f63661566f ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reported-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Dong Aisheng e47aa90568 mm/page_alloc: improve memmap_pages dbg msg
Make debug message more accurate.

Link: https://lkml.kernel.org/r/20210531091908.1738465-6-aisheng.dong@nxp.com
Signed-off-by: Dong Aisheng <aisheng.dong@nxp.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mel Gorman 74f4482209 mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
This introduces a new sysctl vm.percpu_pagelist_high_fraction.  It is
similar to the old vm.percpu_pagelist_fraction.  The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention.  However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled.  This
sysctl only adjusts pcp->high so that zone->lock contention is potentially
reduced but allocation latency during a PCP refill remains the same.

  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=8
  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  35071
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=64
              high:  4383
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=0
              high:  649
              batch: 63

[mgorman@techsingularity.net: fix documentation]
  Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net

Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mel Gorman c49c2c47da mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
When kswapd is active then direct reclaim is potentially active.  In
either case, it is possible that a zone would be balanced if pages were
not trapped on PCP lists.  Instead of draining remote pages, simply limit
the size of the PCP lists while kswapd is active.

Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman 3b12e7e979 mm/page_alloc: scale the number of pages that are batch freed
When a task is freeing a large number of order-0 pages, it may acquire the
zone->lock multiple times freeing pages in batches.  This may
unnecessarily contend on the zone lock when freeing very large number of
pages.  This patch adapts the size of the batch based on the recent
pattern to scale the batch size for subsequent frees.

As the machines I used were not large enough to test this are not large
enough to illustrate a problem, a debugging patch shows patterns like the
following (slightly editted for clarity)

Baseline vanilla kernel
  time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
  time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
  time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
  time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
  time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378

With patches
  time-unmap-7724    [...] free_pcppages_bulk: free  126 count  814 high  814
  time-unmap-7724    [...] free_pcppages_bulk: free  252 count  814 high  814
  time-unmap-7724    [...] free_pcppages_bulk: free  504 count  814 high  814
  time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
  time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814

Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman 04f8cfeaed mm/page_alloc: adjust pcp->high after CPU hotplug events
The PCP high watermark is based on the number of online CPUs so the
watermarks must be adjusted during CPU hotplug.  At the time of
hot-remove, the number of online CPUs is already adjusted but during
hot-add, a delta needs to be applied to update PCP to the correct value.
After this patch is applied, the high watermarks are adjusted correctly.

  # grep high: /proc/zoneinfo  | tail -1
              high:  649
  # echo 0 > /sys/devices/system/cpu/cpu4/online
  # grep high: /proc/zoneinfo  | tail -1
              high:  664
  # echo 1 > /sys/devices/system/cpu/cpu4/online
  # grep high: /proc/zoneinfo  | tail -1
              high:  649

Link: https://lkml.kernel.org/r/20210525080119.5455-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman b92ca18e8c mm/page_alloc: disassociate the pcp->high from pcp->batch
The pcp high watermark is based on the batch size but there is no
relationship between them other than it is convenient to use early in
boot.

This patch takes the first step and bases pcp->high on the zone low
watermark split across the number of CPUs local to a zone while the batch
size remains the same to avoid increasing allocation latencies.  The
intent behind the default pcp->high is "set the number of PCP pages such
that if they are all full that background reclaim is not started
prematurely".

Note that in this patch the pcp->high values are adjusted after memory
hotplug events, min_free_kbytes adjustments and watermark scale factor
adjustments but not CPU hotplug events which is handled later in the
series.

On a test KVM instance;

Before grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  378
              batch: 63

After grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

[mgorman@techsingularity.net:  fix __setup_per_zone_wmarks for parallel memory
hotplug]
  Link: https://lkml.kernel.org/r/20210528105925.GN30378@techsingularity.net

Link: https://lkml.kernel.org/r/20210525080119.5455-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman bbbecb35a4 mm/page_alloc: delete vm.percpu_pagelist_fraction
Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.

The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the
zone size into account or the number of CPUs local to a zone.  With larger
zones and more CPUs per node, the contention is getting worse.
Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
and high values means that the sysctl can reduce zone lock contention but
also increase allocation latencies.

This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static.  It
also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.

The motivation is partially to adjust to larger memory sizes but is also
driven by the fact that large batches of page freeing via release_pages()
often shows zone contention as a major part of the problem.  Another is a
bug report based on an older kernel where a multi-terabyte process can
takes several minutes to exit.  A workaround was to use
vm.percpu_pagelist_fraction to increase the pcp->high value but testing
indicated that a production workload could not use the same values because
of an increase in allocation latencies.  Unfortunately, I cannot reproduce
this test case myself as the multi-terabyte machines are in active use but
it should alleviate the problem.

The series aims to address both and partially acts as a pre-requisite.
pcp only works with order-0 which is useless for SLUB (when using high
orders) and THP (unconditionally).  To store high-order pages on PCP, the
pcp->high values need to be increased first.

This patch (of 6):

The vm.percpu_pagelist_fraction is used to increase the batch and high
limits for the per-cpu page allocator (PCP).  The intent behind the sysctl
is to reduce zone lock acquisition when allocating/freeing pages but it
has a problem.  While it can decrease contention, it can also increase
latency on the allocation side due to unreasonably large batch sizes.
This leads to games where an administrator adjusts
percpu_pagelist_fraction on the fly to work around contention and
allocation latency problems.

This series aims to alleviate the problems with zone lock contention while
avoiding the allocation-side latency problems.  For the purposes of
review, it's easier to remove this sysctl now and reintroduce a similar
sysctl later in the series that deals only with pcp->high.

Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Minchan Kim 151e084af4 mm: page_alloc: dump migrate-failed pages only at -EBUSY
alloc_contig_dump_pages() aims for helping debugging page migration
failure by elevated page refcount compared to expected_count.  (for the
detail, please look at migrate_page_move_mapping)

However, -ENOMEM is just the case that system is under memory pressure
state, not relevant with page refcount at all.  Thus, the dumping page
list is not helpful for the debugging point of view.

Link: https://lkml.kernel.org/r/YKa2Wyo9xqIErpfa@google.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman 902499937e mm/page_alloc: update PGFREE outside the zone lock in __free_pages_ok
VM events do not need explicit protection by disabling IRQs so update the
counter with IRQs enabled in __free_pages_ok.

Link: https://lkml.kernel.org/r/20210512095458.30632-10-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman df1acc8569 mm/page_alloc: avoid conflating IRQs disabled with zone->lock
Historically when freeing pages, free_one_page() assumed that callers had
IRQs disabled and the zone->lock could be acquired with spin_lock().  This
confuses the scope of what local_lock_irq is protecting and what
zone->lock is protecting in free_unref_page_list in particular.

This patch uses spin_lock_irqsave() for the zone->lock in free_one_page()
instead of relying on callers to have disabled IRQs.
free_unref_page_commit() is changed to only deal with PCP pages protected
by the local lock.  free_unref_page_list() then first frees isolated pages
to the buddy lists with free_one_page() and frees the rest of the pages to
the PCP via free_unref_page_commit().  The end result is that
free_one_page() is no longer depending on side-effects of local_lock to be
correct.

Note that this may incur a performance penalty while memory hot-remove is
running but that is not a common operation.

[lkp@intel.com: Ensure CMA pages get addded to correct pcp list]

Link: https://lkml.kernel.org/r/20210512095458.30632-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman 56f0e661ea mm/page_alloc: explicitly acquire the zone lock in __free_pages_ok
__free_pages_ok() disables IRQs before calling a common helper
free_one_page() that acquires the zone lock.  This is not safe according
to Documentation/locking/locktypes.rst and in this context, IRQ disabling
is not protecting a per_cpu_pages structure either or a local_lock would
be used.

This patch explicitly acquires the lock with spin_lock_irqsave instead of
relying on a helper.  This removes the last instance of local_irq_save()
in page_alloc.c.

Link: https://lkml.kernel.org/r/20210512095458.30632-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman 43c95bcc51 mm/page_alloc: reduce duration that IRQs are disabled for VM counters
IRQs are left disabled for the zone and node VM event counters.  This is
unnecessary as the affected counters are allowed to race for preemmption
and IRQs.

This patch reduces the scope of IRQs being disabled via
local_[lock|unlock]_irq on !PREEMPT_RT kernels.  One
__mod_zone_freepage_state is still called with IRQs disabled.  While this
could be moved out, it's not free on all architectures as some require
IRQs to be disabled for mod_zone_page_state on !PREEMPT_RT kernels.

Link: https://lkml.kernel.org/r/20210512095458.30632-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman 3e23060b2d mm/page_alloc: batch the accounting updates in the bulk allocator
Now that the zone_statistics are simple counters that do not require
special protection, the bulk allocator accounting updates can be batch
updated without adding too much complexity with protected RMW updates or
using xchg.

Link: https://lkml.kernel.org/r/20210512095458.30632-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman f19298b951 mm/vmstat: convert NUMA statistics to basic NUMA counters
NUMA statistics are maintained on the zone level for hits, misses, foreign
etc but nothing relies on them being perfectly accurate for functional
correctness.  The counters are used by userspace to get a general overview
of a workloads NUMA behaviour but the page allocator incurs a high cost to
maintain perfect accuracy similar to what is required for a vmstat like
NR_FREE_PAGES.  There even is a sysctl vm.numa_stat to allow userspace to
turn off the collection of NUMA statistics like NUMA_HIT.

This patch converts NUMA_HIT and friends to be NUMA events with similar
accuracy to VM events.  There is a possibility that slight errors will be
introduced but the overall trend as seen by userspace will be similar.
The counters are no longer updated from vmstat_refresh context as it is
unnecessary overhead for counters that may never be read by userspace.
Note that counters could be maintained at the node level to save space but
it would have a user-visible impact due to /proc/zoneinfo.

[lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA]

Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman dbbee9d5cd mm/page_alloc: convert per-cpu list protection to local_lock
There is a lack of clarity of what exactly
local_irq_save/local_irq_restore protects in page_alloc.c .  It conflates
the protection of per-cpu page allocation structures with per-cpu vmstat
deltas.

This patch protects the PCP structure using local_lock which for most
configurations is identical to IRQ enabling/disabling.  The scope of the
lock is still wider than it should be but this is decreased later.

It is possible for the local_lock to be embedded safely within struct
per_cpu_pages but it adds complexity to free_unref_page_list.

[akpm@linux-foundation.org: coding style fixes]
[mgorman@techsingularity.net: work around a pahole limitation with zero-sized struct pagesets]
  Link: https://lkml.kernel.org/r/20210526080741.GW30378@techsingularity.net
[lkp@intel.com: Make pagesets static]

Link: https://lkml.kernel.org/r/20210512095458.30632-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Mel Gorman 28f836b677 mm/page_alloc: split per cpu page lists and zone stats
The PCP (per-cpu page allocator in page_alloc.c) shares locking
requirements with vmstat and the zone lock which is inconvenient and
causes some issues.  For example, the PCP list and vmstat share the same
per-cpu space meaning that it's possible that vmstat updates dirty cache
lines holding per-cpu lists across CPUs unless padding is used.  Second,
PREEMPT_RT does not want to disable IRQs for too long in the page
allocator.

This series splits the locking requirements and uses locks types more
suitable for PREEMPT_RT, reduces the time when special locking is required
for stats and reduces the time when IRQs need to be disabled on
!PREEMPT_RT kernels.

Why local_lock?  PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst

   local_irq_disable();
   spin_lock(&lock);

The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock).  While it's possible to
separate this out, it generally means there are points where we enable
IRQs and reenable them again immediately.  To prevent a migration and the
per-cpu pointer going stale, migrate_disable is also needed.  That is a
custom lock that is similar, but worse, than local_lock.  Furthermore, on
PREEMPT_RT, it's undesirable to leave IRQs disabled for too long.  By
converting to local_lock which disables migration on PREEMPT_RT, the
locking requirements can be separated and start moving the protections for
PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking.  As a
bonus, local_lock also means that PROVE_LOCKING does something useful.

After that, it's obvious that zone_statistics incurs too much overhead and
leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels.
zone_statistics uses perfectly accurate counters requiring IRQs be
disabled for parallel RMW sequences when inaccurate ones like vm_events
would do.  The series makes the NUMA statistics (NUMA_HIT and friends)
inaccurate counters that then require no special protection on
!PREEMPT_RT.

The bulk page allocator can then do stat updates in bulk with IRQs enabled
which should improve the efficiency.  Technically, this could have been
done without the local_lock and vmstat conversion work and the order
simply reflects the timing of when different series were implemented.

Finally, there are places where we conflate IRQs being disabled for the
PCP with the IRQ-safe zone spinlock.  The remainder of the series reduces
the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
By the end of the series, page_alloc.c does not call local_irq_save so the
locking scope is a bit clearer.  The one exception is that modifying
NR_FREE_PAGES still happens in places where it's known the IRQs are
disabled as it's harmless for PREEMPT_RT and would be expensive to split
the locking there.

No performance data is included because despite the overhead of the stats,
it's within the noise for most workloads on !PREEMPT_RT.  However, Jesper
Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
3.60GHz CPU on the first version of this series.  Focusing on the array
variant of the bulk page allocator reveals the following.

(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

         Baseline        Patched
 1       56.383          54.225 (+3.83%)
 2       40.047          35.492 (+11.38%)
 3       37.339          32.643 (+12.58%)
 4       35.578          30.992 (+12.89%)
 8       33.592          29.606 (+11.87%)
 16      32.362          28.532 (+11.85%)
 32      31.476          27.728 (+11.91%)
 64      30.633          27.252 (+11.04%)
 128     30.596          27.090 (+11.46%)

While this is a positive outcome, the series is more likely to be
interesting to the RT people in terms of getting parts of the PREEMPT_RT
tree into mainline.

This patch (of 9):

The per-cpu page allocator lists and the per-cpu vmstat deltas are stored
in the same struct per_cpu_pages even though vmstats have no direct impact
on the per-cpu page lists.  This is inconsistent because the vmstats for a
node are stored on a dedicated structure.  The bigger issue is that the
per_cpu_pages structure is not cache-aligned and stat updates either cache
conflict with adjacent per-cpu lists incurring a runtime cost or padding
is required incurring a memory cost.

This patch splits the per-cpu pagelists and the vmstat deltas into
separate structures.  It's mostly a mechanical conversion but some
variable renaming is done to clearly distinguish the per-cpu pages
structure (pcp) from the vmstats (pzstats).

Superficially, this appears to increase the size of the per_cpu_pages
structure but the movement of expire fills a structure hole so there is no
impact overall.

[mgorman@techsingularity.net: make it W=1 cleaner]
  Link: https://lkml.kernel.org/r/20210514144622.GA3735@techsingularity.net
[mgorman@techsingularity.net: make it W=1 even cleaner]
  Link: https://lkml.kernel.org/r/20210516140705.GB3735@techsingularity.net
[lkp@intel.com: check struct per_cpu_zonestat has a non-zero size]
[vbabka@suse.cz: Init zone->per_cpu_zonestats properly]

Link: https://lkml.kernel.org/r/20210512095458.30632-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210512095458.30632-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Heiner Kallweit 9660ecaa79 mm/page_alloc: switch to pr_debug
Having such debug messages in the dmesg log may confuse users.  Therefore
restrict debug output to cases where DEBUG is defined or dynamic debugging
is enabled for the respective code piece.

Link: https://lkml.kernel.org/r/976adb93-3041-ce63-48fc-55a6096a51c1@gmail.com
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:53 -07:00
Matthew Wilcox (Oracle) ca891f41c4 mm: constify get_pfnblock_flags_mask and get_pfnblock_migratetype
The struct page is not modified by these routines, so it can be marked
const.

Link: https://lkml.kernel.org/r/20210416231531.2521383-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:53 -07:00
Aaron Tomlin 691d949728 mm/page_alloc: bail out on fatal signal during reclaim/compaction retry attempt
A customer experienced a low-memory situation and decided to issue a
SIGKILL (i.e.  a fatal signal).  Instead of promptly terminating as one
would expect, the aforementioned task remained unresponsive.

Further investigation indicated that the task was "stuck" in the
reclaim/compaction retry loop.  Now, it does not make sense to retry
compaction when a fatal signal is pending.

In the context of try_to_compact_pages(), indeed COMPACT_SKIPPED can be
returned; albeit, not every zone, on the zone list, would be considered in
the case a fatal signal is found to be pending.  Yet, in
should_compact_retry(), given the last known compaction result, each zone,
on the zone list, can be considered/or checked (see
compaction_zonelist_suitable()).  For example, if a zone was found to
succeed, then reclaim/compaction would be tried again (notwithstanding the
above).

This patch ensures that compaction is not needlessly retried irrespective
of the last known compaction result e.g.  if it was skipped, in the
unlikely case a fatal signal is found pending.  So, OOM is at least
attempted.

Link: https://lkml.kernel.org/r/20210520142901.3371299-1-atomlin@redhat.com
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:53 -07:00
Matthew Wilcox (Oracle) d2f07ec052 mm: make __dump_page static
Patch series "Constify struct page arguments".

While working on various solutions to the 32-bit struct page size
regression, one of the problems I found was the networking stack expects
to be able to pass const struct page pointers around, and the mm doesn't
provide a lot of const-friendly functions to call.  The root tangle of
problems is that a lot of functions call VM_BUG_ON_PAGE(), which calls
dump_page(), which calls a lot of functions which don't take a const
struct page (but could be const).

This patch (of 6):

The only caller of __dump_page() now opencodes dump_page(), so remove it
as an externally visible symbol.

Link: https://lkml.kernel.org/r/20210416231531.2521383-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20210416231531.2521383-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:53 -07:00
Mel Gorman ff4b2b4014 mm/page_alloc: correct return value of populated elements if bulk array is populated
Dave Jones reported the following

	This made it into 5.13 final, and completely breaks NFSD for me
	(Serving tcp v3 mounts).  Existing mounts on clients hang, as do
	new mounts from new clients.  Rebooting the server back to rc7
	everything recovers.

The commit b3b64ebd38 ("mm/page_alloc: do bulk array bounds check after
checking populated elements") returns the wrong value if the array is
already populated which is interpreted as an allocation failure.  Dave
reported this fixes his problem and it also passed a test running dbench
over NFS.

Link: https://lkml.kernel.org/r/20210628150219.GC3840@techsingularity.net
Fixes: b3b64ebd38 ("mm/page_alloc: do bulk array bounds check after checking populated elements")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Tested-by: Dave Jones <davej@codemonkey.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [5.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:45 -07:00
Mike Rapoport 122e093c17 mm/page_alloc: fix memory map initialization for descending nodes
On systems with memory nodes sorted in descending order, for instance Dell
Precision WorkStation T5500, the struct pages for higher PFNs and
respectively lower nodes, could be overwritten by the initialization of
struct pages corresponding to the holes in the memory sections.

For example for the below memory layout

[    0.245624] Early memory node ranges
[    0.248496]   node   1: [mem 0x0000000000001000-0x0000000000090fff]
[    0.251376]   node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
[    0.254256]   node   1: [mem 0x0000000100000000-0x0000001423ffffff]
[    0.257144]   node   0: [mem 0x0000001424000000-0x0000002023ffffff]

the range 0x1424000000 - 0x1428000000 in the beginning of node 0 starts in
the middle of a section and will be considered as a hole during the
initialization of the last section in node 1.

The wrong initialization of the memory map causes panic on boot when
CONFIG_DEBUG_VM is enabled.

Reorder loop order of the memory map initialization so that the outer loop
will always iterate over populated memory regions in the ascending order
and the inner loop will select the zone corresponding to the PFN range.

This way initialization of the struct pages for the memory holes will be
always done for the ranges that are actually not populated.

[akpm@linux-foundation.org: coding style fixes]

Link: https://lkml.kernel.org/r/YNXlMqBbL+tBG7yq@kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213073
Link: https://lkml.kernel.org/r/20210624062305.10940-1-rppt@kernel.org
Fixes: 0740a50b9b ("mm/page_alloc.c: refactor initialization of struct page for holes in memory layout")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Boris Petkov <bp@alien8.de>
Cc: Robert Shteynfeld <robert.shteynfeld@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:45 -07:00
Linus Torvalds 9840cfcb97 arm64 updates for 5.14
- Optimise SVE switching for CPUs with 128-bit implementations.
 
  - Fix output format from SVE selftest.
 
  - Add support for versions v1.2 and 1.3 of the SMC calling convention.
 
  - Allow Pointer Authentication to be configured independently for
    kernel and userspace.
 
  - PMU driver cleanups for managing IRQ affinity and exposing event
    attributes via sysfs.
 
  - KASAN optimisations for both hardware tagging (MTE) and out-of-line
    software tagging implementations.
 
  - Relax frame record alignment requirements to facilitate 8-byte
    alignment with KASAN and Clang.
 
  - Cleanup of page-table definitions and removal of unused memory types.
 
  - Reduction of ARCH_DMA_MINALIGN back to 64 bytes.
 
  - Refactoring of our instruction decoding routines and addition of some
    missing encodings.
 
  - Move entry code moved into C and hardened against harmful compiler
    instrumentation.
 
  - Update booting requirements for the FEAT_HCX feature, added to v8.7
    of the architecture.
 
  - Fix resume from idle when pNMI is being used.
 
  - Additional CPU sanity checks for MTE and preparatory changes for
    systems where not all of the CPUs support 32-bit EL0.
 
  - Update our kernel string routines to the latest Cortex Strings
    implementation.
 
  - Big cleanup of our cache maintenance routines, which were confusingly
    named and inconsistent in their implementations.
 
  - Tweak linker flags so that GDB can understand vmlinux when using RELR
    relocations.
 
  - Boot path cleanups to enable early initialisation of per-cpu
    operations needed by KCSAN.
 
  - Non-critical fixes and miscellaneous cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmDUh1YQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNDaUCAC+2Jy2Yopd94uBPYajGybM0rqCUgE7b5n1
 A7UzmQ6fia2hwqCPmxGG+sRabovwN7C1bKrUCc03RIbErIa7wum1edeyqmF/Aw44
 DUDY1MAOSZaFmX8L62QCvxG1hfdLPtGmHMd1hdXvxYK7PCaigEFnzbLRWTtgE+Ok
 JhdvNfsoeITJObHnvYPF3rV3NAbyYni9aNJ5AC/qb3dlf6XigEraXaMj29XHKfwc
 +vmn+25oqFkLHyFeguqIoK+vUQAy/8TjFfjX83eN3LZknNhDJgWS1Iq1Nm+Vxt62
 RvDUUecWJjAooCWgmil6pt0enI+q6E8LcX3A3cWWrM6psbxnYzkU
 =I6KS
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "There's a reasonable amount here and the juicy details are all below.

  It's worth noting that the MTE/KASAN changes strayed outside of our
  usual directories due to core mm changes and some associated changes
  to some other architectures; Andrew asked for us to carry these [1]
  rather that take them via the -mm tree.

  Summary:

   - Optimise SVE switching for CPUs with 128-bit implementations.

   - Fix output format from SVE selftest.

   - Add support for versions v1.2 and 1.3 of the SMC calling
     convention.

   - Allow Pointer Authentication to be configured independently for
     kernel and userspace.

   - PMU driver cleanups for managing IRQ affinity and exposing event
     attributes via sysfs.

   - KASAN optimisations for both hardware tagging (MTE) and out-of-line
     software tagging implementations.

   - Relax frame record alignment requirements to facilitate 8-byte
     alignment with KASAN and Clang.

   - Cleanup of page-table definitions and removal of unused memory
     types.

   - Reduction of ARCH_DMA_MINALIGN back to 64 bytes.

   - Refactoring of our instruction decoding routines and addition of
     some missing encodings.

   - Move entry code moved into C and hardened against harmful compiler
     instrumentation.

   - Update booting requirements for the FEAT_HCX feature, added to v8.7
     of the architecture.

   - Fix resume from idle when pNMI is being used.

   - Additional CPU sanity checks for MTE and preparatory changes for
     systems where not all of the CPUs support 32-bit EL0.

   - Update our kernel string routines to the latest Cortex Strings
     implementation.

   - Big cleanup of our cache maintenance routines, which were
     confusingly named and inconsistent in their implementations.

   - Tweak linker flags so that GDB can understand vmlinux when using
     RELR relocations.

   - Boot path cleanups to enable early initialisation of per-cpu
     operations needed by KCSAN.

   - Non-critical fixes and miscellaneous cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (150 commits)
  arm64: tlb: fix the TTL value of tlb_get_level
  arm64: Restrict undef hook for cpufeature registers
  arm64/mm: Rename ARM64_SWAPPER_USES_SECTION_MAPS
  arm64: insn: avoid circular include dependency
  arm64: smp: Bump debugging information print down to KERN_DEBUG
  drivers/perf: fix the missed ida_simple_remove() in ddr_perf_probe()
  perf/arm-cmn: Fix invalid pointer when access dtc object sharing the same IRQ number
  arm64: suspend: Use cpuidle context helpers in cpu_suspend()
  PSCI: Use cpuidle context helpers in psci_cpu_suspend_enter()
  arm64: Convert cpu_do_idle() to using cpuidle context helpers
  arm64: Add cpuidle context save/restore helpers
  arm64: head: fix code comments in set_cpu_boot_mode_flag
  arm64: mm: drop unused __pa(__idmap_text_start)
  arm64: mm: fix the count comments in compute_indices
  arm64/mm: Fix ttbr0 values stored in struct thread_info for software-pan
  arm64: mm: Pass original fault address to handle_mm_fault()
  arm64/mm: Drop SECTION_[SHIFT|SIZE|MASK]
  arm64/mm: Use CONT_PMD_SHIFT for ARM64_MEMSTART_SHIFT
  arm64/mm: Drop SWAPPER_INIT_MAP_SIZE
  arm64: Conditionally configure PTR_AUTH key of the kernel.
  ...
2021-06-28 14:04:24 -07:00
Mel Gorman 66d9282523 mm/page_alloc: Correct return value of populated elements if bulk array is populated
Dave Jones reported the following

	This made it into 5.13 final, and completely breaks NFSD for me
	(Serving tcp v3 mounts).  Existing mounts on clients hang, as do
	new mounts from new clients.  Rebooting the server back to rc7
	everything recovers.

The commit b3b64ebd38 ("mm/page_alloc: do bulk array bounds check after
checking populated elements") returns the wrong value if the array is
already populated which is interpreted as an allocation failure. Dave
reported this fixes his problem and it also passed a test running dbench
over NFS.

Fixes: b3b64ebd38 ("mm/page_alloc: do bulk array bounds check after checking populated elements")
Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [5.13+]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-28 10:00:54 -07:00
Mel Gorman b3b64ebd38 mm/page_alloc: do bulk array bounds check after checking populated elements
Dan Carpenter reported the following

  The patch 0f87d9d30f21: "mm/page_alloc: add an array-based interface
  to the bulk page allocator" from Apr 29, 2021, leads to the following
  static checker warning:

        mm/page_alloc.c:5338 __alloc_pages_bulk()
        warn: potentially one past the end of array 'page_array[nr_populated]'

The problem can occur if an array is passed in that is fully populated.
That potentially ends up allocating a single page and storing it past
the end of the array.  This patch returns 0 if the array is fully
populated.

Link: https://lkml.kernel.org/r/20210618125102.GU30378@techsingularity.net
Fixes: 0f87d9d30f ("mm/page_alloc: add an array-based interface to the bulk page allocator")
Signed-off-by: Mel Gorman <mgorman@techsinguliarity.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24 19:40:54 -07:00
Rasmus Villemoes b08e50dd64 mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array
In the event that somebody would call this with an already fully
populated page_array, the last loop iteration would do an access beyond
the end of page_array.

It's of course extremely unlikely that would ever be done, but this
triggers my internal static analyzer.  Also, if it really is not
supposed to be invoked this way (i.e., with no NULL entries in
page_array), the nr_populated<nr_pages check could simply be removed
instead.

Link: https://lkml.kernel.org/r/20210507064504.1712559-1-linux@rasmusvillemoes.dk
Fixes: 0f87d9d30f ("mm/page_alloc: add an array-based interface to the bulk page allocator")
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24 19:40:54 -07:00
Ding Hui bac9c6fa1f mm/page_alloc: fix counting of free pages after take off from buddy
Recently we found that there is a lot MemFree left in /proc/meminfo
after do a lot of pages soft offline, it's not quite correct.

Before Oscar's rework of soft offline for free pages [1], if we soft
offline free pages, these pages are left in buddy with HWPoison flag,
and NR_FREE_PAGES is not updated immediately.  So the difference between
NR_FREE_PAGES and real number of available free pages is also even big
at the beginning.

However, with the workload running, when we catch HWPoison page in any
alloc functions subsequently, we will remove it from buddy, meanwhile
update the NR_FREE_PAGES and try again, so the NR_FREE_PAGES will get
more and more closer to the real number of available free pages.
(regardless of unpoison_memory())

Now, for offline free pages, after a successful call
take_page_off_buddy(), the page is no longer belong to buddy allocator,
and will not be used any more, but we missed accounting NR_FREE_PAGES in
this situation, and there is no chance to be updated later.

Do update in take_page_off_buddy() like rmqueue() does, but avoid double
counting if some one already set_migratetype_isolate() on the page.

[1]: commit 06be6ff3d2 ("mm,hwpoison: rework soft offline for free pages")

Link: https://lkml.kernel.org/r/20210526075247.11130-1-dinghui@sangfor.com.cn
Fixes: 06be6ff3d2 ("mm,hwpoison: rework soft offline for free pages")
Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-05 08:58:11 -07:00
Peter Collingbourne c275c5c6d5 kasan: disable freed user page poisoning with HW tags
Poisoning freed pages protects against kernel use-after-free. The
likelihood of such a bug involving kernel pages is significantly higher
than that for user pages. At the same time, poisoning freed pages can
impose a significant performance cost, which cannot always be justified
for user pages given the lower probability of finding a bug. Therefore,
disable freed user page poisoning when using HW tags. We identify
"user" pages via the flag set GFP_HIGHUSER_MOVABLE, which indicates
a strong likelihood of not being directly accessible to the kernel.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://linux-review.googlesource.com/id/I716846e2de8ef179f44e835770df7e6307be96c9
Link: https://lore.kernel.org/r/20210602235230.3928842-5-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-06-04 19:32:21 +01:00
Peter Collingbourne 013bb59dbb arm64: mte: handle tags zeroing at page allocation time
Currently, on an anonymous page fault, the kernel allocates a zeroed
page and maps it in user space. If the mapping is tagged (PROT_MTE),
set_pte_at() additionally clears the tags. It is, however, more
efficient to clear the tags at the same time as zeroing the data on
allocation. To avoid clearing the tags on any page (which may not be
mapped as tagged), only do this if the vma flags contain VM_MTE. This
requires introducing a new GFP flag that is used to determine whether
to clear the tags.

The DC GZVA instruction with a 0 top byte (and 0 tag) requires
top-byte-ignore. Set the TCR_EL1.{TBI1,TBID1} bits irrespective of
whether KASAN_HW is enabled.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://linux-review.googlesource.com/id/Id46dc94e30fe11474f7e54f5d65e7658dbdddb26
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/r/20210602235230.3928842-4-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-06-04 19:32:21 +01:00
Peter Collingbourne 7a3b835371 kasan: use separate (un)poison implementation for integrated init
Currently with integrated init page_alloc.c needs to know whether
kasan_alloc_pages() will zero initialize memory, but this will start
becoming more complicated once we start adding tag initialization
support for user pages. To avoid page_alloc.c needing to know more
details of what integrated init will do, move the unpoisoning logic
for integrated init into the HW tags implementation. Currently the
logic is identical but it will diverge in subsequent patches.

For symmetry do the same for poisoning although this logic will
be unaffected by subsequent patches.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://linux-review.googlesource.com/id/I2c550234c6c4a893c48c18ff0c6ce658c7c67056
Link: https://lore.kernel.org/r/20210602235230.3928842-3-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-06-04 19:32:21 +01:00
Lu Jialin baf2f90ba4 mm: fix typos in comments
succed -> succeed in mm/hugetlb.c
wil -> will in mm/mempolicy.c
wit -> with in mm/page_alloc.c
Retruns -> Returns in mm/page_vma_mapped.c
confict -> conflict in mm/secretmem.c
No functionality changed.

Link: https://lkml.kernel.org/r/20210408140027.60623-1-lujialin4@huawei.com
Signed-off-by: Lu Jialin <lujialin4@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:35 -07:00
Ingo Molnar f0953a1bba mm: fix typos in comments
Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:35 -07:00
Zhiyuan Dai 68d68ff6eb mm/mempool: minor coding style tweaks
Various coding style tweaks to various files under mm/

[daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:27 -07:00
Mel Gorman 8ca559132a mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove
zone_pcp_reset allegedly protects against a race with drain_pages using
local_irq_save but this is bogus.  local_irq_save only operates on the
local CPU.  If memory hotplug is running on CPU A and drain_pages is
running on CPU B, disabling IRQs on CPU A does not affect CPU B and
offers no protection.

This patch deletes IRQ disable/enable on the grounds that IRQs protect
nothing and assumes the existing hotplug paths guarantees the PCP cannot
be used after zone_pcp_enable().  That should be the case already
because all the pages have been freed and there is no page to put on the
PCP lists.

Link: https://lkml.kernel.org/r/20210412090346.GQ3697@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:26 -07:00
Pavel Tatashin 8e3560d963 mm: honor PF_MEMALLOC_PIN for all movable pages
PF_MEMALLOC_PIN is only honored for CMA pages, extend this flag to work
for any allocations from ZONE_MOVABLE by removing __GFP_MOVABLE from
gfp_mask when this flag is passed in the current context.

Add is_pinnable_page() to return true if page is in a pinnable page.  A
pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.

Link: https://lkml.kernel.org/r/20210215161349.246722-8-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:26 -07:00
Pavel Tatashin da6df1b0fc mm: apply per-task gfp constraints in fast path
Function current_gfp_context() is called after fast path.  However, soon
we will add more constraints which will also limit zones based on
context.  Move this call into fast path, and apply the correct
constraints for all allocations.

Also update .reclaim_idx based on value returned by
current_gfp_context() because it soon will modify the allowed zones.

Note:
With this patch we will do one extra current->flags load during fast path,
but we already load current->flags in fast-path:

__alloc_pages()
 prepare_alloc_pages()
  current_alloc_flags(gfp_mask, *alloc_flags);

Later, when we add the zone constrain logic to current_gfp_context() we
will be able to remove current->flags load from current_alloc_flags, and
therefore return fast-path to the current performance level.

Link: https://lkml.kernel.org/r/20210215161349.246722-7-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:26 -07:00
Pavel Tatashin 1a08ae36cf mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN
PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not
return pages that might belong to CMA region.  This is currently used
for long term gup to make sure that such pins are not going to be done
on any CMA pages.

When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it
is focusing on CMA pages too much and that there is larger class of
pages that need the same treatment.  MOVABLE zone cannot contain any
long term pins as well so it makes sense to reuse and redefine this flag
for that usecase as well.  Rename the flag to PF_MEMALLOC_PIN which
defines an allocation context which can only get pages suitable for
long-term pins.

Also rename: memalloc_nocma_save()/memalloc_nocma_restore to
memalloc_pin_save()/memalloc_pin_restore() and make the new functions
common.

[rppt@linux.ibm.com: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN]
  Link: https://lkml.kernel.org/r/20210331163816.11517-1-rppt@kernel.org

Link: https://lkml.kernel.org/r/20210215161349.246722-6-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:26 -07:00
Minchan Kim 78fa51503f mm: use proper type for cma_[alloc|release]
size_t in cma_alloc is confusing since it makes people think it's byte
count, not pages.  Change it to unsigned long[1].

The unsigned int in cma_release is also not right so change it.  Since we
have unsigned long in cma_release, free_contig_range should also respect
it.

[1] 67a2e213e7, mm: cma: fix incorrect type conversion for size during dma allocation

Link: https://lore.kernel.org/linux-mm/20210324043434.GP1719932@casper.infradead.org/
Link: https://lkml.kernel.org/r/20210331164018.710560-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Minchan Kim 361a2a229f mm: replace migrate_[prep|finish] with lru_cache_[disable|enable]
Currently, migrate_[prep|finish] is merely a wrapper of
lru_cache_[disable|enable].  There is not much to gain from having
additional abstraction.

Use lru_cache_[disable|enable] instead of migrate_[prep|finish], which
would be more descriptive.

note: migrate_prep_local in compaction.c changed into lru_add_drain to
avoid CPU schedule cost with involving many other CPUs to keep old
behavior.

Link: https://lkml.kernel.org/r/20210319175127.886124-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: John Dias <joaodias@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Minchan Kim d479960e44 mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained.  It
could prevent migration since the refcount of the page is greater than
the expection in migration logic.  To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration.  Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.

Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
			..
			..

  if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
         printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
         dump_page(page, "fail to migrate");
  }

The test was repeating android apps launching with cma allocation in
background every five seconds.  Total cma allocation count was about 500
during the testing.  With this patch, the dump_page count was reduced
from 400 to 30.

The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure.  This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once.  This is also in
line with pcp allocator cache which are disabled for the offlining as
well.

Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Charan Teja Reddy 06dac2f467 mm: compaction: update the COMPACT[STALL|FAIL] events properly
By definition, COMPACT[STALL|FAIL] events needs to be counted when there
is 'At least in one zone compaction wasn't deferred or skipped from the
direct compaction'.  And when compaction is skipped or deferred,
COMPACT_SKIPPED will be returned but it will still go and update these
compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
is counted without even trying the compaction.

Correct this by skipping the counting of these events when
COMPACT_SKIPPED is returned for compaction.  This indirectly also avoid
the unnecessary try into the get_page_from_freelist() when compaction is
not even tried.

There is a corner case where compaction is skipped but still count
COMPACTSTALL event, which is that IRQ came and freed the page and the
same is captured in capture_control.

Link: https://lkml.kernel.org/r/1613151184-21213-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Dave Hansen 202e35db5e mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks
RECLAIM_ZONE was assumed to be unused because it was never explicitly
used in the kernel.  However, there were a number of places where it was
checked implicitly by checking 'node_reclaim_mode' for a zero value.

These zero checks are not great because it is not obvious what a zero
mode *means* in the code.  Replace them with a helper which makes it
more obvious: node_reclaim_enabled().

This helper also provides a handy place to explicitly check the
RECLAIM_ZONE bit itself.  Check it explicitly there to make it more
obvious where the bit can affect behavior.

This should have no functional impact.

Link: https://lkml.kernel.org/r/20210219172559.BF589C44@viggo.jf.intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:23 -07:00
Oscar Salvador eb14d4eefd mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig
pfn_range_valid_contig() bails out when it finds an in-use page or a
hugetlb page, among other things.  We can drop the in-use page check since
__alloc_contig_pages can migrate away those pages, and the hugetlb page
check can go too since isolate_migratepages_range is now capable of
dealing with hugetlb pages.  Either way, those checks are racy so let the
end function handle it when the time comes.

Link: https://lkml.kernel.org/r/20210419075413.1064-8-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:22 -07:00
Oscar Salvador c2ad7a1ffe mm,compaction: let isolate_migratepages_{range,block} return error codes
Currently, isolate_migratepages_{range,block} and their callers use a pfn
== 0 vs pfn != 0 scheme to let the caller know whether there was any error
during isolation.

This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.

Let us rework isolate_migratepages_{range,block} so we can report error
codes.  Since isolate_migratepages_block will stop returning the next pfn
to be scanned, we reuse the cc->migrate_pfn field to keep track of that.

Link: https://lkml.kernel.org/r/20210419075413.1064-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:22 -07:00
Oscar Salvador c8e28b47af mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_range
Patch series "Make alloc_contig_range handle Hugetlb pages", v10.

alloc_contig_range lacks the ability to handle HugeTLB pages.  This can
be problematic for some users, e.g: CMA and virtio-mem, where those
users will fail the call if alloc_contig_range ever sees a HugeTLB page,
even when those pages lay in ZONE_MOVABLE and are free.  That problem
can be easily solved by replacing the page in the free hugepage pool.

In-use HugeTLB are no exception though, as those can be isolated and
migrated as any other LRU or Movable page.

This aims to improve alloc_contig_range->isolate_migratepages_block, so
that HugeTLB pages can be recognized and handled.

Since we also need to start reporting errors down the chain (e.g:
-ENOMEM due to not be able to allocate a new hugetlb page),
isolate_migratepages_{range,block} interfaces need to change to start
reporting error codes instead of the pfn == 0 vs pfn != 0 scheme it is
using right now.  From now on, isolate_migratepages_block will not
return the next pfn to be scanned anymore, but -EINTR, -ENOMEM or 0, so
we the next pfn to be scanned will be recorded in cc->migrate_pfn field
(as it is already done in isolate_migratepages_range()).

Below is an insight from David (thanks), where the problem can clearly be
seen:

 "Start a VM with 4G. Hotplug 1G via virtio-mem and online it to
  ZONE_MOVABLE. Allocate 512 huge pages.

  [root@localhost ~]# cat /proc/meminfo
  MemTotal:        5061512 kB
  MemFree:         3319396 kB
  MemAvailable:    3457144 kB
  ...
  HugePages_Total:     512
  HugePages_Free:      512
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:       2048 kB

  The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging
  1G via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:

  [  180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy"

And then with this patchset running:

 "Same experiment with ZONE_MOVABLE:

  a) Free huge pages: all memory can get unplugged again.

  b) Allocated/populated but idle huge pages: all memory can get unplugged
     again.

  c) Allocated/populated but all 512 huge pages are read/written in a
     loop: all memory can get unplugged again, but I get a single

     [  121.192345] alloc_contig_range: [180000, 188000) PFNs busy

     Most probably because it happened to try migrating a huge page
     while it was busy.  As virtio-mem retries on ZONE_MOVABLE a couple of
     times, it can deal with this temporary failure.

  Last but not least, I did something extreme:

  # cat /proc/meminfo
  MemTotal:        5061568 kB
  MemFree:          186560 kB
  MemAvailable:     354524 kB
  ...
  HugePages_Total:    2048
  HugePages_Free:     2048
  HugePages_Rsvd:        0
  HugePages_Surp:        0

  Triggering unplug would require to dissolve+alloc - which now fails
  when trying to allocate an additional ~512 huge pages (1G).

  As expected, I can properly see memory unplug not fully succeeding.  +
  I get a fairly continuous stream of

  [  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
  ...

  But more importantly, the hugepage count remains stable, as configured
  by the admin (me):

  HugePages_Total:    2048
  HugePages_Free:     2048
  HugePages_Rsvd:        0
  HugePages_Surp:        0"

This patch (of 7):

Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or
-EBUSY, and report them down the chain.  The problem is that when
migrate_pages() reports -ENOMEM, we keep going till we exhaust all the
try-attempts (5 at the moment) instead of bailing out.

migrate_pages() bails out right away on -ENOMEM because it is considered a
fatal error.  Do the same here instead of keep going and retrying.  Note
that this is not fixing a real issue, just a cosmetic change.  Although we
can save some cycles by backing off ealier

Link: https://lkml.kernel.org/r/20210419075413.1064-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20210419075413.1064-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:22 -07:00
Sergei Trofimovich 9df65f5225 mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1
On !ARCH_SUPPORTS_DEBUG_PAGEALLOC (like ia64) debug_pagealloc=1 implies
page_poison=on:

    if (page_poisoning_enabled() ||
         (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
          debug_pagealloc_enabled()))
            static_branch_enable(&_page_poisoning_enabled);

page_poison=on needs to override init_on_free=1.

Before the change it did not work as expected for the following case:
- have PAGE_POISONING=y
- have page_poison unset
- have !ARCH_SUPPORTS_DEBUG_PAGEALLOC arch (like ia64)
- have init_on_free=1
- have debug_pagealloc=1

That way we get both keys enabled:
- static_branch_enable(&init_on_free);
- static_branch_enable(&_page_poisoning_enabled);

which leads to poisoned pages returned for __GFP_ZERO pages.

After the change we execute only:
- static_branch_enable(&_page_poisoning_enabled);
  and ignore init_on_free=1.

Link: https://lkml.kernel.org/r/20210329222555.3077928-1-slyfox@gentoo.org
Link: https://lkml.org/lkml/2021/3/26/443
Fixes: 8db26a3d47 ("mm, page_poison: use static key more efficiently")
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:43 -07:00
Jesper Dangaard Brouer 3b822017b6 mm/page_alloc: inline __rmqueue_pcplist
When __alloc_pages_bulk() got introduced two callers of __rmqueue_pcplist
exist and the compiler chooses to not inline this function.

  ./scripts/bloat-o-meter vmlinux-before vmlinux-inline__rmqueue_pcplist
  add/remove: 0/1 grow/shrink: 2/0 up/down: 164/-125 (39)
  Function                                     old     new   delta
  rmqueue                                     2197    2296     +99
  __alloc_pages_bulk                          1921    1986     +65
  __rmqueue_pcplist                            125       -    -125
  Total: Before=19374127, After=19374166, chg +0.00%

modprobe page_bench04_bulk loops=$((10**7))

Type:time_bulk_page_alloc_free_array
 -  Per elem: 106 cycles(tsc) 29.595 ns (step:64)
 - (measurement period time:0.295955434 sec time_interval:295955434)
 - (invoke count:10000000 tsc_interval:1065447105)

Before:
 - Per elem: 110 cycles(tsc) 30.633 ns (step:64)

Link: https://lkml.kernel.org/r/20210325114228.27719-6-mgorman@techsingularity.net
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:43 -07:00
Jesper Dangaard Brouer ce76f9a1d9 mm/page_alloc: optimize code layout for __alloc_pages_bulk
Looking at perf-report and ASM-code for __alloc_pages_bulk() it is clear
that the code activated is suboptimal.  The compiler guesses wrong and
places unlikely code at the beginning.  Due to the use of WARN_ON_ONCE()
macro the UD2 asm instruction is added to the code, which confuse the
I-cache prefetcher in the CPU.

[mgorman@techsingularity.net: minor changes and rebasing]

Link: https://lkml.kernel.org/r/20210325114228.27719-5-mgorman@techsingularity.net
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Acked-By: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:43 -07:00
Mel Gorman 0f87d9d30f mm/page_alloc: add an array-based interface to the bulk page allocator
The proposed callers for the bulk allocator store pages from the bulk
allocator in an array.  This patch adds an array-based interface to the
API to avoid multiple list iterations.  The page list interface is
preserved to avoid requiring all users of the bulk API to allocate and
manage enough storage to store the pages.

[akpm@linux-foundation.org: remove now unused local `allocated']

Link: https://lkml.kernel.org/r/20210325114228.27719-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:43 -07:00
Mel Gorman 387ba26fb1 mm/page_alloc: add a bulk page allocator
This patch adds a new page allocator interface via alloc_pages_bulk, and
__alloc_pages_bulk_nodemask.  A caller requests a number of pages to be
allocated and added to a list.

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation.  It's up to the caller to request more pages in batch if
necessary.

Note that this implementation is not very efficient and could be
improved but it would require refactoring.  The intent is to make it
available early to determine what semantics are required by different
callers.  Once the full semantics are nailed down, it can be refactored.

[mgorman@techsingularity.net: fix alloc_pages_bulk() return type, per Matthew]
  Link: https://lkml.kernel.org/r/20210325123713.GQ3697@techsingularity.net
[mgorman@techsingularity.net: fix uninit var warning]
  Link: https://lkml.kernel.org/r/20210330114847.GX3697@techsingularity.net
[mgorman@techsingularity.net: fix comment, per Vlastimil]
  Link: https://lkml.kernel.org/r/20210412110255.GV3697@techsingularity.net

Link: https://lkml.kernel.org/r/20210325114228.27719-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Tested-by: Colin Ian King <colin.king@canonical.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:43 -07:00