Commit Graph

2073 Commits

Author SHA1 Message Date
Patrick Talbert 0c06cdf722 Merge: CVE-2024-53113: mm: fix NULL pointer dereference in alloc_pages_bulk_noprof
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5916

JIRA: https://issues.redhat.com/browse/RHEL-69700
CVE: CVE-2024-53113

```
mm: fix NULL pointer dereference in alloc_pages_bulk_noprof

We triggered a NULL pointer dereference for ac.preferred_zoneref->zone in
alloc_pages_bulk_noprof() when the task is migrated between cpusets.

When cpuset is enabled, in prepare_alloc_pages(), ac->nodemask may be
&current->mems_allowed.  when first_zones_zonelist() is called to find
preferred_zoneref, the ac->nodemask may be modified concurrently if the
task is migrated between different cpusets.  Assuming we have 2 NUMA Node,
when traversing Node1 in ac->zonelist, the nodemask is 2, and when
traversing Node2 in ac->zonelist, the nodemask is 1.  As a result, the
ac->preferred_zoneref points to NULL zone.

In alloc_pages_bulk_noprof(), for_each_zone_zonelist_nodemask() finds a
allowable zone and calls zonelist_node_idx(ac.preferred_zoneref), leading
to NULL pointer dereference.

__alloc_pages_noprof() fixes this issue by checking NULL pointer in commit
ea57485af8 ("mm, page_alloc: fix check for NULL preferred_zone") and
commit df76cee6bb ("mm, page_alloc: remove redundant checks from alloc
fastpath").

To fix it, check NULL pointer for preferred_zoneref->zone.

Link: https://lkml.kernel.org/r/20241113083235.166798-1-tujinjiang@huawei.com
Fixes: 387ba26fb1 ("mm/page_alloc: add a bulk page allocator")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 8ce41b0f9d77cca074df25afd39b86e2ee3aa68e)
```

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>

---

<small>Created 2024-12-02 14:42 UTC by backporter - [KWF FAQ](https://red.ht/kernel_workflow_doc) - [Slack #team-kernel-workflow](https://redhat-internal.slack.com/archives/C04LRUPMJQ5) - [Source](https://gitlab.com/cki-project/kernel-workflow/-/blob/main/webhook/utils/backporter.py) - [Documentation](https://gitlab.com/cki-project/kernel-workflow/-/blob/main/docs/README.backporter.md) - [Report an issue](https://gitlab.com/cki-project/kernel-workflow/-/issues/new?issue%5Btitle%5D=backporter%20webhook%20issue)</small>

Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: Audra Mitchell <aubaker@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Patrick Talbert <ptalbert@redhat.com>
2024-12-30 07:30:14 -05:00
Rafael Aquini 766d961233 mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
JIRA: https://issues.redhat.com/browse/RHEL-27745
JIRA: https://issues.redhat.com/browse/RHEL-69683
CVE: CVE-2024-53105
Conflicts:
  * mm/swap.c: minor differences as RHEL-9 misses upstream v6.9 commit
    f1ee018baee9 ("mm: use __page_cache_release() in folios_put()"),
    which changes the comment section above the branched being moved
    from __page_cache_release() to free_pages_prepare()

This patch is a backport of the following upstream commit:
commit 66edc3a5894c74f8887c8af23b97593a0dd0df4d
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Wed Nov 6 19:53:54 2024 +0000

    mm: page_alloc: move mlocked flag clearance into free_pages_prepare()

    Syzbot reported a bad page state problem caused by a page being freed
    using free_page() still having a mlocked flag at free_pages_prepare()
    stage:

      BUG: Bad page state in process syz.5.504  pfn:61f45
      page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x61f45
      flags: 0xfff00000080204(referenced|workingset|mlocked|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000080204 0000000000000000 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
      page_owner tracks the page as allocated
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 8443, tgid 8442 (syz.5.504), ts 201884660643, free_ts 201499827394
       set_page_owner include/linux/page_owner.h:32 [inline]
       post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
       prep_new_page mm/page_alloc.c:1545 [inline]
       get_page_from_freelist+0x303f/0x3190 mm/page_alloc.c:3457
       __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4733
       alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
       kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99
       kvm_create_vm virt/kvm/kvm_main.c:1235 [inline]
       kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5488 [inline]
       kvm_dev_ioctl+0x12dc/0x2240 virt/kvm/kvm_main.c:5530
       __do_compat_sys_ioctl fs/ioctl.c:1007 [inline]
       __se_compat_sys_ioctl+0x510/0xc90 fs/ioctl.c:950
       do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
       __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386
       do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411
       entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      page last free pid 8399 tgid 8399 stack trace:
       reset_page_owner include/linux/page_owner.h:25 [inline]
       free_pages_prepare mm/page_alloc.c:1108 [inline]
       free_unref_folios+0xf12/0x18d0 mm/page_alloc.c:2686
       folios_put_refs+0x76c/0x860 mm/swap.c:1007
       free_pages_and_swap_cache+0x5c8/0x690 mm/swap_state.c:335
       __tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline]
       tlb_batch_pages_flush mm/mmu_gather.c:149 [inline]
       tlb_flush_mmu_free mm/mmu_gather.c:366 [inline]
       tlb_flush_mmu+0x3a3/0x680 mm/mmu_gather.c:373
       tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:465
       exit_mmap+0x496/0xc40 mm/mmap.c:1926
       __mmput+0x115/0x390 kernel/fork.c:1348
       exit_mm+0x220/0x310 kernel/exit.c:571
       do_exit+0x9b2/0x28e0 kernel/exit.c:926
       do_group_exit+0x207/0x2c0 kernel/exit.c:1088
       __do_sys_exit_group kernel/exit.c:1099 [inline]
       __se_sys_exit_group kernel/exit.c:1097 [inline]
       __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
       x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      Modules linked in:
      CPU: 0 UID: 0 PID: 8442 Comm: syz.5.504 Not tainted 6.12.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:94 [inline]
       dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
       bad_page+0x176/0x1d0 mm/page_alloc.c:501
       free_page_is_bad mm/page_alloc.c:918 [inline]
       free_pages_prepare mm/page_alloc.c:1100 [inline]
       free_unref_page+0xed0/0xf20 mm/page_alloc.c:2638
       kvm_destroy_vm virt/kvm/kvm_main.c:1327 [inline]
       kvm_put_kvm+0xc75/0x1350 virt/kvm/kvm_main.c:1386
       kvm_vcpu_release+0x54/0x60 virt/kvm/kvm_main.c:4143
       __fput+0x23f/0x880 fs/file_table.c:431
       task_work_run+0x24f/0x310 kernel/task_work.c:239
       exit_task_work include/linux/task_work.h:43 [inline]
       do_exit+0xa2f/0x28e0 kernel/exit.c:939
       do_group_exit+0x207/0x2c0 kernel/exit.c:1088
       __do_sys_exit_group kernel/exit.c:1099 [inline]
       __se_sys_exit_group kernel/exit.c:1097 [inline]
       __ia32_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
       ia32_sys_call+0x2624/0x2630 arch/x86/include/generated/asm/syscalls_32.h:253
       do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
       __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386
       do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411
       entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      RIP: 0023:0xf745d579
      Code: Unable to access opcode bytes at 0xf745d54f.
      RSP: 002b:00000000f75afd6c EFLAGS: 00000206 ORIG_RAX: 00000000000000fc
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 00000000ffffff9c RDI: 00000000f744cff4
      RBP: 00000000f717ae61 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       </TASK>

    The problem was originally introduced by commit b109b87050df ("mm/munlock:
    replace clear_page_mlock() by final clearance"): it was focused on
    handling pagecache and anonymous memory and wasn't suitable for lower
    level get_page()/free_page() API's used for example by KVM, as with this
    reproducer.

    Fix it by moving the mlocked flag clearance down to free_page_prepare().

    The bug itself if fairly old and harmless (aside from generating these
    warnings), aside from a small memory leak - "bad" pages are stopped from
    being allocated again.

    Link: https://lkml.kernel.org/r/20241106195354.270757-1-roman.gushchin@linux.dev
    Fixes: b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance")
    Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
    Reported-by: syzbot+e985d3026c4fd041578e@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/all/6729f475.050a0220.701a.0019.GAE@google.com
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:59 -05:00
Rafael Aquini 87c482db30 mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reserves
JIRA: https://issues.redhat.com/browse/RHEL-27745
JIRA: https://issues.redhat.com/browse/RHEL-66794
CVE: CVE-2024-50219

This patch is a backport of the following upstream commit:
commit 281dd25c1a018261a04d1b8bf41a0674000bfe38
Author: Matt Fleming <mfleming@cloudflare.com>
Date:   Fri Oct 11 13:07:37 2024 +0100

    mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reserves

    Under memory pressure it's possible for GFP_ATOMIC order-0 allocations to
    fail even though free pages are available in the highatomic reserves.
    GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock()
    since it's only run from reclaim.

    Given that such allocations will pass the watermarks in
    __zone_watermark_unusable_free(), it makes sense to fallback to highatomic
    reserves the same way that ALLOC_OOM can.

    This fixes order-0 page allocation failures observed on Cloudflare's fleet
    when handling network packets:

      kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC),
      nodemask=(null),cpuset=/,mems_allowed=0-7
      CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G           O 6.6.43-CUSTOM #1
      Hardware name: MACHINE
      Call Trace:
       <IRQ>
       dump_stack_lvl+0x3c/0x50
       warn_alloc+0x13a/0x1c0
       __alloc_pages_slowpath.constprop.0+0xc9d/0xd10
       __alloc_pages+0x327/0x340
       __napi_alloc_skb+0x16d/0x1f0
       bnxt_rx_page_skb+0x96/0x1b0 [bnxt_en]
       bnxt_rx_pkt+0x201/0x15e0 [bnxt_en]
       __bnxt_poll_work+0x156/0x2b0 [bnxt_en]
       bnxt_poll+0xd9/0x1c0 [bnxt_en]
       __napi_poll+0x2b/0x1b0
       bpf_trampoline_6442524138+0x7d/0x1000
       __napi_poll+0x5/0x1b0
       net_rx_action+0x342/0x740
       handle_softirqs+0xcf/0x2b0
       irq_exit_rcu+0x6c/0x90
       sysvec_apic_timer_interrupt+0x72/0x90
       </IRQ>

    [mfleming@cloudflare.com: update comment]
      Link: https://lkml.kernel.org/r/20241015125158.3597702-1-matt@readmodwrite.com
    Link: https://lkml.kernel.org/r/20241011120737.3300370-1-matt@readmodwrite.com
    Link: https://lore.kernel.org/all/CAGis_TWzSu=P7QJmjD58WWiu3zjMTVKSzdOwWE8ORaGytzWJwQ@mail.gmail.com/
    Fixes: 1d91df85f3 ("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs")
    Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
    Suggested-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:53 -05:00
Rafael Aquini fb85eb34b8 mm: fix endless reclaim on machines with unaccepted memory
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 807174a93d24c456503692dc3f5af322ee0b640a
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Fri Aug 9 14:48:47 2024 +0300

    mm: fix endless reclaim on machines with unaccepted memory

    Unaccepted memory is considered unusable free memory, which is not counted
    as free on the zone watermark check.  This causes get_page_from_freelist()
    to accept more memory to hit the high watermark, but it creates problems
    in the reclaim path.

    The reclaim path encounters a failed zone watermark check and attempts to
    reclaim memory.  This is usually successful, but if there is little or no
    reclaimable memory, it can result in endless reclaim with little to no
    progress.  This can occur early in the boot process, just after start of
    the init process when the only reclaimable memory is the page cache of the
    init executable and its libraries.

    Make unaccepted memory free from watermark check point of view.  This way
    unaccepted memory will never be the trigger of memory reclaim.  Accept
    more memory in the get_page_from_freelist() if needed.

    Link: https://lkml.kernel.org/r/20240809114854.3745464-2-kirill.shutemov@linux.intel.com
    Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory")
    Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Reported-by: Jianxiong Gao <jxgao@google.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Tested-by: Jianxiong Gao <jxgao@google.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
    Cc: Tom Lendacky <thomas.lendacky@amd.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>    [6.5+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:34 -05:00
Rafael Aquini 622d2a11c3 mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 66eca1021a42856d6af2a9802c99e160278aed91
Author: Li Zhijian <lizhijian@fujitsu.com>
Date:   Tue Jul 23 14:44:28 2024 +0800

    mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist()

    It's expected that no page should be left in pcp_list after calling
    zone_pcp_disable() in offline_pages().  Previously, it's observed that
    offline_pages() gets stuck [1] due to some pages remaining in pcp_list.

    Cause:
    There is a race condition between drain_pages_zone() and __rmqueue_pcplist()
    involving the pcp->count variable. See below scenario:

             CPU0                              CPU1
        ----------------                    ---------------
                                          spin_lock(&pcp->lock);
                                          __rmqueue_pcplist() {
    zone_pcp_disable() {
                                            /* list is empty */
                                            if (list_empty(list)) {
                                              /* add pages to pcp_list */
                                              alloced = rmqueue_bulk()
      mutex_lock(&pcp_batch_high_lock)
      ...
      __drain_all_pages() {
        drain_pages_zone() {
          /* read pcp->count, it's 0 here */
          count = READ_ONCE(pcp->count)
          /* 0 means nothing to drain */
                                              /* update pcp->count */
                                              pcp->count += alloced << order;
          ...
                                          ...
                                          spin_unlock(&pcp->lock);

    In this case, after calling zone_pcp_disable() though, there are still some
    pages in pcp_list. And these pages in pcp_list are neither movable nor
    isolated, offline_pages() gets stuck as a result.

    Solution:
    Expand the scope of the pcp->lock to also protect pcp->count in
    drain_pages_zone(), to ensure no pages are left in the pcp list after
    zone_pcp_disable()

    [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/

    Link: https://lkml.kernel.org/r/20240723064428.1179519-1-lizhijian@fujitsu.com
    Fixes: 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock")
    Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
    Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:31 -05:00
Rafael Aquini 022010821a mm: refactor folio_undo_large_rmappable()
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * this is a direct port from v6.6 LTS branch backport commit eb6b6d3e1f1e
    ("mm: refactor folio_undo_large_rmappable()"), due to RHEL9 missing
    upstream commit 90491d87dd46 ("mm: add free_unref_folios()") and its
    big accompanying series.

This patch is a backport of the following upstream commit:
commit 593a10dabe08dcf93259fce2badd8dc2528859a8
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Tue May 21 21:03:15 2024 +0800

    mm: refactor folio_undo_large_rmappable()

    Folios of order <= 1 are not in deferred list, the check of order is added
    into folio_undo_large_rmappable() from commit 8897277acfef ("mm: support
    order-1 folios in the page cache"), but there is a repeated check for
    small folio (order 0) during each call of the
    folio_undo_large_rmappable(), so only keep folio_order() check inside the
    function.

    In addition, move all the checks into header file to save a function call
    for non-large-rmappable or empty deferred_list folio.

    Link: https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Lance Yang <ioworker0@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeel.butt@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:22 -05:00
Rafael Aquini 307d3c3a96 mm/page_alloc: Separate THP PCP into movable and non-movable categories
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit bf14ed81f571f8dba31cd72ab2e50fbcc877cc31
Author: yangge <yangge1116@126.com>
Date:   Thu Jun 20 08:59:50 2024 +0800

    mm/page_alloc: Separate THP PCP into movable and non-movable categories

    Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
    THP-sized allocations") no longer differentiates the migration type of
    pages in THP-sized PCP list, it's possible that non-movable allocation
    requests may get a CMA page from the list, in some cases, it's not
    acceptable.

    If a large number of CMA memory are configured in system (for example, the
    CMA memory accounts for 50% of the system memory), starting a virtual
    machine with device passthrough will get stuck.  During starting the
    virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM,
    ...) to pin memory.  Normally if a page is present and in CMA area,
    pin_user_pages_remote() will migrate the page from CMA area to non-CMA
    area because of FOLL_LONGTERM flag.  But if non-movable allocation
    requests return CMA memory, migrate_longterm_unpinnable_pages() will
    migrate a CMA page to another CMA page, which will fail to pass the check
    in check_and_migrate_movable_pages() and cause migration endless.

    Call trace:
    pin_user_pages_remote
    --__gup_longterm_locked // endless loops in this function
    ----_get_user_pages_locked
    ----check_and_migrate_movable_pages
    ------migrate_longterm_unpinnable_pages
    --------alloc_migration_target

    This problem will also have a negative impact on CMA itself.  For example,
    when CMA is borrowed by THP, and we need to reclaim it through cma_alloc()
    or dma_alloc_coherent(), we must move those pages out to ensure CMA's
    users can retrieve that contigous memory.  Currently, CMA's memory is
    occupied by non-movable pages, meaning we can't relocate them.  As a
    result, cma_alloc() is more likely to fail.

    To fix the problem above, we add one PCP list for THP, which will not
    introduce a new cacheline for struct per_cpu_pages.  THP will have 2 PCP
    lists, one PCP list is used by MOVABLE allocation, and the other PCP list
    is used by UNMOVABLE allocation.  MOVABLE allocation contains GPF_MOVABLE,
    and UNMOVABLE allocation contains GFP_UNMOVABLE and GFP_RECLAIMABLE.

    Link: https://lkml.kernel.org/r/1718845190-4456-1-git-send-email-yangge1116@126.com
    Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations")
    Signed-off-by: yangge <yangge1116@126.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Barry Song <21cnbao@gmail.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:17 -05:00
Rafael Aquini e18b41a3d1 mm: page_alloc: use the correct THP order for THP PCP
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 6303d1c553c8d758f068de70a41668622b7a917c
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Apr 3 21:47:21 2024 +0800

    mm: page_alloc: use the correct THP order for THP PCP

    Commit 44042b4498 ("mm/page_alloc: allow high-order pages to be stored
    on the per-cpu lists") extends the PCP allocator to store THP pages, and
    it determines whether to cache THP pages in PCP by comparing with
    pageblock_order.  But the pageblock_order is not always equal to THP
    order.  It might also be MAX_PAGE_ORDER, which could prevent PCP from
    caching THP pages.

    Therefore, using HPAGE_PMD_ORDER instead to determine the need for caching
    THP for PCP will fix this issue

    Link: https://lkml.kernel.org/r/a25c9e14cd03907d5978b60546a69e6aa3fc2a7d.1712151833.git.baolin.wang@linux.alibaba.com
    Fixes: 44042b4498 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:03 -05:00
Rafael Aquini 533a2348af mm: page_alloc: control latency caused by zone PCP draining
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 55f77df7d715110299f12c27f4365bd6332d1adb
Author: Lucas Stach <l.stach@pengutronix.de>
Date:   Mon Mar 18 21:07:36 2024 +0100

    mm: page_alloc: control latency caused by zone PCP draining

    Patch series "mm/treewide: Remove pXd_huge() API", v2.

    In previous work [1], we removed the pXd_large() API, which is arch
    specific.  This patchset further removes the hugetlb pXd_huge() API.

    Hugetlb was never special on creating huge mappings when compared with
    other huge mappings.  Having a standalone API just to detect such pgtable
    entries is more or less redundant, especially after the pXd_leaf() API set
    is introduced with/without CONFIG_HUGETLB_PAGE.

    When looking at this problem, a few issues are also exposed that we don't
    have a clear definition of the *_huge() variance API.  This patchset
    started by cleaning these issues first, then replace all *_huge() users to
    use *_leaf(), then drop all *_huge() code.

    On x86/sparc, swap entries will be reported "true" in pXd_huge(), while
    for all the rest archs they're reported "false" instead.  This part is
    done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix,
    but I'll leave that to hmm experts to decide.

    Besides, there are three archs (arm, arm64, powerpc) that have slightly
    different definitions between the *_huge() v.s.  *_leaf() variances.  I
    tackled them separately so that it'll be easier for arch experts to chim
    in when necessary.  This part is done in patch 6-9.

    The final patches 10-14 do the rest on the final removal, since *_leaf()
    will be the ultimate API in the future, and we seem to have quite some
    confusions on how *_huge() APIs can be defined, provide a rich comment for
    *_leaf() API set to define them properly to avoid future misuse, and
    hopefully that'll also help new archs to start support huge mappings and
    avoid traps (like either swap entries, or PROT_NONE entry checks).

    [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com

    This patch (of 14):

    When the complete PCP is drained a much larger number of pages than the
    usual batch size might be freed at once, causing large IRQ and preemption
    latency spikes, as they are all freed while holding the pcp and zone
    spinlocks.

    To avoid those latency spikes, limit the number of pages freed in a single
    bulk operation to common batch limits.

    Link: https://lkml.kernel.org/r/20240318200404.448346-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20240318200736.2835502-1-l.stach@pengutronix.de
    Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andreas Larsson <andreas@gaisler.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Bjorn Andersson <andersson@kernel.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Fabio Estevam <festevam@denx.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
    Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
    Cc: Mark Salter <msalter@redhat.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
    Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Russell King <linux@armlinux.org.uk>
    Cc: Shawn Guo <shawnguo@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:54 -05:00
Rafael Aquini c642587310 mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 803de9000f334b771afacb6ff3e78622916668b0
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Wed Feb 21 12:43:58 2024 +0100

    mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations

    Sven reports an infinite loop in __alloc_pages_slowpath() for costly order
    __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO.  Such combination
    can happen in a suspend/resume context where a GFP_KERNEL allocation can
    have __GFP_IO masked out via gfp_allowed_mask.

    Quoting Sven:

    1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER)
       with __GFP_RETRY_MAYFAIL set.

    2. page alloc's __alloc_pages_slowpath tries to get a page from the
       freelist. This fails because there is nothing free of that costly
       order.

    3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim,
       which bails out because a zone is ready to be compacted; it pretends
       to have made a single page of progress.

    4. page alloc tries to compact, but this always bails out early because
       __GFP_IO is not set (it's not passed by the snd allocator, and even
       if it were, we are suspending so the __GFP_IO flag would be cleared
       anyway).

    5. page alloc believes reclaim progress was made (because of the
       pretense in item 3) and so it checks whether it should retry
       compaction. The compaction retry logic thinks it should try again,
       because:
        a) reclaim is needed because of the early bail-out in item 4
        b) a zonelist is suitable for compaction

    6. goto 2. indefinite stall.

    (end quote)

    The immediate root cause is confusing the COMPACT_SKIPPED returned from
    __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be
    indicating a lack of order-0 pages, and in step 5 evaluating that in
    should_compact_retry() as a reason to retry, before incrementing and
    limiting the number of retries.  There are however other places that
    wrongly assume that compaction can happen while we lack __GFP_IO.

    To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO
    evaluation and switch the open-coded test in try_to_compact_pages() to use
    it.

    Also use the new helper in:
    - compaction_ready(), which will make reclaim not bail out in step 3, so
      there's at least one attempt to actually reclaim, even if chances are
      small for a costly order
    - in_reclaim_compaction() which will make should_continue_reclaim()
      return false and we don't over-reclaim unnecessarily
    - in __alloc_pages_slowpath() to set a local variable can_compact,
      which is then used to avoid retrying reclaim/compaction for costly
      allocations (step 5) if we can't compact and also to skip the early
      compaction attempt that we do in some cases

    Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz
    Fixes: 3250845d05 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"")
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Reported-by: Sven van Ashbrook <svenva@chromium.org>
    Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%2Bg@mail.gmail.com/
    Tested-by: Karthikeyan Ramasubramanian <kramasub@chromium.org>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Curtis Malainey <cujomalainey@chromium.org>
    Cc: Jaroslav Kysela <perex@perex.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Takashi Iwai <tiwai@suse.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:27 -05:00
Rafael Aquini c8c9c0b259 mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * arch/*/Kconfig: all hunks dropped as there were only text blurbs and comments
     being changed with no functional changes whatsoever, and RHEL9 is missing
     several (unrelated) commits to these arches that tranform the text blurbs in
     the way these non-functional hunks were expecting;
  * drivers/accel/qaic/qaic_data.c: hunk dropped due to RHEL-only commit
     083c0cdce2 ("Merge DRM changes from upstream v6.8..v6.9");
  * drivers/gpu/drm/i915/gem/selftests/huge_pages.c: hunk dropped due to RHEL-only
     commit ca8b16c11b ("Merge DRM changes from upstream v6.7..v6.8");
  * drivers/gpu/drm/ttm/tests/ttm_pool_test.c: all hunks dropped due to RHEL-only
     commit ca8b16c11b ("Merge DRM changes from upstream v6.7..v6.8");
  * drivers/video/fbdev/vermilion/vermilion.c: hunk dropped as RHEL9 misses
     commit dbe7e429fe ("vmlfb: framebuffer driver for Intel Vermilion Range");
  * include/linux/pageblock-flags.h: differences due to out-of-order backport
    of upstream commits 72801513b2bf ("mm: set pageblock_order to HPAGE_PMD_ORDER
    in case with !CONFIG_HUGETLB_PAGE but THP enabled"), and 3a7e02c040b1
    ("minmax: avoid overly complicated constant expressions in VM code");
  * mm/mm_init.c: differences on the 3rd, and 4th hunks are due to RHEL
     backport commit 1845b92dcf ("mm: move most of core MM initialization to
     mm/mm_init.c") ignoring the out-of-order backport of commit 3f6dac0fd1b8
     ("mm/page_alloc: make deferred page init free pages in MAX_ORDER blocks")
     thus partially reverting the changes introduced by the latter;

This patch is a backport of the following upstream commit:
commit 5e0a760b44417f7cadd79de2204d6247109558a0
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Thu Dec 28 17:47:04 2023 +0300

    mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

    commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
    changed the definition of MAX_ORDER to be inclusive.  This has caused
    issues with code that was not yet upstream and depended on the previous
    definition.

    To draw attention to the altered meaning of the define, rename MAX_ORDER
    to MAX_PAGE_ORDER.

    Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:17 -05:00
Rafael Aquini 9f578eff61 mm, treewide: introduce NR_PAGE_ORDERS
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * drivers/gpu/drm/*, include/drm/ttm/ttm_pool.h: all hunks dropped due to
    RHEL-only commit ca8b16c11b ("Merge DRM changes from upstream v6.7..v6.8");
  * include/linux/mmzone.h: 3rd hunk dropped due to RHEL-only commit
    afa0ca9cf7 ("Partial backport of mm, treewide: introduce NR_PAGE_ORDERS");

This patch is a backport of the following upstream commit:
commit fd37721803c6e73619108f76ad2e12a9aa5fafaf
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Thu Dec 28 17:47:03 2023 +0300

    mm, treewide: introduce NR_PAGE_ORDERS

    NR_PAGE_ORDERS defines the number of page orders supported by the page
    allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.

    NR_PAGE_ORDERS assists in defining arrays of page orders and allows for
    more natural iteration over them.

    [kirill.shutemov@linux.intel.com: fixup for kerneldoc warning]
      Link: https://lkml.kernel.org/r/20240101111512.7empzyifq7kxtzk3@box
    Link: https://lkml.kernel.org/r/20231228144704.14033-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:14 -05:00
Rafael Aquini 15d43a67db mm: add page_rmappable_folio() wrapper
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 23e4883248f0472d806c8b3422ba6257e67bf1a5
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Oct 3 02:25:33 2023 -0700

    mm: add page_rmappable_folio() wrapper

    folio_prep_large_rmappable() is being used repeatedly along with a
    conversion from page to folio, a check non-NULL, a check order > 1: wrap
    it all up into struct folio *page_rmappable_folio(struct page *).

    Link: https://lkml.kernel.org/r/8d92c6cf-eebe-748-e29c-c8ab224c741@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Tejun heo <tj@kernel.org>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:15 -05:00
Rafael Aquini f7632ac537 mm: page_alloc: check the order of compound page even when the order is zero
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 76f26535d1446373d4735a252ea4247c39d64ba6
Author: Hyesoo Yu <hyesoo.yu@samsung.com>
Date:   Mon Oct 23 17:32:16 2023 +0900

    mm: page_alloc: check the order of compound page even when the order is zero

    For compound pages, the head sets the PG_head flag and the tail sets the
    compound_head to indicate the head page.  If a user allocates a compound
    page and frees it with a different order, the compound page information
    will not be properly initialized.  To detect this problem,
    compound_order(page) and the order argument are compared, but this is not
    checked when the order argument is zero.  That error should be checked
    regardless of the order.

    Link: https://lkml.kernel.org/r/20231023083217.1866451-1-hyesoo.yu@samsung.com
    Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:07 -05:00
CKI Backport Bot d700461f87 mm: fix NULL pointer dereference in alloc_pages_bulk_noprof
JIRA: https://issues.redhat.com/browse/RHEL-69700
CVE: CVE-2024-53113

commit 8ce41b0f9d77cca074df25afd39b86e2ee3aa68e
Author: Jinjiang Tu <tujinjiang@huawei.com>
Date:   Wed Nov 13 16:32:35 2024 +0800

    mm: fix NULL pointer dereference in alloc_pages_bulk_noprof

    We triggered a NULL pointer dereference for ac.preferred_zoneref->zone in
    alloc_pages_bulk_noprof() when the task is migrated between cpusets.

    When cpuset is enabled, in prepare_alloc_pages(), ac->nodemask may be
    &current->mems_allowed.  when first_zones_zonelist() is called to find
    preferred_zoneref, the ac->nodemask may be modified concurrently if the
    task is migrated between different cpusets.  Assuming we have 2 NUMA Node,
    when traversing Node1 in ac->zonelist, the nodemask is 2, and when
    traversing Node2 in ac->zonelist, the nodemask is 1.  As a result, the
    ac->preferred_zoneref points to NULL zone.

    In alloc_pages_bulk_noprof(), for_each_zone_zonelist_nodemask() finds a
    allowable zone and calls zonelist_node_idx(ac.preferred_zoneref), leading
    to NULL pointer dereference.

    __alloc_pages_noprof() fixes this issue by checking NULL pointer in commit
    ea57485af8 ("mm, page_alloc: fix check for NULL preferred_zone") and
    commit df76cee6bb ("mm, page_alloc: remove redundant checks from alloc
    fastpath").

    To fix it, check NULL pointer for preferred_zoneref->zone.

    Link: https://lkml.kernel.org/r/20241113083235.166798-1-tujinjiang@huawei.com
    Fixes: 387ba26fb1 ("mm/page_alloc: add a bulk page allocator")
    Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Alexander Lobakin <alobakin@pm.me>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2024-12-02 14:42:53 +00:00
Rafael Aquini c6fc2dab3f mm: add large_rmappable page flag
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit de53c05f2ae3d47d30db58e9c4e54e3bbc868377
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:56 2023 +0100

    mm: add large_rmappable page flag

    Stored in the first tail page's flags, this flag replaces the destructor.
    That removes the last of the destructors, so remove all references to
    folio_dtor and compound_dtor.

    Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:51 -04:00
Rafael Aquini 980ab30d90 mm: remove HUGETLB_PAGE_DTOR
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * mm/hugetlb.c: conflict on the 4th hunk due to out-of-order backport of
      commit d8f5f7e445f0 ("hugetlb: set hugetlb page flag before optimizing vmemmap")

This patch is a backport of the following upstream commit:
commit 9c5ccf2db04b8d7c3df363fdd4856c2b79ab2c6a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:55 2023 +0100

    mm: remove HUGETLB_PAGE_DTOR

    We can use a bit in page[1].flags to indicate that this folio belongs to
    hugetlb instead of using a value in page[1].dtors.  That lets
    folio_test_hugetlb() become an inline function like it should be.  We can
    also get rid of NULL_COMPOUND_DTOR.

    Link: https://lkml.kernel.org/r/20230816151201.3655946-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:50 -04:00
Rafael Aquini bb9af3555c mm: remove free_compound_page() and the compound_page_dtors array
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * include/linux/mm.h: due to a series of backports sorting out-of-order
      conflicts, the prototype for nr_free_buffer_pages() ended up in the
      wrong context wrt its current upstream placement. To avoid further
      silly conflicts in the future, we're finally fixing its placement
      with this backport.

This patch is a backport of the following upstream commit:
commit 0f2f43fabb95192c73b19586ef7536d7ac7c2f8c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:54 2023 +0100

    mm: remove free_compound_page() and the compound_page_dtors array

    The only remaining destructor is free_compound_page().  Inline it into
    destroy_large_folio() and remove the array it used to live in.

    Link: https://lkml.kernel.org/r/20230816151201.3655946-7-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:49 -04:00
Rafael Aquini 6e53c42dda mm: convert prep_transhuge_page() to folio_prep_large_rmappable()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit da6e7bf3a0315025e4199d599bd31763f0df3b4a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:53 2023 +0100

    mm: convert prep_transhuge_page() to folio_prep_large_rmappable()

    Match folio_undo_large_rmappable(), and move the casting from page to
    folio into the callers (which they were largely doing anyway).

    Link: https://lkml.kernel.org/r/20230816151201.3655946-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:48 -04:00
Rafael Aquini d987bc96eb mm: convert free_transhuge_folio() to folio_undo_large_rmappable()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 8dc4a8f1e038189cb575f89bcd23364698b88cc1
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:52 2023 +0100

    mm: convert free_transhuge_folio() to folio_undo_large_rmappable()

    Indirect calls are expensive, thanks to Spectre.  Test for
    TRANSHUGE_PAGE_DTOR and destroy the folio appropriately.  Move the
    free_compound_page() call into destroy_large_folio() to simplify later
    patches.

    Link: https://lkml.kernel.org/r/20230816151201.3655946-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:47 -04:00
Rafael Aquini a70f0dc41c mm: convert free_huge_page() to free_huge_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 454a00c40a21c59e99c526fe8cc57bd029cf8f0e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:51 2023 +0100

    mm: convert free_huge_page() to free_huge_folio()

    Pass a folio instead of the head page to save a few instructions.  Update
    the documentation, at least in English.

    Link: https://lkml.kernel.org/r/20230816151201.3655946-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:47 -04:00
Rafael Aquini e64e619a6d mm: call free_huge_page() directly
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit dd6fa0b61814492476463149c91110e529364e82
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:50 2023 +0100

    mm: call free_huge_page() directly

    Indirect calls are expensive, thanks to Spectre.  Call free_huge_page()
    directly if the folio belongs to hugetlb.

    Link: https://lkml.kernel.org/r/20230816151201.3655946-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:46 -04:00
Rafael Aquini 016a93d5e2 mm/page_alloc: use get_pfnblock_migratetype to avoid extra page_to_pfn
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit b5ffd2973365298c4d829802653133038837a6f9
Author: Kemeng Shi <shikemeng@huaweicloud.com>
Date:   Fri Aug 11 19:59:45 2023 +0800

    mm/page_alloc: use get_pfnblock_migratetype to avoid extra page_to_pfn

    We have get_pageblock_migratetype and get_pfnblock_migratetype to get
    migratetype of page.  get_pfnblock_migratetype accepts both page and pfn
    from caller while get_pageblock_migratetype only accept page and get pfn
    with page_to_pfn from page.

    In case we already record pfn of page, we can simply call
    get_pfnblock_migratetype to avoid a page_to_pfn.

    Link: https://lkml.kernel.org/r/20230811115945.3423894-3-shikemeng@huaweicloud.com
    Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:39 -04:00
Rafael Aquini de1bb2803d mm/page_alloc: remove unnecessary inner __get_pfnblock_flags_mask
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit a04d12c2481fbf2752b5686d8a8049dd59e61e37
Author: Kemeng Shi <shikemeng@huaweicloud.com>
Date:   Fri Aug 11 19:59:44 2023 +0800

    mm/page_alloc: remove unnecessary inner __get_pfnblock_flags_mask

    Patch series "Two minor cleanups for get pageblock migratetype".

    This series contains two minor cleanups for get pageblock migratetype.
    More details can be found in respective patches.

    This patch (of 2):

    get_pfnblock_flags_mask() just calls inline inner
    __get_pfnblock_flags_mask without any extra work.  Just opencode
    __get_pfnblock_flags_mask in get_pfnblock_flags_mask and replace call to
    __get_pfnblock_flags_mask with call to get_pfnblock_flags_mask to remove
    unnecessary __get_pfnblock_flags_mask.

    Link: https://lkml.kernel.org/r/20230811115945.3423894-1-shikemeng@huaweicloud.com
    Link: https://lkml.kernel.org/r/20230811115945.3423894-2-shikemeng@huaweicloud.com
    Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:38 -04:00
Rafael Aquini 6f0996c6e1 mm: page_alloc: remove unused parameter from reserve_highatomic_pageblock()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 368d983b985572e33422432d849f5956268bce21
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Wed Aug 9 15:33:23 2023 +0800

    mm: page_alloc: remove unused parameter from reserve_highatomic_pageblock()

    Just remove the redundant parameter alloc_order from
    reserve_highatomic_pageblock(). No functional modification involved.

    Link: https://lkml.kernel.org/r/20230809073323.1065286-1-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:36 -04:00
Rafael Aquini 2643f23bfc mm/page_alloc: remove unneeded variable base
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit c1dc69e6ce65d95e3d4d080868c3007f1a6fc4fe
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Aug 3 19:49:34 2023 +0800

    mm/page_alloc: remove unneeded variable base

    Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
    THP-sized allocations"), local variable base is just as same as order.  So
    remove it.  No functional change intended.

    Link: https://lkml.kernel.org/r/20230803114934.693989-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:47 -04:00
Rafael Aquini 334b208e9b mm/page_alloc: avoid unneeded alike_pages calculation
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit ebddd111fcd13fefd7350f77201dfc5605672909
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 1 20:37:23 2023 +0800

    mm/page_alloc: avoid unneeded alike_pages calculation

    When free_pages is 0, alike_pages is not used.  So alike_pages calculation
    can be avoided by checking free_pages early to save cpu cycles.  Also fix
    typo 'comparable'.  It should be 'compatible' here.

    Link: https://lkml.kernel.org/r/20230801123723.2225543-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:12 -04:00
Rafael Aquini 2492503ed4 mm: page_alloc: avoid false page outside zone error info
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 82d9b8c85b7e4bd85c679ac2da26b57224c4999d
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 4 19:18:23 2023 +0800

    mm: page_alloc: avoid false page outside zone error info

    If pfn is outside zone boundaries in the first round, ret will be set to
    1.  But if pfn is changed to inside the zone boundaries in zone span
    seqretry path, ret is still set to 1 leading to false page outside zone
    error info.

    This is from code inspection.  The race window should be really small thus
    hard to trigger in real world.

    [akpm@linux-foundation.org: code simplification, per Matthew]
    Link: https://lkml.kernel.org/r/20230704111823.940331-1-linmiaohe@huawei.com
    Fixes: bdc8cb9845 ("[PATCH] memory hotplug locking: zone span seqlock")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:45 -04:00
Rafael Aquini accaf9e40b mm: page_alloc: use the correct type of list for free pages
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 1bf61092bc90a9054d01cfdf35b42c1bf6fe47c7
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Jun 21 16:14:28 2023 +0800

    mm: page_alloc: use the correct type of list for free pages

    Commit bf75f200569d ("mm/page_alloc: add page->buddy_list and
    page->pcp_list") introduces page->buddy_list and page->pcp_list as a union
    with page->lru, but missed to change get_page_from_free_area() to use
    page->buddy_list to clarify the correct type of list for a free page.

    Link: https://lkml.kernel.org/r/7e7ab533247d40c0ea0373c18a6a48e5667f9e10.1687333557.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:27 -04:00
Rafael Aquini 646c680a0c mm: page_alloc: make compound_page_dtors static
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit cf01724e2d73a90524450e3dd8798cfb9d7aca05
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Sat Jun 17 11:46:22 2023 +0800

    mm: page_alloc: make compound_page_dtors static

    It's only used inside page_alloc.c now. So make it static and remove the
    declaration in mm.h.

    Link: https://lkml.kernel.org/r/20230617034622.1235913-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:21 -04:00
Rafael Aquini 7bec5764a9 mm: page_alloc: remove unneeded header files
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 12dd992accd96c09ec765a3bb4706a3ed24ae5b3
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Sat Jun 3 19:25:58 2023 +0800

    mm: page_alloc: remove unneeded header files

    Remove some unneeded header files. No functional change intended.

    Link: https://lkml.kernel.org/r/20230603112558.213694-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:28 -04:00
Rafael Aquini cf4d465e89 mm: compaction: simplify should_compact_retry()
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 511a69b27fe6c2d7312789bd9e2e40b00e3903ef
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Fri May 19 14:39:56 2023 +0200

    mm: compaction: simplify should_compact_retry()

    The different branches for retry are unnecessarily complicated.  There are
    really only three outcomes: progress (retry n times), skipped (retry if
    reclaim can help), failed (retry with higher priority).

    Rearrange the branches and the retry counter to make it simpler.

    [hannes@cmpxchg.org: restore behavior when hitting max_retries]
      Link: https://lkml.kernel.org/r/20230602144705.GB161817@cmpxchg.org
    Link: https://lkml.kernel.org/r/20230519123959.77335-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:42 -04:00
Rafael Aquini 4a2d05d5ec mm: compaction: remove compaction result helpers
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit ecd8b2928f2efc7b678b361d51920c15b5ef3885
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Fri May 19 14:39:55 2023 +0200

    mm: compaction: remove compaction result helpers

    Patch series "mm: compaction: cleanups & simplifications".

    These compaction cleanups are split out from the huge page allocator
    series[1], as requested by reviewer feedback.

    [1] https://lore.kernel.org/linux-mm/20230418191313.268131-1-hannes@cmpxchg.org/

    This patch (of 5):

    The compaction result helpers encode quirks that are specific to the
    allocator's retry logic.  E.g.  COMPACT_SUCCESS and COMPACT_COMPLETE
    actually represent failures that should be retried upon, and so on.  I
    frequently found myself pulling up the helper implementation in order to
    understand and work on the retry logic.  They're not quite clean
    abstractions; rather they split the retry logic into two locations.

    Remove the helpers and inline the checks.  Then comment on the result
    interpretations directly where the decision making happens.

    Link: https://lkml.kernel.org/r/20230519123959.77335-1-hannes@cmpxchg.org
    Link: https://lkml.kernel.org/r/20230519123959.77335-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:42 -04:00
Rafael Aquini af18d2ce98 mm: page_alloc: set sysctl_lowmem_reserve_ratio storage-class-specifier to static
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 62069aace145658bc8ce79cbf7b6cf611db4a22f
Author: Tom Rix <trix@redhat.com>
Date:   Thu May 18 10:11:19 2023 -0400

    mm: page_alloc: set sysctl_lowmem_reserve_ratio storage-class-specifier to static

    smatch reports
    mm/page_alloc.c:247:5: warning: symbol
      'sysctl_lowmem_reserve_ratio' was not declared. Should it be static?

    This variable is only used in its defining file, so it should be static

    Link: https://lkml.kernel.org/r/20230518141119.927074-1-trix@redhat.com
    Signed-off-by: Tom Rix <trix@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:39 -04:00
Rafael Aquini 7125e59119 mm/page_alloc: drop the unnecessary pfn_valid() for start pfn
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 3c4322c94b9af33dc62e47cb80c057f9814fb595
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Mon Apr 24 21:45:39 2023 +0800

    mm/page_alloc: drop the unnecessary pfn_valid() for start pfn

    __pageblock_pfn_to_page() currently performs both pfn_valid check and
    pfn_to_online_page().  The former one is redundant because the latter is a
    stronger check.  Drop pfn_valid().

    Link: https://lkml.kernel.org/r/c3868b58c6714c09a43440d7d02c7b4eed6e03f6.1682342634.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:14 -04:00
Chris von Recklinghausen 8bfb38d72a mm and cache_info: remove unnecessary CPU cache info update
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit 5cec4eb7fad6fb1e9a3dd8403b558d1eff7490ff
Author: Huang Ying <ying.huang@intel.com>
Date:   Fri Jan 26 16:19:44 2024 +0800

    mm and cache_info: remove unnecessary CPU cache info update

    For each CPU hotplug event, we will update per-CPU data slice size and
    corresponding PCP configuration for every online CPU to make the
    implementation simple.  But, Kyle reported that this takes tens seconds
    during boot on a machine with 34 zones and 3840 CPUs.

    So, in this patch, for each CPU hotplug event, we only update per-CPU data
    slice size and corresponding PCP configuration for the CPUs that share
    caches with the hotplugged CPU.  With the patch, the system boot time
    reduces 67 seconds on the machine.

    Link: https://lkml.kernel.org/r/20240126081944.414520-1-ying.huang@intel.com
    Fixes: 362d37a106dd ("mm, pcp: reduce lock contention for draining high-order pages")
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Originally-by: Kyle Meyer <kyle.meyer@hpe.com>
    Reported-and-tested-by: Kyle Meyer <kyle.meyer@hpe.com>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:13 -04:00
Chris von Recklinghausen 6d5dc73b7d mm, pcp: reduce detecting time of consecutive high order page freeing
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit 6ccdcb6d3a741c4e005ca6ffd4a62ddf8b5bead3
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Oct 16 13:30:02 2023 +0800

    mm, pcp: reduce detecting time of consecutive high order page freeing

    In current PCP auto-tuning design, if the number of pages allocated is
    much more than that of pages freed on a CPU, the PCP high may become the
    maximal value even if the allocating/freeing depth is small, for example,
    in the sender of network workloads.  If a CPU was used as sender
    originally, then it is used as receiver after context switching, we need
    to fill the whole PCP with maximal high before triggering PCP draining for
    consecutive high order freeing.  This will hurt the performance of some
    network workloads.

    To solve the issue, in this patch, we will track the consecutive page
    freeing with a counter in stead of relying on PCP draining.  So, we can
    detect consecutive page freeing much earlier.

    On a 2-socket Intel server with 128 logical CPU, we tested
    SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.
    With the patch, the network bandwidth improves 5.0%.  This restores the
    performance drop caused by PCP auto-tuning.

    Link: https://lkml.kernel.org/r/20231016053002.756205-10-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Arjan van de Ven <arjan@linux.intel.com>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:13 -04:00
Chris von Recklinghausen fa5b2e0df3 mm, pcp: decrease PCP high if free pages < high watermark
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit 57c0419c5f0ea2ccab8700895c8fac20ba1eb21f
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Oct 16 13:30:01 2023 +0800

    mm, pcp: decrease PCP high if free pages < high watermark

    One target of PCP is to minimize pages in PCP if the system free pages is
    too few.  To reach that target, when page reclaiming is active for the
    zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating
    path, decrease PCP high and free some pages in freeing path.  But this may
    be too late because the background page reclaiming may introduce latency
    for some workloads.  So, in this patch, during page allocation we will
    detect whether the number of free pages of the zone is below high
    watermark.  If so, we will stop increasing PCP high in allocating path,
    decrease PCP high and free some pages in freeing path.  With this, we can
    reduce the possibility of the premature background page reclaiming caused
    by too large PCP.

    The high watermark checking is done in allocating path to reduce the
    overhead in hotter freeing path.

    Link: https://lkml.kernel.org/r/20231016053002.756205-9-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Arjan van de Ven <arjan@linux.intel.com>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:13 -04:00
Chris von Recklinghausen acd9029606 mm: tune PCP high automatically
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit 51a755c56dc05a8b31ed28d24f28354946dc7529
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Oct 16 13:30:00 2023 +0800

    mm: tune PCP high automatically

    The target to tune PCP high automatically is as follows,

    - Minimize allocation/freeing from/to shared zone

    - Minimize idle pages in PCP

    - Minimize pages in PCP if the system free pages is too few

    To reach these target, a tuning algorithm as follows is designed,

    - When we refill PCP via allocating from the zone, increase PCP high.
      Because if we had larger PCP, we could avoid to allocate from the
      zone.

    - In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
      decrease PCP high to try to free possible idle PCP pages.

    - When page reclaiming is active for the zone, stop increasing PCP
      high in allocating path, decrease PCP high and free some pages in
      freeing path.

    So, the PCP high can be tuned to the page allocating/freeing depth of
    workloads eventually.

    One issue of the algorithm is that if the number of pages allocated is
    much more than that of pages freed on a CPU, the PCP high may become the
    maximal value even if the allocating/freeing depth is small.  But this
    isn't a severe issue, because there are no idle pages in this case.

    One alternative choice is to increase PCP high when we drain PCP via
    trying to free pages to the zone, but don't increase PCP high during PCP
    refilling.  This can avoid the issue above.  But if the number of pages
    allocated is much less than that of pages freed on a CPU, there will be
    many idle pages in PCP and it is hard to free these idle pages.

    1/8 (>> 3) of PCP high will be decreased periodically.  The value 1/8 is
    kind of arbitrary.  Just to make sure that the idle PCP pages will be
    freed eventually.

    On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
    in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
    kbuild server that is used by 0-Day kbuild service.  With the patch, the
    build time decreases 3.5%.  The cycles% of the spinlock contention (mostly
    for zone lock) decreases from 11.0% to 0.5%.  The number of PCP draining
    for high order pages freeing (free_high) decreases 65.6%.  The number of
    pages allocated from zone (instead of from PCP) decreases 83.9%.

    Link: https://lkml.kernel.org/r/20231016053002.756205-8-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Suggested-by: Mel Gorman <mgorman@techsingularity.net>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Arjan van de Ven <arjan@linux.intel.com>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:13 -04:00
Chris von Recklinghausen 16720507f4 mm: add framework for PCP high auto-tuning
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit 90b41691b9881376fe784e13b5766ec3676fdb55
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Oct 16 13:29:59 2023 +0800

    mm: add framework for PCP high auto-tuning

    The page allocation performance requirements of different workloads are
    usually different.  So, we need to tune PCP (per-CPU pageset) high to
    optimize the workload page allocation performance.  Now, we have a system
    wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand.
    But, it's hard to find out the best value by hand.  And one global
    configuration may not work best for the different workloads that run on
    the same system.  One solution to these issues is to tune PCP high of each
    CPU automatically.

    This patch adds the framework for PCP high auto-tuning.  With it,
    pcp->high of each CPU will be changed automatically by tuning algorithm at
    runtime.  The minimal high (pcp->high_min) is the original PCP high value
    calculated based on the low watermark pages.  While the maximal high
    (pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction
    sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION.  That is, the
    maximal pcp->high that can be set via sysctl knob by hand.

    It's possible that PCP high auto-tuning doesn't work well for some
    workloads.  So, when PCP high is tuned by hand via the sysctl knob, the
    auto-tuning will be disabled.  The PCP high set by hand will be used
    instead.

    This patch only adds the framework, so pcp->high will be set to
    pcp->high_min (original default) always.  We will add actual auto-tuning
    algorithm in the following patches in the series.

    Link: https://lkml.kernel.org/r/20231016053002.756205-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Arjan van de Ven <arjan@linux.intel.com>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:12 -04:00
Chris von Recklinghausen 5cbf0fb521 mm, page_alloc: scale the number of pages that are batch allocated
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit c0a242394cb980bd00e1f61dc8aacb453d2bbe6a
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Oct 16 13:29:58 2023 +0800

    mm, page_alloc: scale the number of pages that are batch allocated

    When a task is allocating a large number of order-0 pages, it may acquire
    the zone->lock multiple times allocating pages in batches.  This may
    unnecessarily contend on the zone lock when allocating very large number
    of pages.  This patch adapts the size of the batch based on the recent
    pattern to scale the batch size for subsequent allocations.

    On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
    in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
    kbuild server that is used by 0-Day kbuild service.  With the patch, the
    cycles% of the spinlock contention (mostly for zone lock) decreases from
    12.6% to 11.0% (with PCP size == 367).

    Link: https://lkml.kernel.org/r/20231016053002.756205-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Suggested-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Arjan van de Ven <arjan@linux.intel.com>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:12 -04:00
Chris von Recklinghausen b4efb8d044 mm: restrict the pcp batch scale factor to avoid too long latency
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit 52166607ecc980391b1fffbce0be3074a96d0c7b
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Oct 16 13:29:57 2023 +0800

    mm: restrict the pcp batch scale factor to avoid too long latency

    In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
    batches to increase page allocation throughput, reduce page
    allocation/freeing latency per page, and reduce zone lock contention.  But
    too large batch size will cause too long maximal allocation/freeing
    latency, which may punish arbitrary users.  So the default batch size is
    chosen carefully (in zone_batchsize(), the value is 63 for zone > 1GB) to
    avoid that.

    In commit 3b12e7e979 ("mm/page_alloc: scale the number of pages that are
    batch freed"), the batch size will be scaled for large number of page
    freeing to improve page freeing performance and reduce zone lock
    contention.  Similar optimization can be used for large number of pages
    allocation too.

    To find out a suitable max batch scale factor (that is, max effective
    batch size), some tests and measurement on some machines were done as
    follows.

    A set of debug patches are implemented as follows,

    - Set PCP high to be 2 * batch to reduce the effect of PCP high

    - Disable free batch size scaling to get the raw performance.

    - The code with zone lock held is extracted from rmqueue_bulk() and
      free_pcppages_bulk() to 2 separate functions to make it easy to
      measure the function run time with ftrace function_graph tracer.

    - The batch size is hard coded to be 63 (default), 127, 255, 511,
      1023, 2047, 4095.

    Then will-it-scale/page_fault1 is used to generate the page
    allocation/freeing workload.  The page allocation/freeing throughput
    (page/s) is measured via will-it-scale.  The page allocation/freeing
    average latency (alloc/free latency avg, in us) and allocation/freeing
    latency at 99 percentile (alloc/free latency 99%, in us) are measured with
    ftrace function_graph tracer.

    The test results are as follows,

    Sapphire Rapids Server
    ======================
    Batch   throughput      free latency    free latency    alloc latency   alloc latency
            page/s          avg / us        99% / us        avg / us        99% / us
    -----   ----------      ------------    ------------    -------------   -------------
      63    513633.4         2.33            3.57            2.67             6.83
     127    517616.7         4.35            6.65            4.22            13.03
     255    520822.8         8.29           13.32            7.52            25.24
     511    524122.0        15.79           23.42           14.02            49.35
    1023    525980.5        30.25           44.19           25.36            94.88
    2047    526793.6        59.39           84.50           45.22           140.81

    Ice Lake Server
    ===============
    Batch   throughput      free latency    free latency    alloc latency   alloc latency
            page/s          avg / us        99% / us        avg / us        99% / us
    -----   ----------      ------------    ------------    -------------   -------------
      63    620210.3         2.21            3.68            2.02            4.35
     127    627003.0         4.09            6.86            3.51            8.28
     255    630777.5         7.70           13.50            6.17           15.97
     511    633651.5        14.85           22.62           11.66           31.08
    1023    637071.1        28.55           42.02           20.81           54.36
    2047    638089.7        56.54           84.06           39.28           91.68

    Cascade Lake Server
    ===================
    Batch   throughput      free latency    free latency    alloc latency   alloc latency
            page/s          avg / us        99% / us        avg / us        99% / us
    -----   ----------      ------------    ------------    -------------   -------------
      63    404706.7         3.29             5.03           3.53             4.75
     127    422475.2         6.12             9.09           6.36             8.76
     255    411522.2        11.68            16.97          10.90            16.39
     511    428124.1        22.54            31.28          19.86            32.25
    1023    414718.4        43.39            62.52          40.00            66.33
    2047    429848.7        86.64           120.34          71.14           106.08

    Commet Lake Desktop
    ===================
    Batch   throughput      free latency    free latency    alloc latency   alloc latency
            page/s          avg / us        99% / us        avg / us        99% / us
    -----   ----------      ------------    ------------    -------------   -------------

      63    795183.13        2.18            3.55            2.03            3.05
     127    803067.85        3.91            6.56            3.85            5.52
     255    812771.10        7.35           10.80            7.14           10.20
     511    817723.48       14.17           27.54           13.43           30.31
    1023    818870.19       27.72           40.10           27.89           46.28

    Coffee Lake Desktop
    ===================
    Batch   throughput      free latency    free latency    alloc latency   alloc latency
            page/s          avg / us        99% / us        avg / us        99% / us
    -----   ----------      ------------    ------------    -------------   -------------
      63    510542.8         3.13             4.40           2.48            3.43
     127    514288.6         5.97             7.89           4.65            6.04
     255    516889.7        11.86            15.58           8.96           12.55
     511    519802.4        23.10            28.81          16.95           26.19
    1023    520802.7        45.30            52.51          33.19           45.95
    2047    519997.1        90.63           104.00          65.26           81.74

    From the above data, to restrict the allocation/freeing latency to be less
    than 100 us in most times, the max batch scale factor needs to be less
    than or equal to 5.

    Although it is reasonable to use 5 as max batch scale factor for the
    systems tested, there are also slower systems.  Where smaller value should
    be used to constrain the page allocation/freeing latency.

    So, in this patch, a new kconfig option (PCP_BATCH_SCALE_MAX) is added to
    set the max batch scale factor.  Whose default value is 5, and users can
    reduce it when necessary.

    Link: https://lkml.kernel.org/r/20231016053002.756205-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: Andrew Morton <akpm@linux-foundation.org>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Arjan van de Ven <arjan@linux.intel.com>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:12 -04:00
Chris von Recklinghausen 0ae67db50a mm, pcp: reduce lock contention for draining high-order pages
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit 362d37a106dd3f6431b2fdd91d9208b0d023b50d
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Oct 16 13:29:56 2023 +0800

    mm, pcp: reduce lock contention for draining high-order pages

    In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages
    on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
    PCP is mostly used for high-order pages freeing to improve the cache-hot
    pages reusing between page allocating and freeing CPUs.

    On system with small per-CPU data cache slice, pages shouldn't be cached
    before draining to guarantee cache-hot.  But on a system with large
    per-CPU data cache slice, some pages can be cached before draining to
    reduce zone lock contention.

    So, in this patch, instead of draining without any caching, "pcp->batch"
    pages will be cached in PCP before draining if the size of the per-CPU
    data cache slice is more than "3 * batch".

    In theory, if the size of per-CPU data cache slice is more than "2 *
    batch", we can reuse cache-hot pages between CPUs.  But considering the
    other usage of cache (code, other data accessing, etc.), "3 * batch" is
    used.

    Note: "3 * batch" is chosen to make sure the optimization works on recent
    x86_64 server CPUs.  If you want to increase it, please check whether it
    breaks the optimization.

    On a 2-socket Intel server with 128 logical CPU, with the patch, the
    network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite
    with 16-pair processes increase 70.5%.  The cycles% of the spinlock
    contention (mostly for zone lock) decreases from 46.1% to 21.3%.  The
    number of PCP draining for high order pages freeing (free_high) decreases
    89.9%.  The cache miss rate keeps 0.2%.

    Link: https://lkml.kernel.org/r/20231016053002.756205-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Arjan van de Ven <arjan@linux.intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:12 -04:00
Chris von Recklinghausen af3055ada0 mm, pcp: avoid to drain PCP when process exit
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit ca71fe1ad9221a89c6a25f49159c600d9e598ae1
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Oct 16 13:29:54 2023 +0800

    mm, pcp: avoid to drain PCP when process exit

    Patch series "mm: PCP high auto-tuning", v3.

    The page allocation performance requirements of different workloads are
    often different.  So, we need to tune the PCP (Per-CPU Pageset) high on
    each CPU automatically to optimize the page allocation performance.

    The list of patches in series is as follows,

    [1/9] mm, pcp: avoid to drain PCP when process exit
    [2/9] cacheinfo: calculate per-CPU data cache size
    [3/9] mm, pcp: reduce lock contention for draining high-order pages
    [4/9] mm: restrict the pcp batch scale factor to avoid too long latency
    [5/9] mm, page_alloc: scale the number of pages that are batch allocated
    [6/9] mm: add framework for PCP high auto-tuning
    [7/9] mm: tune PCP high automatically
    [8/9] mm, pcp: decrease PCP high if free pages < high watermark
    [9/9] mm, pcp: reduce detecting time of consecutive high order page freeing

    Patch [1/9], [2/9], [3/9] optimize the PCP draining for consecutive
    high-order pages freeing.

    Patch [4/9], [5/9] optimize batch freeing and allocating.

    Patch [6/9], [7/9], [8/9] implement and optimize a PCP high
    auto-tuning method.

    Patch [9/9] optimize the PCP draining for consecutive high order page
    freeing based on PCP high auto-tuning.

    The test results for patches with performance impact are as follows,

    kbuild
    ======

    On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
    in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
    kbuild server that is used by 0-Day kbuild service.

            build time   lock contend%      free_high       alloc_zone
            ----------      ----------      ---------       ----------
    base         100.0            14.0          100.0            100.0
    patch1        99.5            12.8           19.5             95.6
    patch3        99.4            12.6            7.1             95.6
    patch5        98.6            11.0            8.1             97.1
    patch7        95.1             0.5            2.8             15.6
    patch9        95.0             1.0            8.8             20.0

    The PCP draining optimization (patch [1/9], [3/9]) and PCP batch
    allocation optimization (patch [5/9]) reduces zone lock contention a
    little.  The PCP high auto-tuning (patch [7/9], [9/9]) reduces build time
    visibly.  Where the tuning target: the number of pages allocated from zone
    reduces greatly.  So, the zone contention cycles% reduces greatly.

    With PCP tuning patches (patch [7/9], [9/9]), the average used memory
    during test increases up to 18.4% because more pages are cached in PCP.
    But at the end of the test, the number of the used memory decreases to the
    same level as that of the base patch.  That is, the pages cached in PCP
    will be released to zone after not being used actively.

    netperf SCTP_STREAM_MANY
    ========================

    On a 2-socket Intel server with 128 logical CPU, we tested
    SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.

                 score   lock contend%      free_high       alloc_zone  cache miss rate%
                 -----      ----------      ---------       ----------  ----------------
    base         100.0             2.1          100.0            100.0               1.3
    patch1        99.4             2.1           99.4             99.4               1.3
    patch3       106.4             1.3           13.3            106.3               1.3
    patch5       106.0             1.2           13.2            105.9               1.3
    patch7       103.4             1.9            6.7             90.3               7.6
    patch9       108.6             1.3           13.7            108.6               1.3

    The PCP draining optimization (patch [1/9]+[3/9]) improves performance.
    The PCP high auto-tuning (patch [7/9]) reduces performance a little
    because PCP draining cannot be triggered in time sometimes.  So, the cache
    miss rate% increases.  The further PCP draining optimization (patch [9/9])
    based on PCP tuning restore the performance.

    lmbench3 UNIX (AF_UNIX)
    =======================

    On a 2-socket Intel server with 128 logical CPU, we tested UNIX
    (AF_UNIX socket) test case of lmbench3 test suite with 16-pair
    processes.

                 score   lock contend%      free_high       alloc_zone  cache miss rate%
                 -----      ----------      ---------       ----------  ----------------
    base         100.0            51.4          100.0            100.0               0.2
    patch1       116.8            46.1           69.5            104.3               0.2
    patch3       199.1            21.3            7.0            104.9               0.2
    patch5       200.0            20.8            7.1            106.9               0.3
    patch7       191.6            19.9            6.8            103.8               2.8
    patch9       193.4            21.7            7.0            104.7               2.1

    The PCP draining optimization (patch [1/9], [3/9]) improves performance
    much.  The PCP tuning (patch [7/9]) reduces performance a little because
    PCP draining cannot be triggered in time sometimes.  The further PCP
    draining optimization (patch [9/9]) based on PCP tuning restores the
    performance partly.

    The patchset adds several fields in struct per_cpu_pages.  The struct
    layout before/after the patchset is as follows,

    base
    ====

    struct per_cpu_pages {
            spinlock_t                 lock;                 /*     0     4 */
            int                        count;                /*     4     4 */
            int                        high;                 /*     8     4 */
            int                        batch;                /*    12     4 */
            short int                  free_factor;          /*    16     2 */
            short int                  expire;               /*    18     2 */

            /* XXX 4 bytes hole, try to pack */

            struct list_head           lists[13];            /*    24   208 */

            /* size: 256, cachelines: 4, members: 7 */
            /* sum members: 228, holes: 1, sum holes: 4 */
            /* padding: 24 */
    } __attribute__((__aligned__(64)));

    patched
    =======

    struct per_cpu_pages {
            spinlock_t                 lock;                 /*     0     4 */
            int                        count;                /*     4     4 */
            int                        high;                 /*     8     4 */
            int                        high_min;             /*    12     4 */
            int                        high_max;             /*    16     4 */
            int                        batch;                /*    20     4 */
            u8                         flags;                /*    24     1 */
            u8                         alloc_factor;         /*    25     1 */
            u8                         expire;               /*    26     1 */

            /* XXX 1 byte hole, try to pack */

            short int                  free_count;           /*    28     2 */

            /* XXX 2 bytes hole, try to pack */

            struct list_head           lists[13];            /*    32   208 */

            /* size: 256, cachelines: 4, members: 11 */
            /* sum members: 237, holes: 2, sum holes: 3 */
            /* padding: 16 */
    } __attribute__((__aligned__(64)));

    The size of the struct doesn't changed with the patchset.

    This patch (of 9):

    In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages
    on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
    PCP is mostly used for high-order pages freeing to improve the cache-hot
    pages reusing between page allocation and freeing CPUs.

    But, the PCP draining mechanism may be triggered unexpectedly when process
    exits.  With some customized trace point, it was found that PCP draining
    (free_high == true) was triggered with the order-1 page freeing with the
    following call stack,

     => free_unref_page_commit
     => free_unref_page
     => __mmdrop
     => exit_mm
     => do_exit
     => do_group_exit
     => __x64_sys_exit_group
     => do_syscall_64

    Checking the source code, this is the page table PGD freeing
    (mm_free_pgd()).  It's a order-1 page freeing if
    CONFIG_PAGE_TABLE_ISOLATION=y.  Which is a common configuration for
    security.

    Just before that, page freeing with the following call stack was found,

     => free_unref_page_commit
     => free_unref_page_list
     => release_pages
     => tlb_batch_pages_flush
     => tlb_finish_mmu
     => exit_mmap
     => __mmput
     => exit_mm
     => do_exit
     => do_group_exit
     => __x64_sys_exit_group
     => do_syscall_64

    So, when a process exits,

    - a large number of user pages of the process will be freed without
      page allocation, it's highly possible that pcp->free_factor becomes >
      0.  In fact, this is expected behavior to improve process exit
      performance.

    - after freeing all user pages, the PGD will be freed, which is a
      order-1 page freeing, PCP will be drained.

    All in all, when a process exits, it's high possible that the PCP will be
    drained.  This is an unexpected behavior.

    To avoid this, in the patch, the PCP draining will only be triggered for 2
    consecutive high-order page freeing.

    On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
    in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
    kbuild server that is used by 0-Day kbuild service.  With the patch, the
    cycles% of the spinlock contention (mostly for zone lock) decreases from
    14.0% to 12.8% (with PCP size == 367).  The number of PCP draining for
    high order pages freeing (free_high) decreases 80.5%.

    This helps network workload too for reduced zone lock contention.  On a
    2-socket Intel server with 128 logical CPU, with the patch, the network
    bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with
    16-pair processes increase 16.8%.  The cycles% of the spinlock contention
    (mostly for zone lock) decreases from 51.4% to 46.1%.  The number of PCP
    draining for high order pages freeing (free_high) decreases 30.5%.  The
    cache miss rate keeps 0.2%.

    Link: https://lkml.kernel.org/r/20231016053002.756205-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20231016053002.756205-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Arjan van de Ven <arjan@linux.intel.com>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:12 -04:00
Chris von Recklinghausen 63fabe6239 mm/page_alloc: remove unnecessary parameter batch of nr_pcp_free
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit 1305870529d9e16170bb744148aab6dffb19bb23
Author: Kemeng Shi <shikemeng@huaweicloud.com>
Date:   Wed Aug 9 18:07:54 2023 +0800

    mm/page_alloc: remove unnecessary parameter batch of nr_pcp_free

    We get batch from pcp and just pass it to nr_pcp_free immediately.  Get
    batch from pcp inside nr_pcp_free to remove unnecessary parameter batch.

    Link: https://lkml.kernel.org/r/20230809100754.3094517-3-shikemeng@huaweicloud.com
    Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:12 -04:00
Chris von Recklinghausen a419bd2641 mm/page_alloc: remove track of active PCP lists range in bulk free
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit f142b2c2530c1383a45e1ada1d641974b9723a35
Author: Kemeng Shi <shikemeng@huaweicloud.com>
Date:   Wed Aug 9 18:07:53 2023 +0800

    mm/page_alloc: remove track of active PCP lists range in bulk free

    Patch series "Two minor cleanups for pcp list in page_alloc".

    There are two minor cleanups for pcp list in page_alloc. More details
    can be found in respective patches.

    This patch (of 2):

    After commit fd56eef258a17 ("mm/page_alloc: simplify how many pages are
    selected per pcp list during bulk free"), we will drain all pages in
    selected pcp list.  And we ensured passed count is < pcp->count.  Then,
    the search will finish before wrap-around and track of active PCP lists
    range intended for wrap-around case is no longer needed.

    Link: https://lkml.kernel.org/r/20230809100754.3094517-1-shikemeng@huaweicloud.com
    Link: https://lkml.kernel.org/r/20230809100754.3094517-2-shikemeng@huaweicloud.com
    Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:12 -04:00
Chris von Recklinghausen 0572a333f0 mm: page_alloc: move is_check_pages_enabled() into page_alloc.c
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit ecbb490d8ee38fd84a0d682282589ff723dc62c0
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Tue May 16 14:38:21 2023 +0800

    mm: page_alloc: move is_check_pages_enabled() into page_alloc.c

    The is_check_pages_enabled() only used in page_alloc.c, move it into
    page_alloc.c, also use it in free_tail_page_prepare().

    Link: https://lkml.kernel.org/r/20230516063821.121844-14-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Iurii Zaikin <yzaikin@google.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Len Brown <len.brown@intel.com>
    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Pavel Machek <pavel@ucw.cz>
    Cc: Rafael J. Wysocki <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:12 -04:00
Chris von Recklinghausen 28429327ca mm: page_alloc: move sysctls into it own fils
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit e95d372c4cd46b6ec4eeacc07adcb7260ab4cfa0
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Tue May 16 14:38:20 2023 +0800

    mm: page_alloc: move sysctls into it own fils

    This moves all page alloc related sysctls to its own file, as part of the
    kernel/sysctl.c spring cleaning, also move some functions declarations
    from mm.h into internal.h.

    Link: https://lkml.kernel.org/r/20230516063821.121844-13-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Iurii Zaikin <yzaikin@google.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Len Brown <len.brown@intel.com>
    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Pavel Machek <pavel@ucw.cz>
    Cc: Rafael J. Wysocki <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:12 -04:00
Chris von Recklinghausen c724e38631 mm: page_alloc: move pm_* function into power
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit 07f44ac3c90c50a201307d3fe4dda120ee8394f5
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Tue May 16 14:38:18 2023 +0800

    mm: page_alloc: move pm_* function into power

    pm_restrict_gfp_mask()/pm_restore_gfp_mask() only used in power, let's
    move them out of page_alloc.c.

    Adding a general gfp_has_io_fs() function which return true if gfp with
    both __GFP_IO and __GFP_FS flags, then use it inside of
    pm_suspended_storage(), also the pm_suspended_storage() is moved into
    suspend.h.

    Link: https://lkml.kernel.org/r/20230516063821.121844-11-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Iurii Zaikin <yzaikin@google.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Len Brown <len.brown@intel.com>
    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Pavel Machek <pavel@ucw.cz>
    Cc: Rafael J. Wysocki <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:11 -04:00
Chris von Recklinghausen b0e0e4f7a0 mm: page_alloc: move mark_free_page() into snapshot.c
JIRA: https://issues.redhat.com/browse/RHEL-20141

commit 31a1b9d7fe768db521b12287ec6426983e9787e3
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Tue May 16 14:38:17 2023 +0800

    mm: page_alloc: move mark_free_page() into snapshot.c

    The mark_free_page() is only used in kernel/power/snapshot.c, move it out
    to reduce a bit of page_alloc.c

    Link: https://lkml.kernel.org/r/20230516063821.121844-10-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Iurii Zaikin <yzaikin@google.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Len Brown <len.brown@intel.com>
    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Pavel Machek <pavel@ucw.cz>
    Cc: Rafael J. Wysocki <rafael@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-06-07 13:14:11 -04:00