Commit Graph

1261 Commits

Author SHA1 Message Date
Aristeu Rozanski 1c1f6235c1 mm: memcontrol: rename memcg_kmem_enabled()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f7a449f779608efe1941a0e0c4bd7b5f57000be7
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Mon Feb 13 11:29:22 2023 -0800

    mm: memcontrol: rename memcg_kmem_enabled()

    Currently there are two kmem-related helper functions with a confusing
    semantics: memcg_kmem_enabled() and mem_cgroup_kmem_disabled().

    The problem is that an obvious expectation
    memcg_kmem_enabled() == !mem_cgroup_kmem_disabled(),
    can be false.

    mem_cgroup_kmem_disabled() is similar to mem_cgroup_disabled(): it returns
    true only if CONFIG_MEMCG_KMEM is not set or the kmem accounting is
    disabled using a boot time kernel option "cgroup.memory=nokmem".  It never
    changes the value dynamically.

    memcg_kmem_enabled() is different: it always returns false until the first
    non-root memory cgroup will get online (assuming the kernel memory
    accounting is enabled).  It's goal is to improve the performance on
    systems without the cgroupfs mounted/memory controller enabled or on the
    systems with only the root memory cgroup.

    To make things more obvious and avoid potential bugs, let's rename
    memcg_kmem_enabled() to memcg_kmem_online().

    Link: https://lkml.kernel.org/r/20230213192922.1146370-1-roman.gushchin@linux.dev
    Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski d908e3177a mm: add vma_has_recency()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 8788f6781486769d9598dcaedc3fe0eb12fc3e59
Author: Yu Zhao <yuzhao@google.com>
Date:   Fri Dec 30 14:52:51 2022 -0700

    mm: add vma_has_recency()

    Add vma_has_recency() to indicate whether a VMA may exhibit temporal
    locality that the LRU algorithm relies on.

    This function returns false for VMAs marked by VM_SEQ_READ or
    VM_RAND_READ.  While the former flag indicates linear access, i.e., a
    special case of spatial locality, both flags indicate a lack of temporal
    locality, i.e., the reuse of an area within a relatively small duration.

    "Recency" is chosen over "locality" to avoid confusion between temporal
    and spatial localities.

    Before this patch, the active/inactive LRU only ignored the accessed bit
    from VMAs marked by VM_SEQ_READ.  After this patch, the active/inactive
    LRU and MGLRU share the same logic: they both ignore the accessed bit if
    vma_has_recency() returns false.

    For the active/inactive LRU, the following fio test showed a [6, 8]%
    increase in IOPS when randomly accessing mapped files under memory
    pressure.

      kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo)
      kb=$((kb - 8*1024*1024))

      modprobe brd rd_nr=1 rd_size=$kb
      dd if=/dev/zero of=/dev/ram0 bs=1M

      mkfs.ext4 /dev/ram0
      mount /dev/ram0 /mnt/
      swapoff -a

      fio --name=test --directory=/mnt/ --ioengine=mmap --numjobs=8 \
          --size=8G --rw=randrw --time_based --runtime=10m \
          --group_reporting

    The discussion that led to this patch is here [1].  Additional test
    results are available in that thread.

    [1] https://lore.kernel.org/r/Y31s%2FK8T85jh05wH@google.com/

    Link: https://lkml.kernel.org/r/20221230215252.2628425-1-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Andrea Righi <andrea.righi@canonical.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:03 -04:00
Aristeu Rozanski c9156ec75f mm/swap: convert deactivate_page() to folio_deactivate()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: f3cd4ab0aabf was backported before this change

commit 5a9e34747c9f731bbb6b7fd7521c4fec0d840593
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Wed Dec 21 10:08:48 2022 -0800

    mm/swap: convert deactivate_page() to folio_deactivate()

    Deactivate_page() has already been converted to use folios, this change
    converts it to take in a folio argument instead of calling page_folio().
    It also renames the function folio_deactivate() to be more consistent with
    other folio functions.

    [akpm@linux-foundation.org: fix left-over comments, per Yu Zhao]
    Link: https://lkml.kernel.org/r/20221221180848.20774-5-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:02 -04:00
Audra Mitchell 086595e4f0 mm: Remove pointless barrier() after pmdp_get_lockless()
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit eb780dcae02d5a71e6979aa7b8c708dea8597adf
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Oct 21 13:47:29 2022 +0200

    mm: Remove pointless barrier() after pmdp_get_lockless()

    pmdp_get_lockless() should itself imply any ordering required.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20221022114425.298833095%40infradead.org

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:02 -04:00
Audra Mitchell 9c11ced5c1 mm: disable top-tier fallback to reclaim on proactive reclaim
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 6b426d071419a40f61fe41fe1bd9e1b4fa5aeb37
Author: Mina Almasry <almasrymina@google.com>
Date:   Thu Dec 1 15:33:17 2022 -0800

    mm: disable top-tier fallback to reclaim on proactive reclaim

    Reclaiming directly from top tier nodes breaks the aging pipeline of
    memory tiers.  If we have a RAM -> CXL -> storage hierarchy, we should
    demote from RAM to CXL and from CXL to storage.  If we reclaim a page from
    RAM, it means we 'demote' it directly from RAM to storage, bypassing
    potentially a huge amount of pages colder than it in CXL.

    However disabling reclaim from top tier nodes entirely would cause ooms in
    edge scenarios where lower tier memory is unreclaimable for whatever
    reason, e.g.  memory being mlocked() or too hot to reclaim.  In these
    cases we would rather the job run with a performance regression rather
    than it oom altogether.

    However, we can disable reclaim from top tier nodes for proactive reclaim.
    That reclaim is not real memory pressure, and we don't have any cause to
    be breaking the aging pipeline.

    [akpm@linux-foundation.org: restore comment layout, per Ying Huang]
    Link: https://lkml.kernel.org/r/20221201233317.1394958-1-almasrymina@google.com
    Signed-off-by: Mina Almasry <almasrymina@google.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tim Chen <tim.c.chen@linux.intel.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:02 -04:00
Audra Mitchell 821f801a0a mm: vmscan: use sysfs_emit() to instead of scnprintf()
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Minor context difference due to out of order backport
    58a78c1704 ("mm: multi-gen LRU: add helpers in page table walks")

This patch is a backport of the following upstream commit:
commit 8ef9c32a12a8a0012a4988050947c45521260c5d
Author: Xu Panda <xu.panda@zte.com.cn>
Date:   Thu Nov 24 19:29:01 2022 +0800

    mm: vmscan: use sysfs_emit() to instead of scnprintf()

    Replace open-coded snprintf() with sysfs_emit() to simplify the code.

    Link: https://lkml.kernel.org/r/202211241929015476424@zte.com.cn
    Signed-off-by: Xu Panda <xu.panda@zte.com.cn>
    Signed-off-by: Yang Yang <yang.yang29@zte.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:00 -04:00
Audra Mitchell a7c35d6d98 mm: make drop_caches keep reclaiming on all nodes
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit e83b39d6bbdb6d25bd6f5c258832774635d29b47
Author: Jan Kara <jack@suse.cz>
Date:   Tue Nov 15 13:32:55 2022 +0100

    mm: make drop_caches keep reclaiming on all nodes

    Currently, drop_caches are reclaiming node-by-node, looping on each node
    until reclaim could not make progress.  This can however leave quite some
    slab entries (such as filesystem inodes) unreclaimed if objects say on
    node 1 keep objects on node 0 pinned.  So move the "loop until no
    progress" loop to the node-by-node iteration to retry reclaim also on
    other nodes if reclaim on some nodes made progress.  This fixes problem
    when drop_caches was not reclaiming lots of otherwise perfectly fine to
    reclaim inodes.

    Link: https://lkml.kernel.org/r/20221115123255.12559-1-jack@suse.cz
    Signed-off-by: Jan Kara <jack@suse.cz>
    Reported-by: You Zhou <you.zhou@intel.com>
    Reported-by: Pengfei Xu <pengfei.xu@intel.com>
    Tested-by: Pengfei Xu <pengfei.xu@intel.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:57 -04:00
Chris von Recklinghausen af0daa0b92 mm/vmscan: use vma iterator instead of vm_next
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 78ba531ff3ec2a444001853d8636ff39ed11ca28
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Tue Sep 6 19:49:05 2022 +0000

    mm/vmscan: use vma iterator instead of vm_next

    Use the vma iterator in in get_next_vma() instead of the linked list.

    [yuzhao@google.com: mm/vmscan: use the proper VMA iterator]
      Link: https://lkml.kernel.org/r/Yx+QGOgHg1Wk8tGK@google.com
    Link: https://lkml.kernel.org/r/20220906194824.2110408-68-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:57 -04:00
Nico Pache 6dca0738a9 mm, vmscan: remove ISOLATE_UNMAPPED
commit 3dfbb555c98ac55b9d911f9af0e35014b445fb41
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Thu Sep 14 15:16:39 2023 +0200

    mm, vmscan: remove ISOLATE_UNMAPPED

    This isolate_mode_t flag is effectively unused since 89f6c88a6ab4 ("mm:
    __isolate_lru_page_prepare() in isolate_migratepages_block()") as
    sc->may_unmap is now checked directly (and only node_reclaim has a mode
    that sets it to 0).  The last remaining place is mm_vmscan_lru_isolate
    tracepoint for the isolate_mode parameter.  That one was mainly used to
    indicate the active/inactive mode, which the trace-vmscan-postprocess.pl
    script consumed, but that got silently broken.  After fixing the script by
    the previous patch, it does not need the isolate_mode anymore.  So just
    remove the parameter and with that the whole ISOLATE_UNMAPPED flag.

    Link: https://lkml.kernel.org/r/20230914131637.12204-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-28667
Signed-off-by: Nico Pache <npache@redhat.com>
2024-03-13 11:38:35 -06:00
Nico Pache 19eae72cdd mm/mglru: skip special VMAs in lru_gen_look_around()
Conflicts:
       mm/vmscan.c: missing commit c33c794828f2 ("mm: ptep_get() conversion")
        which uses a helper to access pte_t

commit c28ac3c7eb945fee6e20f47d576af68fdff1392a
Author: Yu Zhao <yuzhao@google.com>
Date:   Fri Dec 22 21:56:47 2023 -0700

    mm/mglru: skip special VMAs in lru_gen_look_around()

    Special VMAs like VM_PFNMAP can contain anon pages from COW.  There isn't
    much profit in doing lookaround on them.  Besides, they can trigger the
    pte_special() warning in get_pte_pfn().

    Skip them in lru_gen_look_around().

    Link: https://lkml.kernel.org/r/20231223045647.1566043-1-yuzhao@google.com
    Fixes: 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap")
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reported-by: syzbot+03fd9b3f71641f0ebf2d@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/000000000000f9ff00060d14c256@google.com/
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-28667
Signed-off-by: Nico Pache <npache@redhat.com>
2024-03-13 10:35:40 -06:00
Nico Pache d255baa734 mm/mglru: reclaim offlined memcgs harder
commit 4376807bf2d5371c3e00080c972be568c3f8a7d1
Author: Yu Zhao <yuzhao@google.com>
Date:   Thu Dec 7 23:14:07 2023 -0700

    mm/mglru: reclaim offlined memcgs harder

    In the effort to reduce zombie memcgs [1], it was discovered that the
    memcg LRU doesn't apply enough pressure on offlined memcgs.  Specifically,
    instead of rotating them to the tail of the current generation
    (MEMCG_LRU_TAIL) for a second attempt, it moves them to the next
    generation (MEMCG_LRU_YOUNG) after the first attempt.

    Not applying enough pressure on offlined memcgs can cause them to build
    up, and this can be particularly harmful to memory-constrained systems.

    On Pixel 8 Pro, launching apps for 50 cycles:
                     Before  After  Change
      Zombie memcgs  45      35     -22%

    [1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/

    Link: https://lkml.kernel.org/r/20231208061407.2125867-4-yuzhao@google.com
    Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reported-by: T.J. Mercier <tjmercier@google.com>
    Tested-by: T.J. Mercier <tjmercier@google.com>
    Cc: Charan Teja Kalla <quic_charante@quicinc.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
    Cc: Kairui Song <ryncsn@gmail.com>
    Cc: Kalesh Singh <kaleshsingh@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-28667
Signed-off-by: Nico Pache <npache@redhat.com>
2024-03-13 10:35:40 -06:00
Nico Pache 2934d4c93d mm/mglru: try to stop at high watermarks
Conflicts:
       mm/vmscan.c: upstream commit 7a704474b3022 ("mm: memcg: rename and
        document global_reclaim()") renamed global reclaim to
        root_reclaim. which we dont have

commit 5095a2b23987d3c3c47dd16b3d4080e2733b8bb9
Author: Yu Zhao <yuzhao@google.com>
Date:   Thu Dec 7 23:14:05 2023 -0700

    mm/mglru: try to stop at high watermarks

    The initial MGLRU patchset didn't include the memcg LRU support, and it
    relied on should_abort_scan(), added by commit f76c83378851 ("mm:
    multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid
    overshooting their aggregate reclaim target by too much".

    Later on when the memcg LRU was added, should_abort_scan() was deemed
    unnecessary, and the test results [1] showed no side effects after it was
    removed by commit a579086c99ed ("mm: multi-gen LRU: remove eviction
    fairness safeguard").

    However, that test used memory.reclaim, which sets nr_to_reclaim to
    SWAP_CLUSTER_MAX.  So it can overshoot only by SWAP_CLUSTER_MAX-1 pages,
    i.e., from nr_reclaimed=nr_to_reclaim-1 to
    nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1.  Compared with the batch
    size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny.  Therefore
    that test isn't able to reproduce the worst case scenario, i.e., kswapd
    overshooting GBs on large systems and "consuming 100% CPU" (see the Closes
    tag).

    Bring back a simplified version of should_abort_scan() on top of the memcg
    LRU, so that kswapd stops when all eligible zones are above their
    respective high watermarks plus a small delta to lower the chance of
    KSWAPD_HIGH_WMARK_HIT_QUICKLY.  Note that this only applies to order-0
    reclaim, meaning compaction-induced reclaim can still run wild (which is a
    different problem).

    On Android, launching 55 apps sequentially:
               Before     After      Change
      pgpgin   838377172  802955040  -4%
      pgpgout  38037080   34336300   -10%

    [1] https://lore.kernel.org/20221222041905.2431096-1-yuzhao@google.com/

    Link: https://lkml.kernel.org/r/20231208061407.2125867-2-yuzhao@google.com
    Fixes: a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard")
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
    Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
    Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/
    Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
    Tested-by: Kalesh Singh <kaleshsingh@google.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Kairui Song <ryncsn@gmail.com>
    Cc: T.J. Mercier <tjmercier@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-28667
Signed-off-by: Nico Pache <npache@redhat.com>
2024-03-13 10:35:40 -06:00
Nico Pache 32ede53fe2 mm/mglru: fix underprotected page cache
commit 081488051d28d32569ebb7c7a23572778b2e7d57
Author: Yu Zhao <yuzhao@google.com>
Date:   Thu Dec 7 23:14:04 2023 -0700

    mm/mglru: fix underprotected page cache

    Unmapped folios accessed through file descriptors can be underprotected.
    Those folios are added to the oldest generation based on:

    1. The fact that they are less costly to reclaim (no need to walk the
       rmap and flush the TLB) and have less impact on performance (don't
       cause major PFs and can be non-blocking if needed again).
    2. The observation that they are likely to be single-use. E.g., for
       client use cases like Android, its apps parse configuration files
       and store the data in heap (anon); for server use cases like MySQL,
       it reads from InnoDB files and holds the cached data for tables in
       buffer pools (anon).

    However, the oldest generation can be very short lived, and if so, it
    doesn't provide the PID controller with enough time to respond to a surge
    of refaults.  (Note that the PID controller uses weighted refaults and
    those from evicted generations only take a half of the whole weight.) In
    other words, for a short lived generation, the moving average smooths out
    the spike quickly.

    To fix the problem:
    1. For folios that are already on LRU, if they can be beyond the
       tracking range of tiers, i.e., five accesses through file
       descriptors, move them to the second oldest generation to give them
       more time to age. (Note that tiers are used by the PID controller
       to statistically determine whether folios accessed multiple times
       through file descriptors are worth protecting.)
    2. When adding unmapped folios to LRU, adjust the placement of them so
       that they are not too close to the tail. The effect of this is
       similar to the above.

    On Android, launching 55 apps sequentially:
                               Before     After      Change
      workingset_refault_anon  25641024   25598972   0%
      workingset_refault_file  115016834  106178438  -8%

    Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com
    Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
    Tested-by: Kalesh Singh <kaleshsingh@google.com>
    Cc: T.J. Mercier <tjmercier@google.com>
    Cc: Kairui Song <ryncsn@gmail.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-28667
Signed-off-by: Nico Pache <npache@redhat.com>
2024-03-13 10:35:40 -06:00
Nico Pache a25260c6f7 mm: multi-gen LRU: reuse some legacy trace events
commit 8c2214fc9a470aee0c615aeb14d8c7ce98e45a08
Author: Jaewon Kim <jaewon31.kim@samsung.com>
Date:   Tue Oct 3 20:41:55 2023 +0900

    mm: multi-gen LRU: reuse some legacy trace events

    As the legacy lru provides, the mglru needs some trace events for
    debugging.  Let's reuse following legacy events for the mglru.

      trace_mm_vmscan_lru_isolate
      trace_mm_vmscan_lru_shrink_inactive

    Here's an example
      mm_vmscan_lru_isolate: classzone=2 order=0 nr_requested=4096 nr_scanned=64 nr_skipped=0 nr_taken=64 lru=inactive_file
      mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=64 nr_reclaimed=63 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=1 nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

    Link: https://lkml.kernel.org/r/20231003114155.21869-1-jaewon31.kim@samsung.com
    Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kalesh Singh <kaleshsingh@google.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Cc: T.J. Mercier <tjmercier@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-28667
Signed-off-by: Nico Pache <npache@redhat.com>
2024-03-13 10:35:40 -06:00
Nico Pache d2e0b6354b mm: multi-gen LRU: improve design doc
commit 32d32ef140de3cc3f6817999415a72f7b0cb52f5
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Tue Feb 14 03:54:45 2023 +0000

    mm: multi-gen LRU: improve design doc

    This patch improves the design doc. Specifically,
      1. add a section for the per-memcg mm_struct list, and
      2. add a section for the PID controller.

    Link: https://lkml.kernel.org/r/20230214035445.1250139-2-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-28667
Signed-off-by: Nico Pache <npache@redhat.com>
2024-03-13 10:35:40 -06:00
Nico Pache d872902d79 mm: multi-gen LRU: clean up sysfs code
commit 9a52b2f32a0942047348b30f866b846da5fcf4e3
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Tue Feb 14 03:54:44 2023 +0000

    mm: multi-gen LRU: clean up sysfs code

    This patch cleans up the sysfs code. Specifically,
      1. use sysfs_emit(),
      2. use __ATTR_RW(), and
      3. constify multi-gen LRU struct attribute_group.

    Link: https://lkml.kernel.org/r/20230214035445.1250139-1-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-28667
Signed-off-by: Nico Pache <npache@redhat.com>
2024-03-13 10:35:39 -06:00
Marcelo Tosatti 359cd6f7a1 vmstat: allow_direct_reclaim should use zone_page_state_snapshot
commit 501b26510ae3bbdf9333b83addcd4e5c4214346d
JIRA: https://issues.redhat.com/browse/RHEL-21922

A customer provided evidence indicating that a process
was stalled in direct reclaim:

 - The process was trapped in throttle_direct_reclaim().
   The function wait_event_killable() was called to wait condition
   allow_direct_reclaim(pgdat) for current node to be true.
   The allow_direct_reclaim(pgdat) examined the number of free pages
   on the node by zone_page_state() which just returns value in
   zone->vm_stat[NR_FREE_PAGES].

 - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
   However, the freelist on this node was not empty.

 - This inconsistent of vmstat value was caused by percpu vmstat on
   nohz_full cpus. Every increment/decrement of vmstat is performed
   on percpu vmstat counter at first, then pooled diffs are cumulated
   to the zone's vmstat counter in timely manner. However, on nohz_full
   cpus (in case of this customer's system, 48 of 52 cpus) these pooled
   diffs were not cumulated once the cpu had no event on it so that
   the cpu started sleeping infinitely.
   I checked percpu vmstat and found there were total 69 counts not
   cumulated to the zone's vmstat counter yet.

 - In this situation, kswapd did not help the trapped process.
   In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
   of free pages on the node by zone_page_state_snapshot() which
   checks pending counts on percpu vmstat.
   Therefore kswapd could know there were 69 free pages correctly.
   Since zone->_watermark = {8, 20, 32}, kswapd did not work because
   69 was greater than 32 as high watermark.

Change allow_direct_reclaim to use zone_page_state_snapshot, which
allows a more precise version of the vmstat counters to be used.

allow_direct_reclaim will only be called from try_to_free_pages,
which is not a hot path.

Testing: Due to difficulties accessing the system, it has not been
possible for the reproducer to test the patch (however its
clear from available data and analysis that it should fix it).

Link: https://lkml.kernel.org/r/20230530145335.677325196@redhat.com
Reviewed-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2024-01-17 17:11:06 -03:00
Paolo Bonzini 538bf6f332 mm, treewide: redefine MAX_ORDER sanely
JIRA: https://issues.redhat.com/browse/RHEL-10059

MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.

This definition is counter-intuitive and lead to number of bugs all over
the kernel.

Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.

[kirill@shutemov.name: fix min() warning]
  Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box
[akpm@linux-foundation.org: fix another min_t warning]
[kirill@shutemov.name: fixups per Zi Yan]
  Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name
[akpm@linux-foundation.org: fix underlining in docs]
  Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 23baf831a32c04f9a968812511540b1b3e648bf5)

[RHEL: Fix conflicts by changing MAX_ORDER - 1 to MAX_ORDER,
       ">= MAX_ORDER" to "> MAX_ORDER", etc.]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-30 09:12:37 +01:00
Chris von Recklinghausen ea93adbcc4 Multi-gen LRU: skip CMA pages when they are not eligible
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b7108d66318abf3e060c7839eabcba52e9461568
Author: Charan Teja Kalla <quic_charante@quicinc.com>
Date:   Wed Aug 9 13:35:44 2023 +0530

    Multi-gen LRU: skip CMA pages when they are not eligible

    This patch is based on the commit 5da226dbfce3("mm: skip CMA pages when
    they are not available") which skips cma pages reclaim when they are not
    eligible for the current allocation context.  In mglru, such pages are
    added to the tail of the immediate generation to maintain better LRU
    order, which is unlike the case of conventional LRU where such pages are
    directly added to the head of the LRU list(akin to adding to head of the
    youngest generation in mglru).

    No observable issue without this patch on MGLRU, but logically it make
    sense to skip the CMA page reclaim when those pages can't be satisfied for
    the current allocation context.

    Link: https://lkml.kernel.org/r/1691568344-13475-1-git-send-email-quic_charante@quicinc.com
    Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
    Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
    Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:24 -04:00
Chris von Recklinghausen 55a7897f81 Multi-gen LRU: fix can_swap in lru_gen_look_around()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a3235ea2a88b7874204c39ebb20feb712f4dba9d
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Tue Aug 1 19:56:04 2023 -0700

    Multi-gen LRU: fix can_swap in lru_gen_look_around()

    walk->can_swap might be invalid since it's not guaranteed to be
    initialized for the particular lruvec.  Instead deduce it from the folio
    type (anon/file).

    Link: https://lkml.kernel.org/r/20230802025606.346758-3-kaleshsingh@google.com
    Fixes: 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap")
    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
    Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
    Cc: Matthias Brugger <matthias.bgg@gmail.com>
    Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Steven Barrett <steven@liquorix.net>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:23 -04:00
Chris von Recklinghausen 5c27a48dde Multi-gen LRU: avoid race in inc_min_seq()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bb5e7f234eacf34b65be67ebb3613e3b8cf11b87
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Tue Aug 1 19:56:03 2023 -0700

    Multi-gen LRU: avoid race in inc_min_seq()

    inc_max_seq() will try to inc_min_seq() if nr_gens == MAX_NR_GENS. This
    is because the generations are reused (the last oldest now empty
    generation will become the next youngest generation).

    inc_min_seq() is retried until successful, dropping the lru_lock
    and yielding the CPU on each failure, and retaking the lock before
    trying again:

            while (!inc_min_seq(lruvec, type, can_swap)) {
                    spin_unlock_irq(&lruvec->lru_lock);
                    cond_resched();
                    spin_lock_irq(&lruvec->lru_lock);
            }

    However, the initial condition that required incrementing the min_seq
    (nr_gens == MAX_NR_GENS) is not retested. This can change by another
    call to inc_max_seq() from run_aging() with force_scan=true from the
    debugfs interface.

    Since the eviction stalls when the nr_gens == MIN_NR_GENS, avoid
    unnecessarily incrementing the min_seq by rechecking the number of
    generations before each attempt.

    This issue was uncovered in previous discussion on the list by Yu Zhao
    and Aneesh Kumar [1].

    [1] https://lore.kernel.org/linux-mm/CAOUHufbO7CaVm=xjEb1avDhHVvnC8pJmGyKcFf2iY_dpf+zR3w@mail.gmail.com/

    Link: https://lkml.kernel.org/r/20230802025606.346758-2-kaleshsingh@google.com
    Fixes: d6c3af7d8a2b ("mm: multi-gen LRU: debugfs interface")
    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
    Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
    Cc: Matthias Brugger <matthias.bgg@gmail.com>
    Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Steven Barrett <steven@liquorix.net>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:23 -04:00
Chris von Recklinghausen 3c845255b9 Multi-gen LRU: fix per-zone reclaim
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 669281ee7ef731fb5204df9d948669bf32a5e68d
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Tue Aug 1 19:56:02 2023 -0700

    Multi-gen LRU: fix per-zone reclaim

    MGLRU has a LRU list for each zone for each type (anon/file) in each
    generation:

            long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];

    The min_seq (oldest generation) can progress independently for each
    type but the max_seq (youngest generation) is shared for both anon and
    file. This is to maintain a common frame of reference.

    In order for eviction to advance the min_seq of a type, all the per-zone
    lists in the oldest generation of that type must be empty.

    The eviction logic only considers pages from eligible zones for
    eviction or promotion.

        scan_folios() {
            ...
            for (zone = sc->reclaim_idx; zone >= 0; zone--)  {
                ...
                sort_folio();       // Promote
                ...
                isolate_folio();    // Evict
            }
            ...
        }

    Consider the system has the movable zone configured and default 4
    generations. The current state of the system is as shown below
    (only illustrating one type for simplicity):

    Type: ANON

            Zone    DMA32     Normal    Movable    Device

            Gen 0       0          0        4GB         0

            Gen 1       0        1GB        1MB         0

            Gen 2     1MB        4GB        1MB         0

            Gen 3     1MB        1MB        1MB         0

    Now consider there is a GFP_KERNEL allocation request (eligible zone
    index <= Normal), evict_folios() will return without doing any work
    since there are no pages to scan in the eligible zones of the oldest
    generation. Reclaim won't make progress until triggered from a ZONE_MOVABLE
    allocation request; which may not happen soon if there is a lot of free
    memory in the movable zone. This can lead to OOM kills, although there
    is 1GB pages in the Normal zone of Gen 1 that we have not yet tried to
    reclaim.

    This issue is not seen in the conventional active/inactive LRU since
    there are no per-zone lists.

    If there are no (not enough) folios to scan in the eligible zones, move
    folios from ineligible zone (zone_index > reclaim_index) to the next
    generation. This allows for the progression of min_seq and reclaiming
    from the next generation (Gen 1).

    Qualcomm, Mediatek and raspberrypi [1] discovered this issue independently.

    [1] https://github.com/raspberrypi/linux/issues/5395

    Link: https://lkml.kernel.org/r/20230802025606.346758-1-kaleshsingh@google.com
    Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
    Reported-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
    Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
    Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Matthias Brugger <matthias.bgg@gmail.com>
    Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Steven Barrett <steven@liquorix.net>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:23 -04:00
Chris von Recklinghausen 1a5e855a72 mm: multi-gen LRU: don't spin during memcg release
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 6867c7a3320669cbe44b905a3eb35db725c6d470
Author: T.J. Mercier <tjmercier@google.com>
Date:   Mon Aug 14 15:16:36 2023 +0000

    mm: multi-gen LRU: don't spin during memcg release

    When a memcg is in the process of being released mem_cgroup_tryget will
    fail because its reference count has already reached 0.  This can happen
    during reclaim if the memcg has already been offlined, and we reclaim all
    remaining pages attributed to the offlined memcg.  shrink_many attempts to
    skip the empty memcg in this case, and continue reclaiming from the
    remaining memcgs in the old generation.  If there is only one memcg
    remaining, or if all remaining memcgs are in the process of being released
    then shrink_many will spin until all memcgs have finished being released.
    The release occurs through a workqueue, so it can take a while before
    kswapd is able to make any further progress.

    This fix results in reductions in kswapd activity and direct reclaim in
    a test where 28 apps (working set size > total memory) are repeatedly
    launched in a random sequence:

                                           A          B      delta   ratio(%)
               allocstall_movable       5962       3539      -2423     -40.64
                allocstall_normal       2661       2417       -244      -9.17
    kswapd_high_wmark_hit_quickly      53152       7594     -45558     -85.71
                       pageoutrun      57365      11750     -45615     -79.52

    Link: https://lkml.kernel.org/r/20230814151636.1639123-1-tjmercier@google.com
    Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
    Signed-off-by: T.J. Mercier <tjmercier@google.com>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:21 -04:00
Chris von Recklinghausen 302d171e4e mm/mglru: make memcg_lru->lock irq safe
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 814bc1de03ea4361101408e63a68e4b82aef22cb
Author: Yu Zhao <yuzhao@google.com>
Date:   Mon Jun 19 13:38:21 2023 -0600

    mm/mglru: make memcg_lru->lock irq safe

    lru_gen_rotate_memcg() can happen in softirq if memory.soft_limit_in_bytes
    is set.  This requires memcg_lru->lock to be irq safe.  Lockdep warns on
    this.

    This problem only affects memcg v1.

    Link: https://lkml.kernel.org/r/20230619193821.2710944-1-yuzhao@google.com
    Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reported-by: syzbot+87c490fd2be656269b6a@syzkaller.appspotmail.com
    Closes: https://syzkaller.appspot.com/bug?extid=87c490fd2be656269b6a
    Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:20 -04:00
Chris von Recklinghausen 03baeae568 mm/mglru: allow pte_offset_map_nolock() to fail
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 52fc048320adf1b1c07a2627461dca9f7d7956ff
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:37:12 2023 -0700

    mm/mglru: allow pte_offset_map_nolock() to fail

    MGLRU's walk_pte_range() use the safer pte_offset_map_nolock(), rather
    than pte_lockptr(), to get the ptl for its trylock.  Just return false and
    move on to next extent if it fails, like when the trylock fails.  Remove
    the VM_WARN_ON_ONCE(pmd_leaf) since that will happen, rarely.

    Link: https://lkml.kernel.org/r/51ece73e-7398-2e4a-2384-56708c87844f@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:18 -04:00
Chris von Recklinghausen 04aff219a6 mm: skip CMA pages when they are not available
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 5da226dbfce3a2f44978c2c7cf88166e69a6788b
Author: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Date:   Wed May 31 10:51:01 2023 +0800

    mm: skip CMA pages when they are not available

    This patch fixes unproductive reclaiming of CMA pages by skipping them
    when they are not available for current context.  It arises from the below
    OOM issue, which was caused by a large proportion of MIGRATE_CMA pages
    among free pages.

    [   36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0
    [   36.189447] [03-19 10:05:52.189] DMA32: 0*4kB 447*8kB (C) 217*16kB (C) 124*32kB (C) 136*64kB (C) 70*128kB (C) 22*256kB (C) 3*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 35848kB
    [   36.193125] [03-19 10:05:52.193] Normal: 231*4kB (UMEH) 49*8kB (MEH) 14*16kB (H) 13*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 3236kB
    ...
    [   36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    [   36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0
    [   36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0

    This change further decreases the chance for wrong OOMs in the presence
    of a lot of CMA memory.

    [david@redhat.com: changelog addition]
    Link: https://lkml.kernel.org/r/1685501461-19290-1-git-send-email-zhaoyang.huang@unisoc.com
    Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: ke.wang <ke.wang@unisoc.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:11 -04:00
Chris von Recklinghausen 877a79531b Multi-gen LRU: fix workingset accounting
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3af0191a594d5ca5d6d2e3602b5d4284c6835e77
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Tue May 23 13:59:21 2023 -0700

    Multi-gen LRU: fix workingset accounting

    On Android app cycle workloads, MGLRU showed a significant reduction in
    workingset refaults although pgpgin/pswpin remained relatively unchanged.
    This indicated MGLRU may be undercounting workingset refaults.

    This has impact on userspace programs, like Android's LMKD, that monitor
    workingset refault statistics to detect thrashing.

    It was found that refaults were only accounted if the MGLRU shadow entry
    was for a recently evicted folio.  However, recently evicted folios should
    be accounted as workingset activation, and refaults should be accounted
    regardless of recency.

    Fix MGLRU's workingset refault and activation accounting to more closely
    match that of the conventional active/inactive LRU.

    Link: https://lkml.kernel.org/r/20230523205922.3852731-1-kaleshsingh@google.com
    Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:10 -04:00
Chris von Recklinghausen 58a78c1704 mm: multi-gen LRU: add helpers in page table walks
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bd02df412cbb9a63e945a647e3dbe4d6f9e06d19
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Mon May 22 11:20:57 2023 +0000

    mm: multi-gen LRU: add helpers in page table walks

    Add helpers to page table walking code:
     - Clarifies intent via name "should_walk_mmu" and "should_clear_pmd_young"
     - Avoids repeating same logic in two places

    Link: https://lkml.kernel.org/r/20230522112058.2965866-3-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Reviewed-by: Yuanchu Xie <yuanchu@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:09 -04:00
Chris von Recklinghausen 03db559323 mm: multi-gen LRU: cleanup lru_gen_soft_reclaim()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 5c7e7a0d79072eb02780a2c0dee730b23cde711d
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Mon May 22 11:20:56 2023 +0000

    mm: multi-gen LRU: cleanup lru_gen_soft_reclaim()

    lru_gen_soft_reclaim() gets the lruvec from the memcg and node ID to keep a
    cleaner interface on the caller side.

    Link: https://lkml.kernel.org/r/20230522112058.2965866-2-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Reviewed-by: Yuanchu Xie <yuanchu@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:09 -04:00
Chris von Recklinghausen 85f55a542e mm: multi-gen LRU: use macro for bitmap
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 0285762c6f161c3a93ffc75ba278aad21719460a
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Mon May 22 11:20:55 2023 +0000

    mm: multi-gen LRU: use macro for bitmap

    Use DECLARE_BITMAP macro when possible.

    Link: https://lkml.kernel.org/r/20230522112058.2965866-1-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Yuanchu Xie <yuanchu@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:09 -04:00
Chris von Recklinghausen 892292a2cd mm: Multi-gen LRU: remove wait_event_killable()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 7f63cf2d9b9bbe7b90f808927558a66ff737d399
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Thu Apr 13 14:43:26 2023 -0700

    mm: Multi-gen LRU: remove wait_event_killable()

    Android 14 and later default to MGLRU [1] and field telemetry showed
    occasional long tail latency (>100ms) in the reclaim path.

    Tracing revealed priority inversion in the reclaim path.  In
    try_to_inc_max_seq(), when high priority tasks were blocked on
    wait_event_killable(), the preemption of the low priority task to call
    wake_up_all() caused those high priority tasks to wait longer than
    necessary.  In general, this problem is not different from others of its
    kind, e.g., one caused by mutex_lock().  However, it is specific to MGLRU
    because it introduced the new wait queue lruvec->mm_state.wait.

    The purpose of this new wait queue is to avoid the thundering herd
    problem.  If many direct reclaimers rush into try_to_inc_max_seq(), only
    one can succeed, i.e., the one to wake up the rest, and the rest who
    failed might cause premature OOM kills if they do not wait.  So far there
    is no evidence supporting this scenario, based on how often the wait has
    been hit.  And this begs the question how useful the wait queue is in
    practice.

    Based on Minchan's recommendation, which is in line with his commit
    6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path") and the
    rest of the MGLRU code which also uses trylock when possible, remove the
    wait queue.

    [1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf

    Link: https://lkml.kernel.org/r/20230413214326.2147568-1-kaleshsingh@google.com
    Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Suggested-by: Minchan Kim <minchan@kernel.org>
    Reported-by: Wei Wang <wvw@google.com>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
    Cc: Suleiman Souhlal <suleiman@google.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:05 -04:00
Chris von Recklinghausen bc51777160 vmscan: memcg: sleep when flushing stats during reclaim
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 0d856cfedd6bc0cb8e19c1e70d400e79b655cfdd
Author: Yosry Ahmed <yosryahmed@google.com>
Date:   Thu Mar 30 19:18:00 2023 +0000

    vmscan: memcg: sleep when flushing stats during reclaim

    Memory reclaim is a sleepable context.  Flushing is an expensive operaiton
    that scales with the number of cpus and the number of cgroups in the
    system, so avoid doing it atomically unnecessarily.  This can slow down
    reclaim code if flushing stats is taking too long, but there is already
    multiple cond_resched()'s in reclaim code.

    Link: https://lkml.kernel.org/r/20230330191801.1967435-8-yosryahmed@google.com
    Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
    Acked-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Josef Bacik <josef@toxicpanda.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Michal Koutný <mkoutny@suse.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vasily Averin <vasily.averin@linux.dev>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:05 -04:00
Chris von Recklinghausen 1f0ec65b6a memcg: sleep during flushing stats in safe contexts
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 9fad9aee1f267a8ad1f86b87ae70b2c4d6796164
Author: Yosry Ahmed <yosryahmed@google.com>
Date:   Thu Mar 30 19:17:58 2023 +0000

    memcg: sleep during flushing stats in safe contexts

    Currently, all contexts that flush memcg stats do so with sleeping not
    allowed.  Some of these contexts are perfectly safe to sleep in, such as
    reading cgroup files from userspace or the background periodic flusher.
    Flushing is an expensive operation that scales with the number of cpus and
    the number of cgroups in the system, so avoid doing it atomically where
    possible.

    Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
    sleepable), and provide a separate atomic version.  The atomic version is
    used in reclaim, refault, writeback, and in mem_cgroup_usage().  All other
    code paths are left to use the non-atomic version.  This includes
    callbacks for userspace reads and the periodic flusher.

    Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
    change it to mem_cgroup_flush_stats_atomic_ratelimited().  Reclaim and
    refault code paths are modified to do non-atomic flushing in separate
    later patches -- so it will eventually be changed back to
    mem_cgroup_flush_stats_ratelimited().

    Link: https://lkml.kernel.org/r/20230330191801.1967435-6-yosryahmed@google.com
    Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
    Acked-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Josef Bacik <josef@toxicpanda.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Michal Koutný <mkoutny@suse.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vasily Averin <vasily.averin@linux.dev>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:04 -04:00
Chris von Recklinghausen 2f029d0688 mm: multi-gen LRU: avoid futile retries
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 9f550d78b40da21b4da515db4c37d8d7b12aa1a6
Author: Yu Zhao <yuzhao@google.com>
Date:   Mon Feb 13 00:53:22 2023 -0700

    mm: multi-gen LRU: avoid futile retries

    Recall that the per-node memcg LRU has two generations and they alternate
    when the last memcg (of a given node) is moved from one to the other.
    Each generation is also sharded into multiple bins to improve scalability.
    A reclaimer starts with a random bin (in the old generation) and, if it
    fails, it will retry, i.e., to try the rest of the bins.

    If a reclaimer fails with the last memcg, it should move this memcg to the
    young generation first, which causes the generations to alternate, and
    then retry.  Otherwise, the retries will be futile because all other bins
    are empty.

    Link: https://lkml.kernel.org/r/20230213075322.1416966-1-yuzhao@google.com
    Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reported-by: T.J. Mercier <tjmercier@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:59 -04:00
Chris von Recklinghausen 807c9e758a mm: multi-gen LRU: simplify lru_gen_look_around()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit abf086721a2f1e6897c57796f7268df1b194c750
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Wed Jan 18 00:18:27 2023 +0000

    mm: multi-gen LRU: simplify lru_gen_look_around()

    Update the folio generation in place with or without
    current->reclaim_state->mm_walk.  The LRU lock is held for longer, if
    mm_walk is NULL and the number of folios to update is more than
    PAGEVEC_SIZE.

    This causes a measurable regression from the LRU lock contention during a
    microbencmark.  But a tiny regression is not worth the complexity.

    Link: https://lkml.kernel.org/r/20230118001827.1040870-8-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:58 -04:00
Chris von Recklinghausen c3a81791e6 mm: multi-gen LRU: improve walk_pmd_range()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b5ff4133617d0eced35b685da0bd0929dd9fabb7
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Wed Jan 18 00:18:26 2023 +0000

    mm: multi-gen LRU: improve walk_pmd_range()

    Improve readability of walk_pmd_range() and walk_pmd_range_locked().

    Link: https://lkml.kernel.org/r/20230118001827.1040870-7-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:57 -04:00
Chris von Recklinghausen a824614126 mm: multi-gen LRU: improve lru_gen_exit_memcg()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 37cc99979d04cca677c0ad5c0acd1149ec165d1b
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Wed Jan 18 00:18:25 2023 +0000

    mm: multi-gen LRU: improve lru_gen_exit_memcg()

    Add warnings and poison ->next.

    Link: https://lkml.kernel.org/r/20230118001827.1040870-6-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:57 -04:00
Chris von Recklinghausen 2999479542 mm: multi-gen LRU: section for memcg LRU
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 36c7b4db7c942ae9e1b111f0c6b468c8b2e33842
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Wed Jan 18 00:18:24 2023 +0000

    mm: multi-gen LRU: section for memcg LRU

    Move memcg LRU code into a dedicated section.  Improve the design doc to
    outline its architecture.

    Link: https://lkml.kernel.org/r/20230118001827.1040870-5-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:57 -04:00
Chris von Recklinghausen 0f6cc837a3 mm: multi-gen LRU: section for Bloom filters
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit ccbbbb85945d8f0255aa9dbc1b617017e2294f2c
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Wed Jan 18 00:18:23 2023 +0000

    mm: multi-gen LRU: section for Bloom filters

    Move Bloom filters code into a dedicated section.  Improve the design doc
    to explain Bloom filter usage and connection between aging and eviction in
    their use.

    Link: https://lkml.kernel.org/r/20230118001827.1040870-4-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:56 -04:00
Chris von Recklinghausen ec67930f48 mm: multi-gen LRU: section for rmap/PT walk feedback
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit db19a43d9b3a8876552f00f656008206ef9a5efa
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Wed Jan 18 00:18:22 2023 +0000

    mm: multi-gen LRU: section for rmap/PT walk feedback

    Add a section for lru_gen_look_around() in the code and the design doc.

    Link: https://lkml.kernel.org/r/20230118001827.1040870-3-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:56 -04:00
Chris von Recklinghausen 461c2c84f0 mm: multi-gen LRU: section for working set protection
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 7b8144e63d84716f16a1b929e0c7e03ae5c4d5c1
Author: T.J. Alumbaugh <talumbau@google.com>
Date:   Wed Jan 18 00:18:21 2023 +0000

    mm: multi-gen LRU: section for working set protection

    Patch series "mm: multi-gen LRU: improve".

    This patch series improves a few MGLRU functions, collects related
    functions, and adds additional documentation.

    This patch (of 7):

    Add a section for working set protection in the code and the design doc.
    The admin doc already contains its usage.

    Link: https://lkml.kernel.org/r/20230118001827.1040870-1-talumbau@google.com
    Link: https://lkml.kernel.org/r/20230118001827.1040870-2-talumbau@google.com
    Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:56 -04:00
Chris von Recklinghausen a5c269d49f mm: multi-gen LRU: simplify arch_has_hw_pte_young() check
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f386e9314025ea99dae639ed2032560a92081430
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Dec 21 21:19:06 2022 -0700

    mm: multi-gen LRU: simplify arch_has_hw_pte_young() check

    Scanning page tables when hardware does not set the accessed bit has
    no real use cases.

    Link: https://lkml.kernel.org/r/20221222041905.2431096-9-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:50 -04:00
Chris von Recklinghausen db2ea18713 mm: multi-gen LRU: clarify scan_control flags
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e9d4e1ee788097484606c32122f146d802a9c5fb
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Dec 21 21:19:05 2022 -0700

    mm: multi-gen LRU: clarify scan_control flags

    Among the flags in scan_control:
    1. sc->may_swap, which indicates swap constraint due to memsw.max, is
       supported as usual.
    2. sc->proactive, which indicates reclaim by memory.reclaim, may not
       opportunistically skip the aging path, since it is considered less
       latency sensitive.
    3. !(sc->gfp_mask & __GFP_IO), which indicates IO constraint, lowers
       swappiness to prioritize file LRU, since clean file folios are more
       likely to exist.
    4. sc->may_writepage and sc->may_unmap, which indicates opportunistic
       reclaim, are rejected, since unmapped clean folios are already
       prioritized. Scanning for more of them is likely futile and can
       cause high reclaim latency when there is a large number of memcgs.

    The rest are handled by the existing code.

    Link: https://lkml.kernel.org/r/20221222041905.2431096-8-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:50 -04:00
Chris von Recklinghausen 62d9864b57 mm: multi-gen LRU: per-node lru_gen_folio lists
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e4dde56cd208674ce899b47589f263499e5b8cdc
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Dec 21 21:19:04 2022 -0700

    mm: multi-gen LRU: per-node lru_gen_folio lists

    For each node, memcgs are divided into two generations: the old and
    the young. For each generation, memcgs are randomly sharded into
    multiple bins to improve scalability. For each bin, an RCU hlist_nulls
    is virtually divided into three segments: the head, the tail and the
    default.

    An onlining memcg is added to the tail of a random bin in the old
    generation. The eviction starts at the head of a random bin in the old
    generation. The per-node memcg generation counter, whose reminder (mod
    2) indexes the old generation, is incremented when all its bins become
    empty.

    There are four operations:
    1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in
       its current generation (old or young) and updates its "seg" to
       "head";
    2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in
       its current generation (old or young) and updates its "seg" to
       "tail";
    3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in
       the old generation, updates its "gen" to "old" and resets its "seg"
       to "default";
    4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin
       in the young generation, updates its "gen" to "young" and resets
       its "seg" to "default".

    The events that trigger the above operations are:
    1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
    2. The first attempt to reclaim an memcg below low, which triggers
       MEMCG_LRU_TAIL;
    3. The first attempt to reclaim an memcg below reclaimable size
       threshold, which triggers MEMCG_LRU_TAIL;
    4. The second attempt to reclaim an memcg below reclaimable size
       threshold, which triggers MEMCG_LRU_YOUNG;
    5. Attempting to reclaim an memcg below min, which triggers
       MEMCG_LRU_YOUNG;
    6. Finishing the aging on the eviction path, which triggers
       MEMCG_LRU_YOUNG;
    7. Offlining an memcg, which triggers MEMCG_LRU_OLD.

    Note that memcg LRU only applies to global reclaim, and the
    round-robin incrementing of their max_seq counters ensures the
    eventual fairness to all eligible memcgs. For memcg reclaim, it still
    relies on mem_cgroup_iter().

    Link: https://lkml.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:50 -04:00
Chris von Recklinghausen 158e1791a7 mm: multi-gen LRU: shuffle should_run_aging()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 77d4459a4a1a472b7309e475f962dda87d950abd
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Dec 21 21:19:03 2022 -0700

    mm: multi-gen LRU: shuffle should_run_aging()

    Move should_run_aging() next to its only caller left.

    Link: https://lkml.kernel.org/r/20221222041905.2431096-6-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:49 -04:00
Chris von Recklinghausen cfb511dc3a mm: multi-gen LRU: remove aging fairness safeguard
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 7348cc91821b0cb24dfb00e578047f68299a50ab
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Dec 21 21:19:02 2022 -0700

    mm: multi-gen LRU: remove aging fairness safeguard

    Recall that the aging produces the youngest generation: first it scans
    for accessed folios and updates their gen counters; then it increments
    lrugen->max_seq.

    The current aging fairness safeguard for kswapd uses two passes to
    ensure the fairness to multiple eligible memcgs. On the first pass,
    which is shared with the eviction, it checks whether all eligible
    memcgs are low on cold folios. If so, it requires a second pass, on
    which it ages all those memcgs at the same time.

    With memcg LRU, the aging, while ensuring eventual fairness, will run
    when necessary. Therefore the current aging fairness safeguard for
    kswapd will not be needed.

    Note that memcg LRU only applies to global reclaim. For memcg reclaim,
    the aging can be unfair to different memcgs, i.e., their
    lrugen->max_seq can be incremented at different paces.

    Link: https://lkml.kernel.org/r/20221222041905.2431096-5-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:49 -04:00
Chris von Recklinghausen 41a681490c mm: multi-gen LRU: remove eviction fairness safeguard
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a579086c99ed70cc4bfc104348dbe3dd8f2787e6
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Dec 21 21:19:01 2022 -0700

    mm: multi-gen LRU: remove eviction fairness safeguard

    Recall that the eviction consumes the oldest generation: first it
    bucket-sorts folios whose gen counters were updated by the aging and
    reclaims the rest; then it increments lrugen->min_seq.

    The current eviction fairness safeguard for global reclaim has a
    dilemma: when there are multiple eligible memcgs, should it continue
    or stop upon meeting the reclaim goal? If it continues, it overshoots
    and increases direct reclaim latency; if it stops, it loses fairness
    between memcgs it has taken memory away from and those it has yet to.

    With memcg LRU, the eviction, while ensuring eventual fairness, will
    stop upon meeting its goal. Therefore the current eviction fairness
    safeguard for global reclaim will not be needed.

    Note that memcg LRU only applies to global reclaim. For memcg reclaim,
    the eviction will continue, even if it is overshooting. This becomes
    unconditional due to code simplification.

    Link: https://lkml.kernel.org/r/20221222041905.2431096-4-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:49 -04:00
Chris von Recklinghausen 2151fed10f mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 6df1b2212950aae2b2188c6645ea18e2a9e3fdd5
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Dec 21 21:19:00 2022 -0700

    mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]

    lru_gen_folio will be chained into per-node lists by the coming
    lrugen->list.

    Link: https://lkml.kernel.org/r/20221222041905.2431096-3-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:48 -04:00
Chris von Recklinghausen b180a6d21f mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 391655fe08d1f942359a11148aa9aaf3f99d6d6f
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Dec 21 21:18:59 2022 -0700

    mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio

    Patch series "mm: multi-gen LRU: memcg LRU", v3.

    Overview
    ========

    An memcg LRU is a per-node LRU of memcgs.  It is also an LRU of LRUs,
    since each node and memcg combination has an LRU of folios (see
    mem_cgroup_lruvec()).

    Its goal is to improve the scalability of global reclaim, which is
    critical to system-wide memory overcommit in data centers.  Note that
    memcg reclaim is currently out of scope.

    Its memory bloat is a pointer to each lruvec and negligible to each
    pglist_data.  In terms of traversing memcgs during global reclaim, it
    improves the best-case complexity from O(n) to O(1) and does not affect
    the worst-case complexity O(n).  Therefore, on average, it has a sublinear
    complexity in contrast to the current linear complexity.

    The basic structure of an memcg LRU can be understood by an analogy to
    the active/inactive LRU (of folios):
    1. It has the young and the old (generations), i.e., the counterparts
       to the active and the inactive;
    2. The increment of max_seq triggers promotion, i.e., the counterpart
       to activation;
    3. Other events trigger similar operations, e.g., offlining an memcg
       triggers demotion, i.e., the counterpart to deactivation.

    In terms of global reclaim, it has two distinct features:
    1. Sharding, which allows each thread to start at a random memcg (in
       the old generation) and improves parallelism;
    2. Eventual fairness, which allows direct reclaim to bail out at will
       and reduces latency without affecting fairness over some time.

    The commit message in patch 6 details the workflow:
    https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/

    The following is a simple test to quickly verify its effectiveness.

      Test design:
      1. Create multiple memcgs.
      2. Each memcg contains a job (fio).
      3. All jobs access the same amount of memory randomly.
      4. The system does not experience global memory pressure.
      5. Periodically write to the root memory.reclaim.

      Desired outcome:
      1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal)
         over mean(pgsteal) is close to 0%.
      2. The total pgsteal is close to the total requested through
         memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close
         to 100%.

      Actual outcome [1]:
                                         MGLRU off    MGLRU on
      stddev(pgsteal) / mean(pgsteal)    75%          20%
      sum(pgsteal) / sum(requested)      425%         95%

      ####################################################################
      MEMCGS=128

      for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
          mkdir /sys/fs/cgroup/memcg$memcg
      done

      start() {
          echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs

          fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
              --filename=/dev/zero --size=1920M --rw=randrw \
              --rate=64m,64m --random_distribution=random \
              --fadvise_hint=0 --time_based --runtime=10h \
              --group_reporting --minimal
      }

      for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
          start &
      done

      sleep 600

      for ((i = 0; i < 600; i++)); do
          echo 256m >/sys/fs/cgroup/memory.reclaim
          sleep 6
      done

      for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
          grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
      done
      ####################################################################

    [1]: This was obtained from running the above script (touches less
         than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
         hour.

    This patch (of 8):

    The new name lru_gen_folio will be more distinct from the coming
    lru_gen_memcg.

    Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com
    Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:48 -04:00
Chris von Recklinghausen 4ea2a030e8 mm: multi-gen LRU: fix crash during cgroup migration
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit de08eaa6156405f2e9369f06ba5afae0e4ab3b62
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Jan 15 20:44:05 2023 -0700

    mm: multi-gen LRU: fix crash during cgroup migration

    lru_gen_migrate_mm() assumes lru_gen_add_mm() runs prior to itself.  This
    isn't true for the following scenario:

        CPU 1                         CPU 2

      clone()
        cgroup_can_fork()
                                    cgroup_procs_write()
        cgroup_post_fork()
                                      task_lock()
                                      lru_gen_migrate_mm()
                                      task_unlock()
        task_lock()
        lru_gen_add_mm()
        task_unlock()

    And when the above happens, kernel crashes because of linked list
    corruption (mm_struct->lru_gen.list).

    Link: https://lore.kernel.org/r/20230115134651.30028-1-msizanoen@qtmlabs.xyz/
    Link: https://lkml.kernel.org/r/20230116034405.2960276-1-yuzhao@google.com
    Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reported-by: msizanoen <msizanoen@qtmlabs.xyz>
    Tested-by: msizanoen <msizanoen@qtmlabs.xyz>
    Cc: <stable@vger.kernel.org>    [6.1+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:39 -04:00
Chris von Recklinghausen 56f21bc526 mm: Rename pmd_read_atomic()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit dab6e717429e5ec795d558a0e9a5337a1ed33a3d
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Nov 26 17:20:28 2020 +0100

    mm: Rename pmd_read_atomic()

    There's no point in having the identical routines for PTE/PMD have
    different names.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20221022114424.841277397%40infradead.org

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:37 -04:00
Chris von Recklinghausen b55ed7d6c3 mm: memcg: fix swapcached stat accounting
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit c449deb2b99ff2458214ed4a3526277bc9e40757
Author: Hugh Dickins <hughd@google.com>
Date:   Sun Dec 4 17:01:03 2022 -0800

    mm: memcg: fix swapcached stat accounting

    I'd been worried by high "swapcached" counts in memcg OOM reports, thought
    we had a problem freeing swapcache, but it was just the accounting that
    was wrong.

    Two issues:

    1.  When __remove_mapping() removes swapcache,
       __delete_from_swap_cache() relies on memcg_data for the right counts to
       be updated; but that had already been reset by mem_cgroup_swapout().
       Swap those calls around - mem_cgroup_swapout() does not require the
       swapcached flag to be set.

       6.1 commit ac35a4902374 ("mm: multi-gen LRU: minimal
       implementation") already made a similar swap for workingset_eviction(),
       but not for this.

    2.  memcg's "swapcached" count was added for memcg v2 stats, but
       displayed on OOM even for memcg v1: so mem_cgroup_move_account() ought
       to move it.

    Link: https://lkml.kernel.org/r/b8b96ee0-1e1e-85f8-df97-c82a11d7cd14@google.com
    Fixes: b603894248 ("mm: memcg: add swapcache stat for memcg v2")
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Shakeel Butt <shakeelb@google.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:35 -04:00
Chris von Recklinghausen cde03b30f9 mm: memcg: fix stale protection of reclaim target memcg
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit adb8213014b25c7f1d75d5b219becaadcd695efb
Author: Yosry Ahmed <yosryahmed@google.com>
Date:   Fri Dec 2 03:15:10 2022 +0000

    mm: memcg: fix stale protection of reclaim target memcg

    Patch series "mm: memcg: fix protection of reclaim target memcg", v3.

    This series fixes a bug in calculating the protection of the reclaim
    target memcg where we end up using stale effective protection values from
    the last reclaim operation, instead of completely ignoring the protection
    of the reclaim target as intended.  More detailed explanation and examples
    in patch 1, which includes the fix.  Patches 2 & 3 introduce a selftest
    case that catches the bug.

    This patch (of 3):

    When we are doing memcg reclaim, the intended behavior is that we
    ignore any protection (memory.min, memory.low) of the target memcg (but
    not its children).  Ever since the patch pointed to by the "Fixes" tag,
    we actually read a stale value for the target memcg protection when
    deciding whether to skip the memcg or not because it is protected.  If
    the stale value happens to be high enough, we don't reclaim from the
    target memcg.

    Essentially, in some cases we may falsely skip reclaiming from the
    target memcg of reclaim because we read a stale protection value from
    last time we reclaimed from it.

    During reclaim, mem_cgroup_calculate_protection() is used to determine the
    effective protection (emin and elow) values of a memcg.  The protection of
    the reclaim target is ignored, but we cannot set their effective
    protection to 0 due to a limitation of the current implementation (see
    comment in mem_cgroup_protection()).  Instead, we leave their effective
    protection values unchaged, and later ignore it in
    mem_cgroup_protection().

    However, mem_cgroup_protection() is called later in
    shrink_lruvec()->get_scan_count(), which is after the
    mem_cgroup_below_{min/low}() checks in shrink_node_memcgs().  As a result,
    the stale effective protection values of the target memcg may lead us to
    skip reclaiming from the target memcg entirely, before calling
    shrink_lruvec().  This can be even worse with recursive protection, where
    the stale target memcg protection can be higher than its standalone
    protection.  See two examples below (a similar version of example (a) is
    added to test_memcontrol in a later patch).

    (a) A simple example with proactive reclaim is as follows. Consider the
    following hierarchy:
    ROOT
     |
     A
     |
     B (memory.min = 10M)

    Consider the following scenario:
    - B has memory.current = 10M.
    - The system undergoes global reclaim (or memcg reclaim in A).
    - In shrink_node_memcgs():
      - mem_cgroup_calculate_protection() calculates the effective min (emin)
        of B as 10M.
      - mem_cgroup_below_min() returns true for B, we do not reclaim from B.
    - Now if we want to reclaim 5M from B using proactive reclaim
      (memory.reclaim), we should be able to, as the protection of the
      target memcg should be ignored.
    - In shrink_node_memcgs():
      - mem_cgroup_calculate_protection() immediately returns for B without
        doing anything, as B is the target memcg, relying on
        mem_cgroup_protection() to ignore B's stale effective min (still 10M).
      - mem_cgroup_below_min() reads the stale effective min for B and we
        skip it instead of ignoring its protection as intended, as we never
        reach mem_cgroup_protection().

    (b) An more complex example with recursive protection is as follows.
    Consider the following hierarchy with memory_recursiveprot:
    ROOT
     |
     A (memory.min = 50M)
     |
     B (memory.min = 10M, memory.high = 40M)

    Consider the following scenario:
    - B has memory.current = 35M.
    - The system undergoes global reclaim (target memcg is NULL).
    - B will have an effective min of 50M (all of A's unclaimed protection).
    - B will not be reclaimed from.
    - Now allocate 10M more memory in B, pushing it above it's high limit.
    - The system undergoes memcg reclaim from B (target memcg is B).
    - Like example (a), we do nothing in mem_cgroup_calculate_protection(),
      then call mem_cgroup_below_min(), which will read the stale effective
      min for B (50M) and skip it. In this case, it's even worse because we
      are not just considering B's standalone protection (10M), but we are
      reading a much higher stale protection (50M) which will cause us to not
      reclaim from B at all.

    This is an artifact of commit 45c7f7e1ef ("mm, memcg: decouple
    e{low,min} state mutations from protection checks") which made
    mem_cgroup_calculate_protection() only change the state without returning
    any value.  Before that commit, we used to return MEMCG_PROT_NONE for the
    target memcg, which would cause us to skip the
    mem_cgroup_below_{min/low}() checks.  After that commit we do not return
    anything and we end up checking the min & low effective protections for
    the target memcg, which are stale.

    Update mem_cgroup_supports_protection() to also check if we are reclaiming
    from the target, and rename it to mem_cgroup_unprotected() (now returns
    true if we should not protect the memcg, much simpler logic).

    Link: https://lkml.kernel.org/r/20221202031512.1365483-1-yosryahmed@google.com
    Link: https://lkml.kernel.org/r/20221202031512.1365483-2-yosryahmed@google.com
    Fixes: 45c7f7e1ef ("mm, memcg: decouple e{low,min} state mutations from protection checks")
    Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
    Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Chris Down <chris@chrisdown.name>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vasily Averin <vasily.averin@linux.dev>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:34 -04:00
Chris von Recklinghausen d93e594726 mm: multi-gen LRU: remove NULL checks on NODE_DATA()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 931b6a8b36a2de3985eca27e758900e70cd99779
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Nov 15 18:38:08 2022 -0700

    mm: multi-gen LRU: remove NULL checks on NODE_DATA()

    NODE_DATA() is preallocated for all possible nodes after commit
    09f49dca570a ("mm: handle uninitialized numa nodes gracefully").  Checking
    its return value against NULL is now unnecessary.

    Link: https://lkml.kernel.org/r/20221116013808.3995280-2-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:28 -04:00
Chris von Recklinghausen 279f2298f9 mm: vmscan: split khugepaged stats from direct reclaim stats
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 57e9cc50f4dd926d6c38751799d25cad89fb2bd9
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Wed Oct 26 14:01:33 2022 -0400

    mm: vmscan: split khugepaged stats from direct reclaim stats

    Direct reclaim stats are useful for identifying a potential source for
    application latency, as well as spotting issues with kswapd.  However,
    khugepaged currently distorts the picture: as a kernel thread it doesn't
    impose allocation latencies on userspace, and it explicitly opts out of
    kswapd reclaim.  Its activity showing up in the direct reclaim stats is
    misleading.  Counting it as kswapd reclaim could also cause confusion when
    trying to understand actual kswapd behavior.

    Break out khugepaged from the direct reclaim counters into new
    pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters.

    Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS):

    pgsteal_kswapd 1342185
    pgsteal_direct 0
    pgsteal_khugepaged 3623
    pgscan_kswapd 1345025
    pgscan_direct 0
    pgscan_khugepaged 3623

    Link: https://lkml.kernel.org/r/20221026180133.377671-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Reported-by: Eric Bergen <ebergen@meta.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:22 -04:00
Chris von Recklinghausen 2863cdd5ac mm: introduce arch_has_hw_nonleaf_pmd_young()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4aaf269c768dbacd6268af73fda2ffccaa3f1d88
Author: Juergen Gross <jgross@suse.com>
Date:   Wed Nov 23 07:45:10 2022 +0100

    mm: introduce arch_has_hw_nonleaf_pmd_young()

    When running as a Xen PV guests commit eed9a328aa1a ("mm: x86: add
    CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in
    pmdp_test_and_clear_young():

     BUG: unable to handle page fault for address: ffff8880083374d0
     #PF: supervisor write access in kernel mode
     #PF: error_code(0x0003) - permissions violation
     PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065
     Oops: 0003 [#1] PREEMPT SMP NOPTI
     CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1
     RIP: e030:pmdp_test_and_clear_young+0x25/0x40

    This happens because the Xen hypervisor can't emulate direct writes to
    page table entries other than PTEs.

    This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young()
    similar to arch_has_hw_pte_young() and test that instead of
    CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG.

    Link: https://lkml.kernel.org/r/20221123064510.16225-1-jgross@suse.com
    Fixes: eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG")
    Signed-off-by: Juergen Gross <jgross@suse.com>
    Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
    Acked-by: David Hildenbrand <david@redhat.com>  [core changes]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:13 -04:00
Chris von Recklinghausen e8c2a460a5 mm: multi-gen LRU: retry folios written back while isolated
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 359a5e1416caaf9ce28396a65ed3e386cc5de663
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Nov 15 18:38:07 2022 -0700

    mm: multi-gen LRU: retry folios written back while isolated

    The page reclaim isolates a batch of folios from the tail of one of the
    LRU lists and works on those folios one by one.  For a suitable
    swap-backed folio, if the swap device is async, it queues that folio for
    writeback.  After the page reclaim finishes an entire batch, it puts back
    the folios it queued for writeback to the head of the original LRU list.

    In the meantime, the page writeback flushes the queued folios also by
    batches.  Its batching logic is independent from that of the page reclaim.
    For each of the folios it writes back, the page writeback calls
    folio_rotate_reclaimable() which tries to rotate a folio to the tail.

    folio_rotate_reclaimable() only works for a folio after the page reclaim
    has put it back.  If an async swap device is fast enough, the page
    writeback can finish with that folio while the page reclaim is still
    working on the rest of the batch containing it.  In this case, that folio
    will remain at the head and the page reclaim will not retry it before
    reaching there.

    This patch adds a retry to evict_folios().  After evict_folios() has
    finished an entire batch and before it puts back folios it cannot free
    immediately, it retries those that may have missed the rotation.

    Before this patch, ~60% of folios swapped to an Intel Optane missed
    folio_rotate_reclaimable().  After this patch, ~99% of missed folios were
    reclaimed upon retry.

    This problem affects relatively slow async swap devices like Samsung 980
    Pro much less and does not affect sync swap devices like zram or zswap at
    all.

    Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com
    Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:11 -04:00
Chris von Recklinghausen 5281fe11f1 mglru: mm/vmscan.c: fix imprecise comments
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e4fea72b143848d8bbbeae6d39a890212bcf848e
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Sep 28 12:46:20 2022 -0600

    mglru: mm/vmscan.c: fix imprecise comments

    Link: https://lkml.kernel.org/r/YzSWfFI+MOeb1ils@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:54 -04:00
Chris von Recklinghausen b291297c02 mm/mglru: don't sync disk for each aging cycle
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 14aa8b2d5c2ebead01b542f62d68029023054774
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Sep 28 13:36:58 2022 -0600

    mm/mglru: don't sync disk for each aging cycle

    wakeup_flusher_threads() was added under the assumption that if a system
    runs out of clean cold pages, it might want to write back dirty pages more
    aggressively so that they can become clean and be dropped.

    However, doing so can breach the rate limit a system wants to impose on
    writeback, resulting in early SSD wearout.

    Link: https://lkml.kernel.org/r/YzSiWq9UEER5LKup@google.com
    Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reported-by: Axel Rasmussen <axelrasmussen@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:54 -04:00
Chris von Recklinghausen a38c145ddd memcg: convert mem_cgroup_swap_full() to take a folio
Conflicts: mm/memcontrol.c - We already have
	b25806dcd3d5 ("mm: memcontrol: deprecate swapaccounting=0 mode")
	and
	b6c1a8af5b1ee ("mm: memcontrol: add new kernel parameter cgroup.memory=nobpf")
	so keep existing check in mem_cgroup_get_nr_swap_pages
	(surrounding context)

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 9202d527b715f67bcdccbb9b712b65fe053f8109
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:43 2022 +0100

    memcg: convert mem_cgroup_swap_full() to take a folio

    All callers now have a folio, so convert the function to take a folio.
    Saves a couple of calls to compound_head().

    Link: https://lkml.kernel.org/r/20220902194653.1739778-48-willy@infradead.or
g
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:03 -04:00
Chris von Recklinghausen 64858af7ce mm/swap: convert put_swap_page() to put_swap_folio()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4081f7446d95a9d3ced12dc04ff02c187a761e90
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:09 2022 +0100

    mm/swap: convert put_swap_page() to put_swap_folio()

    With all callers now using a folio, we can convert this function.

    Link: https://lkml.kernel.org/r/20220902194653.1739778-14-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:53 -04:00
Chris von Recklinghausen e83f60fd5c mm/swapfile: convert try_to_free_swap() to folio_free_swap()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bdb0ed54a4768dc3c2613d4c45f94c887d43cd7a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:06 2022 +0100

    mm/swapfile: convert try_to_free_swap() to folio_free_swap()

    Add kernel-doc for folio_free_swap() and make it return bool.  Add a
    try_to_free_swap() compatibility wrapper.

    Link: https://lkml.kernel.org/r/20220902194653.1739778-11-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:52 -04:00
Chris von Recklinghausen 525f303b5c mm/vmscan: fix a lot of comments
Conflicts: mm/vmscan.c - We already have
	6fc7da1ce3 ("mm: vmscan: make rotations a secondary factor in balancing anon vs file")
	so keep calling lru_note_cost with nr_scanned - nr_reclaimed.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 49fd9b6df54e610d817f04ab0f94919f5c1a4f66
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:45:57 2022 +0100

    mm/vmscan: fix a lot of comments

    Patch series "MM folio changes for 6.1", v2.

    My focus this round has been on shmem.  I believe it is now fully
    converted to folios.  Of course, shmem interacts with a lot of the swap
    cache and other parts of the kernel, so there are patches all over the MM.

    This patch series survives a round of xfstests on tmpfs, which is nice,
    but hardly an exhaustive test.  Hugh was nice enough to run a round of
    tests on it and found a bug which is fixed in this edition.

    This patch (of 57):

    A lot of comments mention pages when they should say folios.
    Fix them up.

    [akpm@linux-foundation.org: fixups for mglru additions]
    Link: https://lkml.kernel.org/r/20220902194653.1739778-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20220902194653.1739778-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:50 -04:00
Chris von Recklinghausen 37b681b2ba mm: multi-gen LRU: admin guide
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 07017acb06012d250fb68930e809257e6694d324
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:10 2022 -0600

    mm: multi-gen LRU: admin guide

    Add an admin guide.

    Link: https://lkml.kernel.org/r/20220918080010.2920238-14-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:47 -04:00
Chris von Recklinghausen 2903f64630 mm: multi-gen LRU: debugfs interface
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d6c3af7d8a2ba5602c28841248c551a712ac50f5
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:09 2022 -0600

    mm: multi-gen LRU: debugfs interface

    Add /sys/kernel/debug/lru_gen for working set estimation and proactive
    reclaim.  These techniques are commonly used to optimize job scheduling
    (bin packing) in data centers [1][2].

    Compared with the page table-based approach and the PFN-based
    approach, this lruvec-based approach has the following advantages:
    1. It offers better choices because it is aware of memcgs, NUMA nodes,
       shared mappings and unmapped page cache.
    2. It is more scalable because it is O(nr_hot_pages), whereas the
       PFN-based approach is O(nr_total_pages).

    Add /sys/kernel/debug/lru_gen_full for debugging.

    [1] https://dl.acm.org/doi/10.1145/3297858.3304053
    [2] https://dl.acm.org/doi/10.1145/3503222.3507731

    Link: https://lkml.kernel.org/r/20220918080010.2920238-13-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:47 -04:00
Chris von Recklinghausen b3383ea3d3 mm: multi-gen LRU: thrashing prevention
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 1332a809d95a4fc763cabe5ecb6d4fb6a6d941b2
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:08 2022 -0600

    mm: multi-gen LRU: thrashing prevention

    Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
    requested by many desktop users [1].

    When set to value N, it prevents the working set of N milliseconds from
    getting evicted.  The OOM killer is triggered if this working set cannot
    be kept in memory.  Based on the average human detectable lag (~100ms),
    N=1000 usually eliminates intolerable lags due to thrashing.  Larger
    values like N=3000 make lags less noticeable at the risk of premature OOM
    kills.

    Compared with the size-based approach [2], this time-based approach
    has the following advantages:

    1. It is easier to configure because it is agnostic to applications
       and memory sizes.
    2. It is more reliable because it is directly wired to the OOM killer.

    [1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/
    [2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/

    Link: https://lkml.kernel.org/r/20220918080010.2920238-12-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:47 -04:00
Chris von Recklinghausen 9d674e8130 mm: multi-gen LRU: kill switch
Conflicts: include/linux/cgroup.h, kernel/cgroup/cgroup-internal.h -
	hunks already added as part of RHEL commit
	3a21b833b9 ("cgroup: Export cgroup_mutex")

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 354ed597442952fb680c9cafc7e4eb8a76f9514c
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:07 2022 -0600

    mm: multi-gen LRU: kill switch

    Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
    can be disabled include:
      0x0001: the multi-gen LRU core
      0x0002: walking page table, when arch_has_hw_pte_young() returns
              true
      0x0004: clearing the accessed bit in non-leaf PMD entries, when
              CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
      [yYnN]: apply to all the components above
    E.g.,
      echo y >/sys/kernel/mm/lru_gen/enabled
      cat /sys/kernel/mm/lru_gen/enabled
      0x0007
      echo 5 >/sys/kernel/mm/lru_gen/enabled
      cat /sys/kernel/mm/lru_gen/enabled
      0x0005

    NB: the page table walks happen on the scale of seconds under heavy memory
    pressure, in which case the mmap_lock contention is a lesser concern,
    compared with the LRU lock contention and the I/O congestion.  So far the
    only well-known case of the mmap_lock contention happens on Android, due
    to Scudo [1] which allocates several thousand VMAs for merely a few
    hundred MBs.  The SPF and the Maple Tree also have provided their own
    assessments [2][3].  However, if walking page tables does worsen the
    mmap_lock contention, the kill switch can be used to disable it.  In this
    case the multi-gen LRU will suffer a minor performance degradation, as
    shown previously.

    Clearing the accessed bit in non-leaf PMD entries can also be disabled,
    since this behavior was not tested on x86 varieties other than Intel and
    AMD.
    [1] https://source.android.com/devices/tech/debug/scudo
    [2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
    [3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.c
om/

    Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
   Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:46 -04:00
Chris von Recklinghausen 7ff3a3e8b7 mm: multi-gen LRU: optimize multiple memcgs
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f76c83378851f8e70f032848c4e61203f39480e4
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:06 2022 -0600

    mm: multi-gen LRU: optimize multiple memcgs

    When multiple memcgs are available, it is possible to use generations as a
    frame of reference to make better choices and improve overall performance
    under global memory pressure.  This patch adds a basic optimization to
    select memcgs that can drop single-use unmapped clean pages first.  Doing
    so reduces the chance of going into the aging path or swapping, which can
    be costly.

    A typical example that benefits from this optimization is a server running
    mixed types of workloads, e.g., heavy anon workload in one memcg and heavy
    buffered I/O workload in the other.

    Though this optimization can be applied to both kswapd and direct reclaim,
    it is only added to kswapd to keep the patchset manageable.  Later
    improvements may cover the direct reclaim path.

    While ensuring certain fairness to all eligible memcgs, proportional scans
    of individual memcgs also require proper backoff to avoid overshooting
    their aggregate reclaim target by too much.  Otherwise it can cause high
    direct reclaim latency.  The conditions for backoff are:

    1. At low priorities, for direct reclaim, if aging fairness or direct
       reclaim latency is at risk, i.e., aging one memcg multiple times or
       swapping after the target is met.
    2. At high priorities, for global reclaim, if per-zone free pages are
       above respective watermarks.

    Server benchmark results:
      Mixed workloads:
        fio (buffered I/O): +[19, 21]%
                    IOPS         BW
          patch1-8: 1880k        7343MiB/s
          patch1-9: 2252k        8796MiB/s

        memcached (anon): +[119, 123]%
                    Ops/sec      KB/sec
          patch1-8: 862768.65    33514.68
          patch1-9: 1911022.12   74234.54

      Mixed workloads:
        fio (buffered I/O): +[75, 77]%
                    IOPS         BW
          5.19-rc1: 1279k        4996MiB/s
          patch1-9: 2252k        8796MiB/s

        memcached (anon): +[13, 15]%
                    Ops/sec      KB/sec
          5.19-rc1: 1673524.04   65008.87
          patch1-9: 1911022.12   74234.54

      Configurations:
        (changes since patch 6)

        cat mixed.sh
        modprobe brd rd_nr=2 rd_size=56623104

        swapoff -a
        mkswap /dev/ram0
        swapon /dev/ram0

        mkfs.ext4 /dev/ram1
        mount -t ext4 /dev/ram1 /mnt

        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
          --ratio 1:0 --pipeline 8 -d 2000

        fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
          --buffered=1 --ioengine=io_uring --iodepth=128 \
          --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
          --rw=randread --random_distribution=random --norandommap \
          --time_based --ramp_time=10m --runtime=90m --group_reporting &
        pid=$!

        sleep 200

        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
          --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

        kill -INT $pid
        wait

    Client benchmark results:
      no change (CONFIG_MEMCG=n)

    Link: https://lkml.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:46 -04:00
Chris von Recklinghausen b92cce1ea6 mm: multi-gen LRU: support page table walks
Conflicts:
	fs/exec.c - We already have
		33a2d6bc3480 ("Revert "fs/exec: allow to unshare a time namespace on vfork+exec"")
		so don't add call to timens_on_fork back in
	include/linux/mmzone.h - We already have
		e6ad640bc404 ("mm: deduplicate cacheline padding code")
		so keep CACHELINE_PADDING(_pad2_) over ZONE_PADDING(_pad2_)
	mm/vmscan.c - The backport of
		badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs")
		added an #include <linux/debugfs.h>. Keep it.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bd74fdaea146029e4fa12c6de89adbe0779348a9
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:05 2022 -0600

    mm: multi-gen LRU: support page table walks

    To further exploit spatial locality, the aging prefers to walk page tables
    to search for young PTEs and promote hot pages.  A kill switch will be
    added in the next patch to disable this behavior.  When disabled, the
    aging relies on the rmap only.

    NB: this behavior has nothing similar with the page table scanning in the
    2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
    to swapcache and unmaps them.

    To avoid confusion, the term "iteration" specifically means the traversal
    of an entire mm_struct list; the term "walk" will be applied to page
    tables and the rmap, as usual.

    An mm_struct list is maintained for each memcg, and an mm_struct follows
    its owner task to the new memcg when this task is migrated.  Given an
    lruvec, the aging iterates lruvec_memcg()->mm_list and calls
    walk_page_range() with each mm_struct on this list to promote hot pages
    before it increments max_seq.

    When multiple page table walkers iterate the same list, each of them gets
    a unique mm_struct; therefore they can run concurrently.  Page table
    walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
    pages it left in the previous memcg will not be promoted when its current
    memcg is under reclaim.  Similarly, page table walkers will not promote
    pages from nodes other than the one under reclaim.

    This patch uses the following optimizations when walking page tables:
    1. It tracks the usage of mm_struct's between context switches so that
       page table walkers can skip processes that have been sleeping since
       the last iteration.
    2. It uses generational Bloom filters to record populated branches so
       that page table walkers can reduce their search space based on the
       query results, e.g., to skip page tables containing mostly holes or
       misplaced pages.
    3. It takes advantage of the accessed bit in non-leaf PMD entries when
       CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
    4. It does not zigzag between a PGD table and the same PMD table
       spanning multiple VMAs. IOW, it finishes all the VMAs within the
       range of the same PMD table before it returns to a PGD table. This
       improves the cache performance for workloads that have large
       numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change

      Single workload:
        memcached (anon): +[8, 10]%
                    Ops/sec      KB/sec
          patch1-7: 1147696.57   44640.29
          patch1-8: 1245274.91   48435.66

      Configurations:
        no change

    Client benchmark results:
      kswapd profiles:
        patch1-7
          48.16%  lzo1x_1_do_compress (real work)
           8.20%  page_vma_mapped_walk (overhead)
           7.06%  _raw_spin_unlock_irq
           2.92%  ptep_clear_flush
           2.53%  __zram_bvec_write
           2.11%  do_raw_spin_lock
           2.02%  memmove
           1.93%  lru_gen_look_around
           1.56%  free_unref_page_list
           1.40%  memset

        patch1-8
          49.44%  lzo1x_1_do_compress (real work)
           6.19%  page_vma_mapped_walk (overhead)
           5.97%  _raw_spin_unlock_irq
           3.13%  get_pfn_folio
           2.85%  ptep_clear_flush
           2.42%  __zram_bvec_write
           2.08%  do_raw_spin_lock
           1.92%  memmove
           1.44%  alloc_zspage
           1.36%  memset

      Configurations:
        no change

    Thanks to the following developers for their efforts [3].
      kernel test robot <lkp@intel.com>

    [1] https://lwn.net/Articles/23732/
    [2] https://llvm.org/docs/ScudoHardenedAllocator.html
    [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/

    Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:46 -04:00
Chris von Recklinghausen 23d981c266 mm: multi-gen LRU: exploit locality in rmap
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 018ee47f14893d500131dfca2ff9f3ff8ebd4ed2
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:04 2022 -0600

    mm: multi-gen LRU: exploit locality in rmap

    Searching the rmap for PTEs mapping each page on an LRU list (to test and
    clear the accessed bit) can be expensive because pages from different VMAs
    (PA space) are not cache friendly to the rmap (VA space).  For workloads
    mostly using mapped pages, searching the rmap can incur the highest CPU
    cost in the reclaim path.

    This patch exploits spatial locality to reduce the trips into the rmap.
    When shrink_page_list() walks the rmap and finds a young PTE, a new
    function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
    PTEs.  On finding another young PTE, it clears the accessed bit and
    updates the gen counter of the page mapped by this PTE to
    (max_seq%MAX_NR_GENS)+1.

    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change

      Single workload:
        memcached (anon): +[3, 5]%
                    Ops/sec      KB/sec
          patch1-6: 1106168.46   43025.04
          patch1-7: 1147696.57   44640.29

      Configurations:
        no change

    Client benchmark results:
      kswapd profiles:
        patch1-6
          39.03%  lzo1x_1_do_compress (real work)
          18.47%  page_vma_mapped_walk (overhead)
           6.74%  _raw_spin_unlock_irq
           3.97%  do_raw_spin_lock
           2.49%  ptep_clear_flush
           2.48%  anon_vma_interval_tree_iter_first
           1.92%  folio_referenced_one
           1.88%  __zram_bvec_write
           1.48%  memmove
           1.31%  vma_interval_tree_iter_next

        patch1-7
          48.16%  lzo1x_1_do_compress (real work)
           8.20%  page_vma_mapped_walk (overhead)
           7.06%  _raw_spin_unlock_irq
           2.92%  ptep_clear_flush
           2.53%  __zram_bvec_write
           2.11%  do_raw_spin_lock
           2.02%  memmove
           1.93%  lru_gen_look_around
           1.56%  free_unref_page_list
           1.40%  memset

      Configurations:
        no change

    Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Barry Song <baohua@kernel.org>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:45 -04:00
Chris von Recklinghausen e11a2a9f34 mm: multi-gen LRU: minimal implementation
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit ac35a490237446b71e3b4b782b1596967edd0aa8
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:03 2022 -0600

    mm: multi-gen LRU: minimal implementation

    To avoid confusion, the terms "promotion" and "demotion" will be applied
    to the multi-gen LRU, as a new convention; the terms "activation" and
    "deactivation" will be applied to the active/inactive LRU, as usual.

    The aging produces young generations.  Given an lruvec, it increments
    max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS.  The aging promotes
    hot pages to the youngest generation when it finds them accessed through
    page tables; the demotion of cold pages happens consequently when it
    increments max_seq.  Promotion in the aging path does not involve any LRU
    list operations, only the updates of the gen counter and
    lrugen->nr_pages[]; demotion, unless as the result of the increment of
    max_seq, requires LRU list operations, e.g., lru_deactivate_fn().  The
    aging has the complexity O(nr_hot_pages), since it is only interested in
    hot pages.

    The eviction consumes old generations.  Given an lruvec, it increments
    min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
    A feedback loop modeled after the PID controller monitors refaults over
    anon and file types and decides which type to evict when both types are
    available from the same generation.

    The protection of pages accessed multiple times through file descriptors
    takes place in the eviction path.  Each generation is divided into
    multiple tiers.  A page accessed N times through file descriptors is in
    tier order_base_2(N).  Tiers do not have dedicated lrugen->lists[], only
    bits in folio->flags.  The aforementioned feedback loop also monitors
    refaults over all tiers and decides when to protect pages in which tiers
    (N>1), using the first tier (N=0,1) as a baseline.  The first tier
    contains single-use unmapped clean pages, which are most likely the best
    choices.  In contrast to promotion in the aging path, the protection of a
    page in the eviction path is achieved by moving this page to the next
    generation, i.e., min_seq+1, if the feedback loop decides so.  This
    approach has the following advantages:

    1. It removes the cost of activation in the buffered access path by
       inferring whether pages accessed multiple times through file
       descriptors are statistically hot and thus worth protecting in the
       eviction path.
    2. It takes pages accessed through page tables into account and avoids
       overprotecting pages accessed multiple times through file
       descriptors. (Pages accessed through page tables are in the first
       tier, since N=0.)
    3. More tiers provide better protection for pages accessed more than
       twice through file descriptors, when under heavy buffered I/O
       workloads.

    Server benchmark results:
      Single workload:
        fio (buffered I/O): +[30, 32]%
                    IOPS         BW
          5.19-rc1: 2673k        10.2GiB/s
          patch1-6: 3491k        13.3GiB/s

      Single workload:
        memcached (anon): -[4, 6]%
                    Ops/sec      KB/sec
          5.19-rc1: 1161501.04   45177.25
          patch1-6: 1106168.46   43025.04

      Configurations:
        CPU: two Xeon 6154
        Mem: total 256G

        Node 1 was only used as a ram disk to reduce the variance in the
        results.

        patch drivers/block/brd.c <<EOF
        99,100c99,100
        <   gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
        <   page = alloc_page(gfp_flags);
        ---
        >   gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
        >   page = alloc_pages_node(1, gfp_flags, 0);
        EOF

        cat >>/etc/systemd/system.conf <<EOF
        CPUAffinity=numa
        NUMAPolicy=bind
        NUMAMask=0
        EOF

        cat >>/etc/memcached.conf <<EOF
        -m 184320
        -s /var/run/memcached/memcached.sock
        -a 0766
        -t 36
        -B binary
        EOF

        cat fio.sh
        modprobe brd rd_nr=1 rd_size=113246208
        swapoff -a
        mkfs.ext4 /dev/ram0
        mount -t ext4 /dev/ram0 /mnt

        mkdir /sys/fs/cgroup/user.slice/test
        echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
        echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
        fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
          --buffered=1 --ioengine=io_uring --iodepth=128 \
          --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
          --rw=randread --random_distribution=random --norandommap \
          --time_based --ramp_time=10m --runtime=5m --group_reporting

        cat memcached.sh
        modprobe brd rd_nr=1 rd_size=113246208
        swapoff -a
        mkswap /dev/ram0
        swapon /dev/ram0

        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
          --ratio 1:0 --pipeline 8 -d 2000

        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
          --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

    Client benchmark results:
      kswapd profiles:
        5.19-rc1
          40.33%  page_vma_mapped_walk (overhead)
          21.80%  lzo1x_1_do_compress (real work)
           7.53%  do_raw_spin_lock
           3.95%  _raw_spin_unlock_irq
           2.52%  vma_interval_tree_iter_next
           2.37%  folio_referenced_one
           2.28%  vma_interval_tree_subtree_search
           1.97%  anon_vma_interval_tree_iter_first
           1.60%  ptep_clear_flush
           1.06%  __zram_bvec_write

        patch1-6
          39.03%  lzo1x_1_do_compress (real work)
          18.47%  page_vma_mapped_walk (overhead)
           6.74%  _raw_spin_unlock_irq
           3.97%  do_raw_spin_lock
           2.49%  ptep_clear_flush
           2.48%  anon_vma_interval_tree_iter_first
           1.92%  folio_referenced_one
           1.88%  __zram_bvec_write
           1.48%  memmove
           1.31%  vma_interval_tree_iter_next

      Configurations:
        CPU: single Snapdragon 7c
        Mem: total 4G

        ChromeOS MemoryPressure [1]

    [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

    Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:45 -04:00
Chris von Recklinghausen 2af7596eac mm: multi-gen LRU: groundwork
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit ec1c86b25f4bdd9dce6436c0539d2a6ae676e1c4
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:02 2022 -0600

    mm: multi-gen LRU: groundwork

    Evictable pages are divided into multiple generations for each lruvec.
    The youngest generation number is stored in lrugen->max_seq for both
    anon and file types as they are aged on an equal footing. The oldest
    generation numbers are stored in lrugen->min_seq[] separately for anon
    and file types as clean file pages can be evicted regardless of swap
    constraints. These three variables are monotonically increasing.

    Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
    in order to fit into the gen counter in folio->flags. Each truncated
    generation number is an index to lrugen->lists[]. The sliding window
    technique is used to track at least MIN_NR_GENS and at most
    MAX_NR_GENS generations. The gen counter stores a value within [1,
    MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
    stores 0.

    There are two conceptually independent procedures: "the aging", which
    produces young generations, and "the eviction", which consumes old
    generations.  They form a closed-loop system, i.e., "the page reclaim".
    Both procedures can be invoked from userspace for the purposes of working
    set estimation and proactive reclaim.  These techniques are commonly used
    to optimize job scheduling (bin packing) in data centers [1][2].

    To avoid confusion, the terms "hot" and "cold" will be applied to the
    multi-gen LRU, as a new convention; the terms "active" and "inactive" will
    be applied to the active/inactive LRU, as usual.

    The protection of hot pages and the selection of cold pages are based
    on page access channels and patterns. There are two access channels:
    one through page tables and the other through file descriptors. The
    protection of the former channel is by design stronger because:
    1. The uncertainty in determining the access patterns of the former
       channel is higher due to the approximation of the accessed bit.
    2. The cost of evicting the former channel is higher due to the TLB
       flushes required and the likelihood of encountering the dirty bit.
    3. The penalty of underprotecting the former channel is higher because
       applications usually do not prepare themselves for major page
       faults like they do for blocked I/O. E.g., GUI applications
       commonly use dedicated I/O threads to avoid blocking rendering
       threads.

    There are also two access patterns: one with temporal locality and the
    other without.  For the reasons listed above, the former channel is
    assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
    present; the latter channel is assumed to follow the latter pattern unless
    outlying refaults have been observed [3][4].

    The next patch will address the "outlying refaults".  Three macros, i.e.,
    LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
    this patch to make the entire patchset less diffy.

    A page is added to the youngest generation on faulting.  The aging needs
    to check the accessed bit at least twice before handing this page over to
    the eviction.  The first check takes care of the accessed bit set on the
    initial fault; the second check makes sure this page has not been used
    since then.  This protocol, AKA second chance, requires a minimum of two
    generations, hence MIN_NR_GENS.

    [1] https://dl.acm.org/doi/10.1145/3297858.3304053
    [2] https://dl.acm.org/doi/10.1145/3503222.3507731
    [3] https://lwn.net/Articles/495543/
    [4] https://lwn.net/Articles/815342/

    Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:45 -04:00
Chris von Recklinghausen 68caecbf96 mm/vmscan.c: refactor shrink_node()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit f1e1a7be4718609042e3285bc2110d74825ad9d1
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:00 2022 -0600

    mm/vmscan.c: refactor shrink_node()

    This patch refactors shrink_node() to improve readability for the upcoming
    changes to mm/vmscan.c.

    Link: https://lkml.kernel.org/r/20220918080010.2920238-4-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:44 -04:00
Chris von Recklinghausen 39091324ad mm: fix null-ptr-deref in kswapd_is_running()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b4a0215e11dcfe23a48c65c6d6c82c0c2c551a48
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Sat Aug 27 19:19:59 2022 +0800

    mm: fix null-ptr-deref in kswapd_is_running()

    kswapd_run/stop() will set pgdat->kswapd to NULL, which could race with
    kswapd_is_running() in kcompactd(),

    kswapd_run/stop()                       kcompactd()
                                              kswapd_is_running()
      pgdat->kswapd // error or nomal ptr
                                              verify pgdat->kswapd
                                                // load non-NULL
    pgdat->kswapd
      pgdat->kswapd = NULL
                                              task_is_running(pgdat->kswapd)
                                                // Null pointer derefence

    KASAN reports the null-ptr-deref shown below,

      vmscan: Failed to start kswapd on node 0
      ...
      BUG: KASAN: null-ptr-deref in kcompactd+0x440/0x504
      Read of size 8 at addr 0000000000000024 by task kcompactd0/37

      CPU: 0 PID: 37 Comm: kcompactd0 Kdump: loaded Tainted: G           OE     5.10.60 #1
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      Call trace:
       dump_backtrace+0x0/0x394
       show_stack+0x34/0x4c
       dump_stack+0x158/0x1e4
       __kasan_report+0x138/0x140
       kasan_report+0x44/0xdc
       __asan_load8+0x94/0xd0
       kcompactd+0x440/0x504
       kthread+0x1a4/0x1f0
       ret_from_fork+0x10/0x18

    At present kswapd/kcompactd_run() and kswapd/kcompactd_stop() are protected
    by mem_hotplug_begin/done(), but without kcompactd(). There is no need to
    involve memory hotplug lock in kcompactd(), so let's add a new mutex to
    protect pgdat->kswapd accesses.

    Also, because the kcompactd task will check the state of kswapd task, it's
    better to call kcompactd_stop() before kswapd_stop() to reduce lock
    conflicts.

    [akpm@linux-foundation.org: add comments]
    Link: https://lkml.kernel.org/r/20220827111959.186838-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:38 -04:00
Chris von Recklinghausen 44e96704c4 mm/vmscan: define macros for refaults in struct lruvec
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e9c2dbc8bf71a5039604a1dc45b10f24a2098f3b
Author: Yang Yang <yang.yang29@zte.com.cn>
Date:   Mon Aug 8 00:56:45 2022 +0000

    mm/vmscan: define macros for refaults in struct lruvec

    The magic number 0 and 1 are used in several places in vmscan.c.
    Define macros for them to improve code readability.

    Link: https://lkml.kernel.org/r/20220808005644.1721066-1-yang.yang29@zte.com.cn
    Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:25 -04:00
Dave Wysochanski 924daddc03 mm: merge folio_has_private()/filemap_release_folio() call pairs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2209756

Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.

This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache.  The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.

The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.

The second patch is the actual fix.  Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set.  folio_needs_release() is altered to
add a check for this.

This patch (of 2):

Make filemap_release_folio() check folio_has_private().  Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.

There are a couple of sites in mm/vscan.c that this can't so easily be
done.  In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.

In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.

A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory.  This will acquire
additional parts to the condition in a future patch.

After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.

Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0201ebf274a306a6ebb95e5dc2d6a0a27c737cac)
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
2023-09-13 18:19:41 -04:00
Nico Pache 6fc7da1ce3 mm: vmscan: make rotations a secondary factor in balancing anon vs file
commit 0538a82c39e94d49fa6985c6a0101ca819be11ee
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Oct 13 15:31:13 2022 -0400

    mm: vmscan: make rotations a secondary factor in balancing anon vs file

    We noticed a 2% webserver throughput regression after upgrading from 5.6.
    This could be tracked down to a shift in the anon/file reclaim balance
    (confirmed with swappiness) that resulted in worse reclaim efficiency and
    thus more kswapd activity for the same outcome.

    The change that exposed the problem is aae466b005 ("mm/swap: implement
    workingset detection for anonymous LRU").  By qualifying swapins based on
    their refault distance, it lowered the cost of anon reclaim in this
    workload, in turn causing (much) more anon scanning than before.  Scanning
    the anon list is more expensive due to the higher ratio of mmapped pages
    that may rotate during reclaim, and so the result was an increase in %sys
    time.

    Right now, rotations aren't considered a cost when balancing scan pressure
    between LRUs.  We can end up with very few file refaults putting all the
    scan pressure on hot anon pages that are rotated en masse, don't get
    reclaimed, and never push back on the file LRU again.  We still only
    reclaim file cache in that case, but we burn a lot CPU rotating anon
    pages.  It's "fair" from an LRU age POV, but doesn't reflect the real cost
    it imposes on the system.

    Consider rotations as a secondary factor in balancing the LRUs.  This
    doesn't attempt to make a precise comparison between IO cost and CPU cost,
    it just says: if reloads are about comparable between the lists, or
    rotations are overwhelmingly different, adjust for CPU work.

    This fixed the regression on our webservers.  It has since been deployed
    to the entire Meta fleet and hasn't caused any problems.

    Link: https://lkml.kernel.org/r/20221013193113.726425-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:02 -06:00
Nico Pache 011900bb25 mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1
commit 81a70c21d9170de67a45843bdd627f4cce9c4215
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Fri Nov 18 12:36:03 2022 +0530

    mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1

    balance_dirty_pages doesn't do the required dirty throttling on cgroupv1.
    See commit 9badce000e ("cgroup, writeback: don't enable cgroup writeback
    on traditional hierarchies").  Instead, the kernel depends on writeback
    throttling in shrink_folio_list to achieve the same goal.  With large
    memory systems, the flusher may not be able to writeback quickly enough
    such that we will start finding pages in the shrink_folio_list already in
    writeback.  Hence for cgroupv1 let's do a reclaim throttle after waking up
    the flusher.

    The below test which used to fail on a 256GB system completes till the the
    file system is full with this change.

    root@lp2:/sys/fs/cgroup/memory# mkdir test
    root@lp2:/sys/fs/cgroup/memory# cd test/
    root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes
    root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks
    root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M
    Killed

    Link: https://lkml.kernel.org/r/20221118070603.84081-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: zefan li <lizefan.x@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:02 -06:00
Nico Pache 63ce9c4eb8 mm: vmscan: fix extreme overreclaim and swap floods
commit f53af4285d775cd9a9a146fc438bd0a1bee1838a
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Tue Aug 2 12:28:11 2022 -0400

    mm: vmscan: fix extreme overreclaim and swap floods

    During proactive reclaim, we sometimes observe severe overreclaim, with
    several thousand times more pages reclaimed than requested.

    This trace was obtained from shrink_lruvec() during such an instance:

        prio:0 anon_cost:1141521 file_cost:7767
        nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
        nr=[7161123 345 578 1111]

    While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
    by swapping.  These requests take over a minute, during which the write()
    to memory.reclaim is unkillably stuck inside the kernel.

    Digging into the source, this is caused by the proportional reclaim
    bailout logic.  This code tries to resolve a fundamental conflict: to
    reclaim roughly what was requested, while also aging all LRUs fairly and
    in accordance to their size, swappiness, refault rates etc.  The way it
    attempts fairness is that once the reclaim goal has been reached, it stops
    scanning the LRUs with the smaller remaining scan targets, and adjusts the
    remainder of the bigger LRUs according to how much of the smaller LRUs was
    scanned.  It then finishes scanning that remainder regardless of the
    reclaim goal.

    This works fine if priority levels are low and the LRU lists are
    comparable in size.  However, in this instance, the cgroup that is
    targeted by proactive reclaim has almost no files left - they've already
    been squeezed out by proactive reclaim earlier - and the remaining anon
    pages are hot.  Anon rotations cause the priority level to drop to 0,
    which results in reclaim targeting all of anon (a lot) and all of file
    (almost nothing).  By the time reclaim decides to bail, it has scanned
    most or all of the file target, and therefor must also scan most or all of
    the enormous anon target.  This target is thousands of times larger than
    the reclaim goal, thus causing the overreclaim.

    The bailout code hasn't changed in years, why is this failing now?  The
    most likely explanations are two other recent changes in anon reclaim:

    1. Before the series starting with commit 5df741963d ("mm: fix LRU
       balancing effect of new transparent huge pages"), the VM was
       overall relatively reluctant to swap at all, even if swap was
       configured. This means the LRU balancing code didn't come into play
       as often as it does now, and mostly in high pressure situations
       where pronounced swap activity wouldn't be as surprising.

    2. For historic reasons, shrink_lruvec() loops on the scan targets of
       all LRU lists except the active anon one, meaning it would bail if
       the only remaining pages to scan were active anon - even if there
       were a lot of them.

       Before the series starting with commit ccc5dc6734 ("mm/vmscan:
       make active/inactive ratio as 1:1 for anon lru"), most anon pages
       would live on the active LRU; the inactive one would contain only a
       handful of preselected reclaim candidates. After the series, anon
       gets aged similarly to file, and the inactive list is the default
       for new anon pages as well, making it often the much bigger list.

       As a result, the VM is now more likely to actually finish large
       anon targets than before.

    Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
    larger LRU lists is made before bailing out on a met reclaim goal.

    This fixes the extreme overreclaim problem.

    Fairness is more subtle and harder to evaluate.  No obvious misbehavior
    was observed on the test workload, in any case.  Conceptually, fairness
    should primarily be a cumulative effect from regular, lower priority
    scans.  Once the VM is in trouble and needs to escalate scan targets to
    make forward progress, fairness needs to take a backseat.  This is also
    acknowledged by the myriad exceptions in get_scan_count().  This patch
    makes fairness decrease gradually, as it keeps fairness work static over
    increasing priority levels with growing scan targets.  This should make
    more sense - although we may have to re-visit the exact values.

    Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Rik van Riel <riel@surriel.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:02 -06:00
Rafael Aquini 8cfc52e479 mm/demotion: demote pages according to allocation fallback order
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2186559

This patch is a backport of the following upstream commit:
commit 32008027289239100d8d2876f50b15d92bde1855
Author: Jagdish Gediya <jvgediya.oss@gmail.com>
Date:   Thu Aug 18 18:40:40 2022 +0530

    mm/demotion: demote pages according to allocation fallback order

    Currently, a higher tier node can only be demoted to selected nodes on the
    next lower tier as defined by the demotion path.  This strict demotion
    order does not work in all use cases (e.g.  some use cases may want to
    allow cross-socket demotion to another node in the same demotion tier as a
    fallback when the preferred demotion node is out of space).  This demotion
    order is also inconsistent with the page allocation fallback order when
    all the nodes in a higher tier are out of space: The page allocation can
    fall back to any node from any lower tier, whereas the demotion order
    doesn't allow that currently.

    This patch adds support to get all the allowed demotion targets for a
    memory tier.  demote_page_list() function is now modified to utilize this
    allowed node mask as the fallback allocation mask.

    Link: https://lkml.kernel.org/r/20220818131042.113280-9-aneesh.kumar@linux.ibm.com
    Signed-off-by: Jagdish Gediya <jvgediya.oss@gmail.com>
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: Wei Xu <weixugc@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hesham Almatary <hesham.almatary@huawei.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Tim Chen <tim.c.chen@intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2023-04-26 08:55:46 -04:00
Rafael Aquini 892be419d9 mm/demotion: move memory demotion related code
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2186559

This patch is a backport of the following upstream commit:
commit 9195244022788935eac0df16132394ffa5613542
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Thu Aug 18 18:40:34 2022 +0530

    mm/demotion: move memory demotion related code

    This moves memory demotion related code to mm/memory-tiers.c.  No
    functional change in this patch.

    Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Acked-by: Wei Xu <weixugc@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hesham Almatary <hesham.almatary@huawei.com>
    Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Tim Chen <tim.c.chen@intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2023-04-26 08:55:41 -04:00
Chris von Recklinghausen a7fb36ec82 mm: shrinkers: fix deadlock in shrinker debugfs
Conflicts: mm/vmscan.c - We don't have
	d6c3af7d8a2b ("mm: multi-gen LRU: debugfs interface")
	so add an #include of linux/debugfs.h

Bugzilla: https://bugzilla.redhat.com/2160210

commit badc28d4924bfed73efc93f716a0c3aa3afbdf6f
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Thu Feb 2 18:56:12 2023 +0800

    mm: shrinkers: fix deadlock in shrinker debugfs

    The debugfs_remove_recursive() is invoked by unregister_shrinker(), which
    is holding the write lock of shrinker_rwsem.  It will waits for the
    handler of debugfs file complete.  The handler also needs to hold the read
    lock of shrinker_rwsem to do something.  So it may cause the following
    deadlock:

            CPU0                            CPU1

    debugfs_file_get()
    shrinker_debugfs_count_show()/shrinker_debugfs_scan_write()

                                    unregister_shrinker()
                                    --> down_write(&shrinker_rwsem);
                                        debugfs_remove_recursive()
                                            // wait for (A)
                                        --> wait_for_completion();

        // wait for (B)
    --> down_read_killable(&shrinker_rwsem)
    debugfs_file_put() -- (A)

                                        up_write() -- (B)

    The down_read_killable() can be killed, so that the above deadlock can be
    recovered.  But it still requires an extra kill action, otherwise it will
    block all subsequent shrinker-related operations, so it's better to fix
    it.

    [akpm@linux-foundation.org: fix CONFIG_SHRINKER_DEBUG=n stub]
    Link: https://lkml.kernel.org/r/20230202105612.64641-1-zhengqi.arch@bytedance.com
    Fixes: 5035ebc644ae ("mm: shrinkers: introduce debugfs interface for memory shrinkers")
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Kent Overstreet <kent.overstreet@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:35 -04:00
Chris von Recklinghausen e60aca2995 vmscan: check folio_test_private(), not folio_get_private()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 36a3b14b5febdaf0e7f70c4ca6f62c8ea75fabfe
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:26:39 2022 +0100

    vmscan: check folio_test_private(), not folio_get_private()

    These two predicates are the same for file pages, but are not the same for
    anonymous pages.

    Link: https://lkml.kernel.org/r/20220902192639.1737108-3-willy@infradead.org
    Fixes: 07f67a8dedc0 ("mm/vmscan: convert shrink_active_list() to use a folio")
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reported-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:32 -04:00
Chris von Recklinghausen 0b0d5d11c5 mm: vmpressure: don't count proactive reclaim in vmpressure
Bugzilla: https://bugzilla.redhat.com/2160210

commit 73b73bac90d97400e29e585c678c4d0ebfd2680d
Author: Yosry Ahmed <yosryahmed@google.com>
Date:   Thu Jul 14 06:49:18 2022 +0000

    mm: vmpressure: don't count proactive reclaim in vmpressure

    memory.reclaim is a cgroup v2 interface that allows users to proactively
    reclaim memory from a memcg, without real memory pressure.  Reclaim
    operations invoke vmpressure, which is used: (a) To notify userspace of
    reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being
    under memory pressure for networking (see
    mem_cgroup_under_socket_pressure()).

    For (a), vmpressure notifications in v1 are not affected by this change
    since memory.reclaim is a v2 feature.

    For (b), the effects of the vmpressure signal (according to Shakeel [1])
    are as follows:
    1. Reducing send and receive buffers of the current socket.
    2. May drop packets on the rx path.
    3. May throttle current thread on the tx path.

    Since proactive reclaim is invoked directly by userspace, not by memory
    pressure, it makes sense not to throttle networking.  Hence, this change
    makes sure that proactive reclaim caused by memory.reclaim does not
    trigger vmpressure.

    [1] https://lore.kernel.org/lkml/CALvZod68WdrXEmBpOkadhB5GPYmCXaDZzXH=yyGOCAjFRn4NDQ@mail.gmail.com/

    [yosryahmed@google.com: update documentation]
      Link: https://lkml.kernel.org/r/20220721173015.2643248-1-yosryahmed@google.com
    Link: https://lkml.kernel.org/r/20220714064918.2576464-1-yosryahmed@google.com
    Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
    Acked-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:28 -04:00
Chris von Recklinghausen f78820ee59 mm: shrinkers: fix double kfree on shrinker name
Bugzilla: https://bugzilla.redhat.com/2160210

commit 14773bfa70e67f4d4ebd60e60cb6e25e8c84d4c0
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Wed Jul 20 23:47:55 2022 +0900

    mm: shrinkers: fix double kfree on shrinker name

    syzbot is reporting double kfree() at free_prealloced_shrinker() [1], for
    destroy_unused_super() calls free_prealloced_shrinker() even if
    prealloc_shrinker() returned an error.  Explicitly clear shrinker name
    when prealloc_shrinker() called kfree().

    [roman.gushchin@linux.dev: zero shrinker->name in all cases where shrinker->name is freed]
      Link: https://lkml.kernel.org/r/YtgteTnQTgyuKUSY@castle
    Link: https://syzkaller.appspot.com/bug?extid=8b481578352d4637f510 [1]
    Link: https://lkml.kernel.org/r/ffa62ece-6a42-2644-16cf-0d33ef32c676@I-love.SAKURA.ne.jp
    Fixes: e33c267ab70de424 ("mm: shrinkers: provide shrinkers with names")
    Reported-by: syzbot <syzbot+8b481578352d4637f510@syzkaller.appspotmail.com>
    Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:28 -04:00
Chris von Recklinghausen 735b1144c9 mm, docs: fix comments that mention mem_hotplug_end()
Bugzilla: https://bugzilla.redhat.com/2160210

commit e8da368a1e42a8056d1a6b419e1b91b6cf11d77e
Author: Yun-Ze Li <p76091292@gs.ncku.edu.tw>
Date:   Mon Jun 20 07:15:16 2022 +0000

    mm, docs: fix comments that mention mem_hotplug_end()

    Comments that mention mem_hotplug_end() are confusing as there is no
    function called mem_hotplug_end().  Fix them by replacing all the
    occurences of mem_hotplug_end() in the comments with mem_hotplug_done().

    [akpm@linux-foundation.org: grammatical fixes]
    Link: https://lkml.kernel.org/r/20220620071516.1286101-1-p76091292@gs.ncku.edu.tw
    Signed-off-by: Yun-Ze Li <p76091292@gs.ncku.edu.tw>
    Cc: Souptick Joarder <jrdr.linux@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:20 -04:00
Chris von Recklinghausen 626f944477 mm/swap: convert __delete_from_swap_cache() to a folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit ceff9d3354e95ca17e12ad869acea5407cc467f9
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 18:50:20 2022 +0100

    mm/swap: convert __delete_from_swap_cache() to a folio

    All callers now have a folio, so convert the entire function to operate
    on folios.

    Link: https://lkml.kernel.org/r/20220617175020.717127-23-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:20 -04:00
Chris von Recklinghausen 13005f9b18 mm: convert page_swap_flags to folio_swap_flags
Bugzilla: https://bugzilla.redhat.com/2160210

commit b98c359f1d921deae04bb5dbbbbbb9d8705b7c4c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 18:50:18 2022 +0100

    mm: convert page_swap_flags to folio_swap_flags

    The only caller already has a folio, so push the folio->page conversion
    down a level.

    Link: https://lkml.kernel.org/r/20220617175020.717127-21-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:20 -04:00
Chris von Recklinghausen eed5a2e492 mm: convert destroy_compound_page() to destroy_large_folio()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5375336c8c42a343c3b440b6f1e21c65e7b174b9
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 18:50:17 2022 +0100

    mm: convert destroy_compound_page() to destroy_large_folio()

    All callers now have a folio, so push the folio->page conversion
    down to this function.

    [akpm@linux-foundation.org: uninline destroy_large_folio() to fix build issue]
    Link: https://lkml.kernel.org/r/20220617175020.717127-20-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:20 -04:00
Chris von Recklinghausen 5f78168909 mm/vmscan: convert reclaim_pages() to use a folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit a83f0551f49682c81444d682053d49f9dfcbe5fa
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 16:42:48 2022 +0100

    mm/vmscan: convert reclaim_pages() to use a folio

    Remove a few hidden calls to compound_head, saving 76 bytes of text.

    Link: https://lkml.kernel.org/r/20220617154248.700416-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:18 -04:00
Chris von Recklinghausen b80fd4fb22 mm/vmscan: convert shrink_active_list() to use a folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit 07f67a8dedc0788f3f91d945bc6e987cf9cccd4a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 16:42:47 2022 +0100

    mm/vmscan: convert shrink_active_list() to use a folio

    Remove a few hidden calls to compound_head, saving 411 bytes of text.

    Link: https://lkml.kernel.org/r/20220617154248.700416-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:18 -04:00
Chris von Recklinghausen a1783a48b2 mm/vmscan: convert move_pages_to_lru() to use a folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit ff00a170d950309f9daef836caa3d54671b883b8
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 16:42:46 2022 +0100

    mm/vmscan: convert move_pages_to_lru() to use a folio

    Remove a few hidden calls to compound_head, saving 387 bytes of text on
    my test configuration.

    Link: https://lkml.kernel.org/r/20220617154248.700416-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:18 -04:00
Chris von Recklinghausen 7c42c35c60 mm/vmscan: convert isolate_lru_pages() to use a folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit 166e3d32276f4c9ffd290f92b9df55b255f5fed7
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 16:42:45 2022 +0100

    mm/vmscan: convert isolate_lru_pages() to use a folio

    Remove a few hidden calls to compound_head, saving 279 bytes of text.

    Link: https://lkml.kernel.org/r/20220617154248.700416-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:18 -04:00
Chris von Recklinghausen d816e11222 mm/vmscan: convert reclaim_clean_pages_from_list() to folios
Bugzilla: https://bugzilla.redhat.com/2160210

commit b8cecb9376b9d3031cf62b476a0db087b6b01072
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 16:42:44 2022 +0100

    mm/vmscan: convert reclaim_clean_pages_from_list() to folios

    Patch series "nvert much of vmscan to folios"

    vmscan always operates on folios since it puts the pages on the LRU list.
    Switching all of these functions from pages to folios saves 1483 bytes of
    text from removing all the baggage around calling compound_page() and
    similar functions.

    This patch (of 5):

    This is a straightforward conversion which removes several hidden calls
    to compound_head, saving 330 bytes of kernel text.

    Link: https://lkml.kernel.org/r/20220617154248.700416-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20220617154248.700416-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:18 -04:00
Chris von Recklinghausen 8dced2b153 mm: shrinkers: provide shrinkers with names
Bugzilla: https://bugzilla.redhat.com/2160210

commit e33c267ab70de4249d22d7eab1cc7d68a889bac2
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Tue May 31 20:22:24 2022 -0700

    mm: shrinkers: provide shrinkers with names

    Currently shrinkers are anonymous objects.  For debugging purposes they
    can be identified by count/scan function names, but it's not always
    useful: e.g.  for superblock's shrinkers it's nice to have at least an
    idea of to which superblock the shrinker belongs.

    This commit adds names to shrinkers.  register_shrinker() and
    prealloc_shrinker() functions are extended to take a format and arguments
    to master a name.

    In some cases it's not possible to determine a good name at the time when
    a shrinker is allocated.  For such cases shrinker_debugfs_rename() is
    provided.

    The expected format is:
        <subsystem>-<shrinker_type>[:<instance>]-<id>
    For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.

    After this change the shrinker debugfs directory looks like:
      $ cd /sys/kernel/debug/shrinker/
      $ ls
        dquota-cache-16     sb-devpts-28     sb-proc-47       sb-tmpfs-42
        mm-shadow-18        sb-devtmpfs-5    sb-proc-48       sb-tmpfs-43
        mm-zspool:zram0-34  sb-hugetlbfs-17  sb-pstore-31     sb-tmpfs-44
        rcu-kfree-0         sb-hugetlbfs-33  sb-rootfs-2      sb-tmpfs-49
        sb-aio-20           sb-iomem-12      sb-securityfs-6  sb-tracefs-13
        sb-anon_inodefs-15  sb-mqueue-21     sb-selinuxfs-22  sb-xfs:vda1-36
        sb-bdev-3           sb-nsfs-4        sb-sockfs-8      sb-zsmalloc-19
        sb-bpf-32           sb-pipefs-14     sb-sysfs-26      thp-deferred_split-10
        sb-btrfs:vda2-24    sb-proc-25       sb-tmpfs-1       thp-zero-9
        sb-cgroup2-30       sb-proc-39       sb-tmpfs-27      xfs-buf:vda1-37
        sb-configfs-23      sb-proc-41       sb-tmpfs-29      xfs-inodegc:vda1-38
        sb-dax-11           sb-proc-45       sb-tmpfs-35
        sb-debugfs-7        sb-proc-46       sb-tmpfs-40

    [roman.gushchin@linux.dev: fix build warnings]
      Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
      Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
    Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Kent Overstreet <kent.overstreet@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Chris von Recklinghausen 22b1bd509b mm: shrinkers: introduce debugfs interface for memory shrinkers
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5035ebc644aec92d55d1bbfe042f35341e4bffb5
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Tue May 31 20:22:23 2022 -0700

    mm: shrinkers: introduce debugfs interface for memory shrinkers

    This commit introduces the /sys/kernel/debug/shrinker debugfs interface
    which provides an ability to observe the state of individual kernel memory
    shrinkers.

    Because the feature adds some memory overhead (which shouldn't be large
    unless there is a huge amount of registered shrinkers), it's guarded by a
    config option (enabled by default).

    This commit introduces the "count" interface for each shrinker registered
    in the system.

    The output is in the following format:
    <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1>...
    <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1>...
    ...

    To reduce the size of output on machines with many thousands cgroups, if
    the total number of objects on all nodes is 0, the line is omitted.

    If the shrinker is not memcg-aware or CONFIG_MEMCG is off, 0 is printed as
    cgroup inode id.  If the shrinker is not numa-aware, 0's are printed for
    all nodes except the first one.

    This commit gives debugfs entries simple numeric names, which are not very
    convenient.  The following commit in the series will provide shrinkers
    with more meaningful names.

    [akpm@linux-foundation.org: remove WARN_ON_ONCE(), per Roman]
      Reported-by: syzbot+300d27c79fe6d4cbcc39@syzkaller.appspotmail.com
    Link: https://lkml.kernel.org/r/20220601032227.4076670-3-roman.gushchin@linux.dev
    Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
    Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Chris von Recklinghausen 73c174d303 vmscan: Add check_move_unevictable_folios()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 77414d195f905dd43f58bce82118775ffa59575c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Jun 4 17:39:09 2022 -0400

    vmscan: Add check_move_unevictable_folios()

    Change the guts of check_move_unevictable_pages() over to use folios
    and add check_move_unevictable_pages() as a wrapper.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:16 -04:00
Chris von Recklinghausen 2244d30d80 Revert "mm/vmscan: never demote for memcg reclaim"
Bugzilla: https://bugzilla.redhat.com/2160210

commit 3f1509c57b1ba5646de0fb8d81bd7107aec22257
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Wed May 18 15:09:11 2022 -0400

    Revert "mm/vmscan: never demote for memcg reclaim"

    This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a.

    Its premise was that cgroup reclaim cares about freeing memory inside the
    cgroup, and demotion just moves them around within the cgroup limit.
    Hence, pages from toptier nodes should be reclaimed directly.

    However, with NUMA balancing now doing tier promotions, demotion is part
    of the page aging process.  Global reclaim demotes the coldest toptier
    pages to secondary memory, where their life continues and from which they
    have a chance to get promoted back.  Essentially, tiered memory systems
    have an LRU order that spans multiple nodes.

    When cgroup reclaims pages coming off the toptier directly, there can be
    colder pages on lower tier nodes that were demoted by global reclaim.
    This is an aging inversion, not unlike if cgroups were to reclaim directly
    from the active lists while there are inactive pages.

    Proactive reclaim is another factor.  The goal of that it is to offload
    colder pages from expensive RAM to cheaper storage.  When lower tier
    memory is available as an intermediate layer, we want offloading to take
    advantage of it instead of bypassing to storage.

    Revert the patch so that cgroups respect the LRU order spanning the memory
    hierarchy.

    Of note is a specific undercommit scenario, where all cgroup limits in the
    system add up to <= available toptier memory.  In that case, shuffling
    pages out to lower tiers first to reclaim them from there is inefficient.
    This is something could be optimized/short-circuited later on (although
    care must be taken not to accidentally recreate the aging inversion).
    Let's ensure correctness first.

    Link: https://lkml.kernel.org/r/20220518190911.82400-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:12 -04:00
Chris von Recklinghausen 7b1db0833d mm: don't be stuck to rmap lock on reclaim path
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6d4675e601357834dadd2ba1d803f6484596015c
Author: Minchan Kim <minchan@kernel.org>
Date:   Thu May 19 14:08:54 2022 -0700

    mm: don't be stuck to rmap lock on reclaim path

    The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
    under memory pressure if processes keep working on their vmas(e.g., fork,
    mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
    we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
    makes other processes entering direct reclaim, which were also stuck on
    the lock.

    This patch makes lru aging path try_lock mode like shink_page_list so the
    reclaim context will keep working with next lru pages without being stuck.
    if it found the rmap lock contended, it rotates the page back to head of
    lru in both active/inactive lrus to make them consistent behavior, which
    is basic starting point rather than adding more heristic.

    Since this patch introduces a new "contended" field as out-param along
    with try_lock in-param in rmap_walk_control, it's not immutable any longer
    if the try_lock is set so remove const keywords on rmap related functions.
    Since rmap walking is already expensive operation, I doubt the const
    would help sizable benefit( And we didn't have it until 5.17).

    In a heavy app workload in Android, trace shows following statistics.  It
    almost removes rmap lock contention from reclaim path.

    Martin Liu reported:

    Before:

       max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
             1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
              601            0             601   145.544681        28817    198  rmap_walk_file

    After:

       max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
              NaN          NaN              NaN          NaN          NaN    0.0             NaN
                0            0                0     0.127645            1     12  rmap_walk_file

    [minchan@kernel.org: add comment, per Matthew]
      Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
    Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: John Dias <joaodias@google.com>
    Cc: Tim Murray <timmurray@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Martin Liu <liumartin@google.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen 0fa6b9a627 vmscan: remove remaining uses of page in shrink_page_list
Conflicts: mm/vmscan.c - The backport of
	d4b4084ac315 ("mm: Turn can_split_huge_page() into can_split_folio()")
	didn't change page_maybe_dma_pinned to folio_maybe_dma_pinned.
	Do that here.

Bugzilla: https://bugzilla.redhat.com/2160210

commit c28a0e9695b724fbaa58b1f5bbf0a03c5a79d721
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu May 12 20:23:03 2022 -0700

    vmscan: remove remaining uses of page in shrink_page_list

    These are all straightforward conversions to the folio API.

    Link: https://lkml.kernel.org/r/20220504182857.4013401-16-willy@infradead.or
g
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:07 -04:00