Centos-kernel-stream-9/Documentation/features
Mark Salter da27503010 arm64: support batched/deferred tlb shootdown during page reclamation/migration
JIRA: https://issues.redhat.com/browse/RHEL-40604

 Conflicts:
	arch/arm64/Kconfig
	Context from RT kernel patches

	arch/arm64/include/asm/tlbflush.h
	Out of order backports of:
	commit 1af5a8109904 "mmu_notifiers: rename invalidate_range notifier"
	commit 6bbd42e2df8f "mmu_notifiers: call invalidate_range() when
	invalidating TLBs"

commit 43b3dfdd04553171488cb11d46d21948b6b90e27
Author: Barry Song <v-songbaohua@oppo.com>
Date: Mon, 17 Jul 2023 21:10:04 +0800

    On x86, batched and deferred tlb shootdown has lead to 90% performance
    increase on tlb shootdown.  on arm64, HW can do tlb shootdown without
    software IPI.  But sync tlbi is still quite expensive.

    Even running a simplest program which requires swapout can
    prove this is true,
     #include <sys/types.h>
     #include <unistd.h>
     #include <sys/mman.h>
     #include <string.h>

     int main()
     {
     #define SIZE (1 * 1024 * 1024)
             volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                              MAP_SHARED | MAP_ANONYMOUS, -1, 0);

             memset(p, 0x88, SIZE);

             for (int k = 0; k < 10000; k++) {
                     /* swap in */
                     for (int i = 0; i < SIZE; i += 4096) {
                             (void)p[i];
                     }

                     /* swap out */
                     madvise(p, SIZE, MADV_PAGEOUT);
             }
     }

    Perf result on snapdragon 888 with 8 cores by using zRAM
    as the swap block device.

     ~ # perf record taskset -c 4 ./a.out
     [ perf record: Woken up 10 times to write data ]
     [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
     ~ # perf report
     # To display the perf.data header info, please use --header/--header-only options.
     # To display the perf.data header info, please use --header/--header-only options.
     #
     #
     # Total Lost Samples: 0
     #
     # Samples: 60K of event 'cycles'
     # Event count (approx.): 35706225414
     #
     # Overhead  Command  Shared Object      Symbol
     # ........  .......  .................  ......
     #
        21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
         8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
         6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
         6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
         5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
         3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
         3.49%  a.out    [kernel.kallsyms]  [k] memset64
         1.63%  a.out    [kernel.kallsyms]  [k] clear_page
         1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
         1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
         1.23%  a.out    [kernel.kallsyms]  [k] xas_load
         1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

    ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out
    a page mapped by only one process.  If the page is mapped by multiple
    processes, typically, like more than 100 on a phone, the overhead would be
    much higher as we have to run tlb flush 100 times for one single page.
    Plus, tlb flush overhead will increase with the number of CPU cores due to
    the bad scalability of tlb shootdown in HW, so those ARM64 servers should
    expect much higher overhead.

    Further perf annonate shows 95% cpu time of ptep_clear_flush is actually
    used by the final dsb() to wait for the completion of tlb flush.  This
    provides us a very good chance to leverage the existing batched tlb in
    kernel.  The minimum modification is that we only send async tlbi in the
    first stage and we send dsb while we have to sync in the second stage.

    With the above simplest micro benchmark, collapsed time to finish the
    program decreases around 5%.

    Typical collapsed time w/o patch:
     ~ # time taskset -c 4 ./a.out
     0.21user 14.34system 0:14.69elapsed
    w/ patch:
     ~ # time taskset -c 4 ./a.out
     0.22user 13.45system 0:13.80elapsed

    Also tested with benchmark in the commit on Kunpeng920 arm64 server
    and observed an improvement around 12.5% with command
    `time ./swap_bench`.
            w/o             w/
    real    0m13.460s       0m11.771s
    user    0m0.248s        0m0.279s
    sys     0m12.039s       0m11.458s

    Originally it's noticed a 16.99% overhead of ptep_clear_flush()
    which has been eliminated by this patch:

    [root@localhost yang]# perf record -- ./swap_bench && perf report
    [...]
    16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

    It is tested on 4,8,128 CPU platforms and shows to be beneficial on
    large systems but may not have improvement on small systems like on
    a 4 CPU platform.

    Also this patch improve the performance of page migration. Using pmbench
    and tries to migrate the pages of pmbench between node 0 and node 1 for
    100 times for 1G memory, this patch decrease the time used around 20%
    (prev 18.338318910 sec after 13.981866350 sec) and saved the time used
    by ptep_clear_flush().

    Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com
    Tested-by: Yicong Yang <yangyicong@hisilicon.com>
    Tested-by: Xin Hao <xhao@linux.alibaba.com>
    Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: Barry Song <v-songbaohua@oppo.com>
    Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Darren Hart <darren@os.amperecomputing.com>
    Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Cc: lipeifeng <lipeifeng@oppo.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Steven Miao <realmz6@gmail.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zeng Tao <prime.zeng@hisilicon.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Mark Salter <msalter@redhat.com>
2024-10-31 10:35:27 -04:00
..
core Documentation/features: Add THREAD_INFO_IN_TASK feature matrix 2021-07-15 06:33:44 -06:00
debug
io/dma-contiguous
locking
perf
sched s390: enable ARCH_HAS_MEMBARRIER_SYNC_CORE 2024-09-06 17:33:45 +02:00
scripts
seccomp/seccomp-filter
time context_tracking: Split user tracking Kconfig 2023-03-30 08:36:16 -04:00
vm arm64: support batched/deferred tlb shootdown during page reclamation/migration 2024-10-31 10:35:27 -04:00
arch-support.txt
list-arch.sh