Commit Graph

290 Commits

Author SHA1 Message Date
Rafael Aquini fa01616b1d percpu: scoped objcg protection
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit c63b835d0eafc956c43b8c6605708240ac52b8cd
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Thu Oct 19 15:53:45 2023 -0700

    percpu: scoped objcg protection

    Similar to slab and kmem, switch to a scope-based protection of the objcg
    pointer to avoid.

    Link: https://lkml.kernel.org/r/20231019225346.1822282-6-roman.gushchin@linux.dev
    Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev>
    Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
    Acked-by: Shakeel Butt <shakeelb@google.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:58 -05:00
Rafael Aquini c2a3a026db mm/percpu.c: print error message too if atomic alloc failed
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit f7d77dfc91f747f64cb00884fd6d7940c3b49fca
Author: Baoquan He <bhe@redhat.com>
Date:   Fri Jul 28 11:02:55 2023 +0800

    mm/percpu.c: print error message too if atomic alloc failed

    The variable 'err' is assgigned to an error message if atomic alloc
    failed, while it has no chance to be printed if is_atomic is true.

    Here change to print error message too if atomic alloc failed, while
    avoid to call dump_stack() if that case.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:08 -04:00
Rafael Aquini 781159954d mm/percpu.c: optimize the code in pcpu_setup_first_chunk() a little bit
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 7ee1e758bebe13d96217bcfd5230892ed44760e7
Author: Baoquan He <bhe@redhat.com>
Date:   Sat Jul 22 09:14:37 2023 +0800

    mm/percpu.c: optimize the code in pcpu_setup_first_chunk() a little bit

    This removes the need of local varibale 'chunk', and optimize the code
    calling pcpu_alloc_first_chunk() to initialize reserved chunk and
    dynamic chunk to make it simpler.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    [Dennis: reworded first chunk init comment]
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:07 -04:00
Rafael Aquini 09bb07696d mm/percpu.c: remove redundant check
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 5b672085e70c2ea40f4c9d6a23848079bf0ff700
Author: Baoquan He <bhe@redhat.com>
Date:   Fri Jul 21 21:17:58 2023 +0800

    mm/percpu.c: remove redundant check

    The conditional check "(ai->dyn_size < PERCPU_DYNAMIC_EARLY_SIZE) has
    covered the check '(!ai->dyn_size)'.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:07 -04:00
Rafael Aquini 16893a5375 mm/percpu: Remove some local variables in pcpu_populate_pte
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 41fd59b7f9bdde2a473450680411c2016017b992
Author: Bibo Mao <maobibo@loongson.cn>
Date:   Wed Jul 12 11:16:20 2023 +0800

    mm/percpu: Remove some local variables in pcpu_populate_pte

    In function pcpu_populate_pte there are already variable defined,
    it can be reused for later use, here remove duplicated local
    variables.

    Signed-off-by: Bibo Mao <maobibo@loongson.cn>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:06 -04:00
Aristeu Rozanski 1c1f6235c1 mm: memcontrol: rename memcg_kmem_enabled()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f7a449f779608efe1941a0e0c4bd7b5f57000be7
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Mon Feb 13 11:29:22 2023 -0800

    mm: memcontrol: rename memcg_kmem_enabled()

    Currently there are two kmem-related helper functions with a confusing
    semantics: memcg_kmem_enabled() and mem_cgroup_kmem_disabled().

    The problem is that an obvious expectation
    memcg_kmem_enabled() == !mem_cgroup_kmem_disabled(),
    can be false.

    mem_cgroup_kmem_disabled() is similar to mem_cgroup_disabled(): it returns
    true only if CONFIG_MEMCG_KMEM is not set or the kmem accounting is
    disabled using a boot time kernel option "cgroup.memory=nokmem".  It never
    changes the value dynamically.

    memcg_kmem_enabled() is different: it always returns false until the first
    non-root memory cgroup will get online (assuming the kernel memory
    accounting is enabled).  It's goal is to improve the performance on
    systems without the cgroupfs mounted/memory controller enabled or on the
    systems with only the root memory cgroup.

    To make things more obvious and avoid potential bugs, let's rename
    memcg_kmem_enabled() to memcg_kmem_online().

    Link: https://lkml.kernel.org/r/20230213192922.1146370-1-roman.gushchin@linux.dev
    Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Lucas Zampieri 6f794c0e0b
Merge: MM update to v6.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3738

JIRA: https://issues.redhat.com/browse/RHEL-27739

Depends: !3662

Dropped Patches and the reason they were dropped:

Needs to be evaluated by the FS team:
138060ba92b3 ("fs: pass dentry to set acl method")
3b4c7bc01727 ("xattr: use rbtree for simple_xattrs")

Needs to be evaluated by the NVME team:
4003f107fa2e ("mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages")

Needs to be evaluated by the ZRAM team:
7c2af309abd2 ("zram: add size class equals check into recompression")

Signed-off-by: Audra Mitchell <audra@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Jocelyn Falempe <jfalempe@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-17 10:14:56 -03:00
Audra Mitchell a23585f50b mm/percpu.c: remove the lcm code since block size is fixed at page size
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 3289e0533e70aafa9fb6d128fd4452db1b8befe8
Author: Baoquan He <bhe@redhat.com>
Date:   Mon Oct 24 16:14:33 2022 +0800

    mm/percpu.c: remove the lcm code since block size is fixed at page size

    Since commit b239f7daf5 ("percpu: set PCPU_BITMAP_BLOCK_SIZE to
    PAGE_SIZE"), the PCPU_BITMAP_BLOCK_SIZE has been set to page size
    fixedly. So the lcm code in pcpu_alloc_first_chunk() doesn't make
    sense any more, clean it up.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell a4b0f4aadc mm/percpu: replace the goto with break
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 83d261fc9e5fb03e8c32e365ca4ee53952611a2b
Author: Baoquan He <bhe@redhat.com>
Date:   Mon Oct 24 16:14:32 2022 +0800

    mm/percpu: replace the goto with break

    In function pcpu_reclaim_populated(), the line of goto jumping is
    unnecessary since the label 'end_chunk' is near the end of the for
    loop, use break instead.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell 443bfa2d5b mm/percpu: add comment to state the empty populated pages accounting
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 73046f8d31701c379f6db899cb09ba70a3285143
Author: Baoquan He <bhe@redhat.com>
Date:   Tue Oct 25 11:45:16 2022 +0800

    mm/percpu: add comment to state the empty populated pages accounting

    When allocating an area from a chunk, pcpu_block_update_hint_alloc()
    is called to update chunk metadata, including chunk's and global
    nr_empty_pop_pages. However, if the allocation is not atomic, some
    blocks may not be populated with pages yet, while we still subtract
    the number here. The number of pages will be added back with
    pcpu_chunk_populated() when populating pages.

    Adding code comment to make that more understandable.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell a432c1d810 mm/percpu: Update the code comment when creating new chunk
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit e04cb6976340d5ebf2b28ad91bf6a13a285aa566
Author: Baoquan He <bhe@redhat.com>
Date:   Mon Oct 24 16:14:30 2022 +0800

    mm/percpu: Update the code comment when creating new chunk

    The lock pcpu_alloc_mutex taking code has been moved to the beginning of
    pcpu_allo() if it's non atomic allocation. So the code comment above
    above pcpu_create_chunk() callsite need be updated.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:49 -04:00
Audra Mitchell 91e0cae202 mm/percpu: use list_first_entry_or_null in pcpu_reclaim_populated()
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit c1f6688d35d47ca11200789b000b3b20f5ecdbd9
Author: Baoquan He <bhe@redhat.com>
Date:   Tue Oct 25 11:11:45 2022 +0800

    mm/percpu: use list_first_entry_or_null in pcpu_reclaim_populated()

    To replace list_empty()/list_first_entry() pair to simplify code.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Acked-by: Dennis Zhou <dennis@kernel.org>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:49 -04:00
Audra Mitchell 10f60902d9 mm/percpu: remove unused pcpu_map_extend_chunks
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 5a7d596a05dddd09c44ae462f881491cf87ed120
Author: Baoquan He <bhe@redhat.com>
Date:   Mon Oct 24 16:14:28 2022 +0800

    mm/percpu: remove unused pcpu_map_extend_chunks

    Since commit 40064aeca3 ("percpu: replace area map allocator with
    bitmap"), it is unneeded.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:49 -04:00
Artem Savkov 4ab5be6999 mm/percpu.c: introduce pcpu_alloc_size()
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit b460bc8302f222d346f0c15bba980eb8c36d6278
Author: Hou Tao <houtao1@huawei.com>
Date:   Fri Oct 20 21:31:57 2023 +0800

    mm/percpu.c: introduce pcpu_alloc_size()
    
    Introduce pcpu_alloc_size() to get the size of the dynamic per-cpu
    area. It will be used by bpf memory allocator in the following patches.
    BPF memory allocator maintains per-cpu area caches for multiple area
    sizes and its free API only has the to-be-freed per-cpu pointer, so it
    needs the size of dynamic per-cpu area to select the corresponding cache
    when bpf program frees the dynamic per-cpu pointer.
    
    Acked-by: Dennis Zhou <dennis@kernel.org>
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Link: https://lore.kernel.org/r/20231020133202.4043247-3-houtao@huaweicloud.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:55 +01:00
Artem Savkov 513a6387a1 mm/percpu.c: don't acquire pcpu_lock for pcpu_chunk_addr_search()
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 394e6869f0185e89cb815db29bf819474df858ae
Author: Hou Tao <houtao1@huawei.com>
Date:   Fri Oct 20 21:31:56 2023 +0800

    mm/percpu.c: don't acquire pcpu_lock for pcpu_chunk_addr_search()
    
    There is no need to acquire pcpu_lock for pcpu_chunk_addr_search():
    1) both pcpu_first_chunk & pcpu_reserved_chunk must have been
       initialized before the invocation of free_percpu().
    2) The dynamically-created chunk must be valid before the per-cpu
       pointers allocated from it are freed.
    
    So acquire pcpu_lock() after the invocation of pcpu_chunk_addr_search().
    
    Acked-by: Dennis Zhou <dennis@kernel.org>
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Link: https://lore.kernel.org/r/20231020133202.4043247-2-houtao@huaweicloud.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:55 +01:00
Chris von Recklinghausen c766891728 percpu: improve percpu_alloc_percpu event trace
Bugzilla: https://bugzilla.redhat.com/2160210

commit f67bed134a053663852a1a3ab1b3223bfc2104a2
Author: Vasily Averin <vvs@openvz.org>
Date:   Thu May 12 20:23:07 2022 -0700

    percpu: improve percpu_alloc_percpu event trace

    Add call_site, bytes_alloc and gfp_flags fields to the output of the
    percpu_alloc_percpu ftrace event:

    mkdir-4393  [001]   169.334788: percpu_alloc_percpu:
     call_site=mem_cgroup_css_alloc+0xa6 reserved=0 is_atomic=0 size=2408 align=8
      base_addr=0xffffc7117fc00000 off=402176 ptr=0x3dc867a62300 bytes_alloc=14448
       gfp_flags=GFP_KERNEL_ACCOUNT

    This is required to track memcg-accounted percpu allocations.

    Link: https://lkml.kernel.org/r/a07be858-c8a3-7851-9086-e3262cbcf707@openvz.org
    Signed-off-by: Vasily Averin <vvs@openvz.org>
    Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Christoph Lameter <cl@linux.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Waiman Long 470c25f269 mm: percpu: use kmemleak_ignore_phys() instead of kmemleak_free()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2151065

commit a317ebccaa3609917a2c021af870cf3fa607ab0c
Author: Patrick Wang <patrick.wang.shcn@gmail.com>
Date:   Tue, 5 Jul 2022 19:31:58 +0800

    mm: percpu: use kmemleak_ignore_phys() instead of kmemleak_free()

    Kmemleak recently added a rbtree to store the objects allocted with
    physical address.  Those objects can't be freed with kmemleak_free().

    According to the comments, percpu allocations are tracked by kmemleak
    separately.  Kmemleak_free() was used to avoid the unnecessary
    tracking.  If kmemleak_free() fails, those objects would be scanned by
    kmemleak, which is unnecessary but shouldn't lead to other effects.

    Use kmemleak_ignore_phys() instead of kmemleak_free() for those
    objects.

    Link: https://lkml.kernel.org/r/20220705113158.127600-1-patrick.wang.shcn@gmail.com
    Fixes: 0c24e061196c ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA")
    Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-02-07 14:19:38 -05:00
Chris von Recklinghausen 3654bc9b37 mm: percpu: add generic pcpu_populate_pte() function
Bugzilla: https://bugzilla.redhat.com/2120352

commit 20c035764626c56c4f6514936b9ee4be0f4cd962
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Jan 19 18:07:53 2022 -0800

    mm: percpu: add generic pcpu_populate_pte() function

    With NEED_PER_CPU_PAGE_FIRST_CHUNK enabled, we need a function to
    populate pte, this patch adds a generic pcpu populate pte function,
    pcpu_populate_pte(), which is marked __weak and used on most
    architectures, but it is overridden on x86, which has its own
    implementation.

    Link: https://lkml.kernel.org/r/20211216112359.103822-5-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:42 -04:00
Chris von Recklinghausen 18cad680eb mm: percpu: add generic pcpu_fc_alloc/free funciton
Bugzilla: https://bugzilla.redhat.com/2120352

commit 23f917169ef157aa7a6bf80d8c4aad6f1282852c
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Jan 19 18:07:49 2022 -0800

    mm: percpu: add generic pcpu_fc_alloc/free funciton

    With the previous patch, we could add a generic pcpu first chunk
    allocate and free function to cleanup the duplicated definations on each
    architecture.

    Link: https://lkml.kernel.org/r/20211216112359.103822-4-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:42 -04:00
Chris von Recklinghausen 983533c507 mm: percpu: add pcpu_fc_cpu_to_node_fn_t typedef
Bugzilla: https://bugzilla.redhat.com/2120352

commit 1ca3fb3abd2b615c4b61728de545760a6e2c2d8b
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Wed Jan 19 18:07:45 2022 -0800

    mm: percpu: add pcpu_fc_cpu_to_node_fn_t typedef

    Add pcpu_fc_cpu_to_node_fn_t and pass it into pcpu_fc_alloc_fn_t, pcpu
    first chunk allocation will call it to alloc memblock on the
    corresponding node by it, this is prepare for the next patch.

    Link: https://lkml.kernel.org/r/20211216112359.103822-3-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:42 -04:00
Chris von Recklinghausen adf3d212d2 bitmap: unify find_bit operations
Bugzilla: https://bugzilla.redhat.com/2120352

commit ec288a2cf7ca40a939316b6df206ab845bb112d1
Author: Yury Norov <yury.norov@gmail.com>
Date:   Sat Aug 14 14:17:11 2021 -0700

    bitmap: unify find_bit operations

    bitmap_for_each_{set,clear}_region() are similar to for_each_bit()
    macros in include/linux/find.h, but interface and implementation
    of them are different.

    This patch adds for_each_bitrange() macros and drops unused
    bitmap_*_region() API in sake of unification.

    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
    Acked-by: Dennis Zhou <dennis@kernel.org>
    Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:41 -04:00
Chris von Recklinghausen b027f02790 mm/percpu: micro-optimize pcpu_is_populated()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 801a57365fc836d7ec866e2069d0b21d79925c1e
Author: Yury Norov <yury.norov@gmail.com>
Date:   Sat Aug 14 14:17:10 2021 -0700

    mm/percpu: micro-optimize pcpu_is_populated()

    bitmap_next_clear_region() calls find_next_zero_bit() and find_next_bit()
    sequentially to find a range of clear bits. In case of pcpu_is_populated()
    there's a chance to return earlier if bitmap has all bits set.

    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
    Acked-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:41 -04:00
Chris von Recklinghausen b36217e840 mm: memcg/percpu: account extra objcg space to memory cgroups
Bugzilla: https://bugzilla.redhat.com/2120352

commit 8c57c07741bf28e7d867f1200aa80120b8ca663e
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Fri Jan 14 14:09:12 2022 -0800

    mm: memcg/percpu: account extra objcg space to memory cgroups

    Similar to slab memory allocator, for each accounted percpu object there
    is an extra space which is used to store obj_cgroup membership.  Charge
    it too.

    [akpm@linux-foundation.org: fix layout]

    Link: https://lkml.kernel.org/r/20211126040606.97836-1-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Acked-by: Dennis Zhou <dennis@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:41 -04:00
Al Stone d1fd9d18f4 memblock: use memblock_free for freeing virtual pointers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071840
Tested: This is one of a series of patch sets to enable Arm SystemReady IR
 support in the kernel for NXP i.MX8 platforms.  At this stage, this
 has been tested by ensuring we can survive the CI/CD loop -- i.e.,
 that we have not broken anything else, and a simple boot test.  When
 sufficient drivers have been brought in for i.MX8M, we will be able
 to run further tests.

Conflicts:
    init/main.c

    This patch is being applied out of order, but is a simple
    function name replacement, so applied manually.

commit 4421cca0a3e4833b3bf0f20de98eb580ab8c7290
Author: Mike Rapoport <rppt@kernel.org>
Date:   Fri Nov 5 13:43:22 2021 -0700

    memblock: use memblock_free for freeing virtual pointers

    Rename memblock_free_ptr() to memblock_free() and use memblock_free()
    when freeing a virtual pointer so that memblock_free() will be a
    counterpart of memblock_alloc()

    The callers are updated with the below semantic patch and manual
    addition of (void *) casting to pointers that are represented by
    unsigned long variables.

        @@
        identifier vaddr;
        expression size;
        @@
        (
        - memblock_phys_free(__pa(vaddr), size);
        + memblock_free(vaddr, size);
        |
        - memblock_free_ptr(vaddr, size);
        + memblock_free(vaddr, size);
        )

    [sfr@canb.auug.org.au: fixup]
      Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au

    Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Juergen Gross <jgross@suse.com>
    Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    (cherry picked from commit 4421cca0a3e4833b3bf0f20de98eb580ab8c7290)

Signed-off-by: Al Stone <ahs3@redhat.com>
2022-07-01 17:07:00 -06:00
Al Stone 14289d8c8f memblock: rename memblock_free to memblock_phys_free
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071840
Tested: This is one of a series of patch sets to enable Arm SystemReady IR
 support in the kernel for NXP i.MX8 platforms.  At this stage, this
 has been tested by ensuring we can survive the CI/CD loop -- i.e.,
 that we have not broken anything else, and a simple boot test.  When
 sufficient drivers have been brought in for i.MX8M, we will be able
 to run further tests.

Conflicts:
    arch/s390/kernel/setup.c
    arch/s390/kernel/smp.c

    These have been modified in ways that no longer strictly
    match the upstream code, throwing off the auto-merge; this
    is a simple function name replacement, however, so easily
    done manually instead.

commit 3ecc68349bbab6bff1d12cbc7951ca6019b2faf6
Author: Mike Rapoport <rppt@kernel.org>
Date:   Fri Nov 5 13:43:19 2021 -0700

    memblock: rename memblock_free to memblock_phys_free

    Since memblock_free() operates on a physical range, make its name
    reflect it and rename it to memblock_phys_free(), so it will be a
    logical counterpart to memblock_phys_alloc().

    The callers are updated with the below semantic patch:

        @@
        expression addr;
        expression size;
        @@
        - memblock_free(addr, size);
        + memblock_phys_free(addr, size);

    Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Juergen Gross <jgross@suse.com>
    Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    (cherry picked from commit 3ecc68349bbab6bff1d12cbc7951ca6019b2faf6)

Signed-off-by: Al Stone <ahs3@redhat.com>
2022-07-01 17:06:59 -06:00
Al Stone 3b2e45e437 memblock: drop memblock_free_early_nid() and memblock_free_early()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071840
Tested: This is one of a series of patch sets to enable Arm SystemReady IR
 support in the kernel for NXP i.MX8 platforms.  At this stage, this
 has been tested by ensuring we can survive the CI/CD loop -- i.e.,
 that we have not broken anything else, and a simple boot test.  When
 sufficient drivers have been brought in for i.MX8M, we will be able
 to run further tests.

commit fa27717110ae51b9b9013ced0b5143888257bb79
Author: Mike Rapoport <rppt@kernel.org>
Date:   Fri Nov 5 13:43:13 2021 -0700

    memblock: drop memblock_free_early_nid() and memblock_free_early()

    memblock_free_early_nid() is unused and memblock_free_early() is an
    alias for memblock_free().

    Replace calls to memblock_free_early() with calls to memblock_free() and
    remove memblock_free_early() and memblock_free_early_nid().

    Link: https://lkml.kernel.org/r/20210930185031.18648-4-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Juergen Gross <jgross@suse.com>
    Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    (cherry picked from commit fa27717110ae51b9b9013ced0b5143888257bb79)

Signed-off-by: Al Stone <ahs3@redhat.com>
2022-07-01 17:06:59 -06:00
Rafael Aquini 3e417b25d5 percpu: remove export of pcpu_base_addr
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 3843c50a782c397422765cf0839a95e75e523229
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Tue Sep 7 19:57:27 2021 -0700

    percpu: remove export of pcpu_base_addr

    This is not needed by any modules, so remove the export.

    Link: https://lkml.kernel.org/r/20210722185814.504541-1-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Christoph Lameter <cl@linux.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:30 -05:00
Rafael Aquini 55ccb8623f mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 319814504992f51ed17af60edb1a237ada1892e8
Author: Jing Xiangfeng <jingxiangfeng@huawei.com>
Date:   Thu Sep 2 15:01:00 2021 -0700

    mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()

    Commit b239f7daf5 ("percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE")
    removed the parameter 'for_alloc', so remove this comment.

    Link: https://lkml.kernel.org/r/1630576043-21367-1-git-send-email-jingxiangfeng@huawei.com
    Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:32 -05:00
Dennis Zhou 93274f1dd6 percpu: flush tlb in pcpu_reclaim_populated()
Prior to "percpu: implement partial chunk depopulation",
pcpu_depopulate_chunk() was called only on the destruction path. This
meant the virtual address range was on its way back to vmalloc which
will handle flushing the tlbs for us.

However, with pcpu_reclaim_populated(), we are now calling
pcpu_depopulate_chunk() during the active lifecycle of a chunk.
Therefore, we need to flush the tlb as well otherwise we can end up
accessing the wrong page through an invalid tlb mapping as reported in
[1].

[1] https://lore.kernel.org/lkml/20210702191140.GA3166599@roeck-us.net/

Fixes: f183324133 ("percpu: implement partial chunk depopulation")
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-07-04 18:30:17 +00:00
Linus Torvalds e267992f9e Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:

 - percpu chunk depopulation - depopulate backing pages for chunks with
   empty pages when we exceed a global threshold without those pages.
   This lets us reclaim a portion of memory that would previously be
   lost until the full chunk would be freed (possibly never).

 - memcg accounting cleanup - previously separate chunks were managed
   for normal allocations and __GFP_ACCOUNT allocations. These are now
   consolidated which cleans up the code quite a bit.

 - a few misc clean ups for clang warnings

* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  percpu: optimize locking in pcpu_balance_workfn()
  percpu: initialize best_upa variable
  percpu: rework memcg accounting
  mm, memcg: introduce mem_cgroup_kmem_disabled()
  mm, memcg: mark cgroup_memory_nosocket, nokmem and noswap as __ro_after_init
  percpu: make symbol 'pcpu_free_slot' static
  percpu: implement partial chunk depopulation
  percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
  percpu: factor out pcpu_check_block_hint()
  percpu: split __pcpu_balance_workfn()
  percpu: fix a comment about the chunks ordering
2021-07-01 17:17:24 -07:00
Roman Gushchin e4d777003a percpu: optimize locking in pcpu_balance_workfn()
pcpu_balance_workfn() unconditionally calls pcpu_balance_free(),
pcpu_reclaim_populated(), pcpu_balance_populated() and
pcpu_balance_free() again.

Each call to pcpu_balance_free() and pcpu_reclaim_populated() will
cause at least one acquisition of the pcpu_lock. So even if the
balancing was scheduled because of a failed atomic allocation,
pcpu_lock will be acquired at least 4 times. This obviously
increases the contention on the pcpu_lock.

To optimize the scheme let's grab the pcpu_lock on the upper level
(in pcpu_balance_workfn()) and keep it generally locked for the whole
duration of the scheduled work, but release conditionally to perform
any slow operations like chunk (de)population and creation of new
chunks.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-06-17 23:05:24 +00:00
Dennis Zhou 4829c791b2 percpu: initialize best_upa variable
Tom reported this finding from clang 10's static analysis [1].

Due to the way the code is written, it will always see a successful loop
iteration. Instead of setting an initial value, check that it was set
instead with BUG_ON() because 0 units per allocation is bogus.

[1] https://lore.kernel.org/lkml/20210515180817.1751084-1-trix@redhat.com/

Reported-by: Tom Rix <trix@redhat.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-06-14 14:42:05 +00:00
Roman Gushchin faf65dde84 percpu: rework memcg accounting
The current implementation of the memcg accounting of the percpu
memory is based on the idea of having two separate sets of chunks for
accounted and non-accounted memory. This approach has an advantage
of not wasting any extra memory for memcg data for non-accounted
chunks, however it complicates the code and leads to a higher chunks
number due to a lower chunk utilization.

Instead of having two chunk types it's possible to declare all* chunks
memcg-aware unless the kernel memory accounting is disabled globally
by a boot option. The size of objcg_array is usually small in
comparison to chunks themselves (it obviously depends on the number of
CPUs), so even if some chunk will have no accounted allocations, the
memory waste isn't significant and will likely be compensated by
a higher chunk utilization. Also, with time more and more percpu
allocations will likely become accounted.

* The first chunk is initialized before the memory cgroup subsystem,
  so we don't know for sure whether we need to allocate obj_cgroups.
  Because it's small, let's make it free for use. Then we don't need
  to allocate obj_cgroups for it.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-06-05 20:43:15 +00:00
Wei Yongjun 8d55ba5df3 percpu: make symbol 'pcpu_free_slot' static
The sparse tool complains as follows:

mm/percpu.c:138:5: warning:
 symbol 'pcpu_free_slot' was not declared. Should it be static?

This symbol is not used outside of percpu.c, so marks it static.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-05-14 20:57:54 +00:00
Ingo Molnar f0953a1bba mm: fix typos in comments
Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:35 -07:00
Roman Gushchin f183324133 percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.

The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.

This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.

To illustrate the problem the following script can be used:
--

cd /sys/fs/cgroup

mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
    mkdir percpu_test/cg_"${i}"
    for j in `seq 1 10`; do
	mkdir percpu_test/cg_"${i}"_"${j}"
    done
done

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
    for j in `seq 1 10`; do
	rmdir percpu_test/cg_"${i}"_"${j}"
    done
done

sleep 10

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
    rmdir percpu_test/cg_"${i}"
done

rmdir percpu_test
--

It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.

Results:
  vanilla:
    ./percpu_test.sh
    Percpu:             7488 kB
    Percpu:           481152 kB
    Percpu:           481152 kB

  with this patchset applied:
    ./percpu_test.sh
    Percpu:             7488 kB
    Percpu:           481408 kB
    Percpu:           135552 kB

The total size of the percpu memory was reduced by more than 3.5 times.

This patch:

This patch implements partial depopulation of percpu chunks.

As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.

The depopulation is scheduled on the free path if the chunk is all of
the following:
  1) has more than 1/4 of total pages free and populated
  2) the system has enough free percpu pages aside of this chunk
  3) isn't the reserved chunk
  4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.

pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.

Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-21 18:17:40 +00:00
Dennis Zhou 1c29a3ceaf percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
This prepares for adding a to_depopulate list and sidelined list after
the free slot in the set of lists in pcpu_slot.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-21 18:17:40 +00:00
Roman Gushchin 8ea2e1e35d percpu: factor out pcpu_check_block_hint()
Factor out the pcpu_check_block_hint() helper, which will be useful
in the future. The new function checks if the allocation can likely
fit within the contig hint.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-21 18:17:35 +00:00
Roman Gushchin 67c2669d69 percpu: split __pcpu_balance_workfn()
__pcpu_balance_workfn() became fairly big and hard to follow, but in
fact it consists of two fully independent parts, responsible for
the destruction of excessive free chunks and population of necessarily
amount of free pages.

In order to simplify the code and prepare for adding of a new
functionality, split it in two functions:

  1) pcpu_balance_free,
  2) pcpu_balance_populated.

Move the taking/releasing of the pcpu_alloc_mutex to an upper level
to keep the current synchronization in place.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-16 20:57:59 +00:00
Roman Gushchin ac9380f6b8 percpu: fix a comment about the chunks ordering
Since the commit 3e54097beb ("percpu: manage chunks based on
contig_bits instead of free_bytes") chunks are sorted based on the
size of the biggest continuous free area instead of the total number
of free bytes. Update the corresponding comment to reflect this.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-16 20:57:49 +00:00
Roman Gushchin 0760fa3d8f percpu: make pcpu_nr_empty_pop_pages per chunk type
nr_empty_pop_pages is used to guarantee that there are some free
populated pages to satisfy atomic allocations. Accounted and
non-accounted allocations are using separate sets of chunks,
so both need to have a surplus of empty pages.

This commit makes pcpu_nr_empty_pop_pages and the corresponding logic
per chunk type.

[Dennis]
This issue came up as I was reviewing [1] and realized I missed this.
Simultaneously, it was reported btrfs was seeing failed atomic
allocations in fsstress tests [2] and [3].

[1] https://lore.kernel.org/linux-mm/20210324190626.564297-1-guro@fb.com/
[2] https://lore.kernel.org/linux-mm/20210401185158.3275.409509F4@e16-tech.com/
[3] https://lore.kernel.org/linux-mm/CAL3q7H5RNBjCi708GH7jnczAOe0BLnacT9C+OBgA-Dx9jhB6SQ@mail.gmail.com/

Fixes: 3c7be18ac9 ("mm: memcg/percpu: account percpu memory to memory cgroups")
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Roman Gushchin <guro@fb.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-09 13:58:38 +00:00
Dennis Zhou 258e0815e2 percpu: fix clang modpost section mismatch
pcpu_build_alloc_info() is an __init function that makes a call to
cpumask_clear_cpu(). With CONFIG_GCOV_PROFILE_ALL enabled, the inline
heuristics are modified and such cpumask_clear_cpu() which is marked
inline doesn't get inlined. Because it works on mask in __initdata,
modpost throws a section mismatch error.

Arnd sent a patch with the flatten attribute as an alternative [2]. I've
added it to compiler_attributes.h.

modpost complaint:
  WARNING: modpost: vmlinux.o(.text+0x735425): Section mismatch in reference from the function cpumask_clear_cpu() to the variable .init.data:pcpu_build_alloc_info.mask
  The function cpumask_clear_cpu() references
  the variable __initdata pcpu_build_alloc_info.mask.
  This is often because cpumask_clear_cpu lacks a __initdata
  annotation or the annotation of pcpu_build_alloc_info.mask is wrong.

clang output:
  mm/percpu.c:2724:5: remark: cpumask_clear_cpu not inlined into pcpu_build_alloc_info because too costly to inline (cost=725, threshold=325) [-Rpass-missed=inline]

[1] https://lore.kernel.org/linux-mm/202012220454.9F6Bkz9q-lkp@intel.com/
[2] https://lore.kernel.org/lkml/CAK8P3a2ZWfNeXKSm8K_SUhhwkor17jFo3xApLXjzfPqX0eUDUA@mail.gmail.com/

Reported-by: kernel test robot <lkp@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-02-14 18:15:15 +00:00
Wonhyuk Yang d7d29ac76f percpu: reduce the number of cpu distance comparisons
To build group_map[] and group_cnt[], we find out which group
CPUs belong to by comparing the distance of the cpu. However,
this includes cases where comparisons are not required.

This patch uses a bitmap to record CPUs that is not classified in
the group. CPUs that we know which group they belong to should be
cleared from the bitmap. In result, we can reduce the number of
unnecessary comparisons.

Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
[Dennis: added cpumask_clear() call and #include cpumask.h.]
2021-02-14 17:34:05 +00:00
Dennis Zhou 61cf93d3e1 percpu: convert flexible array initializers to use struct_size()
Use the safer macro as sparked by the long discussion in [1].

[1] https://lore.kernel.org/lkml/20200917204514.GA2880159@google.com/

Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2020-10-30 23:02:28 +00:00
Roman Gushchin 279c3393e2 mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current()
Patch series "mm: kmem: kernel memory accounting in an interrupt context".

This patchset implements memcg-based memory accounting of allocations made
from an interrupt context.

Historically, such allocations were passed unaccounted mostly because
charging the memory cgroup of the current process wasn't an option.  Also
performance reasons were likely a reason too.

The remote charging API allows to temporarily overwrite the currently
active memory cgroup, so that all memory allocations are accounted towards
some specified memory cgroup instead of the memory cgroup of the current
process.

This patchset extends the remote charging API so that it can be used from
an interrupt context.  Then it removes the fence that prevented the
accounting of allocations made from an interrupt context.  It also
contains a couple of optimizations/code refactorings.

This patchset doesn't directly enable accounting for any specific
allocations, but prepares the code base for it.  The bpf memory accounting
will likely be the first user of it: a typical example is a bpf program
parsing an incoming network packet, which allocates an entry in hashmap
map to store some information.

This patch (of 4):

Currently memcg_kmem_bypass() is called before obtaining the current
memory/obj cgroup using get_mem/obj_cgroup_from_current().  Moving
memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
number of call sites and allows further code simplifications.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18 09:27:09 -07:00
Sunghyun Jin b3b33d3c43 percpu: fix first chunk size calculation for populated bitmap
Variable populated, which is a member of struct pcpu_chunk, is used as a
unit of size of unsigned long.
However, size of populated is miscounted. So, I fix this minor part.

Fixes: 8ab16c43ea ("percpu: change the number of pages marked in the first_chunk pop bitmap")
Cc: <stable@vger.kernel.org> # 4.14+
Signed-off-by: Sunghyun Jin <mcsmonk@gmail.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2020-09-17 17:34:39 +00:00
Roman Gushchin 772616b031 mm: memcg/percpu: per-memcg percpu memory statistics
Percpu memory can represent a noticeable chunk of the total memory
consumption, especially on big machines with many CPUs.  Let's track
percpu memory usage for each memcg and display it in memory.stat.

A percpu allocation is usually scattered over multiple pages (and nodes),
and can be significantly smaller than a page.  So let's add a byte-sized
counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
created for slabs can be perfectly reused for percpu case.

[guro@fb.com: v3]
  Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:55 -07:00
Roman Gushchin 3c7be18ac9 mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user.  Also, some cgroup internals (e.g.  memory controller
statistics) can be quite large.  On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation.  Similar to the slab memory, percpu memory should be accounted
by default.

To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow.  Let's introduce two types of percpu
chunks: root and memcg.  What makes memcg chunks different is an
additional space allocated to store memcg membership information.  If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.

To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.

[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
  Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:55 -07:00
Roman Gushchin 5b32af91b5 percpu: return number of released bytes from pcpu_free_area()
Patch series "mm: memcg accounting of percpu memory", v3.

This patchset adds percpu memory accounting to memory cgroups.  It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.

Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user.  Also, some cgroup internals (e.g.  memory controller
statistics) can be quite large.  On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation.  Similar to the slab memory, percpu memory should be accounted
by default.

Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis.  So the per-object tracking
introduced by the new slab controller is reused.

The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.

To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks).  These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object.  On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk.  Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation).  The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset.  On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged.  Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.

To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter.  The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it.  Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead.  This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.

This patch (of 5):

To implement accounting of percpu memory we need the information about the
size of freed object.  Return it from pcpu_free_area().

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
cC: Michal Koutnýutny@suse.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:55 -07:00
Kees Cook 3f649ab728 treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16 12:35:15 -07:00