Commit Graph

43 Commits

Author SHA1 Message Date
Viktor Malik 3552b1ff33
bpf: fix order of args in call to bpf_map_kvcalloc
JIRA: https://issues.redhat.com/browse/RHEL-30773

commit af253aef183a31ce62d2e39fc520b0ebfb562bb9
Author: Mohammad Shehar Yaar Tausif <sheharyaar48@gmail.com>
Date:   Wed Jul 10 12:05:22 2024 +0200

    bpf: fix order of args in call to bpf_map_kvcalloc

    The original function call passed size of smap->bucket before the number of
    buckets which raises the error 'calloc-transposed-args' on compilation.

    Vlastimil Babka added:

    The order of parameters can be traced back all the way to 6ac99e8f23
    ("bpf: Introduce bpf sk local storage") accross several refactorings,
    and that's why the commit is used as a Fixes: tag.

    In v6.10-rc1, a different commit 2c321f3f70bc ("mm: change inlined
    allocation helpers to account at the call site") however exposed the
    order of args in a way that gcc-14 has enough visibility to start
    warning about it, because (in !CONFIG_MEMCG case) bpf_map_kvcalloc is
    then a macro alias for kvcalloc instead of a static inline wrapper.

    To sum up the warning happens when the following conditions are all met:

    - gcc-14 is used (didn't see it with gcc-13)
    - commit 2c321f3f70bc is present
    - CONFIG_MEMCG is not enabled in .config
    - CONFIG_WERROR turns this from a compiler warning to error

    Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
    Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
    Tested-by: Christian Kujau <lists@nerdbynature.de>
    Signed-off-by: Mohammad Shehar Yaar Tausif <sheharyaar48@gmail.com>
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Link: https://lore.kernel.org/r/20240710100521.15061-2-vbabka@suse.cz
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-11-19 07:40:50 +01:00
Viktor Malik 013281ada1
bpf: Fix typos in comments
JIRA: https://issues.redhat.com/browse/RHEL-30773

commit a7de265cb2d849f8986a197499ad58dca0a4f209
Author: Rafael Passos <rafael@rcpassos.me>
Date:   Wed Apr 17 15:49:14 2024 -0300

    bpf: Fix typos in comments
    
    Found the following typos in comments, and fixed them:
    
    s/unpriviledged/unprivileged/
    s/reponsible/responsible/
    s/possiblities/possibilities/
    s/Divison/Division/
    s/precsion/precision/
    s/havea/have a/
    s/reponsible/responsible/
    s/responsibile/responsible/
    s/tigher/tighter/
    s/respecitve/respective/
    
    Signed-off-by: Rafael Passos <rafael@rcpassos.me>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/6af7deb4-bb24-49e8-b3f1-8dd410597337@smtp-relay.sendinblue.com

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-11-11 07:44:49 +01:00
Jerome Marchand 4acbcbe529 bpf: Allow compiler to inline most of bpf_local_storage_lookup()
JIRA: https://issues.redhat.com/browse/RHEL-23649

commit 68bc61c26cacf152baf905786b5949769700f40d
Author: Marco Elver <elver@google.com>
Date:   Wed Feb 7 13:26:17 2024 +0100

    bpf: Allow compiler to inline most of bpf_local_storage_lookup()

    In various performance profiles of kernels with BPF programs attached,
    bpf_local_storage_lookup() appears as a significant portion of CPU
    cycles spent. To enable the compiler generate more optimal code, turn
    bpf_local_storage_lookup() into a static inline function, where only the
    cache insertion code path is outlined

    Notably, outlining cache insertion helps avoid bloating callers by
    duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on
    architectures which do not inline spin_lock/unlock, such as x86), which
    would cause the compiler produce worse code by deciding to outline
    otherwise inlinable functions. The call overhead is neutral, because we
    make 2 calls either way: either calling raw_spin_lock_irqsave() and
    raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(),
    which calls raw_spin_lock_irqsave(), followed by a tail-call to
    raw_spin_unlock_irqsave() where the compiler can perform TCO and (in
    optimized uninstrumented builds) turns it into a plain jump. The call to
    __bpf_local_storage_insert_cache() can be elided entirely if
    cacheit_lockit is a false constant expression.

    Based on results from './benchs/run_bench_local_storage.sh' (21 trials,
    reboot between each trial; x86 defconfig + BPF, clang 16) this produces
    improvements in throughput and latency in the majority of cases, with an
    average (geomean) improvement of 8%:

    +---- Hashmap Control --------------------
    |
    | + num keys: 10
    | :                                         <before>             | <after>
    | +-+ hashmap (control) sequential get    +----------------------+----------------------
    |   +- hits throughput                    | 14.789 M ops/s       | 14.745 M ops/s (  ~  )
    |   +- hits latency                       | 67.679 ns/op         | 67.879 ns/op   (  ~  )
    |   +- important_hits throughput          | 14.789 M ops/s       | 14.745 M ops/s (  ~  )
    |
    | + num keys: 1000
    | :                                         <before>             | <after>
    | +-+ hashmap (control) sequential get    +----------------------+----------------------
    |   +- hits throughput                    | 12.233 M ops/s       | 12.170 M ops/s (  ~  )
    |   +- hits latency                       | 81.754 ns/op         | 82.185 ns/op   (  ~  )
    |   +- important_hits throughput          | 12.233 M ops/s       | 12.170 M ops/s (  ~  )
    |
    | + num keys: 10000
    | :                                         <before>             | <after>
    | +-+ hashmap (control) sequential get    +----------------------+----------------------
    |   +- hits throughput                    | 7.220 M ops/s        | 7.204 M ops/s  (  ~  )
    |   +- hits latency                       | 138.522 ns/op        | 138.842 ns/op  (  ~  )
    |   +- important_hits throughput          | 7.220 M ops/s        | 7.204 M ops/s  (  ~  )
    |
    | + num keys: 100000
    | :                                         <before>             | <after>
    | +-+ hashmap (control) sequential get    +----------------------+----------------------
    |   +- hits throughput                    | 5.061 M ops/s        | 5.165 M ops/s  (+2.1%)
    |   +- hits latency                       | 198.483 ns/op        | 194.270 ns/op  (-2.1%)
    |   +- important_hits throughput          | 5.061 M ops/s        | 5.165 M ops/s  (+2.1%)
    |
    | + num keys: 4194304
    | :                                         <before>             | <after>
    | +-+ hashmap (control) sequential get    +----------------------+----------------------
    |   +- hits throughput                    | 2.864 M ops/s        | 2.882 M ops/s  (  ~  )
    |   +- hits latency                       | 365.220 ns/op        | 361.418 ns/op  (-1.0%)
    |   +- important_hits throughput          | 2.864 M ops/s        | 2.882 M ops/s  (  ~  )
    |
    +---- Local Storage ----------------------
    |
    | + num_maps: 1
    | :                                         <before>             | <after>
    | +-+ local_storage cache sequential get  +----------------------+----------------------
    |   +- hits throughput                    | 33.005 M ops/s       | 39.068 M ops/s (+18.4%)
    |   +- hits latency                       | 30.300 ns/op         | 25.598 ns/op   (-15.5%)
    |   +- important_hits throughput          | 33.005 M ops/s       | 39.068 M ops/s (+18.4%)
    | :
    | :                                         <before>             | <after>
    | +-+ local_storage cache interleaved get +----------------------+----------------------
    |   +- hits throughput                    | 37.151 M ops/s       | 44.926 M ops/s (+20.9%)
    |   +- hits latency                       | 26.919 ns/op         | 22.259 ns/op   (-17.3%)
    |   +- important_hits throughput          | 37.151 M ops/s       | 44.926 M ops/s (+20.9%)
    |
    | + num_maps: 10
    | :                                         <before>             | <after>
    | +-+ local_storage cache sequential get  +----------------------+----------------------
    |   +- hits throughput                    | 32.288 M ops/s       | 38.099 M ops/s (+18.0%)
    |   +- hits latency                       | 30.972 ns/op         | 26.248 ns/op   (-15.3%)
    |   +- important_hits throughput          | 3.229 M ops/s        | 3.810 M ops/s  (+18.0%)
    | :
    | :                                         <before>             | <after>
    | +-+ local_storage cache interleaved get +----------------------+----------------------
    |   +- hits throughput                    | 34.473 M ops/s       | 41.145 M ops/s (+19.4%)
    |   +- hits latency                       | 29.010 ns/op         | 24.307 ns/op   (-16.2%)
    |   +- important_hits throughput          | 12.312 M ops/s       | 14.695 M ops/s (+19.4%)
    |
    | + num_maps: 16
    | :                                         <before>             | <after>
    | +-+ local_storage cache sequential get  +----------------------+----------------------
    |   +- hits throughput                    | 32.524 M ops/s       | 38.341 M ops/s (+17.9%)
    |   +- hits latency                       | 30.748 ns/op         | 26.083 ns/op   (-15.2%)
    |   +- important_hits throughput          | 2.033 M ops/s        | 2.396 M ops/s  (+17.9%)
    | :
    | :                                         <before>             | <after>
    | +-+ local_storage cache interleaved get +----------------------+----------------------
    |   +- hits throughput                    | 34.575 M ops/s       | 41.338 M ops/s (+19.6%)
    |   +- hits latency                       | 28.925 ns/op         | 24.193 ns/op   (-16.4%)
    |   +- important_hits throughput          | 11.001 M ops/s       | 13.153 M ops/s (+19.6%)
    |
    | + num_maps: 17
    | :                                         <before>             | <after>
    | +-+ local_storage cache sequential get  +----------------------+----------------------
    |   +- hits throughput                    | 28.861 M ops/s       | 32.756 M ops/s (+13.5%)
    |   +- hits latency                       | 34.649 ns/op         | 30.530 ns/op   (-11.9%)
    |   +- important_hits throughput          | 1.700 M ops/s        | 1.929 M ops/s  (+13.5%)
    | :
    | :                                         <before>             | <after>
    | +-+ local_storage cache interleaved get +----------------------+----------------------
    |   +- hits throughput                    | 31.529 M ops/s       | 36.110 M ops/s (+14.5%)
    |   +- hits latency                       | 31.719 ns/op         | 27.697 ns/op   (-12.7%)
    |   +- important_hits throughput          | 9.598 M ops/s        | 10.993 M ops/s (+14.5%)
    |
    | + num_maps: 24
    | :                                         <before>             | <after>
    | +-+ local_storage cache sequential get  +----------------------+----------------------
    |   +- hits throughput                    | 18.602 M ops/s       | 19.937 M ops/s (+7.2%)
    |   +- hits latency                       | 53.767 ns/op         | 50.166 ns/op   (-6.7%)
    |   +- important_hits throughput          | 0.776 M ops/s        | 0.831 M ops/s  (+7.2%)
    | :
    | :                                         <before>             | <after>
    | +-+ local_storage cache interleaved get +----------------------+----------------------
    |   +- hits throughput                    | 21.718 M ops/s       | 23.332 M ops/s (+7.4%)
    |   +- hits latency                       | 46.047 ns/op         | 42.865 ns/op   (-6.9%)
    |   +- important_hits throughput          | 6.110 M ops/s        | 6.564 M ops/s  (+7.4%)
    |
    | + num_maps: 32
    | :                                         <before>             | <after>
    | +-+ local_storage cache sequential get  +----------------------+----------------------
    |   +- hits throughput                    | 14.118 M ops/s       | 14.626 M ops/s (+3.6%)
    |   +- hits latency                       | 70.856 ns/op         | 68.381 ns/op   (-3.5%)
    |   +- important_hits throughput          | 0.442 M ops/s        | 0.458 M ops/s  (+3.6%)
    | :
    | :                                         <before>             | <after>
    | +-+ local_storage cache interleaved get +----------------------+----------------------
    |   +- hits throughput                    | 17.111 M ops/s       | 17.906 M ops/s (+4.6%)
    |   +- hits latency                       | 58.451 ns/op         | 55.865 ns/op   (-4.4%)
    |   +- important_hits throughput          | 4.776 M ops/s        | 4.998 M ops/s  (+4.6%)
    |
    | + num_maps: 100
    | :                                         <before>             | <after>
    | +-+ local_storage cache sequential get  +----------------------+----------------------
    |   +- hits throughput                    | 5.281 M ops/s        | 5.528 M ops/s  (+4.7%)
    |   +- hits latency                       | 192.398 ns/op        | 183.059 ns/op  (-4.9%)
    |   +- important_hits throughput          | 0.053 M ops/s        | 0.055 M ops/s  (+4.9%)
    | :
    | :                                         <before>             | <after>
    | +-+ local_storage cache interleaved get +----------------------+----------------------
    |   +- hits throughput                    | 6.265 M ops/s        | 6.498 M ops/s  (+3.7%)
    |   +- hits latency                       | 161.436 ns/op        | 152.877 ns/op  (-5.3%)
    |   +- important_hits throughput          | 1.636 M ops/s        | 1.697 M ops/s  (+3.7%)
    |
    | + num_maps: 1000
    | :                                         <before>             | <after>
    | +-+ local_storage cache sequential get  +----------------------+----------------------
    |   +- hits throughput                    | 0.355 M ops/s        | 0.354 M ops/s  (  ~  )
    |   +- hits latency                       | 2826.538 ns/op       | 2827.139 ns/op (  ~  )
    |   +- important_hits throughput          | 0.000 M ops/s        | 0.000 M ops/s  (  ~  )
    | :
    | :                                         <before>             | <after>
    | +-+ local_storage cache interleaved get +----------------------+----------------------
    |   +- hits throughput                    | 0.404 M ops/s        | 0.403 M ops/s  (  ~  )
    |   +- hits latency                       | 2481.190 ns/op       | 2487.555 ns/op (  ~  )
    |   +- important_hits throughput          | 0.102 M ops/s        | 0.101 M ops/s  (  ~  )

    The on_lookup test in {cgrp,task}_ls_recursion.c is removed
    because the bpf_local_storage_lookup is no longer traceable
    and adding tracepoint will make the compiler generate worse
    code: https://lore.kernel.org/bpf/ZcJmok64Xqv6l4ZS@elver.google.com/

    Signed-off-by: Marco Elver <elver@google.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20240207122626.3508658-1-elver@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2024-10-15 10:49:09 +02:00
Jerome Marchand 3ce6ddb3b3 bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit 55d49f750b1cb1f177fb1b00ae02cba4613bcfb7
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Fri Sep 1 16:11:28 2023 -0700

    bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc

    The commit c83597fa5dc6 ("bpf: Refactor some inode/task/sk storage functions
    for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into
    bpf_local_storage_unlink_nolock() which then later renamed to
    bpf_local_storage_destroy(). The commit accidentally passed the
    "bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock()
    which then stopped the uncharge from happening to the sk->sk_omem_alloc.

    This missing uncharge only happens when the sk is going away (during
    __sk_destruct).

    This patch fixes it by always passing "uncharge_mem = true". It is a
    noop to the task/inode/cgroup storage because they do not have the
    map_local_storage_(un)charge enabled in the map_ops. A followup patch
    will be done in bpf-next to remove the uncharge_mem argument.

    A selftest is added in the next patch.

    Fixes: c83597fa5dc6 ("bpf: Refactor some inode/task/sk storage functions for reuse")
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:29:03 +01:00
Jerome Marchand 3e2b654be9 bpf: bpf_sk_storage: Fix invalid wait context lockdep report
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit a96a44aba556c42b432929d37d60158aca21ad4c
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Fri Sep 1 16:11:27 2023 -0700

    bpf: bpf_sk_storage: Fix invalid wait context lockdep report

    './test_progs -t test_local_storage' reported a splat:

    [   27.137569] =============================
    [   27.138122] [ BUG: Invalid wait context ]
    [   27.138650] 6.5.0-03980-gd11ae1b16b0a #247 Tainted: G           O
    [   27.139542] -----------------------------
    [   27.140106] test_progs/1729 is trying to lock:
    [   27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130
    [   27.141834] other info that might help us debug this:
    [   27.142437] context-{5:5}
    [   27.142856] 2 locks held by test_progs/1729:
    [   27.143352]  #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40
    [   27.144492]  #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0
    [   27.145855] stack backtrace:
    [   27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G           O       6.5.0-03980-gd11ae1b16b0a #247
    [   27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    [   27.149127] Call Trace:
    [   27.149490]  <TASK>
    [   27.149867]  dump_stack_lvl+0x130/0x1d0
    [   27.152609]  dump_stack+0x14/0x20
    [   27.153131]  __lock_acquire+0x1657/0x2220
    [   27.153677]  lock_acquire+0x1b8/0x510
    [   27.157908]  local_lock_acquire+0x29/0x130
    [   27.159048]  obj_cgroup_charge+0xf4/0x3c0
    [   27.160794]  slab_pre_alloc_hook+0x28e/0x2b0
    [   27.161931]  __kmem_cache_alloc_node+0x51/0x210
    [   27.163557]  __kmalloc+0xaa/0x210
    [   27.164593]  bpf_map_kzalloc+0xbc/0x170
    [   27.165147]  bpf_selem_alloc+0x130/0x510
    [   27.166295]  bpf_local_storage_update+0x5aa/0x8e0
    [   27.167042]  bpf_fd_sk_storage_update_elem+0xdb/0x1a0
    [   27.169199]  bpf_map_update_value+0x415/0x4f0
    [   27.169871]  map_update_elem+0x413/0x550
    [   27.170330]  __sys_bpf+0x5e9/0x640
    [   27.174065]  __x64_sys_bpf+0x80/0x90
    [   27.174568]  do_syscall_64+0x48/0xa0
    [   27.175201]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    [   27.175932] RIP: 0033:0x7effb40e41ad
    [   27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8
    [   27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
    [   27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad
    [   27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002
    [   27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788
    [   27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000
    [   27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000
    [   27.184958]  </TASK>

    It complains about acquiring a local_lock while holding a raw_spin_lock.
    It means it should not allocate memory while holding a raw_spin_lock
    since it is not safe for RT.

    raw_spin_lock is needed because bpf_local_storage supports tracing
    context. In particular for task local storage, it is easy to
    get a "current" task PTR_TO_BTF_ID in tracing bpf prog.
    However, task (and cgroup) local storage has already been moved to
    bpf mem allocator which can be used after raw_spin_lock.

    The splat is for the sk storage. For sk (and inode) storage,
    it has not been moved to bpf mem allocator. Using raw_spin_lock or not,
    kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context.
    However, the local storage helper requires a verifier accepted
    sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running
    a bpf prog in a kzalloc unsafe context and also able to hold a verifier
    accepted sk pointer) could happen.

    This patch avoids kzalloc after raw_spin_lock to silent the splat.
    There is an existing kzalloc before the raw_spin_lock. At that point,
    a kzalloc is very likely required because a lookup has just been done
    before. Thus, this patch always does the kzalloc before acquiring
    the raw_spin_lock and remove the later kzalloc usage after the
    raw_spin_lock. After this change, it will have a charge and then
    uncharge during the syscall bpf_map_update_elem() code path.
    This patch opts for simplicity and not continue the old
    optimization to save one charge and uncharge.

    This issue is dated back to the very first commit of bpf_sk_storage
    which had been refactored multiple times to create task, inode, and
    cgroup storage. This patch uses a Fixes tag with a more recent
    commit that should be easier to do backport.

    Fixes: b00fa38a9c1c ("bpf: Enable non-atomic allocations in local storage")
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:29:03 +01:00
Viktor Malik daad46d419
bpf: Centralize permissions checks for all BPF map types
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit 6c3eba1c5e283fd2bb1c076dbfcb47f569c3bfde
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jun 13 15:35:32 2023 -0700

    bpf: Centralize permissions checks for all BPF map types
    
    This allows to do more centralized decisions later on, and generally
    makes it very explicit which maps are privileged and which are not
    (e.g., LRU_HASH and LRU_PERCPU_HASH, which are privileged HASH variants,
    as opposed to unprivileged HASH and HASH_PERCPU; now this is explicit
    and easy to verify).
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20230613223533.3689589-4-andrii@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:20 +02:00
Artem Savkov 2cc47c6a2e bpf: Handle NULL in bpf_local_storage_free.
Bugzilla: https://bugzilla.redhat.com/2221599

commit 10fd5f70c397782a97f411f25bfb312ea92b55bc
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Apr 12 10:12:52 2023 -0700

    bpf: Handle NULL in bpf_local_storage_free.
    
    During OOM bpf_local_storage_alloc() may fail to allocate 'storage' and
    call to bpf_local_storage_free() with NULL pointer will cause a crash like:
    [ 271718.917646] BUG: kernel NULL pointer dereference, address: 00000000000000a0
    [ 271719.019620] RIP: 0010:call_rcu+0x2d/0x240
    [ 271719.216274]  bpf_local_storage_alloc+0x19e/0x1e0
    [ 271719.250121]  bpf_local_storage_update+0x33b/0x740
    
    Fixes: 7e30a8477b0b ("bpf: Add bpf_local_storage_free()")
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20230412171252.15635-1-alexei.starovoitov@gmail.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:29 +02:00
Artem Savkov b1533c4649 bpf: Use bpf_mem_cache_alloc/free for bpf_local_storage
Bugzilla: https://bugzilla.redhat.com/2221599

commit 6ae9d5e99e1dd26babdd9502759fa25a3fd348ad
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Wed Mar 22 14:52:44 2023 -0700

    bpf: Use bpf_mem_cache_alloc/free for bpf_local_storage
    
    This patch uses bpf_mem_cache_alloc/free for allocating and freeing
    bpf_local_storage for task and cgroup storage.
    
    The changes are similar to the previous patch. A few things that
    worth to mention for bpf_local_storage:
    
    The local_storage is freed when the last selem is deleted.
    Before deleting a selem from local_storage, it needs to retrieve the
    local_storage->smap because the bpf_selem_unlink_storage_nolock()
    may have set it to NULL. Note that local_storage->smap may have
    already been NULL when the selem created this local_storage has
    been removed. In this case, call_rcu will be used to free the
    local_storage.
    Also, the bpf_ma (true or false) value is needed before calling
    bpf_local_storage_free(). The bpf_ma can either be obtained from
    the local_storage->smap (if available) or any of its selem's smap.
    A new helper check_storage_bpf_ma() is added to obtain
    bpf_ma for a deleting bpf_local_storage.
    
    When bpf_local_storage_alloc getting a reused memory, all
    fields are either in the correct values or will be initialized.
    'cache[]' must already be all NULLs. 'list' must be empty.
    Others will be initialized.
    
    Cc: Namhyung Kim <namhyung@kernel.org>
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230322215246.1675516-4-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:23 +02:00
Artem Savkov 7b478e7484 bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage_elem
Bugzilla: https://bugzilla.redhat.com/2221599

commit 08a7ce384e33e53e0732c500a8af67a73f8fceca
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Wed Mar 22 14:52:43 2023 -0700

    bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage_elem
    
    This patch uses bpf_mem_alloc for the task and cgroup local storage that
    the bpf prog can easily get a hold of the storage owner's PTR_TO_BTF_ID.
    eg. bpf_get_current_task_btf() can be used in some of the kmalloc code
    path which will cause deadlock/recursion. bpf_mem_cache_alloc is
    deadlock free and will solve a legit use case in [1].
    
    For sk storage, its batch creation benchmark shows a few percent
    regression when the sk create/destroy batch size is larger than 32.
    The sk creation/destruction happens much more often and
    depends on external traffic. Considering it is hypothetical
    to be able to cause deadlock with sk storage, it can cross
    the bridge to use bpf_mem_alloc till a legit (ie. useful)
    use case comes up.
    
    For inode storage, bpf_local_storage_destroy() is called before
    waiting for a rcu gp and its memory cannot be reused immediately.
    inode stays with kmalloc/kfree after the rcu [or tasks_trace] gp.
    
    A 'bool bpf_ma' argument is added to bpf_local_storage_map_alloc().
    Only task and cgroup storage have 'bpf_ma == true' which
    means to use bpf_mem_cache_alloc/free(). This patch only changes
    selem to use bpf_mem_alloc for task and cgroup. The next patch
    will change the local_storage to use bpf_mem_alloc also for
    task and cgroup.
    
    Here is some more details on the changes:
    
    * memory allocation:
    After bpf_mem_cache_alloc(), the SDATA(selem)->data is zero-ed because
    bpf_mem_cache_alloc() could return a reused selem. It is to keep
    the existing bpf_map_kzalloc() behavior. Only SDATA(selem)->data
    is zero-ed. SDATA(selem)->data is the visible part to the bpf prog.
    No need to use zero_map_value() to do the zeroing because
    bpf_selem_free(..., reuse_now = true) ensures no bpf prog is using
    the selem before returning the selem through bpf_mem_cache_free().
    For the internal fields of selem, they will be initialized when
    linking to the new smap and the new local_storage.
    
    When 'bpf_ma == false', nothing changes in this patch. It will
    stay with the bpf_map_kzalloc().
    
    * memory free:
    The bpf_selem_free() and bpf_selem_free_rcu() are modified to handle
    the bpf_ma == true case.
    
    For the common selem free path where its owner is also being destroyed,
    the mem is freed in bpf_local_storage_destroy(), the owner (task
    and cgroup) has gone through a rcu gp. The memory can be reused
    immediately, so bpf_local_storage_destroy() will call
    bpf_selem_free(..., reuse_now = true) which will do
    bpf_mem_cache_free() for immediate reuse consideration.
    
    An exception is the delete elem code path. The delete elem code path
    is called from the helper bpf_*_storage_delete() and the syscall
    bpf_map_delete_elem(). This path is an unusual case for local
    storage because the common use case is to have the local storage
    staying with its owner life time so that the bpf prog and the user
    space does not have to monitor the owner's destruction. For the delete
    elem path, the selem cannot be reused immediately because there could
    be bpf prog using it. It will call bpf_selem_free(..., reuse_now = false)
    and it will wait for a rcu tasks trace gp before freeing the elem. The
    rcu callback is changed to do bpf_mem_cache_raw_free() instead of kfree().
    
    When 'bpf_ma == false', it should be the same as before.
    __bpf_selem_free() is added to do the kfree_rcu and call_tasks_trace_rcu().
    A few words on the 'reuse_now == true'. When 'reuse_now == true',
    it is still racing with bpf_local_storage_map_free which is under rcu
    protection, so it still needs to wait for a rcu gp instead of kfree().
    Otherwise, the selem may be reused by slab for a totally different struct
    while the bpf_local_storage_map_free() is still using it (as a
    rcu reader). For the inode case, there may be other rcu readers also.
    In short, when bpf_ma == false and reuse_now == true => vanilla rcu.
    
    [1]: https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@kernel.org/
    
    Cc: Namhyung Kim <namhyung@kernel.org>
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230322215246.1675516-3-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:23 +02:00
Artem Savkov 809b14ffd5 bpf: Add bpf_local_storage_free()
Bugzilla: https://bugzilla.redhat.com/2221599

commit 7e30a8477b0bdd13dfd0b24e4f32b26d22b96e6c
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:30 2023 -0800

    bpf: Add bpf_local_storage_free()
    
    This patch refactors local_storage freeing logic into
    bpf_local_storage_free(). It is a preparation work for a later
    patch that uses bpf_mem_cache_alloc/free. The other kfree(local_storage)
    cases are also changed to bpf_local_storage_free(..., reuse_now = true).
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-12-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov b739e20aa2 bpf: Add bpf_local_storage_rcu callback
Bugzilla: https://bugzilla.redhat.com/2221599

commit 1288aaa2786b1e58c9e88e53f7654d520ebe0f3b
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:29 2023 -0800

    bpf: Add bpf_local_storage_rcu callback
    
    The existing bpf_local_storage_free_rcu is renamed to
    bpf_local_storage_free_trace_rcu. A new bpf_local_storage_rcu
    callback is added to do the kfree instead of using kfree_rcu.
    It is a preparation work for a later patch using
    bpf_mem_cache_alloc/free.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-11-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov b6bb8e5ab4 bpf: Add bpf_selem_free()
Bugzilla: https://bugzilla.redhat.com/2221599

commit c0d63f309186d8492577c67c67984c714b6b72bc
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:28 2023 -0800

    bpf: Add bpf_selem_free()
    
    This patch refactors the selem freeing logic into bpf_selem_free().
    It is a preparation work for a later patch using
    bpf_mem_cache_alloc/free. The other kfree(selem) cases
    are also changed to bpf_selem_free(..., reuse_now = true).
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-10-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov 95c10c386f bpf: Add bpf_selem_free_rcu callback
Bugzilla: https://bugzilla.redhat.com/2221599

commit f8ccf30c179ec1ac16654f6e6ceb40cce1530b91
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:27 2023 -0800

    bpf: Add bpf_selem_free_rcu callback
    
    Add bpf_selem_free_rcu() callback to do the kfree() instead
    of using kfree_rcu. It is a preparation work for using
    bpf_mem_cache_alloc/free in a later patch.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-9-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov 54c0ab75e9 bpf: Remove bpf_selem_free_fields*_rcu
Bugzilla: https://bugzilla.redhat.com/2221599

commit c609981342dca634e5dea8c6ca175b6533581261
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:26 2023 -0800

    bpf: Remove bpf_selem_free_fields*_rcu
    
    This patch removes the bpf_selem_free_fields*_rcu. The
    bpf_obj_free_fields() can be done before the call_rcu_trasks_trace()
    and kfree_rcu(). It is needed when a later patch uses
    bpf_mem_cache_alloc/free. In bpf hashtab, bpf_obj_free_fields()
    is also called before calling bpf_mem_cache_free. The discussion
    can be found in
    https://lore.kernel.org/bpf/f67021ee-21d9-bfae-6134-4ca542fab843@linux.dev/
    
    Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-8-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov 6358cb43d4 bpf: Repurpose use_trace_rcu to reuse_now in bpf_local_storage
Bugzilla: https://bugzilla.redhat.com/2221599

commit a47eabf216f77cb6f22ceb38d46f1bb95968579c
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:25 2023 -0800

    bpf: Repurpose use_trace_rcu to reuse_now in bpf_local_storage
    
    This patch re-purpose the use_trace_rcu to mean
    if the freed memory can be reused immediately or not.
    The use_trace_rcu is renamed to reuse_now. Other than
    the boolean test is reversed, it should be a no-op.
    
    The following explains the reason for the rename and how it will
    be used in a later patch.
    
    In a later patch, bpf_mem_cache_alloc/free will be used
    in the bpf_local_storage. The bpf mem allocator will reuse
    the freed memory immediately. Some of the free paths in
    bpf_local_storage does not support memory to be reused immediately.
    These paths are the "delete" elem cases from the bpf_*_storage_delete()
    helper and the map_delete_elem() syscall. Note that "delete" elem
    before the owner's (sk/task/cgrp/inode) lifetime ended is not
    the common usage for the local storage.
    
    The common free path, bpf_local_storage_destroy(), can reuse the
    memory immediately. This common path means the storage stays with
    its owner until the owner is destroyed.
    
    The above mentioned "delete" elem paths that cannot
    reuse immediately always has the 'use_trace_rcu ==  true'.
    The cases that is safe for immediate reuse always have
    'use_trace_rcu == false'. Instead of adding another arg
    in a later patch, this patch re-purpose this arg
    to reuse_now and have the test logic reversed.
    
    In a later patch, 'reuse_now == true' will free to the
    bpf_mem_cache_free() where the memory can be reused
    immediately. 'reuse_now == false' will go through the
    call_rcu_tasks_trace().
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-7-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov e7bb64f21d bpf: Remember smap in bpf_local_storage
Bugzilla: https://bugzilla.redhat.com/2221599

commit fc6652aab6ad545de70b772550da9043d0b47f1c
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:24 2023 -0800

    bpf: Remember smap in bpf_local_storage
    
    This patch remembers which smap triggers the allocation
    of a 'struct bpf_local_storage' object. The local_storage is
    allocated during the very first selem added to the owner.
    The smap pointer is needed when using the bpf_mem_cache_free
    in a later patch because it needs to free to the correct
    smap's bpf_mem_alloc object.
    
    When a selem is being removed, it needs to check if it is
    the selem that triggers the creation of the local_storage.
    If it is, the local_storage->smap pointer will be reset to NULL.
    This NULL reset is done under the local_storage->lock in
    bpf_selem_unlink_storage_nolock() when a selem is being removed.
    Also note that the local_storage may not go away even
    local_storage->smap is NULL because there may be other
    selem still stored in the local_storage.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-6-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov f3db5f461d bpf: Remove the preceding __ from __bpf_selem_unlink_storage
Bugzilla: https://bugzilla.redhat.com/2221599

commit 121f31f3e00dfc1acbca43f6f35779e050b56cfc
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:23 2023 -0800

    bpf: Remove the preceding __ from __bpf_selem_unlink_storage
    
    __bpf_selem_unlink_storage is taking the spin lock and there is
    no name collision also. Having the preceding '__' is confusing
    when reviewing the later patch.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-5-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov a8bcc2f868 bpf: Remove __bpf_local_storage_map_alloc
Bugzilla: https://bugzilla.redhat.com/2221599

commit 62827d612ae525695799b3635a087cb49c55e977
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:22 2023 -0800

    bpf: Remove __bpf_local_storage_map_alloc
    
    bpf_local_storage_map_alloc() is the only caller of
    __bpf_local_storage_map_alloc().  The remaining logic in
    bpf_local_storage_map_alloc() is only a one liner setting
    the smap->cache_idx.
    
    Remove __bpf_local_storage_map_alloc() to simplify code.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-4-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov b0b486d336 bpf: Refactor codes into bpf_local_storage_destroy
Bugzilla: https://bugzilla.redhat.com/2221599

commit 2ffcb6fc50174d1efc8f98633eb2647d84483c68
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:21 2023 -0800

    bpf: Refactor codes into bpf_local_storage_destroy
    
    This patch first renames bpf_local_storage_unlink_nolock to
    bpf_local_storage_destroy(). It better reflects that it is only
    used when the storage's owner (sk/task/cgrp/inode) is being kfree().
    
    All bpf_local_storage_destroy's caller is taking the spin lock and
    then free the storage. This patch also moves these two steps into
    the bpf_local_storage_destroy.
    
    This is a preparation work for a later patch that uses
    bpf_mem_cache_alloc/free in the bpf_local_storage.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-3-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov 3ccae517bb bpf: Move a few bpf_local_storage functions to static scope
Bugzilla: https://bugzilla.redhat.com/2221599

commit 4cbd23cc92c49173e402753cab62b8a7754ed18f
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Mar 7 22:59:20 2023 -0800

    bpf: Move a few bpf_local_storage functions to static scope
    
    This patch moves the bpf_local_storage_free_rcu() and
    bpf_selem_unlink_map() to static because they are
    not used outside of bpf_local_storage.c.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20230308065936.1550103-2-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:15 +02:00
Artem Savkov a62a4634aa bpf, net: bpf_local_storage memory usage
Bugzilla: https://bugzilla.redhat.com/2221599

commit 7490b7f1c02ef825ef98f7230662049d4a464a21
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Mar 5 12:46:11 2023 +0000

    bpf, net: bpf_local_storage memory usage
    
    A new helper is introduced into bpf_local_storage map to calculate the
    memory usage. This helper is also used by other maps like
    bpf_cgrp_storage, bpf_inode_storage, bpf_task_storage and etc.
    
    Note that currently the dynamically allocated storage elements are not
    counted in the usage, since it will take extra runtime overhead in the
    elements update or delete path. So let's put it aside now, and implement
    it in the future when someone really need it.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20230305124615.12358-15-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:13 +02:00
Artem Savkov cb25976a9a bpf: Use separate RCU callbacks for freeing selem
Bugzilla: https://bugzilla.redhat.com/2221599

commit e768e3c5aab44ee63f58649d4c8cbbb3270e5c06
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Mar 3 15:15:42 2023 +0100

    bpf: Use separate RCU callbacks for freeing selem
    
    Martin suggested that instead of using a byte in the hole (which he has
    a use for in his future patch) in bpf_local_storage_elem, we can
    dispatch a different call_rcu callback based on whether we need to free
    special fields in bpf_local_storage_elem data. The free path, described
    in commit 9db44fdd8105 ("bpf: Support kptrs in local storage maps"),
    only waits for call_rcu callbacks when there are special (kptrs, etc.)
    fields in the map value, hence it is necessary that we only access
    smap in this case.
    
    Therefore, dispatch different RCU callbacks based on the BPF map has a
    valid btf_record, which dereference and use smap's btf_record only when
    it is valid.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20230303141542.300068-1-memxor@gmail.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:10 +02:00
Artem Savkov ef8295203b bpf: Support kptrs in local storage maps
Bugzilla: https://bugzilla.redhat.com/2221599

commit 9db44fdd8105da00669d425acab887c668df75f6
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Sat Feb 25 16:40:09 2023 +0100

    bpf: Support kptrs in local storage maps
    
    Enable support for kptrs in local storage maps by wiring up the freeing
    of these kptrs from map value. Freeing of bpf_local_storage_map is only
    delayed in case there are special fields, therefore bpf_selem_free_*
    path can also only dereference smap safely in that case. This is
    recorded using a bool utilizing a hole in bpF_local_storage_elem. It
    could have been tagged in the pointer value smap using the lowest bit
    (since alignment > 1), but since there was already a hole I went with
    the simpler option. Only the map structure freeing is delayed using RCU
    barriers, as the buckets aren't used when selem is being freed, so they
    can be freed once all readers of the bucket lists can no longer access
    it.
    
    Cc: Martin KaFai Lau <martin.lau@kernel.org>
    Cc: KP Singh <kpsingh@kernel.org>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20230225154010.391965-3-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:09 +02:00
Artem Savkov 7bd4d6c66e bpf: Annotate data races in bpf_local_storage
Bugzilla: https://bugzilla.redhat.com/2221599

commit 0a09a2f933c73dc76ab0b72da6855f44342a8903
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Tue Feb 21 21:06:42 2023 +0100

    bpf: Annotate data races in bpf_local_storage
    
    There are a few cases where hlist_node is checked to be unhashed without
    holding the lock protecting its modification. In this case, one must use
    hlist_unhashed_lockless to avoid load tearing and KCSAN reports. Fix
    this by using lockless variant in places not protected by the lock.
    
    Since this is not prompted by any actual KCSAN reports but only from
    code review, I have not included a fixes tag.
    
    Cc: Martin KaFai Lau <martin.lau@kernel.org>
    Cc: KP Singh <kpsingh@kernel.org>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20230221200646.2500777-4-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:06 +02:00
Viktor Malik 17c5fbcb2c bpf: use bpf_map_kvcalloc in bpf_local_storage
Bugzilla: https://bugzilla.redhat.com/2178930

commit ddef81b5fd1da4d7c3cc8785d2043b73b72f38ef
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Fri Feb 10 15:47:32 2023 +0000

    bpf: use bpf_map_kvcalloc in bpf_local_storage
    
    Introduce new helper bpf_map_kvcalloc() for the memory allocation in
    bpf_local_storage(). Then the allocation will charge the memory from the
    map instead of from current, though currently they are the same thing as
    it is only used in map creation path now. By charging map's memory into
    the memcg from the map, it will be more clear.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
    Link: https://lore.kernel.org/r/20230210154734.4416-3-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:27 +02:00
Viktor Malik 46a4a194b1 bpf: Reduce smap->elem_size
Bugzilla: https://bugzilla.redhat.com/2178930

commit 552d42a356ebf78df9d2f4b73e077d2459966fac
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Dec 20 17:30:36 2022 -0800

    bpf: Reduce smap->elem_size
    
    'struct bpf_local_storage_elem' has an unused 56 byte padding at the
    end due to struct's cache-line alignment requirement. This padding
    space is overlapped by storage value contents, so if we use sizeof()
    to calculate the total size, we overinflate it by 56 bytes. Use
    offsetof() instead to calculate more exact memory use.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20221221013036.3427431-1-martin.lau@linux.dev

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:44:26 +02:00
Jerome Marchand 2b8a340165 bpf: Consolidate spin_lock, timer management into btf_record
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Context change from already backported commit 997849c4b969
("bpf: Zeroing allocated object from slab in bpf memory allocator")

commit db559117828d2448fe81ada051c60bcf39f822e9
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 4 00:39:56 2022 +0530

    bpf: Consolidate spin_lock, timer management into btf_record

    Now that kptr_off_tab has been refactored into btf_record, and can hold
    more than one specific field type, accomodate bpf_spin_lock and
    bpf_timer as well.

    While they don't require any more metadata than offset, having all
    special fields in one place allows us to share the same code for
    allocated user defined types and handle both map values and these
    allocated objects in a similar fashion.

    As an optimization, we still keep spin_lock_off and timer_off offsets in
    the btf_record structure, just to avoid having to find the btf_field
    struct each time their offset is needed. This is mostly needed to
    manipulate such objects in a map value at runtime. It's ok to hardcode
    just one offset as more than one field is disallowed.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:01 +02:00
Jerome Marchand 344e0ea3a2 bpf: Refactor some inode/task/sk storage functions for reuse
Bugzilla: https://bugzilla.redhat.com/2177177

commit c83597fa5dc6b322e9bdf929e5f4136a3f4aa4db
Author: Yonghong Song <yhs@fb.com>
Date:   Tue Oct 25 21:28:45 2022 -0700

    bpf: Refactor some inode/task/sk storage functions for reuse

    Refactor codes so that inode/task/sk storage implementation
    can maximally share the same code. I also added some comments
    in new function bpf_local_storage_unlink_nolock() to make
    codes easy to understand. There is no functionality change.

    Acked-by: David Vernet <void@manifault.com>
    Signed-off-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221026042845.672944-1-yhs@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:42:58 +02:00
Jerome Marchand 675ec47664 bpf: Avoid taking spinlock in bpf_task_storage_get if potential deadlock is detected
Bugzilla: https://bugzilla.redhat.com/2177177

commit e8b02296a6b8d07de752d6157d863a642117bcd3
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Oct 25 11:45:19 2022 -0700

    bpf: Avoid taking spinlock in bpf_task_storage_get if potential deadlock is detected

    bpf_task_storage_get() does a lookup and optionally inserts
    new data if BPF_LOCAL_STORAGE_GET_F_CREATE is present.

    During lookup, it will cache the lookup result and caching requires to
    acquire a spinlock.  When potential deadlock is detected (by the
    bpf_task_storage_busy pcpu-counter added in
    commit bc235cdb42 ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]")),
    the current behavior is returning NULL immediately to avoid deadlock.  It is
    too pessimistic.  This patch will go ahead to do a lookup (which is a
    lockless operation) but it will avoid caching it in order to avoid
    acquiring the spinlock.

    When lookup fails to find the data and BPF_LOCAL_STORAGE_GET_F_CREATE
    is set, an insertion is needed and this requires acquiring a spinlock.
    This patch will still return NULL when a potential deadlock is detected.

    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20221025184524.3526117-5-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:42:57 +02:00
Jerome Marchand cf7c9b6723 bpf: Use rcu_trace_implies_rcu_gp() in local storage map
Bugzilla: https://bugzilla.redhat.com/2177177

commit d39d1445d37747032e2b26732fed6fe25161cd36
Author: Hou Tao <houtao1@huawei.com>
Date:   Fri Oct 14 19:39:45 2022 +0800

    bpf: Use rcu_trace_implies_rcu_gp() in local storage map

    Local storage map is accessible for both sleepable and non-sleepable bpf
    program, and its memory is freed by using both call_rcu_tasks_trace() and
    kfree_rcu() to wait for both RCU-tasks-trace grace period and RCU grace
    period to pass.

    With the introduction of rcu_trace_implies_rcu_gp(), both
    bpf_selem_free_rcu() and bpf_local_storage_free_rcu() can check whether
    or not a normal RCU grace period has also passed after a RCU-tasks-trace
    grace period has passed. If it is true, it is safe to call kfree()
    directly.

    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Link: https://lore.kernel.org/r/20221014113946.965131-4-houtao@huaweicloud.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:42:53 +02:00
Artem Savkov 764948d90c bpf: Do not copy spin lock field from user in bpf_selem_alloc
Bugzilla: https://bugzilla.redhat.com/2166911

commit 836e49e103dfeeff670c934b7d563cbd982fce87
Author: Xu Kuohai <xukuohai@huawei.com>
Date:   Mon Nov 14 08:47:19 2022 -0500

    bpf: Do not copy spin lock field from user in bpf_selem_alloc
    
    bpf_selem_alloc function is used by inode_storage, sk_storage and
    task_storage maps to set map value, for these map types, there may
    be a spin lock in the map value, so if we use memcpy to copy the whole
    map value from user, the spin lock field may be initialized incorrectly.
    
    Since the spin lock field is zeroed by kzalloc, call copy_map_value
    instead of memcpy to skip copying the spin lock field to fix it.
    
    Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
    Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
    Link: https://lore.kernel.org/r/20221114134720.1057939-2-xukuohai@huawei.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:24 +01:00
Artem Savkov 34ce43379f bpf: Use this_cpu_{inc|dec|inc_return} for bpf_task_storage_busy
Bugzilla: https://bugzilla.redhat.com/2166911

commit 197827a05e13808c60f52632e9887eede63f1c16
Author: Hou Tao <houtao1@huawei.com>
Date:   Thu Sep 1 14:19:35 2022 +0800

    bpf: Use this_cpu_{inc|dec|inc_return} for bpf_task_storage_busy
    
    Now migrate_disable() does not disable preemption and under some
    architectures (e.g. arm64) __this_cpu_{inc|dec|inc_return} are neither
    preemption-safe nor IRQ-safe, so for fully preemptible kernel concurrent
    lookups or updates on the same task local storage and on the same CPU
    may make bpf_task_storage_busy be imbalanced, and
    bpf_task_storage_trylock() on the specific cpu will always fail.
    
    Fixing it by using this_cpu_{inc|dec|inc_return} when manipulating
    bpf_task_storage_busy.
    
    Fixes: bc235cdb42 ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]")
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Acked-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/r/20220901061938.3789460-2-houtao@huaweicloud.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:06 +01:00
Artem Savkov 954e5bcd83 bpf: Use bpf_map_area_alloc consistently on bpf map creation
Bugzilla: https://bugzilla.redhat.com/2166911

commit 73cf09a36bf7bfb3e5a3ff23755c36d49137c44d
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Wed Aug 10 15:18:29 2022 +0000

    bpf: Use bpf_map_area_alloc consistently on bpf map creation
    
    Let's use the generic helper bpf_map_area_alloc() instead of the
    open-coded kzalloc helpers in bpf maps creation path.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20220810151840.16394-5-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:00 +01:00
Jerome Marchand 0e62381974 bpf: Enable non-atomic allocations in local storage
Bugzilla: https://bugzilla.redhat.com/2120966

commit b00fa38a9c1cba044a32a601b49a55a18ed719d1
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Thu Mar 17 21:55:52 2022 -0700

    bpf: Enable non-atomic allocations in local storage

    Currently, local storage memory can only be allocated atomically
    (GFP_ATOMIC). This restriction is too strict for sleepable bpf
    programs.

    In this patch, the verifier detects whether the program is sleepable,
    and passes the corresponding GFP_KERNEL or GFP_ATOMIC flag as a
    5th argument to bpf_task/sk/inode_storage_get. This flag will propagate
    down to the local storage functions that allocate memory.

    Please note that bpf_task/sk/inode_storage_update_elem functions are
    invoked by userspace applications through syscalls. Preemption is
    disabled before bpf_task/sk/inode_storage_update_elem is called, which
    means they will always have to allocate memory atomically.

    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: KP Singh <kpsingh@kernel.org>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20220318045553.3091807-2-joannekoong@fb.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:06 +02:00
Jerome Marchand c5056baccd bpf: Cleanup comments
Bugzilla: https://bugzilla.redhat.com/2120966

commit c561d11063009323a0e57c528cb1d77b7d2c41e0
Author: Tom Rix <trix@redhat.com>
Date:   Sun Feb 20 10:40:55 2022 -0800

    bpf: Cleanup comments

    Add leading space to spdx tag
    Use // for spdx c file comment

    Replacements
    resereved to reserved
    inbetween to in between
    everytime to every time
    intutivie to intuitive
    currenct to current
    encontered to encountered
    referenceing to referencing
    upto to up to
    exectuted to executed

    Signed-off-by: Tom Rix <trix@redhat.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20220220184055.3608317-1-trix@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:51 +02:00
Artem Savkov 9c42002344 bpf: Fix usage of trace RCU in local storage.
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit dcf456c9a095a6e71f53d6f6f004133ee851ee70
Author: KP Singh <kpsingh@kernel.org>
Date:   Mon Apr 18 15:51:58 2022 +0000

    bpf: Fix usage of trace RCU in local storage.

    bpf_{sk,task,inode}_storage_free() do not need to use
    call_rcu_tasks_trace as no BPF program should be accessing the owner
    as it's being destroyed. The only other reader at this point is
    bpf_local_storage_map_free() which uses normal RCU.

    The only path that needs trace RCU are:

    * bpf_local_storage_{delete,update} helpers
    * map_{delete,update}_elem() syscalls

    Fixes: 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")
    Signed-off-by: KP Singh <kpsingh@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20220418155158.2865678-1-kpsingh@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:56 +02:00
Artem Savkov a3732d50aa bpf: Allow bpf_local_storage to be used by sleepable programs
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 0fe4b381a59ebc53522fce579b281a67a9e1bee6
Author: KP Singh <kpsingh@kernel.org>
Date:   Fri Dec 24 15:29:15 2021 +0000

    bpf: Allow bpf_local_storage to be used by sleepable programs

    Other maps like hashmaps are already available to sleepable programs.
    Sleepable BPF programs run under trace RCU. Allow task, sk and inode
    storage to be used from sleepable programs. This allows sleepable and
    non-sleepable programs to provide shareable annotations on kernel
    objects.

    Sleepable programs run in trace RCU where as non-sleepable programs run
    in a normal RCU critical section i.e.  __bpf_prog_enter{_sleepable}
    and __bpf_prog_exit{_sleepable}) (rcu_read_lock or rcu_read_lock_trace).

    In order to make the local storage maps accessible to both sleepable
    and non-sleepable programs, one needs to call both
    call_rcu_tasks_trace and call_rcu to wait for both trace and classical
    RCU grace periods to expire before freeing memory.

    Paul's work on call_rcu_tasks_trace allows us to have per CPU queueing
    for call_rcu_tasks_trace. This behaviour can be achieved by setting
    rcupdate.rcu_task_enqueue_lim=<num_cpus> boot parameter.

    In light of these new performance changes and to keep the local storage
    code simple, avoid adding a new flag for sleepable maps / local storage
    to select the RCU synchronization (trace / classical).

    Also, update the dereferencing of the pointers to use
    rcu_derference_check (with either the trace or normal RCU locks held)
    with a common bpf_rcu_lock_held helper method.

    Signed-off-by: KP Singh <kpsingh@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20211224152916.1550677-2-kpsingh@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:51 +02:00
Song Liu bc235cdb42 bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]
BPF helpers bpf_task_storage_[get|delete] could hold two locks:
bpf_local_storage_map_bucket->lock and bpf_local_storage->lock. Calling
these helpers from fentry/fexit programs on functions in bpf_*_storage.c
may cause deadlock on either locks.

Prevent such deadlock with a per cpu counter, bpf_task_storage_busy. We
need this counter to be global, because the two locks here belong to two
different objects: bpf_local_storage_map and bpf_local_storage. If we
pick one of them as the owner of the counter, it is still possible to
trigger deadlock on the other lock. For example, if bpf_local_storage_map
owns the counters, it cannot prevent deadlock on bpf_local_storage->lock
when two maps are used.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210225234319.336131-3-songliubraving@fb.com
2021-02-26 11:51:48 -08:00
Song Liu a10787e6d5 bpf: Enable task local storage for tracing programs
To access per-task data, BPF programs usually creates a hash table with
pid as the key. This is not ideal because:
 1. The user need to estimate the proper size of the hash table, which may
    be inaccurate;
 2. Big hash tables are slow;
 3. To clean up the data properly during task terminations, the user need
    to write extra logic.

Task local storage overcomes these issues and offers a better option for
these per-task data. Task local storage is only available to BPF_LSM. Now
enable it for tracing programs.

Unlike LSM programs, tracing programs can be called in IRQ contexts.
Helpers that access task local storage are updated to use
raw_spin_lock_irqsave() instead of raw_spin_lock_bh().

Tracing programs can attach to functions on the task free path, e.g.
exit_creds(). To avoid allocating task local storage after
bpf_task_storage_free(). bpf_task_storage_get() is updated to not allocate
new storage when the task is not refcounted (task->usage == 0).

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210225234319.336131-2-songliubraving@fb.com
2021-02-26 11:51:47 -08:00
Roman Gushchin ab31be378a bpf: Eliminate rlimit-based memory accounting for bpf local storage maps
Do not use rlimit-based memory accounting for bpf local storage maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-32-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin e9aae8beba bpf: Memcg-based memory accounting for bpf local storage maps
Account memory used by bpf local storage maps:
per-socket, per-inode and per-task storages.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-16-guro@fb.com
2020-12-02 18:32:45 -08:00
Martin KaFai Lau 70b971118e bpf: Use hlist_add_head_rcu when linking to local_storage
The local_storage->list will be traversed by rcu reader in parallel.
Thus, hlist_add_head_rcu() is needed in bpf_selem_link_storage_nolock().
This patch fixes it.

This part of the code has recently been refactored in bpf-next
and this patch makes changes to the new file "bpf_local_storage.c".
Instead of using the original offending commit in the Fixes tag,
the commit that created the file "bpf_local_storage.c" is used.

A separate fix has been provided to the bpf tree.

Fixes: 450af8d0f6 ("bpf: Split bpf_local_storage to bpf_sk_storage")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200916204453.2003915-1-kafai@fb.com
2020-09-19 01:12:35 +02:00
KP Singh 450af8d0f6 bpf: Split bpf_local_storage to bpf_sk_storage
A purely mechanical change:

	bpf_sk_storage.c = bpf_sk_storage.c + bpf_local_storage.c
	bpf_sk_storage.h = bpf_sk_storage.h + bpf_local_storage.h

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-5-kpsingh@chromium.org
2020-08-25 15:00:04 -07:00