Commit Graph

241 Commits

Author SHA1 Message Date
Artem Savkov 219ca146ad bpf: Add support for absolute value BPF timers
Bugzilla: https://bugzilla.redhat.com/2221599

commit f71f8530494bb5ab43d3369ef0ce8373eb1ee077
Author: Tero Kristo <tero.kristo@linux.intel.com>
Date:   Thu Mar 2 13:46:13 2023 +0200

    bpf: Add support for absolute value BPF timers
    
    Add a new flag BPF_F_TIMER_ABS that can be passed to bpf_timer_start()
    to start an absolute value timer instead of the default relative value.
    This makes the timer expire at an exact point in time, instead of a time
    with latencies induced by both the BPF and timer subsystems.
    
    Suggested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
    Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
    Link: https://lore.kernel.org/r/20230302114614.2985072-2-tero.kristo@linux.intel.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:10 +02:00
Artem Savkov 74f2bb6c6c bpf: Make bpf_get_current_[ancestor_]cgroup_id() available for all program types
Bugzilla: https://bugzilla.redhat.com/2221599

commit c501bf55c88b834adefda870c7c092ec9052a437
Author: Tejun Heo <tj@kernel.org>
Date:   Thu Mar 2 09:42:59 2023 -1000

    bpf: Make bpf_get_current_[ancestor_]cgroup_id() available for all program types
    
    These helpers are safe to call from any context and there's no reason to
    restrict access to them. Remove them from bpf_trace and filter lists and add
    to bpf_base_func_proto() under perfmon_capable().
    
    v2: After consulting with Andrii, relocated in bpf_base_func_proto() so that
        they require bpf_capable() but not perfomon_capable() as it doesn't read
        from or affect others on the system.
    
    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/ZAD8QyoszMZiTzBY@slm.duckdns.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:10 +02:00
Artem Savkov db1ae29d75 bpf: Fix bpf_dynptr_slice{_rdwr} to return NULL instead of 0
Bugzilla: https://bugzilla.redhat.com/2221599

commit c45eac537bd8b4977d335c123212140bc5257670
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Wed Mar 1 21:30:14 2023 -0800

    bpf: Fix bpf_dynptr_slice{_rdwr} to return NULL instead of 0
    
    Change bpf_dynptr_slice and bpf_dynptr_slice_rdwr to return NULL instead
    of 0, in accordance with the codebase guidelines.
    
    Fixes: 66e3a13e7c2c ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20230302053014.1726219-1-joannelkoong@gmail.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:09 +02:00
Artem Savkov d3a9527441 bpf: Fix doxygen comments for dynptr slice kfuncs
Bugzilla: https://bugzilla.redhat.com/2221599

commit 7ce60b110eece1d7b3d5c322fd11f6d41a29d17b
Author: David Vernet <void@manifault.com>
Date:   Wed Mar 1 13:49:09 2023 -0600

    bpf: Fix doxygen comments for dynptr slice kfuncs
    
    In commit 66e3a13e7c2c ("bpf: Add bpf_dynptr_slice and
    bpf_dynptr_slice_rdwr"), the bpf_dynptr_slice() and
    bpf_dynptr_slice_rdwr() kfuncs were added to BPF. These kfuncs included
    doxygen headers, but unfortunately those headers are not properly
    formatted according to [0], and causes the following warnings during the
    docs build:
    
    ./kernel/bpf/helpers.c:2225: warning: \
        Excess function parameter 'returns' description in 'bpf_dynptr_slice'
    ./kernel/bpf/helpers.c:2303: warning: \
        Excess function parameter 'returns' description in 'bpf_dynptr_slice_rdwr'
    ...
    
    This patch fixes those doxygen comments.
    
    [0]: https://docs.kernel.org/doc-guide/kernel-doc.html#function-documentation
    
    Fixes: 66e3a13e7c2c ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr")
    Signed-off-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/r/20230301194910.602738-1-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:09 +02:00
Artem Savkov fdc30fa851 bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr
Bugzilla: https://bugzilla.redhat.com/2221599

commit 66e3a13e7c2c44d0c9dd6bb244680ca7529a8845
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Wed Mar 1 07:49:52 2023 -0800

    bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr
    
    Two new kfuncs are added, bpf_dynptr_slice and bpf_dynptr_slice_rdwr.
    The user must pass in a buffer to store the contents of the data slice
    if a direct pointer to the data cannot be obtained.
    
    For skb and xdp type dynptrs, these two APIs are the only way to obtain
    a data slice. However, for other types of dynptrs, there is no
    difference between bpf_dynptr_slice(_rdwr) and bpf_dynptr_data.
    
    For skb type dynptrs, the data is copied into the user provided buffer
    if any of the data is not in the linear portion of the skb. For xdp type
    dynptrs, the data is copied into the user provided buffer if the data is
    between xdp frags.
    
    If the skb is cloned and a call to bpf_dynptr_data_rdwr is made, then
    the skb will be uncloned (see bpf_unclone_prologue()).
    
    Please note that any bpf_dynptr_write() automatically invalidates any prior
    data slices of the skb dynptr. This is because the skb may be cloned or
    may need to pull its paged buffer into the head. As such, any
    bpf_dynptr_write() will automatically have its prior data slices
    invalidated, even if the write is to data in the skb head of an uncloned
    skb. Please note as well that any other helper calls that change the
    underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
    slices of the skb dynptr as well, for the same reasons.
    
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Link: https://lore.kernel.org/r/20230301154953.641654-10-joannelkoong@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:08 +02:00
Artem Savkov 1869c0c1bd bpf: Add xdp dynptrs
Bugzilla: https://bugzilla.redhat.com/2221599

commit 05421aecd4ed65da0dc17b0c3c13779ef334e9e5
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Wed Mar 1 07:49:51 2023 -0800

    bpf: Add xdp dynptrs
    
    Add xdp dynptrs, which are dynptrs whose underlying pointer points
    to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
    benefits. One is that they allow operations on sizes that are not
    statically known at compile-time (eg variable-sized accesses).
    Another is that parsing the packet data through dynptrs (instead of
    through direct access of xdp->data and xdp->data_end) can be more
    ergonomic and less brittle (eg does not need manual if checking for
    being within bounds of data_end).
    
    For reads and writes on the dynptr, this includes reading/writing
    from/to and across fragments. Data slices through the bpf_dynptr_data
    API are not supported; instead bpf_dynptr_slice() and
    bpf_dynptr_slice_rdwr() should be used.
    
    For examples of how xdp dynptrs can be used, please see the attached
    selftests.
    
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Link: https://lore.kernel.org/r/20230301154953.641654-9-joannelkoong@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:08 +02:00
Artem Savkov 6735ca36be bpf: Add skb dynptrs
Bugzilla: https://bugzilla.redhat.com/2221599

commit b5964b968ac64c2ec2debee7518499113b27c34e
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Wed Mar 1 07:49:50 2023 -0800

    bpf: Add skb dynptrs
    
    Add skb dynptrs, which are dynptrs whose underlying pointer points
    to a skb. The dynptr acts on skb data. skb dynptrs have two main
    benefits. One is that they allow operations on sizes that are not
    statically known at compile-time (eg variable-sized accesses).
    Another is that parsing the packet data through dynptrs (instead of
    through direct access of skb->data and skb->data_end) can be more
    ergonomic and less brittle (eg does not need manual if checking for
    being within bounds of data_end).
    
    For bpf prog types that don't support writes on skb data, the dynptr is
    read-only (bpf_dynptr_write() will return an error)
    
    For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
    interfaces, reading and writing from/to data in the head as well as from/to
    non-linear paged buffers is supported. Data slices through the
    bpf_dynptr_data API are not supported; instead bpf_dynptr_slice() and
    bpf_dynptr_slice_rdwr() (added in subsequent commit) should be used.
    
    For examples of how skb dynptrs can be used, please see the attached
    selftests.
    
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Link: https://lore.kernel.org/r/20230301154953.641654-8-joannelkoong@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:08 +02:00
Artem Savkov a1c7ecf835 bpf: Fix bpf_cgroup_from_id() doxygen header
Bugzilla: https://bugzilla.redhat.com/2221599

commit 30a2d8328d8ac1bb0a6bf73f4f4cf03f4f5977cc
Author: David Vernet <void@manifault.com>
Date:   Tue Feb 28 09:28:45 2023 -0600

    bpf: Fix bpf_cgroup_from_id() doxygen header
    
    In commit 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc"), a new
    bpf_cgroup_from_id() kfunc was added which allows a BPF program to
    lookup and acquire a reference to a cgroup from a cgroup id. The
    commit's doxygen comment seems to have copy-pasted fields, which causes
    BPF kfunc helper documentation to fail to render:
    
    <snip>/helpers.c:2114: warning: Excess function parameter 'cgrp'...
    <snip>/helpers.c:2114: warning: Excess function parameter 'level'...
    
    <snip>
    
    <snip>/helpers.c:2114: warning: Excess function parameter 'level'...
    
    This patch fixes the doxygen header.
    
    Fixes: 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc")
    Signed-off-by: David Vernet <void@manifault.com>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20230228152845.294695-1-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:08 +02:00
Artem Savkov 00bc00c8ec bpf: Add bpf_cgroup_from_id() kfunc
Bugzilla: https://bugzilla.redhat.com/2221599

commit 332ea1f697be148bd5e66475d82b5ecc5084da65
Author: Tejun Heo <tj@kernel.org>
Date:   Wed Feb 22 15:29:12 2023 -1000

    bpf: Add bpf_cgroup_from_id() kfunc
    
    cgroup ID is an userspace-visible 64bit value uniquely identifying a given
    cgroup. As the IDs are used widely, it's useful to be able to look up the
    matching cgroups. Add bpf_cgroup_from_id().
    
    v2: Separate out selftest into its own patch as suggested by Alexei.
    
    Signed-off-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/Y/bBaG96t0/gQl9/@slm.duckdns.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:07 +02:00
Viktor Malik b40d8b1fc1 bpf: Add bpf_rbtree_{add,remove,first} kfuncs
Bugzilla: https://bugzilla.redhat.com/2178930

commit bd1279ae8a691d7ec75852c6d0a22139afb034a4
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Mon Feb 13 16:40:11 2023 -0800

    bpf: Add bpf_rbtree_{add,remove,first} kfuncs
    
    This patch adds implementations of bpf_rbtree_{add,remove,first}
    and teaches verifier about their BTF_IDs as well as those of
    bpf_rb_{root,node}.
    
    All three kfuncs have some nonstandard component to their verification
    that needs to be addressed in future patches before programs can
    properly use them:
    
      * bpf_rbtree_add:     Takes 'less' callback, need to verify it
    
      * bpf_rbtree_first:   Returns ptr_to_node_type(off=rb_node_off) instead
                            of ptr_to_rb_node(off=0). Return value ref is
    			non-owning.
    
      * bpf_rbtree_remove:  Returns ptr_to_node_type(off=rb_node_off) instead
                            of ptr_to_rb_node(off=0). 2nd arg (node) is a
    			non-owning reference.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230214004017.2534011-3-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:29 +02:00
Viktor Malik 7e487d11fc bpf: Add basic bpf_rb_{root,node} support
Bugzilla: https://bugzilla.redhat.com/2178930

commit 9c395c1b99bd23f74bc628fa000480c49593d17f
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Mon Feb 13 16:40:10 2023 -0800

    bpf: Add basic bpf_rb_{root,node} support
    
    This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
    BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
    types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
    map_values.
    
    structs bpf_rb_root and bpf_rb_node are opaque types meant to
    obscure structs rb_root_cached rb_node, respectively.
    
    btf_struct_access will prevent BPF programs from touching these special
    fields automatically now that they're recognized.
    
    btf_check_and_fixup_fields now groups list_head and rb_root together as
    "graph root" fields and {list,rb}_node as "graph node", and does same
    ownership cycle checking as before. Note that this function does _not_
    prevent ownership type mixups (e.g. rb_root owning list_node) - that's
    handled by btf_parse_graph_root.
    
    After this patch, a bpf program can have a struct bpf_rb_root in a
    map_value, but not add anything to nor do anything useful with it.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:29 +02:00
Viktor Malik 23c9904275 bpf: Add __bpf_kfunc tag to all kfuncs
Bugzilla: https://bugzilla.redhat.com/2178930

commit 400031e05adfcef9e80eca80bdfc3f4b63658be4
Author: David Vernet <void@manifault.com>
Date:   Wed Feb 1 11:30:15 2023 -0600

    bpf: Add __bpf_kfunc tag to all kfuncs

    Now that we have the __bpf_kfunc tag, we should use add it to all
    existing kfuncs to ensure that they'll never be elided in LTO builds.

    Signed-off-by: David Vernet <void@manifault.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:20 +02:00
Viktor Malik 8a9358eb85 bpf: rename list_head -> graph_root in field info types
Bugzilla: https://bugzilla.redhat.com/2178930

commit 30465003ad776a922c32b2dac58db14f120f037e
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Sat Dec 17 00:24:57 2022 -0800

    bpf: rename list_head -> graph_root in field info types
    
    Many of the structs recently added to track field info for linked-list
    head are useful as-is for rbtree root. So let's do a mechanical renaming
    of list_head-related types and fields:
    
    include/linux/bpf.h:
      struct btf_field_list_head -> struct btf_field_graph_root
      list_head -> graph_root in struct btf_field union
    kernel/bpf/btf.c:
      list_head -> graph_root in struct btf_field_info
    
    This is a nonfunctional change, functionality to actually use these
    fields for rbtree will be added in further patches.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20221217082506.1570898-5-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:44:30 +02:00
Viktor Malik 1876dbfb9e bpf: Remove trace_printk_lock
Bugzilla: https://bugzilla.redhat.com/2178930

commit e2bb9e01d589f7fa82573aedd2765ff9b277816a
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Thu Dec 15 22:44:30 2022 +0100

    bpf: Remove trace_printk_lock
    
    Both bpf_trace_printk and bpf_trace_vprintk helpers use static buffer guarded
    with trace_printk_lock spin lock.
    
    The spin lock contention causes issues with bpf programs attached to
    contention_begin tracepoint [1][2].
    
    Andrii suggested we could get rid of the contention by using trylock, but we
    could actually get rid of the spinlock completely by using percpu buffers the
    same way as for bin_args in bpf_bprintf_prepare function.
    
    Adding new return 'buf' argument to struct bpf_bprintf_data and making
    bpf_bprintf_prepare to return also the buffer for printk helpers.
    
      [1] https://lore.kernel.org/bpf/CACkBjsakT_yWxnSWr4r-0TpPvbKm9-OBmVUhJb7hV3hY8fdCkw@mail.gmail.com/
      [2] https://lore.kernel.org/bpf/CACkBjsaCsTovQHFfkqJKto6S4Z8d02ud1D7MPESrHa1cVNNTrw@mail.gmail.com/
    
    Reported-by: Hao Sun <sunhao.th@gmail.com>
    Suggested-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20221215214430.1336195-4-jolsa@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:44:23 +02:00
Viktor Malik 5741f9f020 bpf: Do cleanup in bpf_bprintf_cleanup only when needed
Bugzilla: https://bugzilla.redhat.com/2178930

commit f19a4050455aad847fb93f18dc1fe502eb60f989
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Thu Dec 15 22:44:29 2022 +0100

    bpf: Do cleanup in bpf_bprintf_cleanup only when needed
    
    Currently we always cleanup/decrement bpf_bprintf_nest_level variable
    in bpf_bprintf_cleanup if it's > 0.
    
    There's possible scenario where this could cause a problem, when
    bpf_bprintf_prepare does not get bin_args buffer (because num_args is 0)
    and following bpf_bprintf_cleanup call decrements bpf_bprintf_nest_level
    variable, like:
    
      in task context:
        bpf_bprintf_prepare(num_args != 0) increments 'bpf_bprintf_nest_level = 1'
        -> first irq :
           bpf_bprintf_prepare(num_args == 0)
           bpf_bprintf_cleanup decrements 'bpf_bprintf_nest_level = 0'
        -> second irq:
           bpf_bprintf_prepare(num_args != 0) bpf_bprintf_nest_level = 1
           gets same buffer as task context above
    
    Adding check to bpf_bprintf_cleanup and doing the real cleanup only if we
    got bin_args data in the first place.
    
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20221215214430.1336195-3-jolsa@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:44:23 +02:00
Viktor Malik e1af8144ba bpf: Add struct for bin_args arg in bpf_bprintf_prepare
Bugzilla: https://bugzilla.redhat.com/2178930

commit 78aa1cc9404399a15d2a1205329c6a06236f5378
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Thu Dec 15 22:44:28 2022 +0100

    bpf: Add struct for bin_args arg in bpf_bprintf_prepare
    
    Adding struct bpf_bprintf_data to hold bin_args argument for
    bpf_bprintf_prepare function.
    
    We will add another return argument to bpf_bprintf_prepare and
    pass the struct to bpf_bprintf_cleanup for proper cleanup in
    following changes.
    
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20221215214430.1336195-2-jolsa@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:44:22 +02:00
Jerome Marchand 0de8fcfc64 bpf: Use memmove for bpf_dynptr_{read,write}
Bugzilla: https://bugzilla.redhat.com/2177177

commit 76d16077bef0954528ec3896710f9eda8b2b4db1
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Thu Dec 8 02:11:40 2022 +0530

    bpf: Use memmove for bpf_dynptr_{read,write}
    
    It may happen that destination buffer memory overlaps with memory dynptr
    points to. Hence, we must use memmove to correctly copy from dynptr to
    destination buffer, or source buffer to dynptr.
    
    This actually isn't a problem right now, as memcpy implementation falls
    back to memmove on detecting overlap and warns about it, but we
    shouldn't be relying on that.
    
    Acked-by: Joanne Koong <joannelkoong@gmail.com>
    Acked-by: David Vernet <void@manifault.com>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221207204141.308952-7-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:17 +02:00
Jerome Marchand a59af7f5dd bpf: Rework process_dynptr_func
Bugzilla: https://bugzilla.redhat.com/2177177

commit 270605317366e4535d8d9fc3d9da1ad0fb3c9d45
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Thu Dec 8 02:11:37 2022 +0530

    bpf: Rework process_dynptr_func
    
    Recently, user ringbuf support introduced a PTR_TO_DYNPTR register type
    for use in callback state, because in case of user ringbuf helpers,
    there is no dynptr on the stack that is passed into the callback. To
    reflect such a state, a special register type was created.
    
    However, some checks have been bypassed incorrectly during the addition
    of this feature. First, for arg_type with MEM_UNINIT flag which
    initialize a dynptr, they must be rejected for such register type.
    Secondly, in the future, there are plans to add dynptr helpers that
    operate on the dynptr itself and may change its offset and other
    properties.
    
    In all of these cases, PTR_TO_DYNPTR shouldn't be allowed to be passed
    to such helpers, however the current code simply returns 0.
    
    The rejection for helpers that release the dynptr is already handled.
    
    For fixing this, we take a step back and rework existing code in a way
    that will allow fitting in all classes of helpers and have a coherent
    model for dealing with the variety of use cases in which dynptr is used.
    
    First, for ARG_PTR_TO_DYNPTR, it can either be set alone or together
    with a DYNPTR_TYPE_* constant that denotes the only type it accepts.
    
    Next, helpers which initialize a dynptr use MEM_UNINIT to indicate this
    fact. To make the distinction clear, use MEM_RDONLY flag to indicate
    that the helper only operates on the memory pointed to by the dynptr,
    not the dynptr itself. In C parlance, it would be equivalent to taking
    the dynptr as a point to const argument.
    
    When either of these flags are not present, the helper is allowed to
    mutate both the dynptr itself and also the memory it points to.
    Currently, the read only status of the memory is not tracked in the
    dynptr, but it would be trivial to add this support inside dynptr state
    of the register.
    
    With these changes and renaming PTR_TO_DYNPTR to CONST_PTR_TO_DYNPTR to
    better reflect its usage, it can no longer be passed to helpers that
    initialize a dynptr, i.e. bpf_dynptr_from_mem, bpf_ringbuf_reserve_dynptr.
    
    A note to reviewers is that in code that does mark_stack_slots_dynptr,
    and unmark_stack_slots_dynptr, we implicitly rely on the fact that
    PTR_TO_STACK reg is the only case that can reach that code path, as one
    cannot pass CONST_PTR_TO_DYNPTR to helpers that don't set MEM_RDONLY. In
    both cases such helpers won't be setting that flag.
    
    The next patch will add a couple of selftest cases to make sure this
    doesn't break.
    
    Fixes: 205715673844 ("bpf: Add bpf_user_ringbuf_drain() helper")
    Acked-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221207204141.308952-4-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:17 +02:00
Jerome Marchand f19e7a99bb bpf/docs: Document struct cgroup * kfuncs
Bugzilla: https://bugzilla.redhat.com/2177177

commit 36aa10ffd6480b93e32611411be4a8fc49804aba
Author: David Vernet <void@manifault.com>
Date:   Wed Dec 7 14:49:11 2022 -0600

    bpf/docs: Document struct cgroup * kfuncs
    
    bpf_cgroup_acquire(), bpf_cgroup_release(), bpf_cgroup_kptr_get(), and
    bpf_cgroup_ancestor(), are kfuncs that were recently added to
    kernel/bpf/helpers.c. These are "core" kfuncs in that they're available
    for use in any tracepoint or struct_ops BPF program. Though they have no
    ABI stability guarantees, we should still document them. This patch adds
    a struct cgroup * subsection to the Core kfuncs section which describes
    each of these kfuncs.
    
    Signed-off-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/r/20221207204911.873646-3-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:16 +02:00
Jerome Marchand 27492209c8 bpf/docs: Document struct task_struct * kfuncs
Bugzilla: https://bugzilla.redhat.com/2177177

commit 25c5e92d197bd721e706444c5910fd386c330456
Author: David Vernet <void@manifault.com>
Date:   Wed Dec 7 14:49:10 2022 -0600

    bpf/docs: Document struct task_struct * kfuncs
    
    bpf_task_acquire(), bpf_task_release(), and bpf_task_from_pid() are
    kfuncs that were recently added to kernel/bpf/helpers.c. These are
    "core" kfuncs in that they're available for use for any tracepoint or
    struct_ops BPF program. Though they have no ABI stability guarantees, we
    should still document them. This patch adds a new Core kfuncs section to
    the BPF kfuncs doc, and adds entries for all of these task kfuncs.
    
    Note that bpf_task_kptr_get() is not documented, as it still returns
    NULL while we're working to resolve how it can use RCU to ensure struct
    task_struct * lifetime.
    
    Signed-off-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/r/20221207204911.873646-2-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:16 +02:00
Jerome Marchand 091fd1e4dc bpf: Don't use rcu_users to refcount in task kfuncs
Bugzilla: https://bugzilla.redhat.com/2177177

commit 156ed20d22ee68d470232d26ae6df2cefacac4a0
Author: David Vernet <void@manifault.com>
Date:   Tue Dec 6 15:05:38 2022 -0600

    bpf: Don't use rcu_users to refcount in task kfuncs
    
    A series of prior patches added some kfuncs that allow struct
    task_struct * objects to be used as kptrs. These kfuncs leveraged the
    'refcount_t rcu_users' field of the task for performing refcounting.
    This field was used instead of 'refcount_t usage', as we wanted to
    leverage the safety provided by RCU for ensuring a task's lifetime.
    
    A struct task_struct is refcounted by two different refcount_t fields:
    
    1. p->usage:     The "true" refcount field which task lifetime. The
    		 task is freed as soon as this refcount drops to 0.
    
    2. p->rcu_users: An "RCU users" refcount field which is statically
    		 initialized to 2, and is co-located in a union with
    		 a struct rcu_head field (p->rcu). p->rcu_users
    		 essentially encapsulates a single p->usage
    		 refcount, and when p->rcu_users goes to 0, an RCU
    		 callback is scheduled on the struct rcu_head which
    		 decrements the p->usage refcount.
    
    Our logic was that by using p->rcu_users, we would be able to use RCU to
    safely issue refcount_inc_not_zero() a task's rcu_users field to
    determine if a task could still be acquired, or was exiting.
    Unfortunately, this does not work due to p->rcu_users and p->rcu sharing
    a union. When p->rcu_users goes to 0, an RCU callback is scheduled to
    drop a single p->usage refcount, and because the fields share a union,
    the refcount immediately becomes nonzero again after the callback is
    scheduled.
    
    If we were to split the fields out of the union, this wouldn't be a
    problem. Doing so should also be rather non-controversial, as there are
    a number of places in struct task_struct that have padding which we
    could use to avoid growing the structure by splitting up the fields.
    
    For now, so as to fix the kfuncs to be correct, this patch instead
    updates bpf_task_acquire() and bpf_task_release() to use the p->usage
    field for refcounting via the get_task_struct() and put_task_struct()
    functions. Because we can no longer rely on RCU, the change also guts
    the bpf_task_acquire_not_zero() and bpf_task_kptr_get() functions
    pending a resolution on the above problem.
    
    In addition, the task fixes the kfunc and rcu_read_lock selftests to
    expect this new behavior.
    
    Fixes: 90660309b0c7 ("bpf: Add kfuncs for storing struct task_struct * as a kptr")
    Fixes: fca1aa75518c ("bpf: Handle MEM_RCU type properly")
    Reported-by: Matus Jokay <matus.jokay@stuba.sk>
    Signed-off-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/r/20221206210538.597606-1-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:15 +02:00
Jerome Marchand 02626b368f bpf: Handle MEM_RCU type properly
Bugzilla: https://bugzilla.redhat.com/2177177

commit fca1aa75518c03b04c3c249e9a9134faf9ca18c5
Author: Yonghong Song <yhs@fb.com>
Date:   Sat Dec 3 10:46:02 2022 -0800

    bpf: Handle MEM_RCU type properly

    Commit 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
    introduced MEM_RCU and bpf_rcu_read_lock/unlock() support. In that
    commit, a rcu pointer is tagged with both MEM_RCU and PTR_TRUSTED
    so that it can be passed into kfuncs or helpers as an argument.

    Martin raised a good question in [1] such that the rcu pointer,
    although being able to accessing the object, might have reference
    count of 0. This might cause a problem if the rcu pointer is passed
    to a kfunc which expects trusted arguments where ref count should
    be greater than 0.

    This patch makes the following changes related to MEM_RCU pointer:
      - MEM_RCU pointer might be NULL (PTR_MAYBE_NULL).
      - Introduce KF_RCU so MEM_RCU ptr can be acquired with
        a KF_RCU tagged kfunc which assumes ref count of rcu ptr
        could be zero.
      - For mem access 'b = ptr->a', say 'ptr' is a MEM_RCU ptr, and
        'a' is tagged with __rcu as well. Let us mark 'b' as
        MEM_RCU | PTR_MAYBE_NULL.

     [1] https://lore.kernel.org/bpf/ac70f574-4023-664e-b711-e0d3b18117fd@linux.dev/

    Fixes: 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
    Signed-off-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221203184602.477272-1-yhs@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:14 +02:00
Jerome Marchand 0ec7796171 bpf: Add kfunc bpf_rcu_read_lock/unlock()
Bugzilla: https://bugzilla.redhat.com/2177177

commit 9bb00b2895cbfe0ad410457b605d0a72524168c1
Author: Yonghong Song <yhs@fb.com>
Date:   Wed Nov 23 21:32:17 2022 -0800

    bpf: Add kfunc bpf_rcu_read_lock/unlock()

    Add two kfunc's bpf_rcu_read_lock() and bpf_rcu_read_unlock(). These two kfunc's
    can be used for all program types. The following is an example about how
    rcu pointer are used w.r.t. bpf_rcu_read_lock()/bpf_rcu_read_unlock().

      struct task_struct {
        ...
        struct task_struct              *last_wakee;
        struct task_struct __rcu        *real_parent;
        ...
      };

    Let us say prog does 'task = bpf_get_current_task_btf()' to get a
    'task' pointer. The basic rules are:
      - 'real_parent = task->real_parent' should be inside bpf_rcu_read_lock
        region. This is to simulate rcu_dereference() operation. The
        'real_parent' is marked as MEM_RCU only if (1). task->real_parent is
        inside bpf_rcu_read_lock region, and (2). task is a trusted ptr. So
        MEM_RCU marked ptr can be 'trusted' inside the bpf_rcu_read_lock region.
      - 'last_wakee = real_parent->last_wakee' should be inside bpf_rcu_read_lock
        region since it tries to access rcu protected memory.
      - the ptr 'last_wakee' will be marked as PTR_UNTRUSTED since in general
        it is not clear whether the object pointed by 'last_wakee' is valid or
        not even inside bpf_rcu_read_lock region.

    The verifier will reset all rcu pointer register states to untrusted
    at bpf_rcu_read_unlock() kfunc call site, so any such rcu pointer
    won't be trusted any more outside the bpf_rcu_read_lock() region.

    The current implementation does not support nested rcu read lock
    region in the prog.

    Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221124053217.2373910-1-yhs@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:12 +02:00
Jerome Marchand 868564cc57 bpf: Introduce might_sleep field in bpf_func_proto
Bugzilla: https://bugzilla.redhat.com/2177177

commit 01685c5bddaa6df3d662c8afed5e5289fcc68e5a
Author: Yonghong Song <yhs@fb.com>
Date:   Wed Nov 23 21:32:11 2022 -0800

    bpf: Introduce might_sleep field in bpf_func_proto

    Introduce bpf_func_proto->might_sleep to indicate a particular helper
    might sleep. This will make later check whether a helper might be
    sleepable or not easier.

    Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221124053211.2373553-1-yhs@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:12 +02:00
Jerome Marchand 3cd5a9ecbe bpf: Add bpf_task_from_pid() kfunc
Bugzilla: https://bugzilla.redhat.com/2177177

commit 3f0e6f2b41d35d4446160c745e8f09037447dd8f
Author: David Vernet <void@manifault.com>
Date:   Tue Nov 22 08:52:59 2022 -0600

    bpf: Add bpf_task_from_pid() kfunc

    Callers can currently store tasks as kptrs using bpf_task_acquire(),
    bpf_task_kptr_get(), and bpf_task_release(). These are useful if a
    caller already has a struct task_struct *, but there may be some callers
    who only have a pid, and want to look up the associated struct
    task_struct * from that to e.g. find task->comm.

    This patch therefore adds a new bpf_task_from_pid() kfunc which allows
    BPF programs to get a struct task_struct * kptr from a pid.

    Signed-off-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/r/20221122145300.251210-2-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:11 +02:00
Jerome Marchand 81daf48e7c bpf: Don't use idx variable when registering kfunc dtors
Bugzilla: https://bugzilla.redhat.com/2177177

commit 2fcc6081a7bf8f7f531cffdc58b630b822e700a1
Author: David Vernet <void@manifault.com>
Date:   Wed Nov 23 07:52:53 2022 -0600

    bpf: Don't use idx variable when registering kfunc dtors

    In commit fda01efc6160 ("bpf: Enable cgroups to be used as kptrs"), I
    added an 'int idx' variable to kfunc_init() which was meant to
    dynamically set the index of the btf id entries of the
    'generic_dtor_ids' array. This was done to make the code slightly less
    brittle as the struct cgroup * kptr kfuncs such as bpf_cgroup_aquire()
    are compiled out if CONFIG_CGROUPS is not defined. This, however, causes
    an lkp build warning:

    >> kernel/bpf/helpers.c:2005:40: warning: multiple unsequenced
       modifications to 'idx' [-Wunsequenced]
    	.btf_id       = generic_dtor_ids[idx++],

    Fix the warning by just hard-coding the indices.

    Fixes: fda01efc6160 ("bpf: Enable cgroups to be used as kptrs")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: David Vernet <void@manifault.com>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221123135253.637525-1-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:11 +02:00
Jerome Marchand 0ee084404a bpf: Add bpf_cgroup_ancestor() kfunc
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Missing commit 7f203bc89eb6 ("cgroup: Replace
cgroup->ancestor_ids[] with ->ancestors[]")

commit 5ca7867078296cfa9c100f9a3b2d24be1e139825
Author: David Vernet <void@manifault.com>
Date:   Mon Nov 21 23:54:57 2022 -0600

    bpf: Add bpf_cgroup_ancestor() kfunc

    struct cgroup * objects have a variably sized struct cgroup *ancestors[]
    field which stores pointers to their ancestor cgroups. If using a cgroup
    as a kptr, it can be useful to access these ancestors, but doing so
    requires variable offset accesses for PTR_TO_BTF_ID, which is currently
    unsupported.

    This is a very useful field to access for cgroup kptrs, as programs may
    wish to walk their ancestor cgroups when determining e.g. their
    proportional cpu.weight. So as to enable this functionality with cgroup
    kptrs before var_off is supported for PTR_TO_BTF_ID, this patch adds a
    bpf_cgroup_ancestor() kfunc which accesses the cgroup node on behalf of
    the caller, and acquires a reference on it. Once var_off is supported
    for PTR_TO_BTF_ID, and fields inside a struct can be marked as trusted
    so they retain the PTR_TRUSTED modifier when walked, this can be
    removed.

    Signed-off-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/r/20221122055458.173143-4-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:10 +02:00
Jerome Marchand cabc5abf20 bpf: Enable cgroups to be used as kptrs
Bugzilla: https://bugzilla.redhat.com/2177177

commit fda01efc61605af7c6fa03c4109f14d59c9228b7
Author: David Vernet <void@manifault.com>
Date:   Mon Nov 21 23:54:55 2022 -0600

    bpf: Enable cgroups to be used as kptrs

    Now that tasks can be used as kfuncs, and the PTR_TRUSTED flag is
    available for us to easily add basic acquire / get / release kfuncs, we
    can do the same for cgroups. This patch set adds the following kfuncs
    which enable using cgroups as kptrs:

    struct cgroup *bpf_cgroup_acquire(struct cgroup *cgrp);
    struct cgroup *bpf_cgroup_kptr_get(struct cgroup **cgrpp);
    void bpf_cgroup_release(struct cgroup *cgrp);

    A follow-on patch will add a selftest suite which validates these
    kfuncs.

    Signed-off-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/r/20221122055458.173143-2-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:10 +02:00
Jerome Marchand 4b3197d3e3 bpf: Add a kfunc for generic type cast
Bugzilla: https://bugzilla.redhat.com/2177177

commit a35b9af4ec2c7f69286ef861fd2074a577e354cb
Author: Yonghong Song <yhs@fb.com>
Date:   Sun Nov 20 11:54:37 2022 -0800

    bpf: Add a kfunc for generic type cast

    Implement bpf_rdonly_cast() which tries to cast the object
    to a specified type. This tries to support use case like below:
      #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB)))
    where skb_end_pointer(SKB) is a 'unsigned char *' and needs to
    be casted to 'struct skb_shared_info *'.

    The signature of bpf_rdonly_cast() looks like
       void *bpf_rdonly_cast(void *obj, __u32 btf_id)
    The function returns the same 'obj' but with PTR_TO_BTF_ID with
    btf_id. The verifier will ensure btf_id being a struct type.

    Since the supported type cast may not reflect what the 'obj'
    represents, the returned btf_id is marked as PTR_UNTRUSTED, so
    the return value and subsequent pointer chasing cannot be
    used as helper/kfunc arguments.

    Signed-off-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221120195437.3114585-1-yhs@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:09 +02:00
Jerome Marchand 750e4d2c71 bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx
Bugzilla: https://bugzilla.redhat.com/2177177

commit fd264ca020948a743e4c36731dfdecc4a812153c
Author: Yonghong Song <yhs@fb.com>
Date:   Sun Nov 20 11:54:32 2022 -0800

    bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx

    Implement bpf_cast_to_kern_ctx() kfunc which does a type cast
    of a uapi ctx object to the corresponding kernel ctx. Previously
    if users want to access some data available in kctx but not
    in uapi ctx, bpf_probe_read_kernel() helper is needed.
    The introduction of bpf_cast_to_kern_ctx() allows direct
    memory access which makes code simpler and easier to understand.

    Signed-off-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221120195432.3113982-1-yhs@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:09 +02:00
Jerome Marchand 96be11db9f bpf: Add support for kfunc set with common btf_ids
Bugzilla: https://bugzilla.redhat.com/2177177

commit cfe1456440c8feaf6558577a400745d774418379
Author: Yonghong Song <yhs@fb.com>
Date:   Sun Nov 20 11:54:26 2022 -0800

    bpf: Add support for kfunc set with common btf_ids

    Later on, we will introduce kfuncs bpf_cast_to_kern_ctx() and
    bpf_rdonly_cast() which apply to all program types. Currently kfunc set
    only supports individual prog types. This patch added support for kfunc
    applying to all program types.

    Signed-off-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221120195426.3113828-1-yhs@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:09 +02:00
Jerome Marchand a89a10052c bpf: Disallow bpf_obj_new_impl call when bpf_mem_alloc_init fails
Bugzilla: https://bugzilla.redhat.com/2177177

commit e181d3f143f7957a73c8365829249d8084602606
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Mon Nov 21 02:56:10 2022 +0530

    bpf: Disallow bpf_obj_new_impl call when bpf_mem_alloc_init fails

    In the unlikely event that bpf_global_ma is not correctly initialized,
    instead of checking the boolean everytime bpf_obj_new_impl is called,
    simply check it while loading the program and return an error if
    bpf_global_ma_set is false.

    Suggested-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221120212610.2361700-1-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:09 +02:00
Jerome Marchand 3ed0a6a4dd bpf: Add kfuncs for storing struct task_struct * as a kptr
Bugzilla: https://bugzilla.redhat.com/2177177

commit 90660309b0c76c564a31a21f3a81d6641a9acaa0
Author: David Vernet <void@manifault.com>
Date:   Sat Nov 19 23:10:03 2022 -0600

    bpf: Add kfuncs for storing struct task_struct * as a kptr

    Now that BPF supports adding new kernel functions with kfuncs, and
    storing kernel objects in maps with kptrs, we can add a set of kfuncs
    which allow struct task_struct objects to be stored in maps as
    referenced kptrs. The possible use cases for doing this are plentiful.
    During tracing, for example, it would be useful to be able to collect
    some tasks that performed a certain operation, and then periodically
    summarize who they are, which cgroup they're in, how much CPU time
    they've utilized, etc.

    In order to enable this, this patch adds three new kfuncs:

    struct task_struct *bpf_task_acquire(struct task_struct *p);
    struct task_struct *bpf_task_kptr_get(struct task_struct **pp);
    void bpf_task_release(struct task_struct *p);

    A follow-on patch will add selftests validating these kfuncs.

    Signed-off-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/r/20221120051004.3605026-4-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:08 +02:00
Jerome Marchand aeccbbda92 bpf: Introduce single ownership BPF linked list API
Bugzilla: https://bugzilla.redhat.com/2177177

commit 8cab76ec634995e59a8b6346bf8b835ab7fad3a3
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:26:06 2022 +0530

    bpf: Introduce single ownership BPF linked list API

    Add a linked list API for use in BPF programs, where it expects
    protection from the bpf_spin_lock in the same allocation as the
    bpf_list_head. For now, only one bpf_spin_lock can be present hence that
    is assumed to be the one protecting the bpf_list_head.

    The following functions are added to kick things off:

    // Add node to beginning of list
    void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node);

    // Add node to end of list
    void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node);

    // Remove node at beginning of list and return it
    struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head);

    // Remove node at end of list and return it
    struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head);

    The lock protecting the bpf_list_head needs to be taken for all
    operations. The verifier ensures that the lock that needs to be taken is
    always held, and only the correct lock is taken for these operations.
    These checks are made statically by relying on the reg->id preserved for
    registers pointing into regions having both bpf_spin_lock and the
    objects protected by it. The comment over check_reg_allocation_locked in
    this change describes the logic in detail.

    Note that bpf_list_push_front and bpf_list_push_back are meant to
    consume the object containing the node in the 1st argument, however that
    specific mechanism is intended to not release the ref_obj_id directly
    until the bpf_spin_unlock is called. In this commit, nothing is done,
    but the next commit will be introducing logic to handle this case, so it
    has been left as is for now.

    bpf_list_pop_front and bpf_list_pop_back delete the first or last item
    of the list respectively, and return pointer to the element at the
    list_node offset. The user can then use container_of style macro to get
    the actual entry type. The verifier however statically knows the actual
    type, so the safety properties are still preserved.

    With these additions, programs can now manage their own linked lists and
    store their objects in them.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-17-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:07 +02:00
Jerome Marchand 0c43899670 bpf: Introduce bpf_obj_drop
Bugzilla: https://bugzilla.redhat.com/2177177

commit ac9f06050a3580cf4076a57a470cd71f12a81171
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:26:04 2022 +0530

    bpf: Introduce bpf_obj_drop

    Introduce bpf_obj_drop, which is the kfunc used to free allocated
    objects (allocated using bpf_obj_new). Pairing with bpf_obj_new, it
    implicitly destructs the fields part of object automatically without
    user intervention.

    Just like the previous patch, btf_struct_meta that is needed to free up
    the special fields is passed as a hidden argument to the kfunc.

    For the user, a convenience macro hides over the kernel side kfunc which
    is named bpf_obj_drop_impl.

    Continuing the previous example:

    void prog(void) {
    	struct foo *f;

    	f = bpf_obj_new(typeof(*f));
    	if (!f)
    		return;
    	bpf_obj_drop(f);
    }

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-15-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:07 +02:00
Jerome Marchand 27b1b8aed6 bpf: Introduce bpf_obj_new
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Context change from already backported commit 997849c4b969
("bpf: Zeroing allocated object from slab in bpf memory allocator"

commit 958cf2e273f0929c66169e0788031310e8118722
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:26:03 2022 +0530

    bpf: Introduce bpf_obj_new

    Introduce type safe memory allocator bpf_obj_new for BPF programs. The
    kernel side kfunc is named bpf_obj_new_impl, as passing hidden arguments
    to kfuncs still requires having them in prototype, unlike BPF helpers
    which always take 5 arguments and have them checked using bpf_func_proto
    in verifier, ignoring unset argument types.

    Introduce __ign suffix to ignore a specific kfunc argument during type
    checks, then use this to introduce support for passing type metadata to
    the bpf_obj_new_impl kfunc.

    The user passes BTF ID of the type it wants to allocates in program BTF,
    the verifier then rewrites the first argument as the size of this type,
    after performing some sanity checks (to ensure it exists and it is a
    struct type).

    The second argument is also fixed up and passed by the verifier. This is
    the btf_struct_meta for the type being allocated. It would be needed
    mostly for the offset array which is required for zero initializing
    special fields while leaving the rest of storage in unitialized state.

    It would also be needed in the next patch to perform proper destruction
    of the object's special fields.

    Under the hood, bpf_obj_new will call bpf_mem_alloc and bpf_mem_free,
    using the any context BPF memory allocator introduced recently. To this
    end, a global instance of the BPF memory allocator is initialized on
    boot to be used for this purpose. This 'bpf_global_ma' serves all
    allocations for bpf_obj_new. In the future, bpf_obj_new variants will
    allow specifying a custom allocator.

    Note that now that bpf_obj_new can be used to allocate objects that can
    be linked to BPF linked list (when future linked list helpers are
    available), we need to also free the elements using bpf_mem_free.
    However, since the draining of elements is done outside the
    bpf_spin_lock, we need to do migrate_disable around the call since
    bpf_list_head_free can be called from map free path where migration is
    enabled. Otherwise, when called from BPF programs migration is already
    disabled.

    A convenience macro is included in the bpf_experimental.h header to hide
    over the ugly details of the implementation, leading to user code
    looking similar to a language level extension which allocates and
    constructs fields of a user type.

    struct bar {
            struct bpf_list_node node;
    };

    struct foo {
            struct bpf_spin_lock lock;
            struct bpf_list_head head __contains(bar, node);
    };

    void prog(void) {
            struct foo *f;

            f = bpf_obj_new(typeof(*f));
            if (!f)
                    return;
            ...
    }

    A key piece of this story is still missing, i.e. the free function,
    which will come in the next patch.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-14-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:07 +02:00
Jerome Marchand 9eb2139c9f bpf: Allow locking bpf_spin_lock in allocated objects
Bugzilla: https://bugzilla.redhat.com/2177177

commit 4e814da0d59917c6d758a80e63e79b5ee212cf11
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:25:58 2022 +0530

    bpf: Allow locking bpf_spin_lock in allocated objects

    Allow locking a bpf_spin_lock in an allocated object, in addition to
    already supported map value pointers. The handling is similar to that of
    map values, by just preserving the reg->id of PTR_TO_BTF_ID | MEM_ALLOC
    as well, and adjusting process_spin_lock to work with them and remember
    the id in verifier state.

    Refactor the existing process_spin_lock to work with PTR_TO_BTF_ID |
    MEM_ALLOC in addition to PTR_TO_MAP_VALUE. We need to update the
    reg_may_point_to_spin_lock which is used in mark_ptr_or_null_reg to
    preserve reg->id, that will be used in env->cur_state->active_spin_lock
    to remember the currently held spin lock.

    Also update the comment describing bpf_spin_lock implementation details
    to also talk about PTR_TO_BTF_ID | MEM_ALLOC type.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-9-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:06 +02:00
Jerome Marchand d03c51f6bc bpf: Support bpf_list_head in map values
Bugzilla: https://bugzilla.redhat.com/2177177

commit f0c5941ff5b255413d31425bb327c2aec3625673
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Tue Nov 15 00:45:25 2022 +0530

    bpf: Support bpf_list_head in map values

    Add the support on the map side to parse, recognize, verify, and build
    metadata table for a new special field of the type struct bpf_list_head.
    To parameterize the bpf_list_head for a certain value type and the
    list_node member it will accept in that value type, we use BTF
    declaration tags.

    The definition of bpf_list_head in a map value will be done as follows:

    struct foo {
    	struct bpf_list_node node;
    	int data;
    };

    struct map_value {
    	struct bpf_list_head head __contains(foo, node);
    };

    Then, the bpf_list_head only allows adding to the list 'head' using the
    bpf_list_node 'node' for the type struct foo.

    The 'contains' annotation is a BTF declaration tag composed of four
    parts, "contains:name:node" where the name is then used to look up the
    type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
    the lookup. The node defines name of the member in this type that has
    the type struct bpf_list_node, which is actually used for linking into
    the linked list. For now, 'kind' part is hardcoded as struct.

    This allows building intrusive linked lists in BPF, using container_of
    to obtain pointer to entry, while being completely type safe from the
    perspective of the verifier. The verifier knows exactly the type of the
    nodes, and knows that list helpers return that type at some fixed offset
    where the bpf_list_node member used for this list exists. The verifier
    also uses this information to disallow adding types that are not
    accepted by a certain list.

    For now, no elements can be added to such lists. Support for that is
    coming in future patches, hence draining and freeing items is done with
    a TODO that will be resolved in a future patch.

    Note that the bpf_list_head_free function moves the list out to a local
    variable under the lock and releases it, doing the actual draining of
    the list items outside the lock. While this helps with not holding the
    lock for too long pessimizing other concurrent list operations, it is
    also necessary for deadlock prevention: unless every function called in
    the critical section would be notrace, a fentry/fexit program could
    attach and call bpf_map_update_elem again on the map, leading to the
    same lock being acquired if the key matches and lead to a deadlock.
    While this requires some special effort on part of the BPF programmer to
    trigger and is highly unlikely to occur in practice, it is always better
    if we can avoid such a condition.

    While notrace would prevent this, doing the draining outside the lock
    has advantages of its own, hence it is used to also fix the deadlock
    related problem.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221114191547.1694267-5-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:04 +02:00
Jerome Marchand 2b8a340165 bpf: Consolidate spin_lock, timer management into btf_record
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Context change from already backported commit 997849c4b969
("bpf: Zeroing allocated object from slab in bpf memory allocator")

commit db559117828d2448fe81ada051c60bcf39f822e9
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 4 00:39:56 2022 +0530

    bpf: Consolidate spin_lock, timer management into btf_record

    Now that kptr_off_tab has been refactored into btf_record, and can hold
    more than one specific field type, accomodate bpf_spin_lock and
    bpf_timer as well.

    While they don't require any more metadata than offset, having all
    special fields in one place allows us to share the same code for
    allocated user defined types and handle both map values and these
    allocated objects in a similar fashion.

    As an optimization, we still keep spin_lock_off and timer_off offsets in
    the btf_record structure, just to avoid having to find the btf_field
    struct each time their offset is needed. This is mostly needed to
    manipulate such objects in a map value at runtime. It's ok to hardcode
    just one offset as more than one field is disallowed.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:01 +02:00
Jerome Marchand dcf538d57d bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Context change from missing commit 7f203bc89eb6 ("cgroup:
Replace cgroup->ancestor_ids[] with ->ancestors[]")

commit c4bcfb38a95edb1021a53f2d0356a78120ecfbe4
Author: Yonghong Song <yhs@fb.com>
Date:   Tue Oct 25 21:28:50 2022 -0700

    bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

    Similar to sk/inode/task storage, implement similar cgroup local storage.

    There already exists a local storage implementation for cgroup-attached
    bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
    bpf_get_local_storage(). But there are use cases such that non-cgroup
    attached bpf progs wants to access cgroup local storage data. For example,
    tc egress prog has access to sk and cgroup. It is possible to use
    sk local storage to emulate cgroup local storage by storing data in socket.
    But this is a waste as it could be lots of sockets belonging to a particular
    cgroup. Alternatively, a separate map can be created with cgroup id as the key.
    But this will introduce additional overhead to manipulate the new map.
    A cgroup local storage, similar to existing sk/inode/task storage,
    should help for this use case.

    The life-cycle of storage is managed with the life-cycle of the
    cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
    with a call to bpf_cgrp_storage_free() when cgroup itself
    is deleted.

    The userspace map operations can be done by using a cgroup fd as a key
    passed to the lookup, update and delete operations.

    Typically, the following code is used to get the current cgroup:
        struct task_struct *task = bpf_get_current_task_btf();
        ... task->cgroups->dfl_cgrp ...
    and in structure task_struct definition:
        struct task_struct {
            ....
            struct css_set __rcu            *cgroups;
            ....
        }
    With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
    So the current implementation only supports non-sleepable program and supporting
    sleepable program will be the next step together with adding rcu_read_lock
    protection for rcu tagged structures.

    Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
    storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
    for cgroup storage available to non-cgroup-attached bpf programs. The old
    cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
    The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
    functionality. While old cgroup storage pre-allocates storage memory, the new
    mechanism can also pre-allocate with a user space bpf_map_update_elem() call
    to avoid potential run-time memory allocation failure.
    Therefore, the new cgroup storage can provide all functionality w.r.t.
    the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
    BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
    be deprecated since the new one can provide the same functionality.

    Acked-by: David Vernet <void@manifault.com>
    Signed-off-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:42:58 +02:00
Artem Savkov d4045a2578 bpf: Add bpf_user_ringbuf_drain() helper
Bugzilla: https://bugzilla.redhat.com/2166911

Conflicts: fixing previously incorrect order of cases in verifier.c

commit 20571567384428dfc9fe5cf9f2e942e1df13c2dd
Author: David Vernet <void@manifault.com>
Date:   Mon Sep 19 19:00:58 2022 -0500

    bpf: Add bpf_user_ringbuf_drain() helper

    In a prior change, we added a new BPF_MAP_TYPE_USER_RINGBUF map type which
    will allow user-space applications to publish messages to a ring buffer
    that is consumed by a BPF program in kernel-space. In order for this
    map-type to be useful, it will require a BPF helper function that BPF
    programs can invoke to drain samples from the ring buffer, and invoke
    callbacks on those samples. This change adds that capability via a new BPF
    helper function:

    bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx,
                           u64 flags)

    BPF programs may invoke this function to run callback_fn() on a series of
    samples in the ring buffer. callback_fn() has the following signature:

    long callback_fn(struct bpf_dynptr *dynptr, void *context);

    Samples are provided to the callback in the form of struct bpf_dynptr *'s,
    which the program can read using BPF helper functions for querying
    struct bpf_dynptr's.

    In order to support bpf_ringbuf_drain(), a new PTR_TO_DYNPTR register
    type is added to the verifier to reflect a dynptr that was allocated by
    a helper function and passed to a BPF program. Unlike PTR_TO_STACK
    dynptrs which are allocated on the stack by a BPF program, PTR_TO_DYNPTR
    dynptrs need not use reference tracking, as the BPF helper is trusted to
    properly free the dynptr before returning. The verifier currently only
    supports PTR_TO_DYNPTR registers that are also DYNPTR_TYPE_LOCAL.

    Note that while the corresponding user-space libbpf logic will be added
    in a subsequent patch, this patch does contain an implementation of the
    .map_poll() callback for BPF_MAP_TYPE_USER_RINGBUF maps. This
    .map_poll() callback guarantees that an epoll-waiting user-space
    producer will receive at least one event notification whenever at least
    one sample is drained in an invocation of bpf_user_ringbuf_drain(),
    provided that the function is not invoked with the BPF_RB_NO_WAKEUP
    flag. If the BPF_RB_FORCE_WAKEUP flag is provided, a wakeup
    notification is sent even if no sample was drained.

    Signed-off-by: David Vernet <void@manifault.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220920000100.477320-3-void@manifault.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:27 +01:00
Artem Savkov 76989136e7 bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF
Bugzilla: https://bugzilla.redhat.com/2166911

Conflicts: already applied 7d21225e01 "bpf: Gate dynptr API behind
CAP_BPF"

commit 5679ff2f138f77b281c468959dc5022cc524d400
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Tue Aug 23 03:31:17 2022 +0200

    bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF

    They would require func_info which needs prog BTF anyway. Loading BTF
    and setting the prog btf_fd while loading the prog indirectly requires
    CAP_BPF, so just to reduce confusion, move both these helpers taking
    callback under bpf_capable() protection as well, since they cannot be
    used without CAP_BPF.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20220823013117.24916-1-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:25 +01:00
Artem Savkov ca86169c5a bpf: expose bpf_strtol and bpf_strtoul to all program types
Bugzilla: https://bugzilla.redhat.com/2166911

Conflicts: already applied 7d21225e01 "bpf: Gate dynptr API behind CAP_BPF"

commit 8a67f2de9b1dc3cf8b75b4bf589efb1f08e3e9b8
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Aug 23 15:25:53 2022 -0700

    bpf: expose bpf_strtol and bpf_strtoul to all program types

    bpf_strncmp is already exposed everywhere. The motivation is to keep
    those helpers in kernel/bpf/helpers.c. Otherwise it's tempting to move
    them under kernel/bpf/cgroup.c because they are currently only used
    by sysctl prog types.

    Suggested-by: Martin KaFai Lau <kafai@fb.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220823222555.523590-4-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:25 +01:00
Artem Savkov 5d1eabf3dc bpf: Export bpf_dynptr_get_size()
Bugzilla: https://bugzilla.redhat.com/2166911

commit 51df4865718540f51bb5d3e552c50dc88e1333d6
Author: Roberto Sassu <roberto.sassu@huawei.com>
Date:   Tue Sep 20 09:59:43 2022 +0200

    bpf: Export bpf_dynptr_get_size()
    
    Export bpf_dynptr_get_size(), so that kernel code dealing with eBPF dynamic
    pointers can obtain the real size of data carried by this data structure.
    
    Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
    Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
    Acked-by: KP Singh <kpsingh@kernel.org>
    Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20220920075951.929132-6-roberto.sassu@huaweicloud.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:15 +01:00
Artem Savkov 47f8ca3b6d bpf: Introduce cgroup_{common,current}_func_proto
Bugzilla: https://bugzilla.redhat.com/2166911

commit dea6a4e17013382b20717664ebf3d7cc405e0952
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Aug 23 15:25:51 2022 -0700

    bpf: Introduce cgroup_{common,current}_func_proto
    
    Split cgroup_base_func_proto into the following:
    
    * cgroup_common_func_proto - common helpers for all cgroup hooks
    * cgroup_current_func_proto - common helpers for all cgroup hooks
      running in the process context (== have meaningful 'current').
    
    Move bpf_{g,s}et_retval and other cgroup-related helpers into
    kernel/bpf/cgroup.c so they closer to where they are being used.
    
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/r/20220823222555.523590-2-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:03 +01:00
Artem Savkov 0e33f79a84 bpf: export crash_kexec() as destructive kfunc
Bugzilla: https://bugzilla.redhat.com/2166911

commit 133790596406ce2658f0864eb7eac64987c2b12f
Author: Artem Savkov <asavkov@redhat.com>
Date:   Wed Aug 10 08:59:04 2022 +0200

    bpf: export crash_kexec() as destructive kfunc
    
    Allow properly marked bpf programs to call crash_kexec().
    
    Signed-off-by: Artem Savkov <asavkov@redhat.com>
    Link: https://lore.kernel.org/r/20220810065905.475418-3-asavkov@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:00 +01:00
Artem Savkov b341ead9ab bpf: Add BPF-helper for accessing CLOCK_TAI
Bugzilla: https://bugzilla.redhat.com/2166911

commit c8996c98f703b09afe77a1d247dae691c9849dc1
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Tue Aug 9 08:08:02 2022 +0200

    bpf: Add BPF-helper for accessing CLOCK_TAI
    
    Commit 3dc6ffae2da2 ("timekeeping: Introduce fast accessor to clock tai")
    introduced a fast and NMI-safe accessor for CLOCK_TAI. Especially in time
    sensitive networks (TSN), where all nodes are synchronized by Precision Time
    Protocol (PTP), it's helpful to have the possibility to generate timestamps
    based on CLOCK_TAI instead of CLOCK_MONOTONIC. With a BPF helper for TAI in
    place, it becomes very convenient to correlate activity across different
    machines in the network.
    
    Use cases for such a BPF helper include functionalities such as Tx launch
    time (e.g. ETF and TAPRIO Qdiscs) and timestamping.
    
    Note: CLOCK_TAI is nothing new per se, only the NMI-safe variant of it is.
    
    Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
    [Kurt: Wrote changelog and renamed helper]
    Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
    Link: https://lore.kernel.org/r/20220809060803.5773-2-kurt@linutronix.de
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:53:59 +01:00
Yauheni Kaliuta 10c1a87294 bpf: Add verifier check for BPF_PTR_POISON retval and arg
Bugzilla: http://bugzilla.redhat.com/2120968

commit 47e34cb74d376ddfeaef94abb1d6dfb3c905ee51
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Mon Sep 12 08:45:44 2022 -0700

    bpf: Add verifier check for BPF_PTR_POISON retval and arg
    
    BPF_PTR_POISON was added in commit c0a5a21c25f37 ("bpf: Allow storing
    referenced kptr in map") to denote a bpf_func_proto btf_id which the
    verifier will replace with a dynamically-determined btf_id at verification
    time.
    
    This patch adds verifier 'poison' functionality to BPF_PTR_POISON in
    order to prepare for expanded use of the value to poison ret- and
    arg-btf_id in ongoing work, namely rbtree and linked list patchsets
    [0, 1]. Specifically, when the verifier checks helper calls, it assumes
    that BPF_PTR_POISON'ed ret type will be replaced with a valid type before
    - or in lieu of - the default ret_btf_id logic. Similarly for arg btf_id.
    
    If poisoned btf_id reaches default handling block for either, consider
    this a verifier internal error and fail verification. Otherwise a helper
    w/ poisoned btf_id but no verifier logic replacing the type will cause a
    crash as the invalid pointer is dereferenced.
    
    Also move BPF_PTR_POISON to existing include/linux/posion.h header and
    remove unnecessary shift.
    
      [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
      [1]: lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20220912154544.1398199-1-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:09 +02:00
Yauheni Kaliuta 20067c525d bpf: Fix non-static bpf_func_proto struct definitions
Bugzilla: http://bugzilla.redhat.com/2120968

commit dc368e1c658e4f478a45e8d1d5b0c8392ca87506
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Thu Jun 16 15:54:07 2022 -0700

    bpf: Fix non-static bpf_func_proto struct definitions

    This patch does two things:

    1) Marks the dynptr bpf_func_proto structs that were added in [1]
       as static, as pointed out by the kernel test robot in [2].

    2) There are some bpf_func_proto structs marked as extern which can
       instead be statically defined.

      [1] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/
      [2] https://lore.kernel.org/bpf/62ab89f2.Pko7sI08RAKdF8R6%25lkp@intel.com/

    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220616225407.1878436-1-joannelkoong@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:09 +02:00
Yauheni Kaliuta 9a5f220992 btf: Export bpf_dynptr definition
Bugzilla: https://bugzilla.redhat.com/2120968

commit 00f146413ccb6c84308e559281449755c83f54c5
Author: Roberto Sassu <roberto.sassu@huawei.com>
Date:   Tue Sep 20 09:59:40 2022 +0200

    btf: Export bpf_dynptr definition
    
    eBPF dynamic pointers is a new feature recently added to upstream. It binds
    together a pointer to a memory area and its size. The internal kernel
    structure bpf_dynptr_kern is not accessible by eBPF programs in user space.
    They instead see bpf_dynptr, which is then translated to the internal
    kernel structure by the eBPF verifier.
    
    The problem is that it is not possible to include at the same time the uapi
    include linux/bpf.h and the vmlinux BTF vmlinux.h, as they both contain the
    definition of some structures/enums. The compiler complains saying that the
    structures/enums are redefined.
    
    As bpf_dynptr is defined in the uapi include linux/bpf.h, this makes it
    impossible to include vmlinux.h. However, in some cases, e.g. when using
    kfuncs, vmlinux.h has to be included. The only option until now was to
    include vmlinux.h and add the definition of bpf_dynptr directly in the eBPF
    program source code from linux/bpf.h.
    
    Solve the problem by using the same approach as for bpf_timer (which also
    follows the same scheme with the _kern suffix for the internal kernel
    structure).
    
    Add the following line in one of the dynamic pointer helpers,
    bpf_dynptr_from_mem():
    
    BTF_TYPE_EMIT(struct bpf_dynptr);
    
    Cc: stable@vger.kernel.org
    Cc: Joanne Koong <joannelkoong@gmail.com>
    Fixes: 97e03f521050c ("bpf: Add verifier support for dynptrs")
    Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
    Acked-by: Yonghong Song <yhs@fb.com>
    Tested-by: KP Singh <kpsingh@kernel.org>
    Link: https://lore.kernel.org/r/20220920075951.929132-3-roberto.sassu@huaweicloud.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:08 +02:00
Yauheni Kaliuta 7d21225e01 bpf: Gate dynptr API behind CAP_BPF
Bugzilla: https://bugzilla.redhat.com/2120968

commit 8addbfc7b308d591f8a5f2f6bb24d08d9d79dfbb
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Wed Sep 21 16:35:50 2022 +0200

    bpf: Gate dynptr API behind CAP_BPF
    
    This has been enabled for unprivileged programs for only one kernel
    release, hence the expected annoyances due to this move are low. Users
    using ringbuf can stick to non-dynptr APIs. The actual use cases dynptr
    is meant to serve may not make sense in unprivileged BPF programs.
    
    Hence, gate these helpers behind CAP_BPF and limit use to privileged
    BPF programs.
    
    Fixes: 263ae152e962 ("bpf: Add bpf_dynptr_from_mem for local dynptrs")
    Fixes: bc34dee65a65 ("bpf: Dynptr support for ring buffers")
    Fixes: 13bbbfbea759 ("bpf: Add bpf_dynptr_read and bpf_dynptr_write")
    Fixes: 34d4ef5775f7 ("bpf: Add dynptr data slices")
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20220921143550.30247-1-memxor@gmail.com
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:08 +02:00
Yauheni Kaliuta aeaa8ffc97 bpf: Add flags arg to bpf_dynptr_read and bpf_dynptr_write APIs
Bugzilla: https://bugzilla.redhat.com/2120968

commit f8d3da4ef8faf027261e06b7864583930dd7c7b9
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Wed Jul 6 16:25:47 2022 -0700

    bpf: Add flags arg to bpf_dynptr_read and bpf_dynptr_write APIs
    
    Commit 13bbbfbea759 ("bpf: Add bpf_dynptr_read and bpf_dynptr_write")
    added the bpf_dynptr_write() and bpf_dynptr_read() APIs.
    
    However, it will be needed for some dynptr types to pass in flags as
    well (e.g. when writing to a skb, the user may like to invalidate the
    hash or recompute the checksum).
    
    This patch adds a "u64 flags" arg to the bpf_dynptr_read() and
    bpf_dynptr_write() APIs before their UAPI signature freezes where
    we then cannot change them anymore with a 5.19.x released kernel.
    
    Fixes: 13bbbfbea759 ("bpf: Add bpf_dynptr_read and bpf_dynptr_write")
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/r/20220706232547.4016651-1-joannelkoong@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:08 +02:00
Yauheni Kaliuta a4778555c5 bpf: Add dynptr data slices
Bugzilla: https://bugzilla.redhat.com/2120968

commit 34d4ef5775f776ec4b0d53a02d588bf3195cada6
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Mon May 23 14:07:11 2022 -0700

    bpf: Add dynptr data slices
    
    This patch adds a new helper function
    
    void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len);
    
    which returns a pointer to the underlying data of a dynptr. *len*
    must be a statically known value. The bpf program may access the returned
    data slice as a normal buffer (eg can do direct reads and writes), since
    the verifier associates the length with the returned pointer, and
    enforces that no out of bounds accesses occur.
    
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20220523210712.3641569-6-joannelkoong@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:06 +02:00
Yauheni Kaliuta db083fdfd1 bpf: Add bpf_dynptr_read and bpf_dynptr_write
Bugzilla: https://bugzilla.redhat.com/2120968

commit 13bbbfbea7598ea9f8d9c3d73bf053bb57f9c4b2
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Mon May 23 14:07:10 2022 -0700

    bpf: Add bpf_dynptr_read and bpf_dynptr_write
    
    This patch adds two helper functions, bpf_dynptr_read and
    bpf_dynptr_write:
    
    long bpf_dynptr_read(void *dst, u32 len, struct bpf_dynptr *src, u32 offset);
    
    long bpf_dynptr_write(struct bpf_dynptr *dst, u32 offset, void *src, u32 len);
    
    The dynptr passed into these functions must be valid dynptrs that have
    been initialized.
    
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220523210712.3641569-5-joannelkoong@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:06 +02:00
Yauheni Kaliuta b08a7de6e2 bpf: Dynptr support for ring buffers
Bugzilla: https://bugzilla.redhat.com/2120968

commit bc34dee65a65e9c920c420005b8a43f2a721a458
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Mon May 23 14:07:09 2022 -0700

    bpf: Dynptr support for ring buffers
    
    Currently, our only way of writing dynamically-sized data into a ring
    buffer is through bpf_ringbuf_output but this incurs an extra memcpy
    cost. bpf_ringbuf_reserve + bpf_ringbuf_commit avoids this extra
    memcpy, but it can only safely support reservation sizes that are
    statically known since the verifier cannot guarantee that the bpf
    program won’t access memory outside the reserved space.
    
    The bpf_dynptr abstraction allows for dynamically-sized ring buffer
    reservations without the extra memcpy.
    
    There are 3 new APIs:
    
    long bpf_ringbuf_reserve_dynptr(void *ringbuf, u32 size, u64 flags, struct bpf_dynptr *ptr);
    void bpf_ringbuf_submit_dynptr(struct bpf_dynptr *ptr, u64 flags);
    void bpf_ringbuf_discard_dynptr(struct bpf_dynptr *ptr, u64 flags);
    
    These closely follow the functionalities of the original ringbuf APIs.
    For example, all ringbuffer dynptrs that have been reserved must be
    either submitted or discarded before the program exits.
    
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/bpf/20220523210712.3641569-4-joannelkoong@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:06 +02:00
Yauheni Kaliuta dc363a11c4 bpf: Add bpf_dynptr_from_mem for local dynptrs
Bugzilla: https://bugzilla.redhat.com/2120968

commit 263ae152e96253f40c2c276faad8629e096b3bad
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Mon May 23 14:07:08 2022 -0700

    bpf: Add bpf_dynptr_from_mem for local dynptrs
    
    This patch adds a new api bpf_dynptr_from_mem:
    
    long bpf_dynptr_from_mem(void *data, u32 size, u64 flags, struct bpf_dynptr *ptr);
    
    which initializes a dynptr to point to a bpf program's local memory. For now
    only local memory that is of reg type PTR_TO_MAP_VALUE is supported.
    
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220523210712.3641569-3-joannelkoong@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:06 +02:00
Yauheni Kaliuta ac78494f02 bpf: Add MEM_UNINIT as a bpf_type_flag
Bugzilla: https://bugzilla.redhat.com/2120968

commit 16d1e00c7e8a4950e914223b3112144289a82913
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Mon May 9 15:42:52 2022 -0700

    bpf: Add MEM_UNINIT as a bpf_type_flag
    
    Instead of having uninitialized versions of arguments as separate
    bpf_arg_types (eg ARG_PTR_TO_UNINIT_MEM as the uninitialized version
    of ARG_PTR_TO_MEM), we can instead use MEM_UNINIT as a bpf_type_flag
    modifier to denote that the argument is uninitialized.
    
    Doing so cleans up some of the logic in the verifier. We no longer
    need to do two checks against an argument type (eg "if
    (base_type(arg_type) == ARG_PTR_TO_MEM || base_type(arg_type) ==
    ARG_PTR_TO_UNINIT_MEM)"), since uninitialized and initialized
    versions of the same argument type will now share the same base type.
    
    In the near future, MEM_UNINIT will be used by dynptr helper functions
    as well.
    
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/r/20220509224257.3222614-2-joannelkoong@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:04 +02:00
Yauheni Kaliuta c0c280946f bpf: add bpf_map_lookup_percpu_elem for percpu map
Bugzilla: https://bugzilla.redhat.com/2120968

commit 07343110b293456d30393e89b86c4dee1ac051c8
Author: Feng Zhou <zhoufeng.zf@bytedance.com>
Date:   Wed May 11 17:38:53 2022 +0800

    bpf: add bpf_map_lookup_percpu_elem for percpu map
    
    Add new ebpf helpers bpf_map_lookup_percpu_elem.
    
    The implementation method is relatively simple, refer to the implementation
    method of map_lookup_elem of percpu map, increase the parameters of cpu, and
    obtain it according to the specified cpu.
    
    Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
    Link: https://lore.kernel.org/r/20220511093854.411-2-zhoufeng.zf@bytedance.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:04 +02:00
Yauheni Kaliuta f008a624b3 bpf: Allow storing referenced kptr in map
Bugzilla: https://bugzilla.redhat.com/2120968

commit c0a5a21c25f37c9fd7b36072f9968cdff1e4aa13
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Mon Apr 25 03:18:51 2022 +0530

    bpf: Allow storing referenced kptr in map
    
    Extending the code in previous commits, introduce referenced kptr
    support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
    unreferenced kptr, referenced kptr have a lot more restrictions. In
    addition to the type matching, only a newly introduced bpf_kptr_xchg
    helper is allowed to modify the map value at that offset. This transfers
    the referenced pointer being stored into the map, releasing the
    references state for the program, and returning the old value and
    creating new reference state for the returned pointer.
    
    Similar to unreferenced pointer case, return value for this case will
    also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
    must either be eventually released by calling the corresponding release
    function, otherwise it must be transferred into another map.
    
    It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
    the value, and obtain the old value if any.
    
    BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
    commit will permit using BPF_LDX for such pointers, but attempt at
    making it safe, since the lifetime of object won't be guaranteed.
    
    There are valid reasons to enforce the restriction of permitting only
    bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
    consistent in face of concurrent modification, and any prior values
    contained in the map must also be released before a new one is moved
    into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
    returns the old value, which the verifier would require the user to
    either free or move into another map, and releases the reference held
    for the pointer being moved in.
    
    In the future, direct BPF_XCHG instruction may also be permitted to work
    like bpf_kptr_xchg helper.
    
    Note that process_kptr_func doesn't have to call
    check_helper_mem_access, since we already disallow rdonly/wronly flags
    for map, which is what check_map_access_type checks, and we already
    ensure the PTR_TO_MAP_VALUE refers to kptr by obtaining its off_desc,
    so check_map_access is also not required.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220424214901.2743946-4-memxor@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-28 16:52:11 +02:00
Jerome Marchand 6102306761 bpf: Replace strncpy() with strscpy()
Bugzilla: https://bugzilla.redhat.com/2120966

commit 03b9c7fa3f15f51bcd07f3828c2a01311e7746c4
Author: Yuntao Wang <ytcoode@gmail.com>
Date:   Fri Mar 4 15:04:08 2022 +0800

    bpf: Replace strncpy() with strscpy()

    Using strncpy() on NUL-terminated strings is considered deprecated[1].
    Moreover, if the length of 'task->comm' is less than the destination buffer
    size, strncpy() will NUL-pad the destination buffer, which is a needless
    performance penalty.

    Replacing strncpy() with strscpy() fixes all these issues.

    [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings

    Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20220304070408.233658-1-ytcoode@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:58 +02:00
Jerome Marchand c5056baccd bpf: Cleanup comments
Bugzilla: https://bugzilla.redhat.com/2120966

commit c561d11063009323a0e57c528cb1d77b7d2c41e0
Author: Tom Rix <trix@redhat.com>
Date:   Sun Feb 20 10:40:55 2022 -0800

    bpf: Cleanup comments

    Add leading space to spdx tag
    Use // for spdx c file comment

    Replacements
    resereved to reserved
    inbetween to in between
    everytime to every time
    intutivie to intuitive
    currenct to current
    encontered to encountered
    referenceing to referencing
    upto to up to
    exectuted to executed

    Signed-off-by: Tom Rix <trix@redhat.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20220220184055.3608317-1-trix@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:51 +02:00
Jerome Marchand ea549a6a9d bpf: make bpf_copy_from_user_task() gpl only
Bugzilla: https://bugzilla.redhat.com/2120966

commit 0407a65f356e6d9340ad673907c17e52fade43e3
Author: Kenta Tada <Kenta.Tada@sony.com>
Date:   Sat Jan 29 02:09:06 2022 +0900

    bpf: make bpf_copy_from_user_task() gpl only

    access_process_vm() is exported by EXPORT_SYMBOL_GPL().

    Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
    Link: https://lore.kernel.org/r/20220128170906.21154-1-Kenta.Tada@sony.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:45 +02:00
Jerome Marchand e13edf1f6a bpf: Add bpf_copy_from_user_task() helper
Bugzilla: https://bugzilla.redhat.com/2120966

commit 376040e47334c6dc6a939a32197acceb00fe4acf
Author: Kenny Yu <kennyyu@fb.com>
Date:   Mon Jan 24 10:54:01 2022 -0800

    bpf: Add bpf_copy_from_user_task() helper

    This adds a helper for bpf programs to read the memory of other
    tasks.

    As an example use case at Meta, we are using a bpf task iterator program
    and this new helper to print C++ async stack traces for all threads of
    a given process.

    Signed-off-by: Kenny Yu <kennyyu@fb.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20220124185403.468466-3-kennyyu@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:43 +02:00
Artem Savkov 66a0aad37a bpf: Emit bpf_timer in vmlinux BTF
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 3bd916ee0ecbbdd902fc24845f2fef332b2a310c
Author: Yonghong Song <yhs@fb.com>
Date:   Fri Feb 11 11:49:48 2022 -0800

    bpf: Emit bpf_timer in vmlinux BTF

    Currently the following code in check_and_init_map_value()
      *(struct bpf_timer *)(dst + map->timer_off) =
          (struct bpf_timer){};
    can help generate bpf_timer definition in vmlinuxBTF.
    But the code above may not zero the whole structure
    due to anonymour members and that code will be replaced
    by memset in the subsequent patch and
    bpf_timer definition will disappear from vmlinuxBTF.
    Let us emit the type explicitly so bpf program can continue
    to use it from vmlinux.h.

    Signed-off-by: Yonghong Song <yhs@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220211194948.3141529-1-yhs@fb.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:55 +02:00
Artem Savkov 7f76bfc54f bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 216e3cd2f28dbbf1fe86848e0e29e6693b9f0a20
Author: Hao Luo <haoluo@google.com>
Date:   Thu Dec 16 16:31:51 2021 -0800

    bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.

    Some helper functions may modify its arguments, for example,
    bpf_d_path, bpf_get_stack etc. Previously, their argument types
    were marked as ARG_PTR_TO_MEM, which is compatible with read-only
    mem types, such as PTR_TO_RDONLY_BUF. Therefore it's legitimate,
    but technically incorrect, to modify a read-only memory by passing
    it into one of such helper functions.

    This patch tags the bpf_args compatible with immutable memory with
    MEM_RDONLY flag. The arguments that don't have this flag will be
    only compatible with mutable memory types, preventing the helper
    from modifying a read-only memory. The bpf_args that have
    MEM_RDONLY are compatible with both mutable memory and immutable
    memory.

    Signed-off-by: Hao Luo <haoluo@google.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211217003152.48334-9-haoluo@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:50 +02:00
Artem Savkov 9bba0ecd96 bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM.
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 34d3a78c681e8e7844b43d1a2f4671a04249c821
Author: Hao Luo <haoluo@google.com>
Date:   Thu Dec 16 16:31:50 2021 -0800

    bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM.

    Tag the return type of {per, this}_cpu_ptr with RDONLY_MEM. The
    returned value of this pair of helpers is kernel object, which
    can not be updated by bpf programs. Previously these two helpers
    return PTR_OT_MEM for kernel objects of scalar type, which allows
    one to directly modify the memory. Now with RDONLY_MEM tagging,
    the verifier will reject programs that write into RDONLY_MEM.

    Fixes: 63d9b80dcf ("bpf: Introducte bpf_this_cpu_ptr()")
    Fixes: eaa6bcb71e ("bpf: Introduce bpf_per_cpu_ptr()")
    Fixes: 4976b718c3 ("bpf: Introduce pseudo_btf_id")
    Signed-off-by: Hao Luo <haoluo@google.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211217003152.48334-8-haoluo@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:50 +02:00
Artem Savkov a4975f77f2 bpf: Replace RET_XXX_OR_NULL with RET_XXX | PTR_MAYBE_NULL
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 3c4807322660d4290ac9062c034aed6b87243861
Author: Hao Luo <haoluo@google.com>
Date:   Thu Dec 16 16:31:46 2021 -0800

    bpf: Replace RET_XXX_OR_NULL with RET_XXX | PTR_MAYBE_NULL

    We have introduced a new type to make bpf_ret composable, by
    reserving high bits to represent flags.

    One of the flag is PTR_MAYBE_NULL, which indicates a pointer
    may be NULL. When applying this flag to ret_types, it means
    the returned value could be a NULL pointer. This patch
    switches the qualified arg_types to use this flag.
    The ret_types changed in this patch include:

    1. RET_PTR_TO_MAP_VALUE_OR_NULL
    2. RET_PTR_TO_SOCKET_OR_NULL
    3. RET_PTR_TO_TCP_SOCK_OR_NULL
    4. RET_PTR_TO_SOCK_COMMON_OR_NULL
    5. RET_PTR_TO_ALLOC_MEM_OR_NULL
    6. RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL
    7. RET_PTR_TO_BTF_ID_OR_NULL

    This patch doesn't eliminate the use of these names, instead
    it makes them aliases to 'RET_PTR_TO_XXX | PTR_MAYBE_NULL'.

    Signed-off-by: Hao Luo <haoluo@google.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211217003152.48334-4-haoluo@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:49 +02:00
Artem Savkov 75a645a56c add missing bpf-cgroup.h includes
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit aef2feda97b840ec38e9fa53d0065188453304e8
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Dec 15 18:55:37 2021 -0800

    add missing bpf-cgroup.h includes

    We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency,
    make sure those who actually need more than the definition of
    struct cgroup_bpf include bpf-cgroup.h explicitly.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/bpf/20211216025538.1649516-3-kuba@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:49 +02:00
Artem Savkov f7cd3db52f bpf: Add bpf_strncmp helper
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit c5fb19937455095573a19ddcbff32e993ed10e35
Author: Hou Tao <houtao1@huawei.com>
Date:   Fri Dec 10 22:16:49 2021 +0800

    bpf: Add bpf_strncmp helper

    The helper compares two strings: one string is a null-terminated
    read-only string, and another string has const max storage size
    but doesn't need to be null-terminated. It can be used to compare
    file name in tracing or LSM program.

    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211210141652.877186-2-houtao1@huawei.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:46 +02:00
Artem Savkov 59cc18238f bpf: Add bpf_loop helper
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit e6f2dd0f80674e9d5960337b3e9c2a242441b326
Author: Joanne Koong <joannekoong@fb.com>
Date:   Mon Nov 29 19:06:19 2021 -0800

    bpf: Add bpf_loop helper

    This patch adds the kernel-side and API changes for a new helper
    function, bpf_loop:

    long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx,
    u64 flags);

    where long (*callback_fn)(u32 index, void *ctx);

    bpf_loop invokes the "callback_fn" **nr_loops** times or until the
    callback_fn returns 1. The callback_fn can only return 0 or 1, and
    this is enforced by the verifier. The callback_fn index is zero-indexed.

    A few things to please note:
    ~ The "u64 flags" parameter is currently unused but is included in
    case a future use case for it arises.
    ~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c),
    bpf_callback_t is used as the callback function cast.
    ~ A program can have nested bpf_loop calls but the program must
    still adhere to the verifier constraint of its stack depth (the stack depth
    cannot exceed MAX_BPF_STACK))
    ~ Recursive callback_fns do not pass the verifier, due to the call stack
    for these being too deep.
    ~ The next patch will include the tests and benchmark

    Signed-off-by: Joanne Koong <joannekoong@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20211130030622.4131246-2-joannekoong@fb.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:41 +02:00
Yauheni Kaliuta c164813e34 bpf: Replace callers of BPF_CAST_CALL with proper function typedef
Bugzilla: http://bugzilla.redhat.com/2069045

commit 102acbacfd9a96d101abd96d1a7a5bf92b7c3e8e
Author: Kees Cook <keescook@chromium.org>
Date:   Tue Sep 28 16:09:46 2021 -0700

    bpf: Replace callers of BPF_CAST_CALL with proper function typedef
    
    In order to keep ahead of cases in the kernel where Control Flow
    Integrity (CFI) may trip over function call casts, enabling
    -Wcast-function-type is helpful. To that end, BPF_CAST_CALL causes
    various warnings and is one of the last places in the kernel
    triggering this warning.
    
    For actual function calls, replace BPF_CAST_CALL() with a typedef, which
    captures the same details about the given function pointers.
    
    This change results in no object code difference.
    
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
    Link: https://github.com/KSPP/linux/issues/20
    Link: https://lore.kernel.org/lkml/CAEf4Bzb46=-J5Fxc3mMZ8JQPtK1uoE0q6+g6WPz53Cvx=CBEhw@mail.gmail.com
    Link: https://lore.kernel.org/bpf/20210928230946.4062144-3-keescook@chromium.org

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-06-03 17:23:36 +03:00
Yauheni Kaliuta 3c3c123ddb bpf: Add bpf_trace_vprintk helper
Bugzilla: http://bugzilla.redhat.com/2069045

commit 10aceb629e198429c849d5e995c3bb1ba7a9aaa3
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Fri Sep 17 11:29:05 2021 -0700

    bpf: Add bpf_trace_vprintk helper
    
    This helper is meant to be "bpf_trace_printk, but with proper vararg
    support". Follow bpf_snprintf's example and take a u64 pseudo-vararg
    array. Write to /sys/kernel/debug/tracing/trace_pipe using the same
    mechanism as bpf_trace_printk. The functionality of this helper was
    requested in the libbpf issue tracker [0].
    
    [0] Closes: https://github.com/libbpf/libbpf/issues/315
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20210917182911.2426606-4-davemarchevsky@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-06-03 17:16:14 +03:00
Yauheni Kaliuta b2847aa466 bpf: Merge printk and seq_printf VARARG max macros
Bugzilla: http://bugzilla.redhat.com/2069045

commit 335ff4990cf3bfa42d8846f9b3d8c09456f51801
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Fri Sep 17 11:29:03 2021 -0700

    bpf: Merge printk and seq_printf VARARG max macros
    
    MAX_SNPRINTF_VARARGS and MAX_SEQ_PRINTF_VARARGS are used by bpf helpers
    bpf_snprintf and bpf_seq_printf to limit their varargs. Both call into
    bpf_bprintf_prepare for print formatting logic and have convenience
    macros in libbpf (BPF_SNPRINTF, BPF_SEQ_PRINTF) which use the same
    helper macros to convert varargs to a byte array.
    
    Changing shared functionality to support more varargs for either bpf
    helper would affect the other as well, so let's combine the _VARARGS
    macros to make this more obvious.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20210917182911.2426606-2-davemarchevsky@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-06-03 17:16:14 +03:00
Jerome Marchand c6feb8361c bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
Bugzilla: https://bugzilla.redhat.com/2041365

Conflicts: Minor context change from missing commit eb18b49ea758 ("bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt")

commit 5e0bc3082e2e403ac0753e099c2b01446bb35578
Author: Dmitrii Banshchikov <me@ubique.spb.ru>
Date:   Sat Nov 13 18:22:26 2021 +0400

    bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs

    Use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in tracing
    progs may result in locking issues.

    bpf_ktime_get_coarse_ns() uses ktime_get_coarse_ns() time accessor that
    isn't safe for any context:
    ======================================================
    WARNING: possible circular locking dependency detected
    5.15.0-syzkaller #0 Not tainted
    ------------------------------------------------------
    syz-executor.4/14877 is trying to acquire lock:
    ffffffff8cb30008 (tk_core.seq.seqcount){----}-{0:0}, at: ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255

    but task is already holding lock:
    ffffffff90dbf200 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_deactivate+0x61/0x400 lib/debugobjects.c:735

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&obj_hash[i].lock){-.-.}-{2:2}:
           lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
           __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
           _raw_spin_lock_irqsave+0xd1/0x120 kernel/locking/spinlock.c:162
           __debug_object_init+0xd9/0x1860 lib/debugobjects.c:569
           debug_hrtimer_init kernel/time/hrtimer.c:414 [inline]
           debug_init kernel/time/hrtimer.c:468 [inline]
           hrtimer_init+0x20/0x40 kernel/time/hrtimer.c:1592
           ntp_init_cmos_sync kernel/time/ntp.c:676 [inline]
           ntp_init+0xa1/0xad kernel/time/ntp.c:1095
           timekeeping_init+0x512/0x6bf kernel/time/timekeeping.c:1639
           start_kernel+0x267/0x56e init/main.c:1030
           secondary_startup_64_no_verify+0xb1/0xbb

    -> #0 (tk_core.seq.seqcount){----}-{0:0}:
           check_prev_add kernel/locking/lockdep.c:3051 [inline]
           check_prevs_add kernel/locking/lockdep.c:3174 [inline]
           validate_chain+0x1dfb/0x8240 kernel/locking/lockdep.c:3789
           __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5015
           lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
           seqcount_lockdep_reader_access+0xfe/0x230 include/linux/seqlock.h:103
           ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
           ktime_get_coarse include/linux/timekeeping.h:120 [inline]
           ktime_get_coarse_ns include/linux/timekeeping.h:126 [inline]
           ____bpf_ktime_get_coarse_ns kernel/bpf/helpers.c:173 [inline]
           bpf_ktime_get_coarse_ns+0x7e/0x130 kernel/bpf/helpers.c:171
           bpf_prog_a99735ebafdda2f1+0x10/0xb50
           bpf_dispatcher_nop_func include/linux/bpf.h:721 [inline]
           __bpf_prog_run include/linux/filter.h:626 [inline]
           bpf_prog_run include/linux/filter.h:633 [inline]
           BPF_PROG_RUN_ARRAY include/linux/bpf.h:1294 [inline]
           trace_call_bpf+0x2cf/0x5d0 kernel/trace/bpf_trace.c:127
           perf_trace_run_bpf_submit+0x7b/0x1d0 kernel/events/core.c:9708
           perf_trace_lock+0x37c/0x440 include/trace/events/lock.h:39
           trace_lock_release+0x128/0x150 include/trace/events/lock.h:58
           lock_release+0x82/0x810 kernel/locking/lockdep.c:5636
           __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:149 [inline]
           _raw_spin_unlock_irqrestore+0x75/0x130 kernel/locking/spinlock.c:194
           debug_hrtimer_deactivate kernel/time/hrtimer.c:425 [inline]
           debug_deactivate kernel/time/hrtimer.c:481 [inline]
           __run_hrtimer kernel/time/hrtimer.c:1653 [inline]
           __hrtimer_run_queues+0x2f9/0xa60 kernel/time/hrtimer.c:1749
           hrtimer_interrupt+0x3b3/0x1040 kernel/time/hrtimer.c:1811
           local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
           __sysvec_apic_timer_interrupt+0xf9/0x270 arch/x86/kernel/apic/apic.c:1103
           sysvec_apic_timer_interrupt+0x8c/0xb0 arch/x86/kernel/apic/apic.c:1097
           asm_sysvec_apic_timer_interrupt+0x12/0x20
           __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
           _raw_spin_unlock_irqrestore+0xd4/0x130 kernel/locking/spinlock.c:194
           try_to_wake_up+0x702/0xd20 kernel/sched/core.c:4118
           wake_up_process kernel/sched/core.c:4200 [inline]
           wake_up_q+0x9a/0xf0 kernel/sched/core.c:953
           futex_wake+0x50f/0x5b0 kernel/futex/waitwake.c:184
           do_futex+0x367/0x560 kernel/futex/syscalls.c:127
           __do_sys_futex kernel/futex/syscalls.c:199 [inline]
           __se_sys_futex+0x401/0x4b0 kernel/futex/syscalls.c:180
           do_syscall_x64 arch/x86/entry/common.c:50 [inline]
           do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
           entry_SYSCALL_64_after_hwframe+0x44/0xae

    There is a possible deadlock with bpf_timer_* set of helpers:
    hrtimer_start()
      lock_base();
      trace_hrtimer...()
        perf_event()
          bpf_run()
            bpf_timer_start()
              hrtimer_start()
                lock_base()         <- DEADLOCK

    Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in
    BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_PERF_EVENT
    and BPF_PROG_TYPE_RAW_TRACEPOINT prog types.

    Fixes: d055126180 ("bpf: Add bpf_ktime_get_coarse_ns helper")
    Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
    Reported-by: syzbot+43fd005b5a1b4d10781e@syzkaller.appspotmail.com
    Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211113142227.566439-2-me@ubique.spb.ru

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:17:16 +02:00
Jerome Marchand 2682e1df7c bpf: Add bpf_task_pt_regs() helper
Bugzilla: http://bugzilla.redhat.com/2041365

commit dd6e10fbd9fb86a571d925602c8a24bb4d09a2a7
Author: Daniel Xu <dxu@dxuuu.xyz>
Date:   Mon Aug 23 19:43:49 2021 -0700

    bpf: Add bpf_task_pt_regs() helper

    The motivation behind this helper is to access userspace pt_regs in a
    kprobe handler.

    uprobe's ctx is the userspace pt_regs. kprobe's ctx is the kernelspace
    pt_regs. bpf_task_pt_regs() allows accessing userspace pt_regs in a
    kprobe handler. The final case (kernelspace pt_regs in uprobe) is
    pretty rare (usermode helper) so I think that can be solved later if
    necessary.

    More concretely, this helper is useful in doing BPF-based DWARF stack
    unwinding. Currently the kernel can only do framepointer based stack
    unwinds for userspace code. This is because the DWARF state machines are
    too fragile to be computed in kernelspace [0]. The idea behind
    DWARF-based stack unwinds w/ BPF is to copy a chunk of the userspace
    stack (while in prog context) and send it up to userspace for unwinding
    (probably with libunwind) [1]. This would effectively enable profiling
    applications with -fomit-frame-pointer using kprobes and uprobes.

    [0]: https://lkml.org/lkml/2012/2/10/356
    [1]: https://github.com/danobi/bpf-dwarf-walk

    Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/e2718ced2d51ef4268590ab8562962438ab82815.1629772842.git.dxu@dxuuu.xyz

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:47 +02:00
Jerome Marchand 020f743d7a bpf: Extend bpf_base_func_proto helpers with bpf_get_current_task_btf()
Bugzilla: http://bugzilla.redhat.com/2041365

commit a396eda5517ac958fb4eb7358f4708eb829058c4
Author: Daniel Xu <dxu@dxuuu.xyz>
Date:   Mon Aug 23 19:43:48 2021 -0700

    bpf: Extend bpf_base_func_proto helpers with bpf_get_current_task_btf()

    bpf_get_current_task() is already supported so it's natural to also
    include the _btf() variant for btf-powered helpers.

    This is required for non-tracing progs to use bpf_task_pt_regs() in the
    next commit.

    Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/f99870ed5f834c9803d73b3476f8272b1bb987c0.1629772842.git.dxu@dxuuu.xyz

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:47 +02:00
Jerome Marchand fd4b1bfec6 bpf: Support "%c" in bpf_bprintf_prepare().
Bugzilla: http://bugzilla.redhat.com/2041365

commit 3478cfcfcddff0f3aad82891be2992e51c4f7936
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Sat Aug 14 10:57:16 2021 +0900

    bpf: Support "%c" in bpf_bprintf_prepare().

    /proc/net/unix uses "%c" to print a single-byte character to escape '\0' in
    the name of the abstract UNIX domain socket.  The following selftest uses
    it, so this patch adds support for "%c".  Note that it does not support
    wide character ("%lc" and "%llc") for simplicity.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20210814015718.42704-3-kuniyu@amazon.co.jp

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:40 +02:00
Jerome Marchand 8bdb9654b1 bpf: Add ambient BPF runtime context stored in current
Bugzilla: http://bugzilla.redhat.com/2041365

Conflicts: Minor conflict with commit a2baf4e8bb ("bpf: Fix
potentially incorrect results with bpf_get_local_storage()")

commit c7603cfa04e7c3a435b31d065f7cbdc829428f6e
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Mon Jul 12 16:06:15 2021 -0700

    bpf: Add ambient BPF runtime context stored in current

    b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
    helper") fixed the problem with cgroup-local storage use in BPF by
    pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
    possible BPF program preemptions and nested executions.

    While this seems to work good in practice, it introduces new and unnecessary
    failure mode in which not all BPF programs might be executed if we fail to
    find an unused slot for cgroup storage, however unlikely it is. It might also
    not be so unlikely when/if we allow sleepable cgroup BPF programs in the
    future.

    Further, the way that cgroup storage is implemented as ambiently-available
    property during entire BPF program execution is a convenient way to pass extra
    information to BPF program and helpers without requiring user code to pass
    around extra arguments explicitly. So it would be good to have a generic
    solution that can allow implementing this without arbitrary restrictions.
    Ideally, such solution would work for both preemptable and sleepable BPF
    programs in exactly the same way.

    This patch introduces such solution, bpf_run_ctx. It adds one pointer field
    (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
    macros in such a way that it always stays valid throughout BPF program
    execution. BPF program preemption is handled by remembering previous
    current->bpf_ctx value locally while executing nested BPF program and
    restoring old value after nested BPF program finishes. This is handled by two
    helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
    supposed to be used before and after BPF program runs, respectively.

    Restoring old value of the pointer handles preemption, while bpf_run_ctx
    pointer being a property of current task_struct naturally solves this problem
    for sleepable BPF programs by "following" BPF program execution as it is
    scheduled in and out of CPU. It would even allow CPU migration of BPF
    programs, even though it's not currently allowed by BPF infra.

    This patch cleans up cgroup local storage handling as a first application. The
    design itself is generic, though, with bpf_run_ctx being an empty struct that
    is supposed to be embedded into a specific struct for a given BPF program type
    (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
    this mechanism for other uses within tracing BPF programs.

    To verify that this change doesn't revert the fix to the original cgroup
    storage issue, I ran the same repro as in the original report ([0]) and didn't
    get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
    bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).

      [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/

    Fixes: b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:33 +02:00
Jerome Marchand 3a219de31f bpf: Implement verifier support for validation of async callbacks.
Bugzilla: http://bugzilla.redhat.com/2041365

commit bfc6bb74e4f16ab264fa73398a7a79d7d2afac2e
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Jul 14 17:54:14 2021 -0700

    bpf: Implement verifier support for validation of async callbacks.

    bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
    PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
    and pass them into the helpers as function callbacks.
    In case of bpf_for_each_map_elem() the callback is invoked synchronously
    and the verifier treats it as a normal subprogram call by adding another
    bpf_func_state and new frame in __check_func_call().
    bpf_timer_set_callback() doesn't invoke the callback directly.
    The subprogram will be called asynchronously from bpf_timer_cb().
    Teach the verifier to validate such async callbacks as special kind
    of jump by pushing verifier state into stack and let pop_stack() process it.

    Special care needs to be taken during state pruning.
    The call insn doing bpf_timer_set_callback has to be a prune_point.
    Otherwise short timer callbacks might not have prune points in front of
    bpf_timer_set_callback() which means is_state_visited() will be called
    after this call insn is processed in __check_func_call(). Which means that
    another async_cb state will be pushed to be walked later and the verifier
    will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
    Since push_async_cb() looks like another push_stack() branch the
    infinite loop detection will trigger false positive. To recognize
    this case mark such states as in_async_callback_fn.
    To distinguish infinite loop in async callback vs the same callback called
    with different arguments for different map and timer add async_entry_cnt
    to bpf_func_state.

    Enforce return zero from async callbacks.

    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:31 +02:00
Jerome Marchand 00ace47307 bpf: Introduce bpf timers.
Bugzilla: http://bugzilla.redhat.com/2041365

commit b00628b1c7d595ae5b544e059c27b1f5828314b4
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Jul 14 17:54:09 2021 -0700

    bpf: Introduce bpf timers.

    Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
    in hash/array/lru maps as a regular field and helpers to operate on it:

    // Initialize the timer.
    // First 4 bits of 'flags' specify clockid.
    // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
    long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);

    // Configure the timer to call 'callback_fn' static function.
    long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);

    // Arm the timer to expire 'nsec' nanoseconds from the current time.
    long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);

    // Cancel the timer and wait for callback_fn to finish if it was running.
    long bpf_timer_cancel(struct bpf_timer *timer);

    Here is how BPF program might look like:
    struct map_elem {
        int counter;
        struct bpf_timer timer;
    };

    struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, 1000);
        __type(key, int);
        __type(value, struct map_elem);
    } hmap SEC(".maps");

    static int timer_cb(void *map, int *key, struct map_elem *val);
    /* val points to particular map element that contains bpf_timer. */

    SEC("fentry/bpf_fentry_test1")
    int BPF_PROG(test1, int a)
    {
        struct map_elem *val;
        int key = 0;

        val = bpf_map_lookup_elem(&hmap, &key);
        if (val) {
            bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
            bpf_timer_set_callback(&val->timer, timer_cb);
            bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
        }
    }

    This patch adds helper implementations that rely on hrtimers
    to call bpf functions as timers expire.
    The following patches add necessary safety checks.

    Only programs with CAP_BPF are allowed to use bpf_timer.

    The amount of timers used by the program is constrained by
    the memcg recorded at map creation time.

    The bpf_timer_init() helper needs explicit 'map' argument because inner maps
    are dynamic and not known at load time. While the bpf_timer_set_callback() is
    receiving hidden 'aux->prog' argument supplied by the verifier.

    The prog pointer is needed to do refcnting of bpf program to make sure that
    program doesn't get freed while the timer is armed. This approach relies on
    "user refcnt" scheme used in prog_array that stores bpf programs for
    bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
    paired with bpf_timer_cancel() that will drop the prog refcnt. The
    ops->map_release_uref is responsible for cancelling the timers and dropping
    prog refcnt when user space reference to a map reaches zero.
    This uref approach is done to make sure that Ctrl-C of user space process will
    not leave timers running forever unless the user space explicitly pinned a map
    that contained timers in bpffs.

    bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
    have user references (is not held by open file descriptor from user space and
    not pinned in bpffs).

    The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
    and free the timer if given map element had it allocated.
    "bpftool map update" command can be used to cancel timers.

    The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
    '__u64 :64' has 1 byte alignment of 8 byte padding.

    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:31 +02:00
Jerome Marchand 3551d32622 bpf: Factor out bpf_spin_lock into helpers.
Bugzilla: http://bugzilla.redhat.com/2041365

commit c1b3fed319d32a721d4b9c17afaeb430444ff773
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Jul 14 17:54:08 2021 -0700

    bpf: Factor out bpf_spin_lock into helpers.

    Move ____bpf_spin_lock/unlock into helpers to make it more clear
    that quadruple underscore bpf_spin_lock/unlock are irqsave/restore variants.

    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210715005417.78572-3-alexei.starovoitov@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:31 +02:00
Yonghong Song 2d3a1e3615 bpf: Add rcu_read_lock in bpf_get_current_[ancestor_]cgroup_id() helpers
Currently, if bpf_get_current_cgroup_id() or
bpf_get_current_ancestor_cgroup_id() helper is
called with sleepable programs e.g., sleepable
fentry/fmod_ret/fexit/lsm programs, a rcu warning
may appear. For example, if I added the following
hack to test_progs/test_lsm sleepable fentry program
test_sys_setdomainname:

  --- a/tools/testing/selftests/bpf/progs/lsm.c
  +++ b/tools/testing/selftests/bpf/progs/lsm.c
  @@ -168,6 +168,10 @@ int BPF_PROG(test_sys_setdomainname, struct pt_regs *regs)
          int buf = 0;
          long ret;

  +       __u64 cg_id = bpf_get_current_cgroup_id();
  +       if (cg_id == 1000)
  +               copy_test++;
  +
          ret = bpf_copy_from_user(&buf, sizeof(buf), ptr);
          if (len == -2 && ret == 0 && buf == 1234)
                  copy_test++;

I will hit the following rcu warning:

  include/linux/cgroup.h:481 suspicious rcu_dereference_check() usage!
  other info that might help us debug this:
    rcu_scheduler_active = 2, debug_locks = 1
    1 lock held by test_progs/260:
      #0: ffffffffa5173360 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xa0
    stack backtrace:
    CPU: 1 PID: 260 Comm: test_progs Tainted: G           O      5.14.0-rc2+ #176
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    Call Trace:
      dump_stack_lvl+0x56/0x7b
      bpf_get_current_cgroup_id+0x9c/0xb1
      bpf_prog_a29888d1c6706e09_test_sys_setdomainname+0x3e/0x89c
      bpf_trampoline_6442469132_0+0x2d/0x1000
      __x64_sys_setdomainname+0x5/0x110
      do_syscall_64+0x3a/0x80
      entry_SYSCALL_64_after_hwframe+0x44/0xae

I can get similar warning using bpf_get_current_ancestor_cgroup_id() helper.
syzbot reported a similar issue in [1] for syscall program. Helper
bpf_get_current_cgroup_id() or bpf_get_current_ancestor_cgroup_id()
has the following callchain:
   task_dfl_cgroup
     task_css_set
       task_css_set_check
and we have
   #define task_css_set_check(task, __c)                                   \
           rcu_dereference_check((task)->cgroups,                          \
                   lockdep_is_held(&cgroup_mutex) ||                       \
                   lockdep_is_held(&css_set_lock) ||                       \
                   ((task)->flags & PF_EXITING) || (__c))
Since cgroup_mutex/css_set_lock is not held and the task
is not existing and rcu read_lock is not held, a warning
will be issued. Note that bpf sleepable program is protected by
rcu_read_lock_trace().

The above sleepable bpf programs are already protected
by migrate_disable(). Adding rcu_read_lock() in these
two helpers will silence the above warning.
I marked the patch fixing 95b861a793
("bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing")
which added bpf_get_current_ancestor_cgroup_id() to tracing programs
in 5.14. I think backporting 5.14 is probably good enough as sleepable
progrems are not widely used.

This patch should fix [1] as well since syscall program is a sleepable
program protected with migrate_disable().

 [1] https://lore.kernel.org/bpf/0000000000006d5cab05c7d9bb87@google.com/

Fixes: 95b861a793 ("bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing")
Reported-by: syzbot+7ee5c2c09c284495371f@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210810230537.2864668-1-yhs@fb.com
2021-08-11 11:45:43 -07:00
Yonghong Song a2baf4e8bb bpf: Fix potentially incorrect results with bpf_get_local_storage()
Commit b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed a bug for bpf_get_local_storage() helper so different tasks
won't mess up with each other's percpu local storage.

The percpu data contains 8 slots so it can hold up to 8 contexts (same or
different tasks), for 8 different program runs, at the same time. This in
general is sufficient. But our internal testing showed the following warning
multiple times:

  [...]
  warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193
     __cgroup_bpf_run_filter_sock_ops+0x13e/0x180
  RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
  <IRQ>
   tcp_call_bpf.constprop.99+0x93/0xc0
   tcp_conn_request+0x41e/0xa50
   ? tcp_rcv_state_process+0x203/0xe00
   tcp_rcv_state_process+0x203/0xe00
   ? sk_filter_trim_cap+0xbc/0x210
   ? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160
   tcp_v6_do_rcv+0x181/0x3e0
   tcp_v6_rcv+0xc65/0xcb0
   ip6_protocol_deliver_rcu+0xbd/0x450
   ip6_input_finish+0x11/0x20
   ip6_input+0xb5/0xc0
   ip6_sublist_rcv_finish+0x37/0x50
   ip6_sublist_rcv+0x1dc/0x270
   ipv6_list_rcv+0x113/0x140
   __netif_receive_skb_list_core+0x1a0/0x210
   netif_receive_skb_list_internal+0x186/0x2a0
   gro_normal_list.part.170+0x19/0x40
   napi_complete_done+0x65/0x150
   mlx5e_napi_poll+0x1ae/0x680
   __napi_poll+0x25/0x120
   net_rx_action+0x11e/0x280
   __do_softirq+0xbb/0x271
   irq_exit_rcu+0x97/0xa0
   common_interrupt+0x7f/0xa0
   </IRQ>
   asm_common_interrupt+0x1e/0x40
  RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac
   ? __cgroup_bpf_run_filter_skb+0x378/0x4e0
   ? do_softirq+0x34/0x70
   ? ip6_finish_output2+0x266/0x590
   ? ip6_finish_output+0x66/0xa0
   ? ip6_output+0x6c/0x130
   ? ip6_xmit+0x279/0x550
   ? ip6_dst_check+0x61/0xd0
  [...]

Using drgn [0] to dump the percpu buffer contents showed that on this CPU
slot 0 is still available, but slots 1-7 are occupied and those tasks in
slots 1-7 mostly don't exist any more. So we might have issues in
bpf_cgroup_storage_unset().

Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset().
Currently, it tries to unset "current" slot with searching from the start.
So the following sequence is possible:

  1. A task is running and claims slot 0
  2. Running BPF program is done, and it checked slot 0 has the "task"
     and ready to reset it to NULL (not yet).
  3. An interrupt happens, another BPF program runs and it claims slot 1
     with the *same* task.
  4. The unset() in interrupt context releases slot 0 since it matches "task".
  5. Interrupt is done, the task in process context reset slot 0.

At the end, slot 1 is not reset and the same process can continue to occupy
slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 BPF
program won't be able to claim an empty slot and a warning will be issued.

To fix the issue, for unset() function, we should traverse from the last slot
to the first. This way, the above issue can be avoided.

The same reverse traversal should also be done in bpf_get_local_storage() helper
itself. Otherwise, incorrect local storage may be returned to BPF program.

  [0] https://github.com/osandov/drgn

Fixes: b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210810010413.1976277-1-yhs@fb.com
2021-08-10 10:27:16 +02:00
Daniel Borkmann 71330842ff bpf: Add _kernel suffix to internal lockdown_bpf_read
Rename LOCKDOWN_BPF_READ into LOCKDOWN_BPF_READ_KERNEL so we have naming
more consistent with a LOCKDOWN_BPF_WRITE_USER option that we are adding.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
2021-08-09 21:50:41 +02:00
Toke Høiland-Jørgensen 694cea395f bpf: Allow RCU-protected lookups to happen from bh context
XDP programs are called from a NAPI poll context, which means the RCU
reference liveness is ensured by local_bh_disable(). Add
rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so
lockdep understands that the dereferences are safe from inside *either* an
rcu_read_lock() section *or* a local_bh_disable() section. While both
bh_disabled and rcu_read_lock() provide RCU protection, they are
semantically distinct, so we need both conditions to prevent lockdep
complaints.

This change is done in preparation for removing the redundant
rcu_read_lock()s from drivers.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210624160609.292325-5-toke@redhat.com
2021-06-24 19:41:15 +02:00
Daniel Borkmann ff40e51043 bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
Commit 59438b4647 ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:

  1) The audit events that are triggered due to calls to security_locked_down()
     can OOM kill a machine, see below details [0].

  2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
     when trying to wake up kauditd, for example, when using trace_sched_switch()
     tracepoint, see details in [1]. Triggering this was not via some hypothetical
     corner case, but with existing tools like runqlat & runqslower from bcc, for
     example, which make use of this tracepoint. Rough call sequence goes like:

     rq_lock(rq) -> -------------------------+
       trace_sched_switch() ->               |
         bpf_prog_xyz() ->                   +-> deadlock
           selinux_lockdown() ->             |
             audit_log_end() ->              |
               wake_up_interruptible() ->    |
                 try_to_wake_up() ->         |
                   rq_lock(rq) --------------+

What's worse is that the intention of 59438b4647 to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:

  allow <who> <who> : lockdown { <reason> };

However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:

  bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.

Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.

The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b4647 where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:

  I starting seeing this with F-34. When I run a container that is traced with
  BPF to record the syscalls it is doing, auditd is flooded with messages like:

  type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality }
    for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
      scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
        tclass=lockdown permissive=0

  This seems to be leading to auditd running out of space in the backlog buffer
  and eventually OOMs the machine.

  [...]
  auditd running at 99% CPU presumably processing all the messages, eventually I get:
  Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
  Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
  Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
  Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
  Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
  [...]

[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
    Serhei Makarov says:

  Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
  bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
  is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
  testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
  ppc64le. Example stack trace:

  [...]
  [  730.868702] stack backtrace:
  [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
  [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  [  730.873278] Call Trace:
  [  730.873770]  dump_stack+0x7f/0xa1
  [  730.874433]  check_noncircular+0xdf/0x100
  [  730.875232]  __lock_acquire+0x1202/0x1e10
  [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
  [  730.876844]  lock_acquire+0xc2/0x3a0
  [  730.877551]  ? __wake_up_common_lock+0x52/0x90
  [  730.878434]  ? lock_acquire+0xc2/0x3a0
  [  730.879186]  ? lock_is_held_type+0xa7/0x120
  [  730.880044]  ? skb_queue_tail+0x1b/0x50
  [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
  [  730.881656]  ? __wake_up_common_lock+0x52/0x90
  [  730.882532]  __wake_up_common_lock+0x52/0x90
  [  730.883375]  audit_log_end+0x5b/0x100
  [  730.884104]  slow_avc_audit+0x69/0x90
  [  730.884836]  avc_has_perm+0x8b/0xb0
  [  730.885532]  selinux_lockdown+0xa5/0xd0
  [  730.886297]  security_locked_down+0x20/0x40
  [  730.887133]  bpf_probe_read_compat+0x66/0xd0
  [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
  [  730.888917]  trace_call_bpf+0xe9/0x240
  [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
  [  730.890579]  perf_trace_sched_switch+0x142/0x180
  [  730.891485]  ? __schedule+0x6d8/0xb20
  [  730.892209]  __schedule+0x6d8/0xb20
  [  730.892899]  schedule+0x5b/0xc0
  [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
  [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
  [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [...]

Fixes: 59438b4647 ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Reported-by: Jakub Hrozek <jhrozek@redhat.com>
Reported-by: Serhei Makarov <smakarov@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Frank Eigler <fche@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
2021-06-02 21:59:22 +02:00
Florent Revest 0af02eb2a7 bpf: Avoid using ARRAY_SIZE on an uninitialized pointer
The cppcheck static code analysis reported the following error:

    if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(bufs->tmp_bufs))) {
                                             ^
ARRAY_SIZE is a macro that expands to sizeofs, so bufs is not actually
dereferenced at runtime, and the code is actually safe. But to keep
things tidy, this patch removes the need for a call to ARRAY_SIZE by
extracting the size of the array into a macro. Cppcheck should no longer
be confused and the code ends up being a bit cleaner.

Fixes: e2d5b2bb76 ("bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20210517092830.1026418-2-revest@chromium.org
2021-05-20 23:48:38 +02:00
Florent Revest 8afcc19fbf bpf: Clarify a bpf_bprintf_prepare macro
The per-cpu buffers contain bprintf data rather than printf arguments.
The macro name and comment were a bit confusing, this rewords them in a
clearer way.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20210517092830.1026418-1-revest@chromium.org
2021-05-20 23:48:38 +02:00
Florent Revest e2d5b2bb76 bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers
The bpf_seq_printf, bpf_trace_printk and bpf_snprintf helpers share one
per-cpu buffer that they use to store temporary data (arguments to
bprintf). They "get" that buffer with try_get_fmt_tmp_buf and "put" it
by the end of their scope with bpf_bprintf_cleanup.

If one of these helpers gets called within the scope of one of these
helpers, for example: a first bpf program gets called, uses
bpf_trace_printk which calls raw_spin_lock_irqsave which is traced by
another bpf program that calls bpf_snprintf, then the second "get"
fails. Essentially, these helpers are not re-entrant. They would return
-EBUSY and print a warning message once.

This patch triples the number of bprintf buffers to allow three levels
of nesting. This is very similar to what was done for tracepoints in
"9594dc3c7e7 bpf: fix nested bpf tracepoints with per-cpu data"

Fixes: d9c9e4db18 ("bpf: Factorize bpf_trace_printk and bpf_seq_printf")
Reported-by: syzbot+63122d0bc347f18c1884@syzkaller.appspotmail.com
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210511081054.2125874-1-revest@chromium.org
2021-05-11 14:02:33 -07:00
Florent Revest 48cac3f4a9 bpf: Implement formatted output helpers with bstr_printf
BPF has three formatted output helpers: bpf_trace_printk, bpf_seq_printf
and bpf_snprintf. Their signatures specify that all arguments are
provided from the BPF world as u64s (in an array or as registers). All
of these helpers are currently implemented by calling functions such as
snprintf() whose signatures take a variable number of arguments, then
placed in a va_list by the compiler to call vsnprintf().

"d9c9e4db bpf: Factorize bpf_trace_printk and bpf_seq_printf" introduced
a bpf_printf_prepare function that fills an array of u64 sanitized
arguments with an array of "modifiers" which indicate what the "real"
size of each argument should be (given by the format specifier). The
BPF_CAST_FMT_ARG macro consumes these arrays and casts each argument to
its real size. However, the C promotion rules implicitely cast them all
back to u64s. Therefore, the arguments given to snprintf are u64s and
the va_list constructed by the compiler will use 64 bits for each
argument. On 64 bit machines, this happens to work well because 32 bit
arguments in va_lists need to occupy 64 bits anyway, but on 32 bit
architectures this breaks the layout of the va_list expected by the
called function and mangles values.

In "88a5c690b6 bpf: fix bpf_trace_printk on 32 bit archs", this problem
had been solved for bpf_trace_printk only with a "horrid workaround"
that emitted multiple calls to trace_printk where each call had
different argument types and generated different va_list layouts. One of
the call would be dynamically chosen at runtime. This was ok with the 3
arguments that bpf_trace_printk takes but bpf_seq_printf and
bpf_snprintf accept up to 12 arguments. Because this approach scales
code exponentially, it is not a viable option anymore.

Because the promotion rules are part of the language and because the
construction of a va_list is an arch-specific ABI, it's best to just
avoid variadic arguments and va_lists altogether. Thankfully the
kernel's snprintf() has an alternative in the form of bstr_printf() that
accepts arguments in a "binary buffer representation". These binary
buffers are currently created by vbin_printf and used in the tracing
subsystem to split the cost of printing into two parts: a fast one that
only dereferences and remembers values, and a slower one, called later,
that does the pretty-printing.

This patch refactors bpf_printf_prepare to construct binary buffers of
arguments consumable by bstr_printf() instead of arrays of arguments and
modifiers. This gets rid of BPF_CAST_FMT_ARG and greatly simplifies the
bpf_printf_prepare usage but there are a few gotchas that change how
bpf_printf_prepare needs to do things.

Currently, bpf_printf_prepare uses a per cpu temporary buffer as a
generic storage for strings and IP addresses. With this refactoring, the
temporary buffers now holds all the arguments in a structured binary
format.

To comply with the format expected by bstr_printf, certain format
specifiers also need to be pre-formatted: %pB and %pi6/%pi4/%pI4/%pI6.
Because vsnprintf subroutines for these specifiers are hard to expose,
we pre-format these arguments with calls to snprintf().

Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210427174313.860948-3-revest@chromium.org
2021-04-27 15:56:31 -07:00
Florent Revest 7b15523a98 bpf: Add a bpf_snprintf helper
The implementation takes inspiration from the existing bpf_trace_printk
helper but there are a few differences:

To allow for a large number of format-specifiers, parameters are
provided in an array, like in bpf_seq_printf.

Because the output string takes two arguments and the array of
parameters also takes two arguments, the format string needs to fit in
one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to
a zero-terminated read-only map so we don't need a format string length
arg.

Because the format-string is known at verification time, we also do
a first pass of format string validation in the verifier logic. This
makes debugging easier.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-4-revest@chromium.org
2021-04-19 15:27:36 -07:00
Florent Revest d9c9e4db18 bpf: Factorize bpf_trace_printk and bpf_seq_printf
Two helpers (trace_printk and seq_printf) have very similar
implementations of format string parsing and a third one is coming
(snprintf). To avoid code duplication and make the code easier to
maintain, this moves the operations associated with format string
parsing (validation and argument sanitization) into one generic
function.

The implementation of the two existing helpers already drifted quite a
bit so unifying them entailed a lot of changes:

- bpf_trace_printk always expected fmt[fmt_size] to be the terminating
  NULL character, this is no longer true, the first 0 is terminating.
- bpf_trace_printk now supports %% (which produces the percentage char).
- bpf_trace_printk now skips width formating fields.
- bpf_trace_printk now supports the X modifier (capital hexadecimal).
- bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6
- argument casting on 32 bit has been simplified into one macro and
  using an enum instead of obscure int increments.

- bpf_seq_printf now uses bpf_trace_copy_string instead of
  strncpy_from_kernel_nofault and handles the %pks %pus specifiers.
- bpf_seq_printf now prints longs correctly on 32 bit architectures.

- both were changed to use a global per-cpu tmp buffer instead of one
  stack buffer for trace_printk and 6 small buffers for seq_printf.
- to avoid per-cpu buffer usage conflict, these helpers disable
  preemption while the per-cpu buffer is in use.
- both helpers now support the %ps and %pS specifiers to print symbols.

The implementation is also moved from bpf_trace.c to helpers.c because
the upcoming bpf_snprintf helper will be made available to all BPF
programs and will need it.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-2-revest@chromium.org
2021-04-19 15:27:36 -07:00
Yonghong Song b910eaaaa4 bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper
Jiri Olsa reported a bug ([1]) in kernel where cgroup local
storage pointer may be NULL in bpf_get_local_storage() helper.
There are two issues uncovered by this bug:
  (1). kprobe or tracepoint prog incorrectly sets cgroup local storage
       before prog run,
  (2). due to change from preempt_disable to migrate_disable,
       preemption is possible and percpu storage might be overwritten
       by other tasks.

This issue (1) is fixed in [2]. This patch tried to address issue (2).
The following shows how things can go wrong:
  task 1:   bpf_cgroup_storage_set() for percpu local storage
         preemption happens
  task 2:   bpf_cgroup_storage_set() for percpu local storage
         preemption happens
  task 1:   run bpf program

task 1 will effectively use the percpu local storage setting by task 2
which will be either NULL or incorrect ones.

Instead of just one common local storage per cpu, this patch fixed
the issue by permitting 8 local storages per cpu and each local
storage is identified by a task_struct pointer. This way, we
allow at most 8 nested preemption between bpf_cgroup_storage_set()
and bpf_cgroup_storage_unset(). The percpu local storage slot
is released (calling bpf_cgroup_storage_unset()) by the same task
after bpf program finished running.
bpf_test_run() is also fixed to use the new bpf_cgroup_storage_set()
interface.

The patch is tested on top of [2] with reproducer in [1].
Without this patch, kernel will emit error in 2-3 minutes.
With this patch, after one hour, still no error.

 [1] https://lore.kernel.org/bpf/CAKH8qBuXCfUz=w8L+Fj74OaUpbosO29niYwTki7e3Ag044_aww@mail.gmail.com/T
 [2] https://lore.kernel.org/bpf/20210309185028.3763817-1-yhs@fb.com

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/bpf/20210323055146.3334476-1-yhs@fb.com
2021-03-25 18:31:36 -07:00
Yonghong Song 69c087ba62 bpf: Add bpf_for_each_map_elem() helper
The bpf_for_each_map_elem() helper is introduced which
iterates all map elements with a callback function. The
helper signature looks like
  long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
and for each map element, the callback_fn will be called. For example,
like hashmap, the callback signature may look like
  long callback_fn(map, key, val, callback_ctx)

There are two known use cases for this. One is from upstream ([1]) where
a for_each_map_elem helper may help implement a timeout mechanism
in a more generic way. Another is from our internal discussion
for a firewall use case where a map contains all the rules. The packet
data can be compared to all these rules to decide allow or deny
the packet.

For array maps, users can already use a bounded loop to traverse
elements. Using this helper can avoid using bounded loop. For other
type of maps (e.g., hash maps) where bounded loop is hard or
impossible to use, this helper provides a convenient way to
operate on all elements.

For callback_fn, besides map and map element, a callback_ctx,
allocated on caller stack, is also passed to the callback
function. This callback_ctx argument can provide additional
input and allow to write to caller stack for output.

If the callback_fn returns 0, the helper will iterate through next
element if available. If the callback_fn returns 1, the helper
will stop iterating and returns to the bpf program. Other return
values are not used for now.

Currently, this helper is only available with jit. It is possible
to make it work with interpreter with so effort but I leave it
as the future work.

[1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210226204925.3884923-1-yhs@fb.com
2021-02-26 13:23:52 -08:00
Tobias Klauser 61ca36c8c4 bpf: Simplify cases in bpf_base_func_proto
!perfmon_capable() is checked before the last switch(func_id) in
bpf_base_func_proto. Thus, the cases BPF_FUNC_trace_printk and
BPF_FUNC_snprintf_btf can be moved to that last switch(func_id) to omit
the inline !perfmon_capable() checks.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210127174615.3038-1-tklauser@distanz.ch
2021-01-29 02:20:28 +01:00
Mircea Cirjaliu 301a33d518 bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
I assume this was obtained by copy/paste. Point it to bpf_map_peek_elem()
instead of bpf_map_pop_elem(). In practice it may have been less likely
hit when under JIT given shielded via 84430d4232 ("bpf, verifier: avoid
retpoline for map push/pop/peek operation").

Fixes: f1a2e44a3a ("bpf: add queue and stack maps")
Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Mauricio Vasquez <mauriciovasquezbernal@gmail.com>
Link: https://lore.kernel.org/bpf/AM7PR02MB6082663DFDCCE8DA7A6DD6B1BBA30@AM7PR02MB6082.eurprd02.prod.outlook.com
2021-01-19 22:04:08 +01:00
Jakub Kicinski 46d5e62dd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().

strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.

Conflicts:
	net/netfilter/nf_tables_api.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11 22:29:38 -08:00
Andrii Nakryiko b7906b70a2 bpf: Fix enum names for bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpers
Remove bpf_ prefix, which causes these helpers to be reported in verifier
dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
fix it as long as it is still possible before UAPI freezes on these helpers.

Fixes: eaa6bcb71e ("bpf: Introduce bpf_per_cpu_ptr()")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-11 14:19:07 -08:00
Dmitrii Banshchikov d055126180 bpf: Add bpf_ktime_get_coarse_ns helper
The helper uses CLOCK_MONOTONIC_COARSE source of time that is less
accurate but more performant.

We have a BPF CGROUP_SKB firewall that supports event logging through
bpf_perf_event_output(). Each event has a timestamp and currently we use
bpf_ktime_get_ns() for it. Use of bpf_ktime_get_coarse_ns() saves ~15-20
ns in time required for event logging.

bpf_ktime_get_ns():
EgressLogByRemoteEndpoint                              113.82ns    8.79M

bpf_ktime_get_coarse_ns():
EgressLogByRemoteEndpoint                               95.40ns   10.48M

Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201117184549.257280-1-me@ubique.spb.ru
2020-11-18 23:25:32 +01:00
Hao Luo 63d9b80dcf bpf: Introducte bpf_this_cpu_ptr()
Add bpf_this_cpu_ptr() to help access percpu var on this cpu. This
helper always returns a valid pointer, therefore no need to check
returned value for NULL. Also note that all programs run with
preemption disabled, which means that the returned pointer is stable
during all the execution of the program.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200929235049.2533242-6-haoluo@google.com
2020-10-02 15:00:49 -07:00
Hao Luo eaa6bcb71e bpf: Introduce bpf_per_cpu_ptr()
Add bpf_per_cpu_ptr() to help bpf programs access percpu vars.
bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the kernel
except that it may return NULL. This happens when the cpu parameter is
out of range. So the caller must check the returned value.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200929235049.2533242-5-haoluo@google.com
2020-10-02 15:00:49 -07:00
Alan Maguire c4d0bfb450 bpf: Add bpf_snprintf_btf helper
A helper is added to support tracing kernel type information in BPF
using the BPF Type Format (BTF).  Its signature is

long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
		      u32 btf_ptr_size, u64 flags);

struct btf_ptr * specifies

- a pointer to the data to be traced
- the BTF id of the type of data pointed to
- a flags field is provided for future use; these flags
  are not to be confused with the BTF_F_* flags
  below that control how the btf_ptr is displayed; the
  flags member of the struct btf_ptr may be used to
  disambiguate types in kernel versus module BTF, etc;
  the main distinction is the flags relate to the type
  and information needed in identifying it; not how it
  is displayed.

For example a BPF program with a struct sk_buff *skb
could do the following:

	static struct btf_ptr b = { };

	b.ptr = skb;
	b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
	bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);

Default output looks like this:

(struct sk_buff){
 .transport_header = (__u16)65535,
 .mac_header = (__u16)65535,
 .end = (sk_buff_data_t)192,
 .head = (unsigned char *)0x000000007524fd8b,
 .data = (unsigned char *)0x000000007524fd8b,
 .truesize = (unsigned int)768,
 .users = (refcount_t){
  .refs = (atomic_t){
   .counter = (int)1,
  },
 },
}

Flags modifying display are as follows:

- BTF_F_COMPACT:	no formatting around type information
- BTF_F_NONAME:		no struct/union member names/types
- BTF_F_PTR_RAW:	show raw (unobfuscated) pointer values;
			equivalent to %px.
- BTF_F_ZERO:		show zero-valued struct/union members;
			they are not displayed by default

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com
2020-09-28 18:26:58 -07:00
Alexei Starovoitov 07be4c4a3e bpf: Add bpf_copy_from_user() helper.
Sleepable BPF programs can now use copy_from_user() to access user memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-4-alexei.starovoitov@gmail.com
2020-08-28 21:20:33 +02:00
Andrii Nakryiko 457f44363a bpf: Implement BPF ring buffer and verifier support for it
This commit adds a new MPSC ring buffer implementation into BPF ecosystem,
which allows multiple CPUs to submit data to a single shared ring buffer. On
the consumption side, only single consumer is assumed.

Motivation
----------
There are two distinctive motivators for this work, which are not satisfied by
existing perf buffer, which prompted creation of a new ring buffer
implementation.
  - more efficient memory utilization by sharing ring buffer across CPUs;
  - preserving ordering of events that happen sequentially in time, even
  across multiple CPUs (e.g., fork/exec/exit events for a task).

These two problems are independent, but perf buffer fails to satisfy both.
Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
also solved by having an MPSC implementation of ring buffer. The ordering
problem could technically be solved for perf buffer with some in-kernel
counting, but given the first one requires an MPSC buffer, the same solution
would solve the second problem automatically.

Semantics and APIs
------------------
Single ring buffer is presented to BPF programs as an instance of BPF map of
type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately
rejected.

One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make
BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce
"same CPU only" rule. This would be more familiar interface compatible with
existing perf buffer use in BPF, but would fail if application needed more
advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses
this with current approach. Additionally, given the performance of BPF
ringbuf, many use cases would just opt into a simple single ring buffer shared
among all CPUs, for which current approach would be an overkill.

Another approach could introduce a new concept, alongside BPF map, to
represent generic "container" object, which doesn't necessarily have key/value
interface with lookup/update/delete operations. This approach would add a lot
of extra infrastructure that has to be built for observability and verifier
support. It would also add another concept that BPF developers would have to
familiarize themselves with, new syntax in libbpf, etc. But then would really
provide no additional benefits over the approach of using a map.
BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so
doesn't few other map types (e.g., queue and stack; array doesn't support
delete, etc).

The approach chosen has an advantage of re-using existing BPF map
infrastructure (introspection APIs in kernel, libbpf support, etc), being
familiar concept (no need to teach users a new type of object in BPF program),
and utilizing existing tooling (bpftool). For common scenario of using
a single ring buffer for all CPUs, it's as simple and straightforward, as
would be with a dedicated "container" object. On the other hand, by being
a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to
implement a wide variety of topologies, from one ring buffer for each CPU
(e.g., as a replacement for perf buffer use cases), to a complicated
application hashing/sharding of ring buffers (e.g., having a small pool of
ring buffers with hashed task's tgid being a look up key to preserve order,
but reduce contention).

Key and value sizes are enforced to be zero. max_entries is used to specify
the size of ring buffer and has to be a power of 2 value.

There are a bunch of similarities between perf buffer
(BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
  - variable-length records;
  - if there is no more space left in ring buffer, reservation fails, no
    blocking;
  - memory-mappable data area for user-space applications for ease of
    consumption and high performance;
  - epoll notifications for new incoming data;
  - but still the ability to do busy polling for new data to achieve the
    lowest latency, if necessary.

BPF ringbuf provides two sets of APIs to BPF programs:
  - bpf_ringbuf_output() allows to *copy* data from one place to a ring
    buffer, similarly to bpf_perf_event_output();
  - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs
    split the whole process into two steps. First, a fixed amount of space is
    reserved. If successful, a pointer to a data inside ring buffer data area
    is returned, which BPF programs can use similarly to a data inside
    array/hash maps. Once ready, this piece of memory is either committed or
    discarded. Discard is similar to commit, but makes consumer ignore the
    record.

bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because
record has to be prepared in some other place first. But it allows to submit
records of the length that's not known to verifier beforehand. It also closely
matches bpf_perf_event_output(), so will simplify migration significantly.

bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory
pointer directly to ring buffer memory. In a lot of cases records are larger
than BPF stack space allows, so many programs have use extra per-CPU array as
a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
completely. But in exchange, it only allows a known constant size of memory to
be reserved, such that verifier can verify that BPF program can't access
memory outside its reserved record space. bpf_ringbuf_output(), while slightly
slower due to extra memory copy, covers some use cases that are not suitable
for bpf_ringbuf_reserve().

The difference between commit and discard is very small. Discard just marks
a record as discarded, and such records are supposed to be ignored by consumer
code. Discard is useful for some advanced use-cases, such as ensuring
all-or-nothing multi-record submission, or emulating temporary malloc()/free()
within single BPF program invocation.

Each reserved record is tracked by verifier through existing
reference-tracking logic, similar to socket ref-tracking. It is thus
impossible to reserve a record, but forget to submit (or discard) it.

bpf_ringbuf_query() helper allows to query various properties of ring buffer.
Currently 4 are supported:
  - BPF_RB_AVAIL_DATA returns amount of unconsumed data in ring buffer;
  - BPF_RB_RING_SIZE returns the size of ring buffer;
  - BPF_RB_CONS_POS/BPF_RB_PROD_POS returns current logical possition of
    consumer/producer, respectively.
Returned values are momentarily snapshots of ring buffer state and could be
off by the time helper returns, so this should be used only for
debugging/reporting reasons or for implementing various heuristics, that take
into account highly-changeable nature of some of those characteristics.

One such heuristic might involve more fine-grained control over poll/epoll
notifications about new data availability in ring buffer. Together with
BPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discard helpers,
it allows BPF program a high degree of control and, e.g., more efficient
batched notifications. Default self-balancing strategy, though, should be
adequate for most applications and will work reliable and efficiently already.

Design and implementation
-------------------------
This reserve/commit schema allows a natural way for multiple producers, either
on different CPUs or even on the same CPU/in the same BPF program, to reserve
independent records and work with them without blocking other producers. This
means that if BPF program was interruped by another BPF program sharing the
same ring buffer, they will both get a record reserved (provided there is
enough space left) and can work with it and submit it independently. This
applies to NMI context as well, except that due to using a spinlock during
reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock,
in which case reservation will fail even if ring buffer is not full.

The ring buffer itself internally is implemented as a power-of-2 sized
circular buffer, with two logical and ever-increasing counters (which might
wrap around on 32-bit architectures, that's not a problem):
  - consumer counter shows up to which logical position consumer consumed the
    data;
  - producer counter denotes amount of data reserved by all producers.

Each time a record is reserved, producer that "owns" the record will
successfully advance producer counter. At that point, data is still not yet
ready to be consumed, though. Each record has 8 byte header, which contains
the length of reserved record, as well as two extra bits: busy bit to denote
that record is still being worked on, and discard bit, which might be set at
commit time if record is discarded. In the latter case, consumer is supposed
to skip the record and move on to the next one. Record header also encodes
record's relative offset from the beginning of ring buffer data area (in
pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only
the pointer to the record itself, without requiring also the pointer to ring
buffer itself. Ring buffer memory location will be restored from record
metadata header. This significantly simplifies verifier, as well as improving
API usability.

Producer counter increments are serialized under spinlock, so there is
a strict ordering between reservations. Commits, on the other hand, are
completely lockless and independent. All records become available to consumer
in the order of reservations, but only after all previous records where
already committed. It is thus possible for slow producers to temporarily hold
off submitted records, that were reserved later.

Reservation/commit/consumer protocol is verified by litmus tests in
Documentation/litmus-test/bpf-rb.

One interesting implementation bit, that significantly simplifies (and thus
speeds up as well) implementation of both producers and consumers is how data
area is mapped twice contiguously back-to-back in the virtual memory. This
allows to not take any special measures for samples that have to wrap around
at the end of the circular buffer data area, because the next page after the
last data page would be first data page again, and thus the sample will still
appear completely contiguous in virtual memory. See comment and a simple ASCII
diagram showing this visually in bpf_ringbuf_area_alloc().

Another feature that distinguishes BPF ringbuf from perf ring buffer is
a self-pacing notifications of new data being availability.
bpf_ringbuf_commit() implementation will send a notification of new record
being available after commit only if consumer has already caught up right up
to the record being committed. If not, consumer still has to catch up and thus
will see new data anyways without needing an extra poll notification.
Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c) show that
this allows to achieve a very high throughput without having to resort to
tricks like "notify only every Nth sample", which are necessary with perf
buffer. For extreme cases, when BPF program wants more manual control of
notifications, commit/discard/output helpers accept BPF_RB_NO_WAKEUP and
BPF_RB_FORCE_WAKEUP flags, which give full control over notifications of data
availability, but require extra caution and diligence in using this API.

Comparison to alternatives
--------------------------
Before considering implementing BPF ring buffer from scratch existing
alternatives in kernel were evaluated, but didn't seem to meet the needs. They
largely fell into few categores:
  - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations
    outlined above (ordering and memory consumption);
  - linked list-based implementations; while some were multi-producer designs,
    consuming these from user-space would be very complicated and most
    probably not performant; memory-mapping contiguous piece of memory is
    simpler and more performant for user-space consumers;
  - io_uring is SPSC, but also requires fixed-sized elements. Naively turning
    SPSC queue into MPSC w/ lock would have subpar performance compared to
    locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
    elements would be too limiting for BPF programs, given existing BPF
    programs heavily rely on variable-sized perf buffer already;
  - specialized implementations (like a new printk ring buffer, [0]) with lots
    of printk-specific limitations and implications, that didn't seem to fit
    well for intended use with BPF programs.

  [0] https://lwn.net/Articles/779550/

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200529075424.3139988-2-andriin@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:22 -07:00
John Fastabend f470378c75 bpf: Extend bpf_base_func_proto helpers with probe_* and *current_task*
Often it is useful when applying policy to know something about the
task. If the administrator has CAP_SYS_ADMIN rights then they can
use kprobe + networking hook and link the two programs together to
accomplish this. However, this is a bit clunky and also means we have
to call both the network program and kprobe program when we could just
use a single program and avoid passing metadata through sk_msg/skb->cb,
socket, maps, etc.

To accomplish this add probe_* helpers to bpf_base_func_proto programs
guarded by a perfmon_capable() check. New supported helpers are the
following,

 BPF_FUNC_get_current_task
 BPF_FUNC_probe_read_user
 BPF_FUNC_probe_read_kernel
 BPF_FUNC_probe_read_user_str
 BPF_FUNC_probe_read_kernel_str

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/159033905529.12355.4368381069655254932.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:20 -07:00
Alexei Starovoitov 2c78ee898d bpf: Implement CAP_BPF
Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
  env->allow_ptr_leaks = bpf_allow_ptr_leaks();
  env->bypass_spec_v1 = bpf_bypass_spec_v1();
  env->bypass_spec_v4 = bpf_bypass_spec_v4();
  env->bpf_capable = bpf_capable();

The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.

'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.

That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
2020-05-15 17:29:41 +02:00
Maciej Żenczykowski 71d1921477 bpf: add bpf_ktime_get_boot_ns()
On a device like a cellphone which is constantly suspending
and resuming CLOCK_MONOTONIC is not particularly useful for
keeping track of or reacting to external network events.
Instead you want to use CLOCK_BOOTTIME.

Hence add bpf_ktime_get_boot_ns() as a mirror of bpf_ktime_get_ns()
based around CLOCK_BOOTTIME instead of CLOCK_MONOTONIC.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-04-26 09:43:05 -07:00
Maciej Żenczykowski 082b57e3eb net: bpf: Make bpf_ktime_get_ns() available to non GPL programs
The entire implementation is in kernel/bpf/helpers.c:

BPF_CALL_0(bpf_ktime_get_ns) {
       /* NMI safe access to clock monotonic */
       return ktime_get_mono_fast_ns();
}

const struct bpf_func_proto bpf_ktime_get_ns_proto = {
       .func           = bpf_ktime_get_ns,
       .gpl_only       = false,
       .ret_type       = RET_INTEGER,
};

and this was presumably marked GPL due to kernel/time/timekeeping.c:
  EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);

and while that may make sense for kernel modules (although even that
is doubtful), there is currently AFAICT no other source of time
available to ebpf.

Furthermore this is really just equivalent to clock_gettime(CLOCK_MONOTONIC)
which is exposed to userspace (via vdso even to make it performant)...

As such, I see no reason to keep the GPL restriction.
(In the future I'd like to have access to time from Apache licensed ebpf code)

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-04-26 09:04:14 -07:00
Stanislav Fomichev 6890896bd7 bpf: Fix missing bpf_base_func_proto in cgroup_base_func_proto for CGROUP_NET=n
linux-next build bot reported compile issue [1] with one of its
configs. It looks like when we have CONFIG_NET=n and
CONFIG_BPF{,_SYSCALL}=y, we are missing the bpf_base_func_proto
definition (from net/core/filter.c) in cgroup_base_func_proto.

I'm reshuffling the code a bit to make it work. The common helpers
are moved into kernel/bpf/helpers.c and the bpf_base_func_proto is
exported from there.
Also, bpf_get_raw_cpu_id goes into kernel/bpf/core.c akin to existing
bpf_user_rnd_u32.

[1] https://lore.kernel.org/linux-next/CAKH8qBsBvKHswiX1nx40LgO+BGeTmb1NX8tiTttt_0uu6T3dCA@mail.gmail.com/T/#mff8b0c083314c68c2e2ef0211cb11bc20dc13c72

Fixes: 0456ea170c ("bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200424235941.58382-1-sdf@google.com
2020-04-26 08:53:13 -07:00
Daniel Borkmann 0f09abd105 bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
recvmsg() and bind-related hooks in order to retrieve the cgroup v2
context which can then be used as part of the key for BPF map lookups,
for example. Given these hooks operate in process context 'current' is
always valid and pointing to the app that is performing mentioned
syscalls if it's subject to a v2 cgroup. Also with same motivation of
commit 7723628101 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
enable retrieval of ancestor from current so the cgroup id can be used
for policy lookups which can then forbid connect() / bind(), for example.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:39 -07:00
Carlos Neira b4490c5c4e bpf: Added new helper bpf_get_ns_current_pid_tgid
New bpf helper bpf_get_ns_current_pid_tgid,
This helper will return pid and tgid from current task
which namespace matches dev_t and inode number provided,
this will allows us to instrument a process inside a container.

Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200304204157.58695-3-cneirabustos@gmail.com
2020-03-12 17:33:11 -07:00
Martin KaFai Lau 5576b991e9 bpf: Add BPF_FUNC_jiffies64
This patch adds a helper to read the 64bit jiffies.  It will be used
in a later patch to implement the bpf_cubic.c.

The helper is inlined for jit_requested and 64 BITS_PER_LONG
as the map_gen_lookup().  Other cases could be considered together
with map_gen_lookup() if needed.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200122233646.903260-1-kafai@fb.com
2020-01-22 16:30:10 -08:00
Tejun Heo 743210386c cgroup: use cgrp->kn->id as the cgroup ID
cgroup ID is currently allocated using a dedicated per-hierarchy idr
and used internally and exposed through tracepoints and bpf.  This is
confusing because there are tracepoints and other interfaces which use
the cgroupfs ino as IDs.

The preceding changes made kn->id exposed as ino as 64bit ino on
supported archs or ino+gen (low 32bits as ino, high gen).  There's no
reason for cgroup to use different IDs.  The kernfs IDs are unique and
userland can easily discover them and map them back to paths using
standard file operations.

This patch replaces cgroup IDs with kernfs IDs.

* cgroup_id() is added and all cgroup ID users are converted to use it.

* kernfs_node creation is moved to earlier during cgroup init so that
  cgroup_id() is available during init.

* While at it, s/cgroup/cgrp/ in psi helpers for consistency.

* Fallback ID value is changed to 1 to be consistent with root cgroup
  ID.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:04 -08:00
Tejun Heo 67c0496e87 kernfs: convert kernfs_node->id from union kernfs_node_id to u64
kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value.  I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to.  Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.

This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper.  Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen.  This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>
2019-11-12 08:18:03 -08:00
Thomas Gleixner 5b497af42f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 295
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of version 2 of the gnu general public license as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 64 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:38 +02:00
Andrey Ignatov d7a4cb9b67 bpf: Introduce bpf_strtol and bpf_strtoul helpers
Add bpf_strtol and bpf_strtoul to convert a string to long and unsigned
long correspondingly. It's similar to user space strtol(3) and
strtoul(3) with a few changes to the API:

* instead of NUL-terminated C string the helpers expect buffer and
  buffer length;

* resulting long or unsigned long is returned in a separate
  result-argument;

* return value is used to indicate success or failure, on success number
  of consumed bytes is returned that can be used to identify position to
  read next if the buffer is expected to contain multiple integers;

* instead of *base* argument, *flags* is used that provides base in 5
  LSB, other bits are reserved for future use;

* number of supported bases is limited.

Documentation for the new helpers is provided in bpf.h UAPI.

The helpers are made available to BPF_PROG_TYPE_CGROUP_SYSCTL programs to
be able to convert string input to e.g. "ulongvec" output.

E.g. "net/ipv4/tcp_mem" consists of three ulong integers. They can be
parsed by calling to bpf_strtoul three times.

Implementation notes:

Implementation includes "../../lib/kstrtox.h" to reuse integer parsing
functions. It's done exactly same way as fs/proc/base.c already does.

Unfortunately existing kstrtoX function can't be used directly since
they fail if any invalid character is present right after integer in the
string. Existing simple_strtoX functions can't be used either since
they're obsolete and don't handle overflow properly.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 13:54:59 -07:00
Alexei Starovoitov 96049f3afd bpf: introduce BPF_F_LOCK flag
Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands
and for map_update() helper function.
In all these cases take a lock of existing element (which was provided
in BTF description) before copying (in or out) the rest of map value.

Implementation details that are part of uapi:

Array:
The array map takes the element lock for lookup/update.

Hash:
hash map also takes the lock for lookup/update and tries to avoid the bucket lock.
If old element exists it takes the element lock and updates the element in place.
If element doesn't exist it allocates new one and inserts into hash table
while holding the bucket lock.
In rare case the hashmap has to take both the bucket lock and the element lock
to update old value in place.

Cgroup local storage:
It is similar to array. update in place and lookup are done with lock taken.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01 20:55:39 +01:00
Alexei Starovoitov d83525ca62 bpf: introduce bpf_spin_lock
Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
bpf program serialize access to other variables.

Example:
struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
}

Restrictions and safety checks:
- bpf_spin_lock is only allowed inside HASH and ARRAY maps.
- BTF description of the map is mandatory for safety analysis.
- bpf program can take one bpf_spin_lock at a time, since two or more can
  cause dead locks.
- only one 'struct bpf_spin_lock' is allowed per map element.
  It drastically simplifies implementation yet allows bpf program to use
  any number of bpf_spin_locks.
- when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
- bpf program must bpf_spin_unlock() before return.
- bpf program can access 'struct bpf_spin_lock' only via
  bpf_spin_lock()/bpf_spin_unlock() helpers.
- load/store into 'struct bpf_spin_lock lock;' field is not allowed.
- to use bpf_spin_lock() helper the BTF description of map value must be
  a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
  Nested lock inside another struct is not allowed.
- syscall map_lookup doesn't copy bpf_spin_lock field to user space.
- syscall map_update and program map_update do not update bpf_spin_lock field.
- bpf_spin_lock cannot be on the stack or inside networking packet.
  bpf_spin_lock can only be inside HASH or ARRAY map value.
- bpf_spin_lock is available to root only and to all program types.
- bpf_spin_lock is not allowed in inner maps of map-in-map.
- ld_abs is not allowed inside spin_lock-ed region.
- tracing progs and socket filter progs cannot use bpf_spin_lock due to
  insufficient preemption checks

Implementation details:
- cgroup-bpf class of programs can nest with xdp/tc programs.
  Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
  Other solutions to avoid nested bpf_spin_lock are possible.
  Like making sure that all networking progs run with softirq disabled.
  spin_lock_irqsave is the simplest and doesn't add overhead to the
  programs that don't use it.
- arch_spinlock_t is used when its implemented as queued_spin_lock
- archs can force their own arch_spinlock_t
- on architectures where queued_spin_lock is not available and
  sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
- presence of bpf_spin_lock inside map value could have been indicated via
  extra flag during map_create, but specifying it via BTF is cleaner.
  It provides introspection for map key/value and reduces user mistakes.

Next steps:
- allow bpf_spin_lock in other map types (like cgroup local storage)
- introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
  to request kernel to grab bpf_spin_lock before rewriting the value.
  That will serialize access to map elements.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01 20:55:38 +01:00
Daniel Borkmann 80b0d86a17 bpf: fix direct packet write into pop/peek helpers
Commit f1a2e44a3a ("bpf: add queue and stack maps") probably just
copy-pasted .pkt_access for bpf_map_{pop,peek}_elem() helpers, but
this is buggy in this context since it would allow writes into cloned
skbs which is invalid. Therefore, disable .pkt_access for the two.

Fixes: f1a2e44a3a ("bpf: add queue and stack maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Mauricio Vasquez B<mauricio.vasquez@polito.it>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-25 17:02:06 -07:00
Mauricio Vasquez B f1a2e44a3a bpf: add queue and stack maps
Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
These maps support peek, pop and push operations that are exposed to eBPF
programs through the new bpf_map[peek/pop/push] helpers.  Those operations
are exposed to userspace applications through the already existing
syscalls in the following way:

BPF_MAP_LOOKUP_ELEM            -> peek
BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
BPF_MAP_UPDATE_ELEM            -> push

Queue/stack maps are implemented using a buffer, tail and head indexes,
hence BPF_F_NO_PREALLOC is not supported.

As opposite to other maps, queue and stack do not use RCU for protecting
maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
argument that is a pointer to a memory zone where to save the value of a
map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
be passed as an extra argument.

Our main motivation for implementing queue/stack maps was to keep track
of a pool of elements, like network ports in a SNAT, however we forsee
other use cases, like for exampling saving last N kernel events in a map
and then analysing from userspace.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:31 -07:00
Roman Gushchin b741f16303 bpf: introduce per-cpu cgroup local storage
This commit introduced per-cpu cgroup local storage.

Per-cpu cgroup local storage is very similar to simple cgroup storage
(let's call it shared), except all the data is per-cpu.

The main goal of per-cpu variant is to implement super fast
counters (e.g. packet counters), which don't require neither
lookups, neither atomic operations.

>From userspace's point of view, accessing a per-cpu cgroup storage
is similar to other per-cpu map types (e.g. per-cpu hashmaps and
arrays).

Writing to a per-cpu cgroup storage is not atomic, but is performed
by copying longs, so some minimal atomicity is here, exactly
as with other per-cpu maps.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-01 16:18:32 +02:00
Roman Gushchin f294b37ec7 bpf: rework cgroup storage pointer passing
To simplify the following introduction of per-cpu cgroup storage,
let's rework a bit a mechanism of passing a pointer to a cgroup
storage into the bpf_get_local_storage(). Let's save a pointer
to the corresponding bpf_cgroup_storage structure, instead of
a pointer to the actual buffer.

It will help us to handle per-cpu storage later, which has
a different way of accessing to the actual data.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-01 16:18:32 +02:00
Roman Gushchin 8bad74f984 bpf: extend cgroup bpf core to allow multiple cgroup storage types
In order to introduce per-cpu cgroup storage, let's generalize
bpf cgroup core to support multiple cgroup storage types.
Potentially, per-node cgroup storage can be added later.

This commit is mostly a formal change that replaces
cgroup_storage pointer with a array of cgroup_storage pointers.
It doesn't actually introduce a new storage type,
it will be done later.

Each bpf program is now able to have one cgroup storage of each type.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-01 16:18:32 +02:00
Roman Gushchin cd33943176 bpf: introduce the bpf_get_local_storage() helper function
The bpf_get_local_storage() helper function is used
to get a pointer to the bpf local storage from a bpf program.

It takes a pointer to a storage map and flags as arguments.
Right now it accepts only cgroup storage maps, and flags
argument has to be 0. Further it can be extended to support
other types of local storage: e.g. thread local storage etc.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Yonghong Song bf6fa2c893 bpf: implement bpf_get_current_cgroup_id() helper
bpf has been used extensively for tracing. For example, bcc
contains an almost full set of bpf-based tools to trace kernel
and user functions/events. Most tracing tools are currently
either filtered based on pid or system-wide.

Containers have been used quite extensively in industry and
cgroup is often used together to provide resource isolation
and protection. Several processes may run inside the same
container. It is often desirable to get container-level tracing
results as well, e.g. syscall count, function count, I/O
activity, etc.

This patch implements a new helper, bpf_get_current_cgroup_id(),
which will return cgroup id based on the cgroup within which
the current task is running.

The later patch will provide an example to show that
userspace can get the same cgroup id so it could
configure a filter or policy in the bpf program based on
task cgroup id.

The helper is currently implemented for tracing. It can
be added to other program types as well when needed.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03 18:22:41 -07:00
Alexei Starovoitov 39f19ebbf5 bpf: rename ARG_PTR_TO_STACK
since ARG_PTR_TO_STACK is no longer just pointer to stack
rename it to ARG_PTR_TO_MEM and adjust comment.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09 16:56:27 -05:00
Daniel Borkmann 2d0e30c30f bpf: add helper for retrieving current numa node id
Use case is mainly for soreuseport to select sockets for the local
numa node, but since generic, lets also add this for other networking
and tracing program types.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:52 -04:00
Daniel Borkmann 36bbef52c7 bpf: direct packet write and access for helpers for clsact progs
This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.

For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.

At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a220 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb->data), so more complex programs using rewrites
can benefit from switching to direct read plus write.

For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 23:32:11 -04:00
Daniel Borkmann f3694e0012 bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.

This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.

BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.

I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:04 -07:00
Daniel Borkmann 6088b5823b bpf: minor cleanups in helpers
Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
anyway, so we can drop the unneeded test for dev; also few more other
misc bits addressed here.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:03 -07:00
Daniel Borkmann 80b48c4457 bpf: don't use raw processor id in generic helper
Use smp_processor_id() for the generic helper bpf_get_smp_processor_id()
instead of the raw variant. This allows for preemption checks when we
have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only
need to keep the raw variant for socket filters, but we can reuse the
helper that is already there from cBPF side.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:54:40 -04:00
Daniel Borkmann 074f528eed bpf: convert relevant helper args to ARG_PTR_TO_RAW_STACK
This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
also be removed since the verifier already makes sure we stay within bounds
on stack buffers.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 21:40:41 -04:00
Alexei Starovoitov cdc4e47da8 bpf: avoid copying junk bytes in bpf_get_current_comm()
Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,
so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.

Fixes: ffeedafbf0 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors")
Reported-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09 23:27:30 -05:00
Daniel Borkmann 3ad0040573 bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs
While recently arguing on a seccomp discussion that raw prandom_u32()
access shouldn't be exposed to unpriviledged user space, I forgot the
fact that SKF_AD_RANDOM extension actually already does it for some time
in cBPF via commit 4cd3675ebf ("filter: added BPF random opcode").

Since prandom_u32() is being used in a lot of critical networking code,
lets be more conservative and split their states. Furthermore, consolidate
eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
bpf_get_prandom_u32() was only accessible for priviledged users, but
should that change one day, we also don't want to leak raw sequences
through things like eBPF maps.

One thought was also to have own per bpf_prog states, but due to ABI
reasons this is not easily possible, i.e. the program code currently
cannot access bpf_prog itself, and copying the rnd_state to/from the
stack scratch space whenever a program uses the prng seems not really
worth the trouble and seems too hacky. If needed, taus113 could in such
cases be implemented within eBPF using a map entry to keep the state
space, or get_random_bytes() could become a second helper in cases where
performance would not be critical.

Both sides can trigger a one-time late init via prandom_init_once() on
the shared state. Performance-wise, there should even be a tiny gain
as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
inside the BPF core since kernels could have a NET-less config as well.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 05:26:39 -07:00
Alexei Starovoitov ffeedafbf0 bpf: introduce current->pid, tgid, uid, gid, comm accessors
eBPF programs attached to kprobes need to filter based on
current->pid, uid and other fields, so introduce helper functions:

u64 bpf_get_current_pid_tgid(void)
Return: current->tgid << 32 | current->pid

u64 bpf_get_current_uid_gid(void)
Return: current_gid << 32 | current_uid

bpf_get_current_comm(char *buf, int size_of_buf)
stores current->comm into buf

They can be used from the programs attached to TC as well to classify packets
based on current task fields.

Update tracex2 example to print histogram of write syscalls for each process
instead of aggregated for all.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-15 15:53:50 -07:00
Daniel Borkmann 3324b584b6 ebpf: misc core cleanup
Besides others, move bpf_tail_call_proto to the remaining definitions
of other protos, improve comments a bit (i.e. remove some obvious ones,
where the code is already self-documenting, add objectives for others),
simplify bpf_prog_array_compatible() a bit.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-31 21:44:44 -07:00
Daniel Borkmann 17ca8cbf49 ebpf: allow bpf_ktime_get_ns_proto also for networking
As this is already exported from tracing side via commit d9847d310a
("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might
as well want to move it to the core, so also networking users can make
use of it, e.g. to measure diffs for certain flows from ingress/egress.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-31 21:44:44 -07:00
Daniel Borkmann c04167ce2c ebpf: add helper for obtaining current processor id
This patch adds the possibility to obtain raw_smp_processor_id() in
eBPF. Currently, this is only possible in classic BPF where commit
da2033c282 ("filter: add SKF_AD_RXHASH and SKF_AD_CPU") has added
facilities for this.

Perhaps most importantly, this would also allow us to track per CPU
statistics with eBPF maps, or to implement a poor-man's per CPU data
structure through eBPF maps.

Example function proto-type looks like:

  u32 (*smp_processor_id)(void) = (void *)BPF_FUNC_get_smp_processor_id;

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15 21:57:25 -04:00
Daniel Borkmann 03e69b508b ebpf: add prandom helper for packet sampling
This work is similar to commit 4cd3675ebf ("filter: added BPF
random opcode") and adds a possibility for packet sampling in eBPF.

Currently, this is only possible in classic BPF and useful to
combine sampling with f.e. packet sockets, possible also with tc.

Example function proto-type looks like:

  u32 (*prandom_u32)(void) = (void *)BPF_FUNC_get_prandom_u32;

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15 21:57:25 -04:00
Daniel Borkmann a2c83fff58 ebpf: constify various function pointer structs
We can move bpf_map_ops and bpf_verifier_ops and other structs into ro
section, bpf_map_type_list and bpf_prog_type_list into read mostly.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01 14:05:18 -05:00
Alexei Starovoitov d0003ec01c bpf: allow eBPF programs to use maps
expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
map accessors to eBPF programs

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 13:44:00 -05:00