Commit Graph

597 Commits

Author SHA1 Message Date
Artem Savkov 400701606d bpf: Support __kptr to local kptrs
Bugzilla: https://bugzilla.redhat.com/2221599

commit c8e18754091479fac3f5b6c053c6bc4be0b7fb11
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Fri Mar 10 15:07:41 2023 -0800

    bpf: Support __kptr to local kptrs
    
    If a PTR_TO_BTF_ID type comes from program BTF - not vmlinux or module
    BTF - it must have been allocated by bpf_obj_new and therefore must be
    free'd with bpf_obj_drop. Such a PTR_TO_BTF_ID is considered a "local
    kptr" and is tagged with MEM_ALLOC type tag by bpf_obj_new.
    
    This patch adds support for treating __kptr-tagged pointers to "local
    kptrs" as having an implicit bpf_obj_drop destructor for referenced kptr
    acquire / release semantics. Consider the following example:
    
      struct node_data {
              long key;
              long data;
              struct bpf_rb_node node;
      };
    
      struct map_value {
              struct node_data __kptr *node;
      };
    
      struct {
              __uint(type, BPF_MAP_TYPE_ARRAY);
              __type(key, int);
              __type(value, struct map_value);
              __uint(max_entries, 1);
      } some_nodes SEC(".maps");
    
    If struct node_data had a matching definition in kernel BTF, the verifier would
    expect a destructor for the type to be registered. Since struct node_data does
    not match any type in kernel BTF, the verifier knows that there is no kfunc
    that provides a PTR_TO_BTF_ID to this type, and that such a PTR_TO_BTF_ID can
    only come from bpf_obj_new. So instead of searching for a registered dtor,
    a bpf_obj_drop dtor can be assumed.
    
    This allows the runtime to properly destruct such kptrs in
    bpf_obj_free_fields, which enables maps to clean up map_vals w/ such
    kptrs when going away.
    
    Implementation notes:
      * "kernel_btf" variable is renamed to "kptr_btf" in btf_parse_kptr.
        Before this patch, the variable would only ever point to vmlinux or
        module BTFs, but now it can point to some program BTF for local kptr
        type. It's later used to populate the (btf, btf_id) pair in kptr btf
        field.
      * It's necessary to btf_get the program BTF when populating btf_field
        for local kptr. btf_record_free later does a btf_put.
      * Behavior for non-local referenced kptrs is not modified, as
        bpf_find_btf_id helper only searches vmlinux and module BTFs for
        matching BTF type. If such a type is found, btf_field_kptr's btf will
        pass btf_is_kernel check, and the associated release function is
        some one-argument dtor. If btf_is_kernel check fails, associated
        release function is two-arg bpf_obj_drop_impl. Before this patch
        only btf_field_kptr's w/ kernel or module BTFs were created.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230310230743.2320707-2-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:16 +02:00
Artem Savkov a441724b52 bpf: Change btf_record_find enum parameter to field_mask
Bugzilla: https://bugzilla.redhat.com/2221599

commit 74843b57ec70af7b67b7e6153374834ee18d139f
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Thu Mar 9 10:01:08 2023 -0800

    bpf: Change btf_record_find enum parameter to field_mask
    
    btf_record_find's 3rd parameter can be multiple enum btf_field_type's
    masked together. The function is called with BPF_KPTR in two places in
    verifier.c, so it works with masked values already.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230309180111.1618459-4-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:16 +02:00
Artem Savkov abbb94d5b6 bpf: enforce all maps having memory usage callback
Bugzilla: https://bugzilla.redhat.com/2221599

commit 6b4a6ea2c62d34272d64161d43a19c02355576e2
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Mar 5 12:46:15 2023 +0000

    bpf: enforce all maps having memory usage callback
    
    We have implemented memory usage callback for all maps, and we enforce
    any newly added map having a callback as well. We check this callback at
    map creation time. If it doesn't have the callback, we will return
    EINVAL.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20230305124615.12358-19-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:13 +02:00
Artem Savkov 7fc5796a7e bpf: offload map memory usage
Bugzilla: https://bugzilla.redhat.com/2221599

commit 9629363cd05642fe43aded44938adec067ad1da3
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Mar 5 12:46:14 2023 +0000

    bpf: offload map memory usage
    
    A new helper is introduced to calculate offload map memory usage. But
    currently the memory dynamically allocated in netdev dev_ops, like
    nsim_map_update_elem, is not counted. Let's just put it aside now.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20230305124615.12358-18-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:13 +02:00
Artem Savkov 0378252f0c bpf: add new map ops ->map_mem_usage
Bugzilla: https://bugzilla.redhat.com/2221599

commit 90a5527d7686d3ebe0dd2a831356a6c7d7dc31bc
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Mar 5 12:45:58 2023 +0000

    bpf: add new map ops ->map_mem_usage
    
    Add a new map ops ->map_mem_usage to print the memory usage of a
    bpf map.
    
    This is a preparation for the followup change.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20230305124615.12358-2-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:12 +02:00
Artem Savkov ef8295203b bpf: Support kptrs in local storage maps
Bugzilla: https://bugzilla.redhat.com/2221599

commit 9db44fdd8105da00669d425acab887c668df75f6
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Sat Feb 25 16:40:09 2023 +0100

    bpf: Support kptrs in local storage maps
    
    Enable support for kptrs in local storage maps by wiring up the freeing
    of these kptrs from map value. Freeing of bpf_local_storage_map is only
    delayed in case there are special fields, therefore bpf_selem_free_*
    path can also only dereference smap safely in that case. This is
    recorded using a bool utilizing a hole in bpF_local_storage_elem. It
    could have been tagged in the pointer value smap using the lowest bit
    (since alignment > 1), but since there was already a hole I went with
    the simpler option. Only the map structure freeing is delayed using RCU
    barriers, as the buckets aren't used when selem is being freed, so they
    can be freed once all readers of the bucket lists can no longer access
    it.
    
    Cc: Martin KaFai Lau <martin.lau@kernel.org>
    Cc: KP Singh <kpsingh@kernel.org>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20230225154010.391965-3-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:09 +02:00
Artem Savkov 6d840b01af bpf: Support kptrs in percpu hashmap and percpu LRU hashmap
Bugzilla: https://bugzilla.redhat.com/2221599

commit 65334e64a493c6a0976de7ad56bf8b7a9ff04b4a
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Sat Feb 25 16:40:08 2023 +0100

    bpf: Support kptrs in percpu hashmap and percpu LRU hashmap
    
    Enable support for kptrs in percpu BPF hashmap and percpu BPF LRU
    hashmap by wiring up the freeing of these kptrs from percpu map
    elements.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20230225154010.391965-2-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:09 +02:00
Jan Stancek e341c7e709 Merge: bpf, xdp: update to 6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2583

Rebase bpf and xdp to 6.3.

Bugzilla: https://bugzilla.redhat.com/2178930

Signed-off-by: Viktor Malik <vmalik@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Jason Wang <jasowang@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-28 07:52:45 +02:00
Viktor Malik 7e487d11fc bpf: Add basic bpf_rb_{root,node} support
Bugzilla: https://bugzilla.redhat.com/2178930

commit 9c395c1b99bd23f74bc628fa000480c49593d17f
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Mon Feb 13 16:40:10 2023 -0800

    bpf: Add basic bpf_rb_{root,node} support
    
    This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
    BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
    types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
    map_values.
    
    structs bpf_rb_root and bpf_rb_node are opaque types meant to
    obscure structs rb_root_cached rb_node, respectively.
    
    btf_struct_access will prevent BPF programs from touching these special
    fields automatically now that they're recognized.
    
    btf_check_and_fixup_fields now groups list_head and rb_root together as
    "graph root" fields and {list,rb}_node as "graph node", and does same
    ownership cycle checking as before. Note that this function does _not_
    prevent ownership type mixups (e.g. rb_root owning list_node) - that's
    handled by btf_parse_graph_root.
    
    After this patch, a bpf program can have a struct bpf_rb_root in a
    map_value, but not add anything to nor do anything useful with it.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:29 +02:00
Viktor Malik c4b5de4021 bpf: allow to disable bpf map memory accounting
Bugzilla: https://bugzilla.redhat.com/2178930

commit ee53cbfb1ebf990de0d084a7cd6b67b05fe1f7ac
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Fri Feb 10 15:47:33 2023 +0000

    bpf: allow to disable bpf map memory accounting
    
    We can simply set root memcg as the map's memcg to disable bpf memory
    accounting. bpf_map_area_alloc is a little special as it gets the memcg
    from current rather than from the map, so we need to disable GFP_ACCOUNT
    specifically for it.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
    Link: https://lore.kernel.org/r/20230210154734.4416-4-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:27 +02:00
Viktor Malik 17c5fbcb2c bpf: use bpf_map_kvcalloc in bpf_local_storage
Bugzilla: https://bugzilla.redhat.com/2178930

commit ddef81b5fd1da4d7c3cc8785d2043b73b72f38ef
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Fri Feb 10 15:47:32 2023 +0000

    bpf: use bpf_map_kvcalloc in bpf_local_storage
    
    Introduce new helper bpf_map_kvcalloc() for the memory allocation in
    bpf_local_storage(). Then the allocation will charge the memory from the
    map instead of from current, though currently they are the same thing as
    it is only used in map creation path now. By charging map's memory into
    the memcg from the map, it will be more clear.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
    Link: https://lore.kernel.org/r/20230210154734.4416-3-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:27 +02:00
Viktor Malik 9c40fe23e7 bpf: Drop always true do_idr_lock parameter to bpf_map_free_id
Bugzilla: https://bugzilla.redhat.com/2178930

commit 158e5e9eeaa0d7a86f2278313746ef6c8521790d
Author: Tobias Klauser <tklauser@distanz.ch>
Date:   Thu Feb 2 15:19:21 2023 +0100

    bpf: Drop always true do_idr_lock parameter to bpf_map_free_id
    
    The do_idr_lock parameter to bpf_map_free_id was introduced by commit
    bd5f5f4ecb ("bpf: Add BPF_MAP_GET_FD_BY_ID"). However, all callers set
    do_idr_lock = true since commit 1e0bd5a091 ("bpf: Switch bpf_map ref
    counter to atomic64_t so bpf_map_inc() never fails").
    
    While at it also inline __bpf_map_put into its only caller bpf_map_put
    now that do_idr_lock can be dropped from its signature.
    
    Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
    Link: https://lore.kernel.org/r/20230202141921.4424-1-tklauser@distanz.ch
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:21 +02:00
Felix Maurer e96fdaf0aa bpf: Support consuming XDP HW metadata from fext programs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit fd7c211d6875013f81acc09868effe199b5d2c0c
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Thu Jan 19 14:15:27 2023 -0800

    bpf: Support consuming XDP HW metadata from fext programs

    Instead of rejecting the attaching of PROG_TYPE_EXT programs to XDP
    programs that consume HW metadata, implement support for propagating the
    offload information. The extension program doesn't need to set a flag or
    ifindex, these will just be propagated from the target by the verifier.
    We need to create a separate offload object for the extension program,
    though, since it can be reattached to a different program later (which
    means we can't just inherit the offload information from the target).

    An additional check is added on attach that the new target is compatible
    with the offload information in the extension prog.

    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-9-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:14 +02:00
Felix Maurer e630642b6b bpf: Introduce device-bound XDP programs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 2b3486bc2d237ec345b3942b7be5deabf8c8fed1
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:24 2023 -0800

    bpf: Introduce device-bound XDP programs

    New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
    to associate a netdev with a BPF program at load time.

    netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:13 +02:00
Felix Maurer c0febc32b2 bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 9d03ebc71a027ca495c60f6e94d3cda81921791f
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:21 2023 -0800

    bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded

    BPF offloading infra will be reused to implement
    bound-but-not-offloaded bpf programs. Rename existing
    helpers for clarity. No functional changes.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:12 +02:00
Viktor Malik 4364d1c4d0 bpf: Remove unused field initialization in bpf's ctl_table
Bugzilla: https://bugzilla.redhat.com/2178930

commit cfca00767febba5f4f5e300fab10e0974491dd4b
Author: Ricardo Ribalda <ribalda@chromium.org>
Date:   Wed Dec 21 20:55:29 2022 +0100

    bpf: Remove unused field initialization in bpf's ctl_table
    
    Maxlen is used by standard proc_handlers such as proc_dointvec(), but in this
    case we have our own proc_handler via bpf_stats_handler(). Therefore, remove
    the initialization.
    
    Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20221221-bpf-syscall-v1-0-9550f5f2c3fc@chromium.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:44:27 +02:00
Ivan Vecera c3640c0d84 bpf: Remove the obsolte u64_stats_fetch_*_irq() users.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193170

commit 97c4090badca743451c3798f1c1846e9f3f252de
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Wed Oct 26 14:31:10 2022 +0200

    bpf: Remove the obsolte u64_stats_fetch_*_irq() users.

    Now that the 32bit UP oddity is gone and 32bit uses always a sequence
    count, there is no need for the fetch_irq() variants anymore.

    Convert to the regular interface.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/bpf/20221026123110.331690-1-bigeasy@linutronix.de

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-06-08 13:37:01 +02:00
Jerome Marchand af27d51cf6 bpf: remove the do_idr_lock parameter from bpf_prog_free_id()
Bugzilla: https://bugzilla.redhat.com/2177177

commit e7895f017b79410bf4591396a733b876dc1e0e9d
Author: Paul Moore <paul@paul-moore.com>
Date:   Fri Jan 6 10:44:00 2023 -0500

    bpf: remove the do_idr_lock parameter from bpf_prog_free_id()

    It was determined that the do_idr_lock parameter to
    bpf_prog_free_id() was not necessary as it should always be true.

    Suggested-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Paul Moore <paul@paul-moore.com>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230106154400.74211-2-paul@paul-moore.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:20 +02:00
Jerome Marchand 1b2a174b5c bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and PERF_BPF_EVENT_PROG_UNLOAD
Bugzilla: https://bugzilla.redhat.com/2177177

commit ef01f4e25c1760920e2c94f1c232350277ace69b
Author: Paul Moore <paul@paul-moore.com>
Date:   Fri Jan 6 10:43:59 2023 -0500

    bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and PERF_BPF_EVENT_PROG_UNLOAD

    When changing the ebpf program put() routines to support being called
    from within IRQ context the program ID was reset to zero prior to
    calling the perf event and audit UNLOAD record generators, which
    resulted in problems as the ebpf program ID was bogus (always zero).
    This patch addresses this problem by removing an unnecessary call to
    bpf_prog_free_id() in __bpf_prog_offload_destroy() and adjusting
    __bpf_prog_put() to only call bpf_prog_free_id() after audit and perf
    have finished their bpf program unload tasks in
    bpf_prog_put_deferred().  For the record, no one can determine, or
    remember, why it was necessary to free the program ID, and remove it
    from the IDR, prior to executing bpf_prog_put_deferred();
    regardless, both Stanislav and Alexei agree that the approach in this
    patch should be safe.

    It is worth noting that when moving the bpf_prog_free_id() call, the
    do_idr_lock parameter was forced to true as the ebpf devs determined
    this was the correct as the do_idr_lock should always be true.  The
    do_idr_lock parameter will be removed in a follow-up patch, but it
    was kept here to keep the patch small in an effort to ease any stable
    backports.

    I also modified the bpf_audit_prog() logic used to associate the
    AUDIT_BPF record with other associated records, e.g. @ctx != NULL.
    Instead of keying off the operation, it now keys off the execution
    context, e.g. '!in_irg && !irqs_disabled()', which is much more
    appropriate and should help better connect the UNLOAD operations with
    the associated audit state (other audit records).

    Cc: stable@vger.kernel.org
    Fixes: d809e134be7a ("bpf: Prepare bpf_prog_put() to be called from irq context.")
    Reported-by: Burn Alting <burn.alting@iinet.net.au>
    Reported-by: Jiri Olsa <olsajiri@gmail.com>
    Suggested-by: Stanislav Fomichev <sdf@google.com>
    Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
    Signed-off-by: Paul Moore <paul@paul-moore.com>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230106154400.74211-1-paul@paul-moore.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:20 +02:00
Jerome Marchand de6eb19233 bpf: Add comments for map BTF matching requirement for bpf_list_head
Bugzilla: https://bugzilla.redhat.com/2177177

commit c22dfdd21592c5d56b49d5fba8de300ad7bf293c
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:26:08 2022 +0530

    bpf: Add comments for map BTF matching requirement for bpf_list_head

    The old behavior of bpf_map_meta_equal was that it compared timer_off
    to be equal (but not spin_lock_off, because that was not allowed), and
    did memcmp of kptr_off_tab.

    Now, we memcmp the btf_record of two bpf_map structs, which has all
    fields.

    We preserve backwards compat as we kzalloc the array, so if only spin
    lock and timer exist in map, we only compare offset while the rest of
    unused members in the btf_field struct are zeroed out.

    In case of kptr, btf and everything else is of vmlinux or module, so as
    long type is same it will match, since kernel btf, module, dtor pointer
    will be same across maps.

    Now with list_head in the mix, things are a bit complicated. We
    implicitly add a requirement that both BTFs are same, because struct
    btf_field_list_head has btf and value_rec members.

    We obviously shouldn't force BTFs to be equal by default, as that breaks
    backwards compatibility.

    Currently it is only implicitly required due to list_head matching
    struct btf and value_rec member. value_rec points back into a btf_record
    stashed in the map BTF (btf member of btf_field_list_head). So that
    pointer and btf member has to match exactly.

    Document all these subtle details so that things don't break in the
    future when touching this code.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-19-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:07 +02:00
Jerome Marchand 5fb8030979 bpf: Verify ownership relationships for user BTF types
Bugzilla: https://bugzilla.redhat.com/2177177

commit 865ce09a49d79d2b2c1d980f4c05ffc0b3517bdc
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:25:57 2022 +0530

    bpf: Verify ownership relationships for user BTF types

    Ensure that there can be no ownership cycles among different types by
    way of having owning objects that can hold some other type as their
    element. For instance, a map value can only hold allocated objects, but
    these are allowed to have another bpf_list_head. To prevent unbounded
    recursion while freeing resources, elements of bpf_list_head in local
    kptrs can never have a bpf_list_head which are part of list in a map
    value. Later patches will verify this by having dedicated BTF selftests.

    Also, to make runtime destruction easier, once btf_struct_metas is fully
    populated, we can stash the metadata of the value type directly in the
    metadata of the list_head fields, as that allows easier access to the
    value type's layout to destruct it at runtime from the btf_field entry
    of the list head itself.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-8-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:06 +02:00
Jerome Marchand ef745b384b bpf: Recognize lock and list fields in allocated objects
Bugzilla: https://bugzilla.redhat.com/2177177

commit 8ffa5cc142137a59d6a10eb5273fa2ba5dcd4947
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:25:56 2022 +0530

    bpf: Recognize lock and list fields in allocated objects

    Allow specifying bpf_spin_lock, bpf_list_head, bpf_list_node fields in a
    allocated object.

    Also update btf_struct_access to reject direct access to these special
    fields.

    A bpf_list_head allows implementing map-in-map style use cases, where an
    allocated object with bpf_list_head is linked into a list in a map
    value. This would require embedding a bpf_list_node, support for which
    is also included. The bpf_spin_lock is used to protect the bpf_list_head
    and other data.

    While we strictly don't require to hold a bpf_spin_lock while touching
    the bpf_list_head in such objects, as when have access to it, we have
    complete ownership of the object, the locking constraint is still kept
    and may be conditionally lifted in the future.

    Note that the specification of such types can be done just like map
    values, e.g.:

    struct bar {
    	struct bpf_list_node node;
    };

    struct foo {
    	struct bpf_spin_lock lock;
    	struct bpf_list_head head __contains(bar, node);
    	struct bpf_list_node node;
    };

    struct map_value {
    	struct bpf_spin_lock lock;
    	struct bpf_list_head head __contains(foo, node);
    };

    To recognize such types in user BTF, we build a btf_struct_metas array
    of metadata items corresponding to each BTF ID. This is done once during
    the btf_parse stage to avoid having to do it each time during the
    verification process's requirement to inspect the metadata.

    Moreover, the computed metadata needs to be passed to some helpers in
    future patches which requires allocating them and storing them in the
    BTF that is pinned by the program itself, so that valid access can be
    assumed to such data during program runtime.

    A key thing to note is that once a btf_struct_meta is available for a
    type, both the btf_record and btf_field_offs should be available. It is
    critical that btf_field_offs is available in case special fields are
    present, as we extensively rely on special fields being zeroed out in
    map values and allocated objects in later patches. The code ensures that
    by bailing out in case of errors and ensuring both are available
    together. If the record is not available, the special fields won't be
    recognized, so not having both is also fine (in terms of being a
    verification error and not a runtime bug).

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-7-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:06 +02:00
Jerome Marchand 1a6464875f bpf: Do btf_record_free outside map_free callback
Bugzilla: https://bugzilla.redhat.com/2177177

commit d7f5ef653c3dd0c0d649cae6ef2708053bb1fb2b
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:25:52 2022 +0530

    bpf: Do btf_record_free outside map_free callback

    Since the commit being fixed, we now miss freeing btf_record for local
    storage maps which will have a btf_record populated in case they have
    bpf_spin_lock element.

    This was missed because I made the choice of offloading the job to free
    kptr_off_tab (now btf_record) to the map_free callback when adding
    support for kptrs.

    Revisiting the reason for this decision, there is the possibility that
    the btf_record gets used inside map_free callback (e.g. in case of maps
    embedding kptrs) to iterate over them and free them, hence doing it
    before the map_free callback would be leaking special field memory, and
    do invalid memory access. The btf_record keeps module references which
    is critical to ensure the dtor call made for referenced kptr is safe to
    do.

    If doing it after map_free callback, the map area is already freed, so
    we cannot access bpf_map structure anymore.

    To fix this and prevent such lapses in future, move bpf_map_free_record
    out of the map_free callback, and do it after map_free by remembering
    the btf_record pointer. There is no need to access bpf_map structure in
    that case, and we can avoid missing this case when support for new map
    types is added for other special fields.

    Since a btf_record and its btf_field_offs are used together, for
    consistency delay freeing of field_offs as well. While not a problem
    right now, a lot of code assumes that either both record and field_offs
    are set or none at once.

    Note that in case of map of maps (outer maps), inner_map_meta->record is
    only used during verification, not to free fields in map value, hence we
    simply keep the bpf_map_free_record call as is in bpf_map_meta_free and
    never touch map->inner_map_meta in bpf_map_free_deferred.

    Add a comment making note of these details.

    Fixes: db559117828d ("bpf: Consolidate spin_lock, timer management into btf_record")
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-3-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:06 +02:00
Jerome Marchand 774d15cf41 bpf: Fix early return in map_check_btf
Bugzilla: https://bugzilla.redhat.com/2177177

commit c237bfa5283a562cd5d74dd74b2d9016acd97f45
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:25:51 2022 +0530

    bpf: Fix early return in map_check_btf

    Instead of returning directly with -EOPNOTSUPP for the timer case, we
    need to free the btf_record before returning to userspace.

    Fixes: db559117828d ("bpf: Consolidate spin_lock, timer management into btf_record")
    Reported-by: Dan Carpenter <error27@gmail.com>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-2-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:06 +02:00
Jerome Marchand b928a49814 bpf: Pass map file to .map_update_batch directly
Bugzilla: https://bugzilla.redhat.com/2177177

commit 3af43ba4c6019b29c048921eb8147eb010165329
Author: Hou Tao <houtao1@huawei.com>
Date:   Wed Nov 16 15:50:58 2022 +0800

    bpf: Pass map file to .map_update_batch directly

    Currently bpf_map_do_batch() first invokes fdget(batch.map_fd) to get
    the target map file, then it invokes generic_map_update_batch() to do
    batch update. generic_map_update_batch() will get the target map file
    by using fdget(batch.map_fd) again and pass it to bpf_map_update_value().

    The problem is map file returned by the second fdget() may be NULL or a
    totally different file compared by map file in bpf_map_do_batch(). The
    reason is that the first fdget() only guarantees the liveness of struct
    file instead of file descriptor and the file description may be released
    by concurrent close() through pick_file().

    It doesn't incur any problem as for now, because maps with batch update
    support don't use map file in .map_fd_get_ptr() ops. But it is better to
    fix the potential access of an invalid map file.

    Using __bpf_map_get() again in generic_map_update_batch() can not fix
    the problem, because batch.map_fd may be closed and reopened, and the
    returned map file may be different with map file got in
    bpf_map_do_batch(), so just passing the map file directly to
    .map_update_batch() in bpf_map_do_batch().

    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20221116075059.1551277-1-houtao@huaweicloud.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:05 +02:00
Jerome Marchand d03c51f6bc bpf: Support bpf_list_head in map values
Bugzilla: https://bugzilla.redhat.com/2177177

commit f0c5941ff5b255413d31425bb327c2aec3625673
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Tue Nov 15 00:45:25 2022 +0530

    bpf: Support bpf_list_head in map values

    Add the support on the map side to parse, recognize, verify, and build
    metadata table for a new special field of the type struct bpf_list_head.
    To parameterize the bpf_list_head for a certain value type and the
    list_node member it will accept in that value type, we use BTF
    declaration tags.

    The definition of bpf_list_head in a map value will be done as follows:

    struct foo {
    	struct bpf_list_node node;
    	int data;
    };

    struct map_value {
    	struct bpf_list_head head __contains(foo, node);
    };

    Then, the bpf_list_head only allows adding to the list 'head' using the
    bpf_list_node 'node' for the type struct foo.

    The 'contains' annotation is a BTF declaration tag composed of four
    parts, "contains:name:node" where the name is then used to look up the
    type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
    the lookup. The node defines name of the member in this type that has
    the type struct bpf_list_node, which is actually used for linking into
    the linked list. For now, 'kind' part is hardcoded as struct.

    This allows building intrusive linked lists in BPF, using container_of
    to obtain pointer to entry, while being completely type safe from the
    perspective of the verifier. The verifier knows exactly the type of the
    nodes, and knows that list helpers return that type at some fixed offset
    where the bpf_list_node member used for this list exists. The verifier
    also uses this information to disallow adding types that are not
    accepted by a certain list.

    For now, no elements can be added to such lists. Support for that is
    coming in future patches, hence draining and freeing items is done with
    a TODO that will be resolved in a future patch.

    Note that the bpf_list_head_free function moves the list out to a local
    variable under the lock and releases it, doing the actual draining of
    the list items outside the lock. While this helps with not holding the
    lock for too long pessimizing other concurrent list operations, it is
    also necessary for deadlock prevention: unless every function called in
    the critical section would be notrace, a fentry/fexit program could
    attach and call bpf_map_update_elem again on the map, leading to the
    same lock being acquired if the key matches and lead to a deadlock.
    While this requires some special effort on part of the BPF programmer to
    trigger and is highly unlikely to occur in practice, it is always better
    if we can avoid such a condition.

    While notrace would prevent this, doing the draining outside the lock
    has advantages of its own, hence it is used to also fix the deadlock
    related problem.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221114191547.1694267-5-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:04 +02:00
Jerome Marchand e9b5bda40b bpf: Refactor map->off_arr handling
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Minor changes from already backported commit 1f6e04a1c7b8
("bpf: Fix offset calculation error in __copy_map_value and
zero_map_value")

commit f71b2f64177a199d5b1d2047e155d45fd98f564a
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 4 00:39:57 2022 +0530

    bpf: Refactor map->off_arr handling

    Refactor map->off_arr handling into generic functions that can work on
    their own without hardcoding map specific code. The btf_fields_offs
    structure is now returned from btf_parse_field_offs, which can be reused
    later for types in program BTF.

    All functions like copy_map_value, zero_map_value call generic
    underlying functions so that they can also be reused later for copying
    to values allocated in programs which encode specific fields.

    Later, some helper functions will also require access to this
    btf_field_offs structure to be able to skip over special fields at
    runtime.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221103191013.1236066-9-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:01 +02:00
Jerome Marchand 2b8a340165 bpf: Consolidate spin_lock, timer management into btf_record
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Context change from already backported commit 997849c4b969
("bpf: Zeroing allocated object from slab in bpf memory allocator")

commit db559117828d2448fe81ada051c60bcf39f822e9
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 4 00:39:56 2022 +0530

    bpf: Consolidate spin_lock, timer management into btf_record

    Now that kptr_off_tab has been refactored into btf_record, and can hold
    more than one specific field type, accomodate bpf_spin_lock and
    bpf_timer as well.

    While they don't require any more metadata than offset, having all
    special fields in one place allows us to share the same code for
    allocated user defined types and handle both map values and these
    allocated objects in a similar fashion.

    As an optimization, we still keep spin_lock_off and timer_off offsets in
    the btf_record structure, just to avoid having to find the btf_field
    struct each time their offset is needed. This is mostly needed to
    manipulate such objects in a map value at runtime. It's ok to hardcode
    just one offset as more than one field is disallowed.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:01 +02:00
Jerome Marchand 40100e4a5a bpf: Refactor kptr_off_tab into btf_record
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts:
 - Context change from already backported commit 997849c4b969 ("bpf:
Zeroing allocated object from slab in bpf memory allocator")
 - Minor changes from already backported commit 1f6e04a1c7b8 ("bpf:
Fix offset calculation error in __copy_map_value and zero_map_value")

commit aa3496accc412b3d975e4ee5d06076d73394d8b5
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 4 00:39:55 2022 +0530

    bpf: Refactor kptr_off_tab into btf_record

    To prepare the BPF verifier to handle special fields in both map values
    and program allocated types coming from program BTF, we need to refactor
    the kptr_off_tab handling code into something more generic and reusable
    across both cases to avoid code duplication.

    Later patches also require passing this data to helpers at runtime, so
    that they can work on user defined types, initialize them, destruct
    them, etc.

    The main observation is that both map values and such allocated types
    point to a type in program BTF, hence they can be handled similarly. We
    can prepare a field metadata table for both cases and store them in
    struct bpf_map or struct btf depending on the use case.

    Hence, refactor the code into generic btf_record and btf_field member
    structs. The btf_record represents the fields of a specific btf_type in
    user BTF. The cnt indicates the number of special fields we successfully
    recognized, and field_mask is a bitmask of fields that were found, to
    enable quick determination of availability of a certain field.

    Subsequently, refactor the rest of the code to work with these generic
    types, remove assumptions about kptr and kptr_off_tab, rename variables
    to more meaningful names, etc.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:01 +02:00
Jerome Marchand dcf538d57d bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Context change from missing commit 7f203bc89eb6 ("cgroup:
Replace cgroup->ancestor_ids[] with ->ancestors[]")

commit c4bcfb38a95edb1021a53f2d0356a78120ecfbe4
Author: Yonghong Song <yhs@fb.com>
Date:   Tue Oct 25 21:28:50 2022 -0700

    bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

    Similar to sk/inode/task storage, implement similar cgroup local storage.

    There already exists a local storage implementation for cgroup-attached
    bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
    bpf_get_local_storage(). But there are use cases such that non-cgroup
    attached bpf progs wants to access cgroup local storage data. For example,
    tc egress prog has access to sk and cgroup. It is possible to use
    sk local storage to emulate cgroup local storage by storing data in socket.
    But this is a waste as it could be lots of sockets belonging to a particular
    cgroup. Alternatively, a separate map can be created with cgroup id as the key.
    But this will introduce additional overhead to manipulate the new map.
    A cgroup local storage, similar to existing sk/inode/task storage,
    should help for this use case.

    The life-cycle of storage is managed with the life-cycle of the
    cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
    with a call to bpf_cgrp_storage_free() when cgroup itself
    is deleted.

    The userspace map operations can be done by using a cgroup fd as a key
    passed to the lookup, update and delete operations.

    Typically, the following code is used to get the current cgroup:
        struct task_struct *task = bpf_get_current_task_btf();
        ... task->cgroups->dfl_cgrp ...
    and in structure task_struct definition:
        struct task_struct {
            ....
            struct css_set __rcu            *cgroups;
            ....
        }
    With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
    So the current implementation only supports non-sleepable program and supporting
    sleepable program will be the next step together with adding rcu_read_lock
    protection for rcu tagged structures.

    Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
    storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
    for cgroup storage available to non-cgroup-attached bpf programs. The old
    cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
    The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
    functionality. While old cgroup storage pre-allocates storage memory, the new
    mechanism can also pre-allocate with a user space bpf_map_update_elem() call
    to avoid potential run-time memory allocation failure.
    Therefore, the new cgroup storage can provide all functionality w.r.t.
    the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
    BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
    be deprecated since the new one can provide the same functionality.

    Acked-by: David Vernet <void@manifault.com>
    Signed-off-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:42:58 +02:00
Jerome Marchand 5a72475fc0 bpf: Remove prog->active check for bpf_lsm and bpf_iter
Bugzilla: https://bugzilla.redhat.com/2177177

commit 271de525e1d7f564e88a9d212c50998b49a54476
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Oct 25 11:45:16 2022 -0700

    bpf: Remove prog->active check for bpf_lsm and bpf_iter

    The commit 64696c40d03c ("bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline")
    removed prog->active check for struct_ops prog.  The bpf_lsm
    and bpf_iter is also using trampoline.  Like struct_ops, the bpf_lsm
    and bpf_iter have fixed hooks for the prog to attach.  The
    kernel does not call the same hook in a recursive way.
    This patch also removes the prog->active check for
    bpf_lsm and bpf_iter.

    A later patch has a test to reproduce the recursion issue
    for a sleepable bpf_lsm program.

    This patch appends the '_recur' naming to the existing
    enter and exit functions that track the prog->active counter.
    New __bpf_prog_{enter,exit}[_sleepable] function are
    added to skip the prog->active tracking. The '_struct_ops'
    version is also removed.

    It also moves the decision on picking the enter and exit function to
    the new bpf_trampoline_{enter,exit}().  It returns the '_recur' ones
    for all tracing progs to use.  For bpf_lsm, bpf_iter,
    struct_ops (no prog->active tracking after 64696c40d03c), and
    bpf_lsm_cgroup (no prog->active tracking after 69fd337a975c7),
    it will return the functions that don't track the prog->active.

    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20221025184524.3526117-2-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:42:57 +02:00
Artem Savkov fee78f87aa bpf: Prevent bpf program recursion for raw tracepoint probes
Bugzilla: https://bugzilla.redhat.com/2166911

commit 05b24ff9b2cfabfcfd951daaa915a036ab53c9e1
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Fri Sep 16 09:19:14 2022 +0200

    bpf: Prevent bpf program recursion for raw tracepoint probes
    
    We got report from sysbot [1] about warnings that were caused by
    bpf program attached to contention_begin raw tracepoint triggering
    the same tracepoint by using bpf_trace_printk helper that takes
    trace_printk_lock lock.
    
     Call Trace:
      <TASK>
      ? trace_event_raw_event_bpf_trace_printk+0x5f/0x90
      bpf_trace_printk+0x2b/0xe0
      bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
      bpf_trace_run2+0x26/0x90
      native_queued_spin_lock_slowpath+0x1c6/0x2b0
      _raw_spin_lock_irqsave+0x44/0x50
      bpf_trace_printk+0x3f/0xe0
      bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
      bpf_trace_run2+0x26/0x90
      native_queued_spin_lock_slowpath+0x1c6/0x2b0
      _raw_spin_lock_irqsave+0x44/0x50
      bpf_trace_printk+0x3f/0xe0
      bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
      bpf_trace_run2+0x26/0x90
      native_queued_spin_lock_slowpath+0x1c6/0x2b0
      _raw_spin_lock_irqsave+0x44/0x50
      bpf_trace_printk+0x3f/0xe0
      bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
      bpf_trace_run2+0x26/0x90
      native_queued_spin_lock_slowpath+0x1c6/0x2b0
      _raw_spin_lock_irqsave+0x44/0x50
      __unfreeze_partials+0x5b/0x160
      ...
    
    The can be reproduced by attaching bpf program as raw tracepoint on
    contention_begin tracepoint. The bpf prog calls bpf_trace_printk
    helper. Then by running perf bench the spin lock code is forced to
    take slow path and call contention_begin tracepoint.
    
    Fixing this by skipping execution of the bpf program if it's
    already running, Using bpf prog 'active' field, which is being
    currently used by trampoline programs for the same reason.
    
    Moving bpf_prog_inc_misses_counter to syscall.c because
    trampoline.c is compiled in just for CONFIG_BPF_JIT option.
    
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Reported-by: syzbot+2251879aa068ad9c960d@syzkaller.appspotmail.com
    [1] https://lore.kernel.org/bpf/YxhFe3EwqchC%2FfYf@krava/T/#t
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/r/20220916071914.7156-1-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:39 +01:00
Artem Savkov 42bf3eaa55 bpf: use kvmemdup_bpfptr helper
Bugzilla: https://bugzilla.redhat.com/2166911

commit a02c118ee9e898612cbae42121b9e8663455b515
Author: Wang Yufen <wangyufen@huawei.com>
Date:   Tue Sep 13 16:40:33 2022 +0800

    bpf: use kvmemdup_bpfptr helper
    
    Use kvmemdup_bpfptr helper instead of open-coding to
    simplify the code.
    
    Signed-off-by: Wang Yufen <wangyufen@huawei.com>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/1663058433-14089-1-git-send-email-wangyufen@huawei.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:13 +01:00
Artem Savkov 7f2b6b92f7 bpf: Ensure correct locking around vulnerable function find_vpid()
Bugzilla: https://bugzilla.redhat.com/2166911

commit 83c10cc362d91c0d8d25e60779ee52fdbbf3894d
Author: Lee Jones <lee@kernel.org>
Date:   Mon Sep 12 14:38:55 2022 +0100

    bpf: Ensure correct locking around vulnerable function find_vpid()
    
    The documentation for find_vpid() clearly states:
    
      "Must be called with the tasklist_lock or rcu_read_lock() held."
    
    Presently we do neither for find_vpid() instance in bpf_task_fd_query().
    Add proper rcu_read_lock/unlock() to fix the issue.
    
    Fixes: 41bdc4b40e ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY")
    Signed-off-by: Lee Jones <lee@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20220912133855.1218900-1-lee@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:12 +01:00
Artem Savkov 4b73ddbd13 bpf: Support kptrs in percpu arraymap
Bugzilla: https://bugzilla.redhat.com/2166911

commit 6df4ea1ff0ff70798ff1e7eed79f98ccb7b5b0a2
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Sun Sep 4 22:41:15 2022 +0200

    bpf: Support kptrs in percpu arraymap
    
    Enable support for kptrs in percpu BPF arraymap by wiring up the freeing
    of these kptrs from percpu map elements.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20220904204145.3089-3-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:10 +01:00
Artem Savkov ade3f4aa53 bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
Bugzilla: https://bugzilla.redhat.com/2166911

commit 8d5a8011b35d387c490a5c977b1d9eb4798aa071
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Fri Sep 2 14:10:51 2022 -0700

    bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
    
    SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down
    kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps
    and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change
    solves the memory consumption issue, avoids kmem_cache_destroy latency and
    keeps bpf hash map performance the same.
    
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220902211058.60789-10-alexei.starovoitov@gmail.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:07 +01:00
Artem Savkov 5c9c04c9ae bpf: prepare for more bpf syscall to be used from kernel and user space.
Bugzilla: https://bugzilla.redhat.com/2166911

commit b88df6979682333815536a0bf43bd56f9499f071
Author: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Date:   Wed Aug 24 15:40:36 2022 +0200

    bpf: prepare for more bpf syscall to be used from kernel and user space.
    
    Add BPF_MAP_GET_FD_BY_ID and BPF_MAP_DELETE_PROG.
    
    Only BPF_MAP_GET_FD_BY_ID needs to be amended to be able
    to access the bpf pointer either from the userspace or the kernel.
    
    Acked-by: Yonghong Song <yhs@fb.com>
    Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
    Link: https://lore.kernel.org/r/20220824134055.1328882-7-benjamin.tissoires@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:04 +01:00
Artem Savkov ad5fbfae98 bpf: prevent leak of lsm program after failed attach
Bugzilla: https://bugzilla.redhat.com/2137876

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit e89f3edffb860a0f54a9ed16deadb7a4a1fa3862
Author: Milan Landaverde <milan@mdaverde.com>
Date:   Tue Dec 13 12:57:14 2022 -0500

    bpf: prevent leak of lsm program after failed attach

    In [0], we added the ability to bpf_prog_attach LSM programs to cgroups,
    but in our validation to make sure the prog is meant to be attached to
    BPF_LSM_CGROUP, we return too early if the check fails. This results in
    lack of decrementing prog's refcnt (through bpf_prog_put)
    leaving the LSM program alive past the point of the expected lifecycle.
    This fix allows for the decrement to take place.

    [0] https://lore.kernel.org/all/20220628174314.1216643-4-sdf@google.com/

    Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")
    Signed-off-by: Milan Landaverde <milan@mdaverde.com>
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20221213175714.31963-1-milan@mdaverde.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:50:16 +01:00
Artem Savkov 1ec0942151 bpf: Restrict bpf_sys_bpf to CAP_PERFMON
Bugzilla: https://bugzilla.redhat.com/2137876

commit 14b20b784f59bdd95f6f1cfb112c9818bcec4d84
Author: YiFei Zhu <zhuyifei@google.com>
Date:   Tue Aug 16 20:55:16 2022 +0000

    bpf: Restrict bpf_sys_bpf to CAP_PERFMON
    
    The verifier cannot perform sufficient validation of any pointers passed
    into bpf_attr and treats them as integers rather than pointers. The helper
    will then read from arbitrary pointers passed into it. Restrict the helper
    to CAP_PERFMON since the security model in BPF of arbitrary kernel read is
    CAP_BPF + CAP_PERFMON.
    
    Fixes: af2ac3e13e ("bpf: Prepare bpf syscall to be used from kernel and user space.")
    Signed-off-by: YiFei Zhu <zhuyifei@google.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220816205517.682470-1-zhuyifei@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:48 +01:00
Artem Savkov 74fbc8786e bpf: Shut up kern_sys_bpf warning.
Bugzilla: https://bugzilla.redhat.com/2137876

commit 4e4588f1c4d2e67c993208f0550ef3fae33abce4
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Aug 10 23:52:28 2022 -0700

    bpf: Shut up kern_sys_bpf warning.
    
    Shut up this warning:
    kernel/bpf/syscall.c:5089:5: warning: no previous prototype for function 'kern_sys_bpf' [-Wmissing-prototypes]
    int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
    
    Reported-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:47 +01:00
Artem Savkov 602a48c545 bpf: Use proper target btf when exporting attach_btf_obj_id
Bugzilla: https://bugzilla.redhat.com/2137876

commit 6644aabbd8973a9f8008cabfd054a36b69a3a3f5
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Aug 4 13:11:39 2022 -0700

    bpf: Use proper target btf when exporting attach_btf_obj_id
    
    When attaching to program, the program itself might not be attached
    to anything (and, hence, might not have attach_btf), so we can't
    unconditionally use 'prog->aux->dst_prog->aux->attach_btf'.
    
    Instead, use bpf_prog_get_target_btf to pick proper target BTF:
    
      * when attached to dst_prog, use dst_prog->aux->btf
      * when attached to kernel btf, use prog->aux->attach_btf
    
    Fixes: b79c9fc9551b ("bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP")
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Hao Luo <haoluo@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20220804201140.1340684-1-sdf@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:46 +01:00
Artem Savkov 97e5df6f0b bpf: reparent bpf maps on memcg offlining
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: conflict with rhel-only commit 1f50357d24 "mm/memcg:
Exclude mem_cgroup pointer from kABI signature computation". mem_cgroup
is replaced by obj_cgroup, excluding it the same way as previous struct
was excluded.

commit 4201d9ab3e42d9e2a20320b751a931e6239c0df2
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Mon Jul 11 09:28:27 2022 -0700

    bpf: reparent bpf maps on memcg offlining

    The memory consumed by a bpf map is always accounted to the memory
    cgroup of the process which created the map. The map can outlive
    the memory cgroup if it's used by processes in other cgroups or
    is pinned on bpffs. In this case the map pins the original cgroup
    in the dying state.

    For other types of objects (slab objects, non-slab kernel allocations,
    percpu objects and recently LRU pages) there is a reparenting process
    implemented: on cgroup offlining charged objects are getting
    reassigned to the parent cgroup. Because all charges and statistics
    are fully recursive it's a fairly cheap operation.

    For efficiency and consistency with other types of objects, let's do
    the same for bpf maps. Fortunately thanks to the objcg API, the
    required changes are minimal.

    Please, note that individual allocations (slabs, percpu and large
    kmallocs) already have the reparenting mechanism. This commit adds
    it to the saved map->memcg pointer by replacing it to map->objcg.
    Because dying cgroups are not visible for a user and all charges are
    recursive, this commit doesn't bring any behavior changes for a user.

    v2:
      added a missing const qualifier

    Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Link: https://lore.kernel.org/r/20220711162827.184743-1-roman.gushchin@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:38 +01:00
Artem Savkov 0e3c3201fc bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP
Bugzilla: https://bugzilla.redhat.com/2137876

commit b79c9fc9551b45953a94abf550b7bd3b00e3a0f9
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Jun 28 10:43:08 2022 -0700

    bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP
    
    We have two options:
    1. Treat all BPF_LSM_CGROUP the same, regardless of attach_btf_id
    2. Treat BPF_LSM_CGROUP+attach_btf_id as a separate hook point
    
    I was doing (2) in the original patch, but switching to (1) here:
    
    * bpf_prog_query returns all attached BPF_LSM_CGROUP programs
    regardless of attach_btf_id
    * attach_btf_id is exported via bpf_prog_info
    
    Reviewed-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220628174314.1216643-6-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:33 +01:00
Artem Savkov 9a33161b25 bpf: per-cgroup lsm flavor
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: already applied 65d9ecfe0ca73 "bpf: Fix ref_obj_id for dynptr
data slices in verifier"

commit 69fd337a975c7e690dfe49d9cb4fe5ba1e6db44e
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Jun 28 10:43:06 2022 -0700

    bpf: per-cgroup lsm flavor

    Allow attaching to lsm hooks in the cgroup context.

    Attaching to per-cgroup LSM works exactly like attaching
    to other per-cgroup hooks. New BPF_LSM_CGROUP is added
    to trigger new mode; the actual lsm hook we attach to is
    signaled via existing attach_btf_id.

    For the hooks that have 'struct socket' or 'struct sock' as its first
    argument, we use the cgroup associated with that socket. For the rest,
    we use 'current' cgroup (this is all on default hierarchy == v2 only).
    Note that for some hooks that work on 'struct sock' we still
    take the cgroup from 'current' because some of them work on the socket
    that hasn't been properly initialized yet.

    Behind the scenes, we allocate a shim program that is attached
    to the trampoline and runs cgroup effective BPF programs array.
    This shim has some rudimentary ref counting and can be shared
    between several programs attaching to the same lsm hook from
    different cgroups.

    Note that this patch bloats cgroup size because we add 211
    cgroup_bpf_attach_type(s) for simplicity sake. This will be
    addressed in the subsequent patch.

    Also note that we only add non-sleepable flavor for now. To enable
    sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
    shim programs have to be freed via trace rcu, cgroup_bpf.effective
    should be also trace-rcu-managed + maybe some other changes that
    I'm not aware of.

    Reviewed-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:33 +01:00
Artem Savkov 00639ab4bd bpf: Unify data extension operation of jited_ksyms and jited_linfo
Bugzilla: https://bugzilla.redhat.com/2137876

commit 2cd008522707a59bf38c1f45d5c654eddbb86c20
Author: Pu Lehui <pulehui@huawei.com>
Date:   Mon May 30 17:28:10 2022 +0800

    bpf: Unify data extension operation of jited_ksyms and jited_linfo
    
    We found that 32-bit environment can not print BPF line info due to a data
    inconsistency between jited_ksyms[0] and jited_linfo[0].
    
    For example:
    
      jited_kyms[0] = 0xb800067c, jited_linfo[0] = 0xffffffffb800067c
    
    We know that both of them store BPF func address, but due to the different
    data extension operations when extended to u64, they may not be the same.
    We need to unify the data extension operations of them.
    
    Signed-off-by: Pu Lehui <pulehui@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/CAEf4BzZ-eDcdJZgJ+Np7Y=V-TVjDDvOMqPwzKjyWrh=i5juv4w@mail.gmail.com
    Link: https://lore.kernel.org/bpf/20220530092815.1112406-2-pulehui@huawei.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:28 +01:00
Yauheni Kaliuta 20067c525d bpf: Fix non-static bpf_func_proto struct definitions
Bugzilla: http://bugzilla.redhat.com/2120968

commit dc368e1c658e4f478a45e8d1d5b0c8392ca87506
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Thu Jun 16 15:54:07 2022 -0700

    bpf: Fix non-static bpf_func_proto struct definitions

    This patch does two things:

    1) Marks the dynptr bpf_func_proto structs that were added in [1]
       as static, as pointed out by the kernel test robot in [2].

    2) There are some bpf_func_proto structs marked as extern which can
       instead be statically defined.

      [1] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/
      [2] https://lore.kernel.org/bpf/62ab89f2.Pko7sI08RAKdF8R6%25lkp@intel.com/

    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220616225407.1878436-1-joannelkoong@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:09 +02:00
Yauheni Kaliuta ff04690df8 bpf: Fix resetting logic for unreferenced kptrs
Bugzilla: https://bugzilla.redhat.com/2120968

commit 9fad7fe5b29803584c7f17a2abe6c2936fec6828
Author: Jules Irenge <jbi.octave@gmail.com>
Date:   Wed Sep 7 16:24:20 2022 +0100

    bpf: Fix resetting logic for unreferenced kptrs
    
    Sparse reported a warning at bpf_map_free_kptrs()
    "warning: Using plain integer as NULL pointer"
    During the process of fixing this warning, it was discovered that the current
    code erroneously writes to the pointer variable instead of deferencing and
    writing to the actual kptr. Hence, Sparse tool accidentally helped to uncover
    this problem. Fix this by doing WRITE_ONCE(*p, 0) instead of WRITE_ONCE(p, 0).
    
    Note that the effect of this bug is that unreferenced kptrs will not be cleared
    during check_and_free_fields. It is not a problem if the clearing is not done
    during map_free stage, as there is nothing to free for them.
    
    Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
    Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
    Link: https://lore.kernel.org/r/Yxi3pJaK6UDjVJSy@playground
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:08 +02:00
Yauheni Kaliuta d0d03e9325 bpf: refine kernel.unprivileged_bpf_disabled behaviour
Bugzilla: https://bugzilla.redhat.com/2120968

commit c8644cd0efe719608ddcb341bcf087d4bc0bf6b8
Author: Alan Maguire <alan.maguire@oracle.com>
Date:   Thu May 19 15:25:33 2022 +0100

    bpf: refine kernel.unprivileged_bpf_disabled behaviour
    
    With unprivileged BPF disabled, all cmds associated with the BPF syscall
    are blocked to users without CAP_BPF/CAP_SYS_ADMIN.  However there are
    use cases where we may wish to allow interactions with BPF programs
    without being able to load and attach them.  So for example, a process
    with required capabilities loads/attaches a BPF program, and a process
    with less capabilities interacts with it; retrieving perf/ring buffer
    events, modifying map-specified config etc.  With all BPF syscall
    commands blocked as a result of unprivileged BPF being disabled,
    this mode of interaction becomes impossible for processes without
    CAP_BPF.
    
    As Alexei notes
    
    "The bpf ACL model is the same as traditional file's ACL.
    The creds and ACLs are checked at open().  Then during file's write/read
    additional checks might be performed. BPF has such functionality already.
    Different map_creates have capability checks while map_lookup has:
    map_get_sys_perms(map, f) & FMODE_CAN_READ.
    In other words it's enough to gate FD-receiving parts of bpf
    with unprivileged_bpf_disabled sysctl.
    The rest is handled by availability of FD and access to files in bpffs."
    
    So key fd creation syscall commands BPF_PROG_LOAD and BPF_MAP_CREATE
    are blocked with unprivileged BPF disabled and no CAP_BPF.
    
    And as Alexei notes, map creation with unprivileged BPF disabled off
    blocks creation of maps aside from array, hash and ringbuf maps.
    
    Programs responsible for loading and attaching the BPF program
    can still control access to its pinned representation by restricting
    permissions on the pin path, as with normal files.
    
    Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
    Acked-by: KP Singh <kpsingh@kernel.org>
    Link: https://lore.kernel.org/r/1652970334-30510-2-git-send-email-alan.maguire@oracle.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:06 +02:00
Yauheni Kaliuta 1c3a7dd065 bpf, x86: Attach a cookie to fentry/fexit/fmod_ret/lsm.
Bugzilla: https://bugzilla.redhat.com/2120968

commit 2fcc82411e74e5e6aba336561cf56fb899bfae4e
Author: Kui-Feng Lee <kuifeng@fb.com>
Date:   Tue May 10 13:59:21 2022 -0700

    bpf, x86: Attach a cookie to fentry/fexit/fmod_ret/lsm.
    
    Pass a cookie along with BPF_LINK_CREATE requests.
    
    Add a bpf_cookie field to struct bpf_tracing_link to attach a cookie.
    The cookie of a bpf_tracing_link is available by calling
    bpf_get_attach_cookie when running the BPF program of the attached
    link.
    
    The value of a cookie will be set at bpf_tramp_run_ctx by the
    trampoline of the link.
    
    Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220510205923.3206889-4-kuifeng@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:03 +02:00
Yauheni Kaliuta 2755884830 bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack
Bugzilla: https://bugzilla.redhat.com/2120968

commit e384c7b7b46d0a5f4bf3c554f963e6e9622d0ab1
Author: Kui-Feng Lee <kuifeng@fb.com>
Date:   Tue May 10 13:59:20 2022 -0700

    bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack

    BPF trampolines will create a bpf_tramp_run_ctx, a bpf_run_ctx, on
    stacks and set/reset the current bpf_run_ctx before/after calling a
    bpf_prog.

    Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220510205923.3206889-3-kuifeng@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:03 +02:00
Yauheni Kaliuta 503bec2387 bpf, x86: Generate trampolines from bpf_tramp_links
Bugzilla: https://bugzilla.redhat.com/2120968
Conflicts: already applied
  1d5f82d9dd47 ("bpf, x86: fix freeing of not-finalized bpf_prog_pack")

commit f7e0beaf39d3868dc700d4954b26cf8443c5d423
Author: Kui-Feng Lee <kuifeng@fb.com>
Date:   Tue May 10 13:59:19 2022 -0700

    bpf, x86: Generate trampolines from bpf_tramp_links

    Replace struct bpf_tramp_progs with struct bpf_tramp_links to collect
    struct bpf_tramp_link(s) for a trampoline.  struct bpf_tramp_link
    extends bpf_link to act as a linked list node.

    arch_prepare_bpf_trampoline() accepts a struct bpf_tramp_links to
    collects all bpf_tramp_link(s) that a trampoline should call.

    Change BPF trampoline and bpf_struct_ops to pass bpf_tramp_links
    instead of bpf_tramp_progs.

    Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220510205923.3206889-2-kuifeng@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:03 +02:00
Yauheni Kaliuta 1bcd6c0f58 bpf: Add bpf_link iterator
Bugzilla: https://bugzilla.redhat.com/2120968

commit 9f88361273082825d9f0d13a543d49f9fa0d44a8
Author: Dmitrii Dolgov <9erthalion6@gmail.com>
Date:   Tue May 10 17:52:30 2022 +0200

    bpf: Add bpf_link iterator
    
    Implement bpf_link iterator to traverse links via bpf_seq_file
    operations. The changeset is mostly shamelessly copied from
    commit a228a64fc1 ("bpf: Add bpf_prog iterator")
    
    Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20220510155233.9815-2-9erthalion6@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:02 +02:00
Yauheni Kaliuta 12c4199b33 bpf: Wire up freeing of referenced kptr
Bugzilla: https://bugzilla.redhat.com/2120968

commit 14a324f6a67ef6a53e04362a70160a47eb8afffa
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Mon Apr 25 03:18:55 2022 +0530

    bpf: Wire up freeing of referenced kptr
    
    A destructor kfunc can be defined as void func(type *), where type may
    be void or any other pointer type as per convenience.
    
    In this patch, we ensure that the type is sane and capture the function
    pointer into off_desc of ptr_off_tab for the specific pointer offset,
    with the invariant that the dtor pointer is always set when 'kptr_ref'
    tag is applied to the pointer's pointee type, which is indicated by the
    flag BPF_MAP_VALUE_OFF_F_REF.
    
    Note that only BTF IDs whose destructor kfunc is registered, thus become
    the allowed BTF IDs for embedding as referenced kptr. Hence it serves
    the purpose of finding dtor kfunc BTF ID, as well acting as a check
    against the whitelist of allowed BTF IDs for this purpose.
    
    Finally, wire up the actual freeing of the referenced pointer if any at
    all available offsets, so that no references are leaked after the BPF
    map goes away and the BPF program previously moved the ownership a
    referenced pointer into it.
    
    The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
    will free any existing referenced kptr. The same case is with LRU map's
    bpf_lru_push_free/htab_lru_push_free functions, which are extended to
    reset unreferenced and free referenced kptr.
    
    Note that unlike BPF timers, kptr is not reset or freed when map uref
    drops to zero.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:46:59 +02:00
Yauheni Kaliuta e4f793f22d bpf: Adapt copy_map_value for multiple offset case
Bugzilla: https://bugzilla.redhat.com/2120968

Omitted-fix: 1f6e04a1c7b8 ("bpf: Fix offset calculation error in __copy_map_value and zero_map_value")
  The patch has a bunch of dependencies and crashes if implemented
  against current codebase.

commit 4d7d7f69f4b104b2ddeec6a1e7fcfd2d044ed8c4
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Mon Apr 25 03:18:53 2022 +0530

    bpf: Adapt copy_map_value for multiple offset case

    Since now there might be at most 10 offsets that need handling in
    copy_map_value, the manual shuffling and special case is no longer going
    to work. Hence, let's generalise the copy_map_value function by using
    a sorted array of offsets to skip regions that must be avoided while
    copying into and out of a map value.

    When the map is created, we populate the offset array in struct map,
    Then, copy_map_value uses this sorted offset array is used to memcpy
    while skipping timer, spin lock, and kptr. The array is allocated as
    in most cases none of these special fields would be present in map
    value, hence we can save on space for the common case by not embedding
    the entire object inside bpf_map struct.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220424214901.2743946-6-memxor@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:45:51 +02:00
Yauheni Kaliuta 9b0b6285f7 bpf: Allow storing unreferenced kptr in map
Bugzilla: https://bugzilla.redhat.com/2120968

commit 61df10c7799e27807ad5e459eec9d77cddf8bf45
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Mon Apr 25 03:18:49 2022 +0530

    bpf: Allow storing unreferenced kptr in map
    
    This commit introduces a new pointer type 'kptr' which can be embedded
    in a map value to hold a PTR_TO_BTF_ID stored by a BPF program during
    its invocation. When storing such a kptr, BPF program's PTR_TO_BTF_ID
    register must have the same type as in the map value's BTF, and loading
    a kptr marks the destination register as PTR_TO_BTF_ID with the correct
    kernel BTF and BTF ID.
    
    Such kptr are unreferenced, i.e. by the time another invocation of the
    BPF program loads this pointer, the object which the pointer points to
    may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
    patched to PROBE_MEM loads by the verifier, it would safe to allow user
    to still access such invalid pointer, but passing such pointers into
    BPF helpers and kfuncs should not be permitted. A future patch in this
    series will close this gap.
    
    The flexibility offered by allowing programs to dereference such invalid
    pointers while being safe at runtime frees the verifier from doing
    complex lifetime tracking. As long as the user may ensure that the
    object remains valid, it can ensure data read by it from the kernel
    object is valid.
    
    The user indicates that a certain pointer must be treated as kptr
    capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
    a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
    information is recorded in the object BTF which will be passed into the
    kernel by way of map's BTF information. The name and kind from the map
    value BTF is used to look up the in-kernel type, and the actual BTF and
    BTF ID is recorded in the map struct in a new kptr_off_tab member. For
    now, only storing pointers to structs is permitted.
    
    An example of this specification is shown below:
    
    	#define __kptr __attribute__((btf_type_tag("kptr")))
    
    	struct map_value {
    		...
    		struct task_struct __kptr *task;
    		...
    	};
    
    Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
    task_struct into the map, and then load it later.
    
    Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
    the verifier cannot know whether the value is NULL or not statically, it
    must treat all potential loads at that map value offset as loading a
    possibly NULL pointer.
    
    Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL)
    are allowed instructions that can access such a pointer. On BPF_LDX, the
    destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
    it is checked whether the source register type is a PTR_TO_BTF_ID with
    same BTF type as specified in the map BTF. The access size must always
    be BPF_DW.
    
    For the map in map support, the kptr_off_tab for outer map is copied
    from the inner map's kptr_off_tab. It was chosen to do a deep copy
    instead of introducing a refcount to kptr_off_tab, because the copy only
    needs to be done when paramterizing using inner_map_fd in the map in map
    case, hence would be unnecessary for all other users.
    
    It is not permitted to use MAP_FREEZE command and mmap for BPF map
    having kptrs, similar to the bpf_timer case. A kptr also requires that
    BPF program has both read and write access to the map (hence both
    BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG are disallowed).
    
    Note that check_map_access must be called from both
    check_helper_mem_access and for the BPF instructions, hence the kptr
    check must distinguish between ACCESS_DIRECT and ACCESS_HELPER, and
    reject ACCESS_HELPER cases. We rename stack_access_src to bpf_access_src
    and reuse it for this purpose.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220424214901.2743946-2-memxor@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-28 16:52:11 +02:00
Yauheni Kaliuta 73afe5573f bpf: Allow attach TRACING programs through LINK_CREATE command
Bugzilla: https://bugzilla.redhat.com/2120968

commit df86ca0d2f0fa6be525a25b0b3d836d361f85754
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Wed Apr 20 20:39:43 2022 -0700

    bpf: Allow attach TRACING programs through LINK_CREATE command
    
    Allow attaching BTF-aware TRACING programs, previously attachable only
    through BPF_RAW_TRACEPOINT_OPEN command, through LINK_CREATE command:
    
      - BTF-aware raw tracepoints (tp_btf in libbpf lingo);
      - fentry/fexit/fmod_ret programs;
      - BPF LSM programs.
    
    This change converges all bpf_link-based attachments under LINK_CREATE
    command allowing to further extend the API with features like BPF cookie
    under "multiplexed" link_create section of bpf_attr.
    
    Non-BTF-aware raw tracepoints are left under BPF_RAW_TRACEPOINT_OPEN,
    but there is nothing preventing opening them up to LINK_CREATE as well.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Kuifeng Lee <kuifeng@fb.com>
    Link: https://lore.kernel.org/bpf/20220421033945.3602803-2-andrii@kernel.org

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-28 16:52:11 +02:00
Yauheni Kaliuta 473f280b74 bpf: Move BPF sysctls from kernel/sysctl.c to BPF core
Bugzilla: https://bugzilla.redhat.com/2120968
Conflicts:
 - applied RHEL-only 786ca1cae5 ("bpf: Fix unprivileged_bpf_disabled setup")
 - headers due to applied a467257ffe4b ("kernel/kexec_core: move kexec_core sysctls into its own file")

commit 2900005ea287b11dcc8c1b9fcf24893b7ff41d6d
Author: Yan Zhu <zhuyan34@huawei.com>
Date:   Thu Apr 7 15:07:59 2022 +0800

    bpf: Move BPF sysctls from kernel/sysctl.c to BPF core

    We're moving sysctls out of kernel/sysctl.c as it is a mess. We
    already moved all filesystem sysctls out. And with time the goal
    is to move all sysctls out to their own subsystem/actual user.

    kernel/sysctl.c has grown to an insane mess and its easy to run
    into conflicts with it. The effort to move them out into various
    subsystems is part of this.

    Signed-off-by: Yan Zhu <zhuyan34@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Link: https://lore.kernel.org/bpf/20220407070759.29506-1-zhuyan34@huawei.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-28 16:51:28 +02:00
Jerome Marchand 84681bc21d bpf: Disallow bpf programs call prog_run command.
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
Code change from missing commit e384c7b7b46d ("bpf, x86: Create
bpf_tramp_run_ctx on the caller thread's stack")

commit 86f44fcec22ce2979507742bc53db8400e454f46
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Mon Aug 8 20:58:09 2022 -0700

    bpf: Disallow bpf programs call prog_run command.

    The verifier cannot perform sufficient validation of bpf_attr->test.ctx_in
    pointer, therefore bpf programs should not be allowed to call BPF_PROG_RUN
    command from within the program.
    To fix this issue split bpf_sys_bpf() bpf helper into normal kern_sys_bpf()
    kernel function that can only be used by the kernel light skeleton directly.

    Reported-by: YiFei Zhu <zhuyifei@google.com>
    Fixes: b1d18a7574d0 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:10 +02:00
Jerome Marchand 683541ee89 bpf: Add cookie support to programs attached with kprobe multi link
Bugzilla: https://bugzilla.redhat.com/2120966

commit ca74823c6e16dd42b7cf60d9fdde80e2a81a67bb
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Wed Mar 16 13:24:12 2022 +0100

    bpf: Add cookie support to programs attached with kprobe multi link

    Adding support to call bpf_get_attach_cookie helper from
    kprobe programs attached with kprobe multi link.

    The cookie is provided by array of u64 values, where each
    value is paired with provided function address or symbol
    with the same array index.

    When cookie array is provided it's sorted together with
    addresses (check bpf_kprobe_multi_cookie_swap). This way
    we can find cookie based on the address in
    bpf_get_attach_cookie helper.

    Suggested-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220316122419.933957-7-jolsa@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:04 +02:00
Jerome Marchand 0c66788f0e bpf: Add multi kprobe link
Bugzilla: https://bugzilla.redhat.com/2120966

commit 0dcac272540613d41c05e89679e4ddb978b612f1
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Wed Mar 16 13:24:09 2022 +0100

    bpf: Add multi kprobe link

    Adding new link type BPF_LINK_TYPE_KPROBE_MULTI that attaches kprobe
    program through fprobe API.

    The fprobe API allows to attach probe on multiple functions at once
    very fast, because it works on top of ftrace. On the other hand this
    limits the probe point to the function entry or return.

    The kprobe program gets the same pt_regs input ctx as when it's attached
    through the perf API.

    Adding new attach type BPF_TRACE_KPROBE_MULTI that allows attachment
    kprobe to multiple function with new link.

    User provides array of addresses or symbols with count to attach the
    kprobe program to. The new link_create uapi interface looks like:

      struct {
              __u32           flags;
              __u32           cnt;
              __aligned_u64   syms;
              __aligned_u64   addrs;
      } kprobe_multi;

    The flags field allows single BPF_TRACE_KPROBE_MULTI bit to create
    return multi kprobe.

    Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220316122419.933957-4-jolsa@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:04 +02:00
Jerome Marchand c8f828883c bpf: Add "live packet" mode for XDP in BPF_PROG_RUN
Bugzilla: https://bugzilla.redhat.com/2120966

commit b530e9e1063ed2b817eae7eec6ed2daa8be11608
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Wed Mar 9 11:53:42 2022 +0100

    bpf: Add "live packet" mode for XDP in BPF_PROG_RUN

    This adds support for running XDP programs through BPF_PROG_RUN in a mode
    that enables live packet processing of the resulting frames. Previous uses
    of BPF_PROG_RUN for XDP returned the XDP program return code and the
    modified packet data to userspace, which is useful for unit testing of XDP
    programs.

    The existing BPF_PROG_RUN for XDP allows userspace to set the ingress
    ifindex and RXQ number as part of the context object being passed to the
    kernel. This patch reuses that code, but adds a new mode with different
    semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES
    flag.

    When running BPF_PROG_RUN in this mode, the XDP program return codes will
    be honoured: returning XDP_PASS will result in the frame being injected
    into the networking stack as if it came from the selected networking
    interface, while returning XDP_TX and XDP_REDIRECT will result in the frame
    being transmitted out that interface. XDP_TX is translated into an
    XDP_REDIRECT operation to the same interface, since the real XDP_TX action
    is only possible from within the network drivers themselves, not from the
    process context where BPF_PROG_RUN is executed.

    Internally, this new mode of operation creates a page pool instance while
    setting up the test run, and feeds pages from that into the XDP program.
    The setup cost of this is amortised over the number of repetitions
    specified by userspace.

    To support the performance testing use case, we further optimise the setup
    step so that all pages in the pool are pre-initialised with the packet
    data, and pre-computed context and xdp_frame objects stored at the start of
    each page. This makes it possible to entirely avoid touching the page
    content on each XDP program invocation, and enables sending up to 9
    Mpps/core on my test box.

    Because the data pages are recycled by the page pool, and the test runner
    doesn't re-initialise them for each run, subsequent invocations of the XDP
    program will see the packet data in the state it was after the last time it
    ran on that particular page. This means that an XDP program that modifies
    the packet before redirecting it has to be careful about which assumptions
    it makes about the packet content, but that is only an issue for the most
    naively written programs.

    Enabling the new flag is only allowed when not setting ctx_out and data_out
    in the test specification, since using it means frames will be redirected
    somewhere else, so they can't be returned.

    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20220309105346.100053-2-toke@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:58 +02:00
Jerome Marchand c5056baccd bpf: Cleanup comments
Bugzilla: https://bugzilla.redhat.com/2120966

commit c561d11063009323a0e57c528cb1d77b7d2c41e0
Author: Tom Rix <trix@redhat.com>
Date:   Sun Feb 20 10:40:55 2022 -0800

    bpf: Cleanup comments

    Add leading space to spdx tag
    Use // for spdx c file comment

    Replacements
    resereved to reserved
    inbetween to in between
    everytime to every time
    intutivie to intuitive
    currenct to current
    encontered to encountered
    referenceing to referencing
    upto to up to
    exectuted to executed

    Signed-off-by: Tom Rix <trix@redhat.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20220220184055.3608317-1-trix@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:51 +02:00
Jerome Marchand 20d1401ee1 bpf: Call maybe_wait_bpf_programs() only once from generic_map_delete_batch()
Bugzilla: https://bugzilla.redhat.com/2120966

commit 9087c6ff8dfe0a070e4e05a434399080603c29de
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Feb 18 10:18:01 2022 -0800

    bpf: Call maybe_wait_bpf_programs() only once from generic_map_delete_batch()

    As stated in the comment found in maybe_wait_bpf_programs(),
    the synchronize_rcu() barrier is only needed before returning
    to userspace, not after each deletion in the batch.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20220218181801.2971275-1-eric.dumazet@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:50 +02:00
Jerome Marchand e8ea7c6063 bpf: Convert bpf_preload.ko to use light skeleton.
Bugzilla: https://bugzilla.redhat.com/2120966

commit cb80ddc67152e72f28ff6ea8517acdf875d7381d
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Feb 9 15:20:01 2022 -0800

    bpf: Convert bpf_preload.ko to use light skeleton.

    The main change is a move of the single line
      #include "iterators.lskel.h"
    from iterators/iterators.c to bpf_preload_kern.c.
    Which means that generated light skeleton can be used from user space or
    user mode driver like iterators.c or from the kernel module or the kernel itself.
    The direct use of light skeleton from the kernel module simplifies the code,
    since UMD is no longer necessary. The libbpf.a required user space and UMD. The
    CO-RE in the kernel and generated "loader bpf program" used by the light
    skeleton are capable to perform complex loading operations traditionally
    provided by libbpf. In addition UMD approach was launching UMD process
    every time bpffs has to be mounted. With light skeleton in the kernel
    the bpf_preload kernel module loads bpf iterators once and pins them
    multiple times into different bpffs mounts.

    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220209232001.27490-6-alexei.starovoitov@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:49 +02:00
Jerome Marchand 852ab0920c bpf: Extend sys_bpf commands for bpf_syscall programs.
Bugzilla: https://bugzilla.redhat.com/2120966

commit b1d18a7574d0df5eb4117c14742baf8bc2b9bb74
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Feb 9 15:19:57 2022 -0800

    bpf: Extend sys_bpf commands for bpf_syscall programs.

    bpf_sycall programs can be used directly by the kernel modules
    to load programs and create maps via kernel skeleton.
    . Export bpf_sys_bpf syscall wrapper to be used in kernel skeleton.
    . Export bpf_map_get to be used in kernel skeleton.
    . Allow prog_run cmd for bpf_syscall programs with recursion check.
    . Enable link_create and raw_tp_open cmds.

    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220209232001.27490-2-alexei.starovoitov@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:49 +02:00
Jiri Benc d1647a95d0 bpf: generalise tail call map compatibility check
Bugzilla: https://bugzilla.redhat.com/2120966

commit f45d5b6ce2e835834c94b8b700787984f02cd662
Author: Toke Hoiland-Jorgensen <toke@redhat.com>
Date:   Fri Jan 21 11:10:02 2022 +0100

    bpf: generalise tail call map compatibility check

    The check for tail call map compatibility ensures that tail calls only
    happen between maps of the same type. To ensure backwards compatibility for
    XDP frags we need a similar type of check for cpumap and devmap
    programs, so move the state from bpf_array_aux into bpf_map, add
    xdp_has_frags to the check, and apply the same check to cpumap and devmap.

    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Signed-off-by: Toke Hoiland-Jorgensen <toke@redhat.com>
    Link: https://lore.kernel.org/r/f19fd97c0328a39927f3ad03e1ca6b43fd53cdfd.1642758637.git.lorenzo@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:42 +02:00
Jerome Marchand aa96201939 bpf: introduce BPF_F_XDP_HAS_FRAGS flag in prog_flags loading the ebpf program
Bugzilla: https://bugzilla.redhat.com/2120966

commit c2f2cdbeffda7b153c19e0f3d73149c41026c0db
Author: Lorenzo Bianconi <lorenzo@kernel.org>
Date:   Fri Jan 21 11:09:52 2022 +0100

    bpf: introduce BPF_F_XDP_HAS_FRAGS flag in prog_flags loading the ebpf program

    Introduce BPF_F_XDP_HAS_FRAGS and the related field in bpf_prog_aux
    in order to notify the driver the loaded program support xdp frags.

    Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Link: https://lore.kernel.org/r/db2e8075b7032a356003f407d1b0deb99adaa0ed.1642758637.git.lorenzo@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:41 +02:00
Jerome Marchand 1a72cc1321 bpf: support BPF_PROG_QUERY for progs attached to sockmap
Bugzilla: https://bugzilla.redhat.com/2120966

commit 748cd5729ac7421091316e32dcdffb0578563880
Author: Di Zhu <zhudi2@huawei.com>
Date:   Wed Jan 19 09:40:04 2022 +0800

    bpf: support BPF_PROG_QUERY for progs attached to sockmap

    Right now there is no way to query whether BPF programs are
    attached to a sockmap or not.

    we can use the standard interface in libbpf to query, such as:
    bpf_prog_query(mapFd, BPF_SK_SKB_STREAM_PARSER, 0, NULL, ...);
    the mapFd is the fd of sockmap.

    Signed-off-by: Di Zhu <zhudi2@huawei.com>
    Acked-by: Yonghong Song <yhs@fb.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/r/20220119014005.1209-1-zhudi2@huawei.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:41 +02:00
Artem Savkov 7ad8abda5f bpf: Add schedule points in batch ops
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 75134f16e7dd0007aa474b281935c5f42e79f2c8
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 17 10:19:02 2022 -0800

    bpf: Add schedule points in batch ops

    syzbot reported various soft lockups caused by bpf batch operations.

     INFO: task kworker/1:1:27 blocked for more than 140 seconds.
     INFO: task hung in rcu_barrier

    Nothing prevents batch ops to process huge amount of data,
    we need to add schedule points in them.

    Note that maybe_wait_bpf_programs(map) calls from
    generic_map_delete_batch() can be factorized by moving
    the call after the loop.

    This will be done later in -next tree once we get this fix merged,
    unless there is strong opinion doing this optimization sooner.

    Fixes: aa2e93b8e5 ("bpf: Add generic support for update and delete batch ops")
    Fixes: cb4d03ab49 ("bpf: Add generic support for lookup batch op")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Acked-by: Brian Vazquez <brianvv@google.com>
    Link: https://lore.kernel.org/bpf/20220217181902.808742-1-eric.dumazet@gmail.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:55 +02:00
Artem Savkov 7f76bfc54f bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 216e3cd2f28dbbf1fe86848e0e29e6693b9f0a20
Author: Hao Luo <haoluo@google.com>
Date:   Thu Dec 16 16:31:51 2021 -0800

    bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.

    Some helper functions may modify its arguments, for example,
    bpf_d_path, bpf_get_stack etc. Previously, their argument types
    were marked as ARG_PTR_TO_MEM, which is compatible with read-only
    mem types, such as PTR_TO_RDONLY_BUF. Therefore it's legitimate,
    but technically incorrect, to modify a read-only memory by passing
    it into one of such helper functions.

    This patch tags the bpf_args compatible with immutable memory with
    MEM_RDONLY flag. The arguments that don't have this flag will be
    only compatible with mutable memory types, preventing the helper
    from modifying a read-only memory. The bpf_args that have
    MEM_RDONLY are compatible with both mutable memory and immutable
    memory.

    Signed-off-by: Hao Luo <haoluo@google.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211217003152.48334-9-haoluo@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:50 +02:00
Artem Savkov 75a645a56c add missing bpf-cgroup.h includes
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit aef2feda97b840ec38e9fa53d0065188453304e8
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Dec 15 18:55:37 2021 -0800

    add missing bpf-cgroup.h includes

    We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency,
    make sure those who actually need more than the definition of
    struct cgroup_bpf include bpf-cgroup.h explicitly.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/bpf/20211216025538.1649516-3-kuba@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:49 +02:00
Artem Savkov 77c4b3ac35 bpf: Pass a set of bpf_core_relo-s to prog_load command.
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit fbd94c7afcf99c9f3b1ba1168657ecc428eb2c8d
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Dec 1 10:10:28 2021 -0800

    bpf: Pass a set of bpf_core_relo-s to prog_load command.

    struct bpf_core_relo is generated by llvm and processed by libbpf.
    It's a de-facto uapi.
    With CO-RE in the kernel the struct bpf_core_relo becomes uapi de-jure.
    Add an ability to pass a set of 'struct bpf_core_relo' to prog_load command
    and let the kernel perform CO-RE relocations.

    Note the struct bpf_line_info and struct bpf_func_info have the same
    layout when passed from LLVM to libbpf and from libbpf to the kernel
    except "insn_off" fields means "byte offset" when LLVM generates it.
    Then libbpf converts it to "insn index" to pass to the kernel.
    The struct bpf_core_relo's "insn_off" field is always "byte offset".

    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20211201181040.23337-6-alexei.starovoitov@gmail.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:42 +02:00
Artem Savkov c7330c1abb bpf: Change bpf_kallsyms_lookup_name size type to ARG_CONST_SIZE_OR_ZERO
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit d4efb170861827290f7f571020001a60d001faaf
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Tue Nov 23 05:27:31 2021 +0530

    bpf: Change bpf_kallsyms_lookup_name size type to ARG_CONST_SIZE_OR_ZERO

    Andrii mentioned in [0] that switching to ARG_CONST_SIZE_OR_ZERO lets
    user avoid having to prove that string size at runtime is not zero and
    helps with not having to supress clang optimizations.

      [0]: https://lore.kernel.org/bpf/CAEf4BzZa_vhXB3c8atNcTS6=krQvC25H7K7c3WWZhM=27ro=Wg@mail.gmail.com

    Suggested-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20211122235733.634914-2-memxor@gmail.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:42 +02:00
Yauheni Kaliuta 120680ed7a bpf: Add bpf_kallsyms_lookup_name helper
Bugzilla: http://bugzilla.redhat.com/2069045

commit d6aef08a872b9e23eecc92d0e92393473b13c497
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Thu Oct 28 12:04:54 2021 +0530

    bpf: Add bpf_kallsyms_lookup_name helper
    
    This helper allows us to get the address of a kernel symbol from inside
    a BPF_PROG_TYPE_SYSCALL prog (used by gen_loader), so that we can
    relocate typeless ksym vars.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20211028063501.2239335-2-memxor@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-06-03 17:23:48 +03:00
Yauheni Kaliuta dff30e48af bpf: Add bloom filter map implementation
Bugzilla: http://bugzilla.redhat.com/2069045

commit 9330986c03006ab1d33d243b7cfe598a7a3c1baa
Author: Joanne Koong <joannekoong@fb.com>
Date:   Wed Oct 27 16:45:00 2021 -0700

    bpf: Add bloom filter map implementation
    
    This patch adds the kernel-side changes for the implementation of
    a bpf bloom filter map.
    
    The bloom filter map supports peek (determining whether an element
    is present in the map) and push (adding an element to the map)
    operations.These operations are exposed to userspace applications
    through the already existing syscalls in the following way:
    
    BPF_MAP_LOOKUP_ELEM -> peek
    BPF_MAP_UPDATE_ELEM -> push
    
    The bloom filter map does not have keys, only values. In light of
    this, the bloom filter map's API matches that of queue stack maps:
    user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
    which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
    and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
    APIs to query or add an element to the bloom filter map. When the
    bloom filter map is created, it must be created with a key_size of 0.
    
    For updates, the user will pass in the element to add to the map
    as the value, with a NULL key. For lookups, the user will pass in the
    element to query in the map as the value, with a NULL key. In the
    verifier layer, this requires us to modify the argument type of
    a bloom filter's BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE;
    as well, in the syscall layer, we need to copy over the user value
    so that in bpf_map_peek_elem, we know which specific value to query.
    
    A few things to please take note of:
     * If there are any concurrent lookups + updates, the user is
    responsible for synchronizing this to ensure no false negative lookups
    occur.
     * The number of hashes to use for the bloom filter is configurable from
    userspace. If no number is specified, the default used will be 5 hash
    functions. The benchmarks later in this patchset can help compare the
    performance of using different number of hashes on different entry
    sizes. In general, using more hashes decreases both the false positive
    rate and the speed of a lookup.
     * Deleting an element in the bloom filter map is not supported.
     * The bloom filter map may be used as an inner map.
     * The "max_entries" size that is specified at map creation time is used
    to approximate a reasonable bitmap size for the bloom filter, and is not
    otherwise strictly enforced. If the user wishes to insert more entries
    into the bloom filter than "max_entries", they may do so but they should
    be aware that this may lead to a higher false positive rate.
    
    Signed-off-by: Joanne Koong <joannekoong@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20211027234504.30744-2-joannekoong@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-06-03 17:23:47 +03:00
Yauheni Kaliuta c0d9cb1fab bpf: Use u64_stats_t in struct bpf_prog_stats
Bugzilla: http://bugzilla.redhat.com/2069045

commit 61a0abaee2092eee69e44fe60336aa2f5b578938
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Oct 26 14:41:33 2021 -0700

    bpf: Use u64_stats_t in struct bpf_prog_stats
    
    Commit 316580b69d ("u64_stats: provide u64_stats_t type")
    fixed possible load/store tearing on 64bit arches.
    
    For instance the following C code
    
    stats->nsecs += sched_clock() - start;
    
    Could be rightfully implemented like this by a compiler,
    confusing concurrent readers a lot:
    
    stats->nsecs += sched_clock();
    // arbitrary delay
    stats->nsecs -= start;
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211026214133.3114279-4-eric.dumazet@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-06-03 17:23:47 +03:00
Yauheni Kaliuta ed11533568 bpf: Add verified_insns to bpf_prog_info and fdinfo
Bugzilla: http://bugzilla.redhat.com/2069045

commit aba64c7da98330141dcdadd5612f088043a83696
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Wed Oct 20 00:48:17 2021 -0700

    bpf: Add verified_insns to bpf_prog_info and fdinfo
    
    This stat is currently printed in the verifier log and not stored
    anywhere. To ease consumption of this data, add a field to bpf_prog_aux
    so it can be exposed via BPF_OBJ_GET_INFO_BY_FD and fdinfo.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20211020074818.1017682-2-davemarchevsky@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-06-03 17:23:43 +03:00
Jerome Marchand 483ae4a299 bpf: Fix potential race in tail call compatibility check
Bugzilla: http://bugzilla.redhat.com/2041365

commit 54713c85f536048e685258f880bf298a74c3620d
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Tue Oct 26 13:00:19 2021 +0200

    bpf: Fix potential race in tail call compatibility check

    Lorenzo noticed that the code testing for program type compatibility of
    tail call maps is potentially racy in that two threads could encounter a
    map with an unset type simultaneously and both return true even though they
    are inserting incompatible programs.

    The race window is quite small, but artificially enlarging it by adding a
    usleep_range() inside the check in bpf_prog_array_compatible() makes it
    trivial to trigger from userspace with a program that does, essentially:

            map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0);
            pid = fork();
            if (pid) {
                    key = 0;
                    value = xdp_fd;
            } else {
                    key = 1;
                    value = tc_fd;
            }
            err = bpf_map_update_elem(map_fd, &key, &value, 0);

    While the race window is small, it has potentially serious ramifications in
    that triggering it would allow a BPF program to tail call to a program of a
    different type. So let's get rid of it by protecting the update with a
    spinlock. The commit in the Fixes tag is the last commit that touches the
    code in question.

    v2:
    - Use a spinlock instead of an atomic variable and cmpxchg() (Alexei)
    v3:
    - Put lock and the members it protects into an embedded 'owner' struct (Daniel)

    Fixes: 3324b584b6 ("ebpf: misc core cleanup")
    Reported-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211026110019.363464-1-toke@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:17:15 +02:00
Jerome Marchand 3fe05f2996 bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()
Bugzilla: http://bugzilla.redhat.com/2041365

commit fda7a38714f40b635f5502ec4855602c6b33dad2
Author: Xu Kuohai <xukuohai@huawei.com>
Date:   Tue Oct 19 03:29:34 2021 +0000

    bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()

    1. The ufd in generic_map_update_batch() should be read from batch.map_fd;
    2. A call to fdget() should be followed by a symmetric call to fdput().

    Fixes: aa2e93b8e5 ("bpf: Add generic support for update and delete batch ops")
    Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211019032934.1210517-1-xukuohai@huawei.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:17:15 +02:00
Jerome Marchand 4cbbc10481 bpf: Use kvmalloc for map keys in syscalls
Bugzilla: http://bugzilla.redhat.com/2041365

commit 44779a4b85abd1d1dab9e5b90bd5e6adcfc8143a
Author: Stanislav Fomichev <sdf@google.com>
Date:   Wed Aug 18 16:52:16 2021 -0700

    bpf: Use kvmalloc for map keys in syscalls

    Same as previous patch but for the keys. memdup_bpfptr is renamed
    to kvmemdup_bpfptr (and converted to kvmalloc).

    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20210818235216.1159202-2-sdf@google.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:43 +02:00
Jerome Marchand 353752951e bpf: Use kvmalloc for map values in syscall
Bugzilla: http://bugzilla.redhat.com/2041365

commit f0dce1d9b7c81fc3dc9d0cc0bc7ef9b3eae22584
Author: Stanislav Fomichev <sdf@google.com>
Date:   Wed Aug 18 16:52:15 2021 -0700

    bpf: Use kvmalloc for map values in syscall

    Use kvmalloc/kvfree for temporary value when manipulating a map via
    syscall. kmalloc might not be sufficient for percpu maps where the value
    is big (and further multiplied by hundreds of CPUs).

    Can be reproduced with netcnt test on qemu with "-smp 255".

    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20210818235216.1159202-1-sdf@google.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:43 +02:00
Jerome Marchand b0371ec3e5 bpf: Allow to specify user-provided bpf_cookie for BPF perf links
Bugzilla: http://bugzilla.redhat.com/2041365

commit 82e6b1eee6a8875ef4eacfd60711cce6965c6b04
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Sun Aug 15 00:05:58 2021 -0700

    bpf: Allow to specify user-provided bpf_cookie for BPF perf links

    Add ability for users to specify custom u64 value (bpf_cookie) when creating
    BPF link for perf_event-backed BPF programs (kprobe/uprobe, perf_event,
    tracepoints).

    This is useful for cases when the same BPF program is used for attaching and
    processing invocation of different tracepoints/kprobes/uprobes in a generic
    fashion, but such that each invocation is distinguished from each other (e.g.,
    BPF program can look up additional information associated with a specific
    kernel function without having to rely on function IP lookups). This enables
    new use cases to be implemented simply and efficiently that previously were
    possible only through code generation (and thus multiple instances of almost
    identical BPF program) or compilation at runtime (BCC-style) on target hosts
    (even more expensive resource-wise). For uprobes it is not even possible in
    some cases to know function IP before hand (e.g., when attaching to shared
    library without PID filtering, in which case base load address is not known
    for a library).

    This is done by storing u64 bpf_cookie in struct bpf_prog_array_item,
    corresponding to each attached and run BPF program. Given cgroup BPF programs
    already use two 8-byte pointers for their needs and cgroup BPF programs don't
    have (yet?) support for bpf_cookie, reuse that space through union of
    cgroup_storage and new bpf_cookie field.

    Make it available to kprobe/tracepoint BPF programs through bpf_trace_run_ctx.
    This is set by BPF_PROG_RUN_ARRAY, used by kprobe/uprobe/tracepoint BPF
    program execution code, which luckily is now also split from
    BPF_PROG_RUN_ARRAY_CG. This run context will be utilized by a new BPF helper
    giving access to this user-provided cookie value from inside a BPF program.
    Generic perf_event BPF programs will access this value from perf_event itself
    through passed in BPF program context.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/bpf/20210815070609.987780-6-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:41 +02:00
Jerome Marchand c767012e36 bpf: Implement minimal BPF perf link
Bugzilla: http://bugzilla.redhat.com/2041365

commit b89fbfbb854c9afc3047e8273cc3a694650b802e
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Sun Aug 15 00:05:57 2021 -0700

    bpf: Implement minimal BPF perf link

    Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
    BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
    the common BPF link infrastructure, allowing to list all active perf_event
    based attachments, auto-detaching BPF program from perf_event when link's FD
    is closed, get generic BPF link fdinfo/get_info functionality.

    BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
    are currently supported.

    Force-detaching and atomic BPF program updates are not yet implemented, but
    with perf_event-based BPF links we now have common framework for this without
    the need to extend ioctl()-based perf_event interface.

    One interesting consideration is a new value for bpf_attach_type, which
    BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
    bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
    bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
    BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
    program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
    mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
    define a single BPF_PERF_EVENT attach type for all of them and adjust
    link_create()'s logic for checking correspondence between attach type and
    program type.

    The alternative would be to define three new attach types (e.g., BPF_KPROBE,
    BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
    and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
    libbpf. I chose to not do this to avoid unnecessary proliferation of
    bpf_attach_type enum values and not have to deal with naming conflicts.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/bpf/20210815070609.987780-5-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:41 +02:00
Jerome Marchand 103c5a16ea bpf: Add map side support for bpf timers.
Bugzilla: http://bugzilla.redhat.com/2041365

commit 68134668c17f31f51930478f75495b552a411550
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Jul 14 17:54:10 2021 -0700

    bpf: Add map side support for bpf timers.

    Restrict bpf timers to array, hash (both preallocated and kmalloced), and
    lru map types. The per-cpu maps with timers don't make sense, since 'struct
    bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
    the number of timers depends on number of possible cpus and timers would not be
    accessible from all cpus. lpm map support can be added in the future.
    The timers in inner maps are supported.

    The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
    bpf_timer in a given map element.

    Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
    that map element indeed contains 'struct bpf_timer'.

    Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
    map element data is reused in preallocated htab and lru maps.

    Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
    map element. There could be one of each, but not more than one. Due to 'one
    bpf_timer in one element' restriction do not support timers in global data,
    since global data is a map of single element, but from bpf program side it's
    seen as many global variables and restriction of single global timer would be
    odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
    timers, since user space could have corrupted mmap element and crashed the
    kernel. The maps with timers cannot be readonly. Due to these restrictions
    search for bpf_timer in datasec BTF in case it was placed in the global data to
    report clear error.

    The previous patch allowed 'struct bpf_timer' as a first field in a map
    element only. Relax this restriction.

    Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
    the timer when lru map deletes an element as a part of it eviction algorithm.

    Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
    The timer operation are done through helpers only.
    This is similar to 'struct bpf_spin_lock'.

    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:31 +02:00
Jerome Marchand f6f5ce1a8d bpf: Prepare bpf_prog_put() to be called from irq context.
Bugzilla: http://bugzilla.redhat.com/2041365

commit d809e134be7a1fdd9f5b99ab3291c6da5c0b8240
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Jul 14 17:54:07 2021 -0700

    bpf: Prepare bpf_prog_put() to be called from irq context.

    Currently bpf_prog_put() is called from the task context only.
    With addition of bpf timers the timer related helpers will start calling
    bpf_prog_put() from irq-saved region and in rare cases might drop
    the refcnt to zero.
    To address this case, first, convert bpf_prog_free_id() to be irq-save
    (this is similar to bpf_map_free_id), and, second, defer non irq
    appropriate calls into work queue.
    For example:
    bpf_audit_prog() is calling kmalloc and wake_up_interruptible,
    bpf_prog_kallsyms_del_all()->bpf_ksym_del()->spin_unlock_bh().
    They are not safe with irqs disabled.

    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210715005417.78572-2-alexei.starovoitov@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:31 +02:00
Jiri Olsa 8a02e88911 bpf: Fix toctou on read-only map's constant scalar tracking
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2029198
CVE: CVE-2021-4001

commit 353050be4c19e102178ccc05988101887c25ae53
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Tue Nov 9 18:48:08 2021 +0000

    bpf: Fix toctou on read-only map's constant scalar tracking

    Commit a23740ec43 ("bpf: Track contents of read-only maps as scalars") is
    checking whether maps are read-only both from BPF program side and user space
    side, and then, given their content is constant, reading out their data via
    map->ops->map_direct_value_addr() which is then subsequently used as known
    scalar value for the register, that is, it is marked as __mark_reg_known()
    with the read value at verification time. Before a23740ec43, the register
    content was marked as an unknown scalar so the verifier could not make any
    assumptions about the map content.

    The current implementation however is prone to a TOCTOU race, meaning, the
    value read as known scalar for the register is not guaranteed to be exactly
    the same at a later point when the program is executed, and as such, the
    prior made assumptions of the verifier with regards to the program will be
    invalid which can cause issues such as OOB access, etc.

    While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
    specified at map creation time, the map->frozen property is initially set to
    false for the map given the map value needs to be populated, e.g. for global
    data sections. Once complete, the loader "freezes" the map from user space
    such that no subsequent updates/deletes are possible anymore. For the rest
    of the lifetime of the map, this freeze one-time trigger cannot be undone
    anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
    cmd calls which would update/delete map entries will be rejected with -EPERM
    since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
    means that pending update/delete map entries must still complete before this
    guarantee is given. This corner case is not an issue for loaders since they
    create and prepare such program private map in successive steps.

    However, a malicious user is able to trigger this TOCTOU race in two different
    ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
    used to expand the competition interval, so that map_update_elem() can modify
    the contents of the map after map_freeze() and bpf_prog_load() were executed.
    This works, because userfaultfd halts the parallel thread which triggered a
    map_update_elem() at the time where we copy key/value from the user buffer and
    this already passed the FMODE_CAN_WRITE capability test given at that time the
    map was not "frozen". Then, the main thread performs the map_freeze() and
    bpf_prog_load(), and once that had completed successfully, the other thread
    is woken up to complete the pending map_update_elem() which then changes the
    map content. For ii) the idea of the batched update is similar, meaning, when
    there are a large number of updates to be processed, it can increase the
    competition interval between the two. It is therefore possible in practice to
    modify the contents of the map after executing map_freeze() and bpf_prog_load().

    One way to fix both i) and ii) at the same time is to expand the use of the
    map's map->writecnt. The latter was introduced in fc9702273e ("bpf: Add mmap()
    support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19be2 ("bpf:
    Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
    the rationale to make a writable mmap()'ing of a map mutually exclusive with
    read-only freezing. The counter indicates writable mmap() mappings and then
    prevents/fails the freeze operation. Its semantics can be expanded beyond
    just mmap() by generally indicating ongoing write phases. This would essentially
    span any parallel regular and batched flavor of update/delete operation and
    then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
    the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
    last pending writes have completed via bpf_map_write_active() test. Once the
    map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
    only then we are really guaranteed to use the map's data as known constants.
    For map->frozen being set and pending writes in process of still being completed
    we fall back to marking that register as unknown scalar so we don't end up
    making assumptions about it. With this, both TOCTOU reproducers from i) and
    ii) are fixed.

    Note that the map->writecnt has been converted into a atomic64 in the fix in
    order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
    map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
    the freeze_mutex over entire map update/delete operations in syscall side
    would not be possible due to then causing everything to be serialized.
    Similarly, something like synchronize_rcu() after setting map->frozen to wait
    for update/deletes to complete is not possible either since it would also
    have to span the user copy which can sleep. On the libbpf side, this won't
    break d66562fba1 ("libbpf: Add BPF object skeleton support") as the
    anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
    mmap()-ed memory where for .rodata it's non-writable.

    Fixes: a23740ec43 ("bpf: Track contents of read-only maps as scalars")
    Reported-by: w1tcher.bupt@gmail.com
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
2021-12-09 22:11:29 +01:00
Jiri Olsa 786ca1cae5 bpf: Fix unprivileged_bpf_disabled setup
There's recent change [1] that adds new config option and sets
unprivileged_bpf_disabled to 2 if the option is enabled
(CONFIG_BPF_UNPRIV_DEFAULT_OFF).

The current RHEL specific behaviour is to set unprivileged_bpf_disabled
to 1 by default and add boot command line argument to enable
unpriv bpf.

The config option is enabled in previous patch, adding the taint
for proc/sysctl unprivileged_bpf_disabled setup.

  # sysctl kernel.unprivileged_bpf_disabled
  kernel.unprivileged_bpf_disabled = 2
  # cat /proc/sys/kernel/tainted
  0
  # sysctl kernel.unprivileged_bpf_disabled=0
  [   45.751085] Unprivileged BPF has been enabled, tainting the kernel
  kernel.unprivileged_bpf_disabled = 0
  # sysctl kernel.unprivileged_bpf_disabled=1
  kernel.unprivileged_bpf_disabled = 1
  # sysctl kernel.unprivileged_bpf_disabled=0
  sysctl: setting key "kernel.unprivileged_bpf_disabled": Operation not permitted
  # sysctl kernel.unprivileged_bpf_disabled=2
  sysctl: setting key "kernel.unprivileged_bpf_disabled": Operation not permitted
  # cat /proc/sys/kernel/tainted
  2147483648

[1] 08389d8882 ("bpf: Add kconfig knob for disabling unpriv bpf by default")
[2] 607f0e89af ("bpf: set unprivileged_bpf_disabled to 1 by default, add a boot parameter")

Fixes: 607f0e89af ("bpf: set unprivileged_bpf_disabled to 1 by default, add a boot parameter")
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
2021-08-30 14:31:15 -04:00
Eugene Syromiatnikov 9ce398ab77 bpf: set unprivileged_bpf_disabled to 1 by default, add a boot parameter
Message-id: <133022c6c389ca16060bd20ef69199de0800200b.1528991396.git.esyr@redhat.com>
Patchwork-id: 8250
O-Subject: [kernel team] [RHEL8 PATCH v4 2/5] [bpf] bpf: set unprivileged_bpf_disabled to 1 by default, add a boot parameter
Bugzilla: 1561171
RH-Acked-by: Jiri Benc <jbenc@redhat.com>
RH-Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

This patch sets kernel.unprivileged_bpf_disabled sysctl knob to 1
by default, and provides an ability (in a form of a boot-time parameter)
to reset it to 0, as it is impossible to do so in runtime.  Since
unprivileged BPF is considered unsupported, it also taints the kernel.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1561171
Brew: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16716594
Upstream: RHEL only.  The patch (in a more generic form) has been
          proposed upstream[1] and subsequently rejected.

[1] https://lkml.org/lkml/2018/5/21/344

Upstream Status: RHEL only
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-08-30 14:29:35 -04:00
David S. Miller a52171ae7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-06-17

The following pull-request contains BPF updates for your *net-next* tree.

We've added 50 non-merge commits during the last 25 day(s) which contain
a total of 148 files changed, 4779 insertions(+), 1248 deletions(-).

The main changes are:

1) BPF infrastructure to migrate TCP child sockets from a listener to another
   in the same reuseport group/map, from Kuniyuki Iwashima.

2) Add a provably sound, faster and more precise algorithm for tnum_mul() as
   noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan.

3) Streamline error reporting changes in libbpf as planned out in the
   'libbpf: the road to v1.0' effort, from Andrii Nakryiko.

4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu.

5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map
   types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek.

6) Support new LLVM relocations in libbpf to make them more linker friendly,
   also add a doc to describe the BPF backend relocations, from Yonghong Song.

7) Silence long standing KUBSAN complaints on register-based shifts in
   interpreter, from Daniel Borkmann and Eric Biggers.

8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when
   target arch cannot be determined, from Lorenz Bauer.

9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson.

10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi.

11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF
    programs to bpf_helpers.h header, from Florent Revest.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17 11:54:56 -07:00
Kuniyuki Iwashima d5e4ddaeb6 bpf: Support socket migration by eBPF.
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT
to check if the attached eBPF program is capable of migrating sockets. When
the eBPF program is attached, we run it for socket migration if the
expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or
net.ipv4.tcp_migrate_req is enabled.

Currently, the expected_attach_type is not enforced for the
BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the
earlier idea in the commit aac3fc320d ("bpf: Post-hooks for sys_bind") to
fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type().

Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to
select a new listener based on the child socket. migrating_sk varies
depending on if it is migrating a request in the accept queue or during
3WHS.

  - accept_queue : sock (ESTABLISHED/SYN_RECV)
  - 3WHS         : request_sock (NEW_SYN_RECV)

In the eBPF program, we can select a new listener by
BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
SK_DROP. This feature is useful when listeners have different settings at
the socket API level or when we want to free resources as soon as possible.

  - SK_PASS with selected_sk, select it as a new listener
  - SK_PASS with selected_sk NULL, fallbacks to the random selection
  - SK_DROP, cancel the migration.

There is a noteworthy point. We select a listening socket in three places,
but we do not have struct skb at closing a listener or retransmitting a
SYN+ACK. On the other hand, some helper functions do not expect skb is NULL
(e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer()
in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb
temporarily before running the eBPF program.

Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
2021-06-15 18:01:06 +02:00
Jakub Kicinski 5ada57a9a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
cdc-wdm: s/kill_urbs/poison_urbs/ to fix build

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 09:55:10 -07:00
Denis Salopek 3e87f192b4 bpf: Add lookup_and_delete_elem support to hashtab
Extend the existing bpf_map_lookup_and_delete_elem() functionality to
hashtab map types, in addition to stacks and queues.
Create a new hashtab bpf_map_ops function that does lookup and deletion
of the element under the same bucket lock and add the created map_ops to
bpf.h.

Signed-off-by: Denis Salopek <denis.salopek@sartura.hr>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/4d18480a3e990ffbf14751ddef0325eed3be2966.1620763117.git.denis.salopek@sartura.hr
2021-05-24 13:30:26 -07:00
Pu Lehui 3a2daa7248 bpf: Make some symbols static
The sparse tool complains as follows:

kernel/bpf/syscall.c:4567:29: warning:
 symbol 'bpf_sys_bpf_proto' was not declared. Should it be static?
kernel/bpf/syscall.c:4592:29: warning:
 symbol 'bpf_sys_close_proto' was not declared. Should it be static?

This symbol is not used outside of syscall.c, so marks it static.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210519064116.240536-1-pulehui@huawei.com
2021-05-19 10:47:43 -07:00
Alexei Starovoitov 3abea08924 bpf: Add bpf_sys_close() helper.
Add bpf_sys_close() helper to be used by the syscall/loader program to close
intermediate FDs and other cleanup.
Note this helper must never be allowed inside fdget/fdput bracketing.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-11-alexei.starovoitov@gmail.com
2021-05-19 00:33:40 +02:00
Alexei Starovoitov 3d78417b60 bpf: Add bpf_btf_find_by_name_kind() helper.
Add new helper:
long bpf_btf_find_by_name_kind(char *name, int name_sz, u32 kind, int flags)
Description
	Find BTF type with given name and kind in vmlinux BTF or in module's BTFs.
Return
	Returns btf_id and btf_obj_fd in lower and upper 32 bits.

It will be used by loader program to find btf_id to attach the program to
and to find btf_ids of ksyms.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-10-alexei.starovoitov@gmail.com
2021-05-19 00:33:40 +02:00
Alexei Starovoitov 387544bfa2 bpf: Introduce fd_idx
Typical program loading sequence involves creating bpf maps and applying
map FDs into bpf instructions in various places in the bpf program.
This job is done by libbpf that is using compiler generated ELF relocations
to patch certain instruction after maps are created and BTFs are loaded.
The goal of fd_idx is to allow bpf instructions to stay immutable
after compilation. At load time the libbpf would still create maps as usual,
but it wouldn't need to patch instructions. It would store map_fds into
__u32 fd_array[] and would pass that pointer to sys_bpf(BPF_PROG_LOAD).

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-9-alexei.starovoitov@gmail.com
2021-05-19 00:33:40 +02:00
Alexei Starovoitov c571bd752e bpf: Make btf_load command to be bpfptr_t compatible.
Similar to prog_load make btf_load command to be availble to
bpf_prog_type_syscall program.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-7-alexei.starovoitov@gmail.com
2021-05-19 00:33:40 +02:00
Alexei Starovoitov af2ac3e13e bpf: Prepare bpf syscall to be used from kernel and user space.
With the help from bpfptr_t prepare relevant bpf syscall commands
to be used from kernel and user space.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-4-alexei.starovoitov@gmail.com
2021-05-19 00:33:40 +02:00
Alexei Starovoitov 79a7f8bdb1 bpf: Introduce bpf_sys_bpf() helper and program type.
Add placeholders for bpf_sys_bpf() helper and new program type.
Make sure to check that expected_attach_type is zero for future extensibility.
Allow tracing helper functions to be used in this program type, since they will
only execute from user context via bpf_prog_test_run.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-2-alexei.starovoitov@gmail.com
2021-05-19 00:33:39 +02:00
Daniel Borkmann 08389d8882 bpf: Add kconfig knob for disabling unpriv bpf by default
Add a kconfig knob which allows for unprivileged bpf to be disabled by default.
If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2.

This still allows a transition of 2 -> {0,1} through an admin. Similarly,
this also still keeps 1 -> {1} behavior intact, so that once set to permanently
disabled, it cannot be undone aside from a reboot.

We've also added extra2 with max of 2 for the procfs handler, so that an admin
still has a chance to toggle between 0 <-> 2.

Either way, as an additional alternative, applications can make use of CAP_BPF
that we added a while ago.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net
2021-05-11 13:56:16 -07:00
Jiri Olsa f3a9507554 bpf: Allow trampoline re-attach for tracing and lsm programs
Currently we don't allow re-attaching of trampolines. Once
it's detached, it can't be re-attach even when the program
is still loaded.

Adding the possibility to re-attach the loaded tracing and
lsm programs.

Fixing missing unlock with proper cleanup goto jump reported
by Julia.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20210414195147.1624932-2-jolsa@kernel.org
2021-04-25 21:09:01 -07:00
Toke Høiland-Jørgensen 441e8c66b2 bpf: Return target info when a tracing bpf_link is queried
There is currently no way to discover the target of a tracing program
attachment after the fact. Add this information to bpf_link_info and return
it when querying the bpf_link fd.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413091607.58945-1-toke@redhat.com
2021-04-13 18:18:57 -07:00
Cong Wang a7ba4558e6 sock_map: Introduce BPF_SK_SKB_VERDICT
Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
confusing and more importantly we still want to distinguish them
from user-space. So we can just reuse the stream verdict code but
introduce a new type of eBPF program, skb_verdict. Users are not
allowed to attach stream_verdict and skb_verdict programs to the
same map.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-10-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Martin KaFai Lau e6ac2450d6 bpf: Support bpf program calling kernel function
This patch adds support to BPF verifier to allow bpf program calling
kernel function directly.

The use case included in this set is to allow bpf-tcp-cc to directly
call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
functions have already been used by some kernel tcp-cc implementations.

This set will also allow the bpf-tcp-cc program to directly call the
kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
from the kernel tcp_dctcp.c instead of reimplementing (or
copy-and-pasting) them.

The tcp-cc kernel functions mentioned above will be white listed
for the struct_ops bpf-tcp-cc programs to use in a later patch.
The white listed functions are not bounded to a fixed ABI contract.
Those functions have already been used by the existing kernel tcp-cc.
If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
implementations have to be changed.  The same goes for the struct_ops
bpf-tcp-cc programs which have to be adjusted accordingly.

This patch is to make the required changes in the bpf verifier.

First change is in btf.c, it adds a case in "btf_check_func_arg_match()".
When the passed in "btf->kernel_btf == true", it means matching the
verifier regs' states with a kernel function.  This will handle the
PTR_TO_BTF_ID reg.  It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET,
and PTR_TO_TCP_SOCK to its kernel's btf_id.

In the later libbpf patch, the insn calling a kernel function will
look like:

insn->code == (BPF_JMP | BPF_CALL)
insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */
insn->imm == func_btf_id /* btf_id of the running kernel */

[ For the future calling function-in-kernel-module support, an array
  of module btf_fds can be passed at the load time and insn->off
  can be used to index into this array. ]

At the early stage of verifier, the verifier will collect all kernel
function calls into "struct bpf_kfunc_desc".  Those
descriptors are stored in "prog->aux->kfunc_tab" and will
be available to the JIT.  Since this "add" operation is similar
to the current "add_subprog()" and looking for the same insn->code,
they are done together in the new "add_subprog_and_kfunc()".

In the "do_check()" stage, the new "check_kfunc_call()" is added
to verify the kernel function call instruction:
1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE.
   A new bpf_verifier_ops "check_kfunc_call" is added to do that.
   The bpf-tcp-cc struct_ops program will implement this function in
   a later patch.
2. Call "btf_check_kfunc_args_match()" to ensure the regs can be
   used as the args of a kernel function.
3. Mark the regs' type, subreg_def, and zext_dst.

At the later do_misc_fixups() stage, the new fixup_kfunc_call()
will replace the insn->imm with the function address (relative
to __bpf_call_base).  If needed, the jit can find the btf_func_model
by calling the new bpf_jit_find_kfunc_model(prog, insn).
With the imm set to the function address, "bpftool prog dump xlated"
will be able to display the kernel function calls the same way as
it displays other bpf helper calls.

gpl_compatible program is required to call kernel function.

This feature currently requires JIT.

The verifier selftests are adjusted because of the changes in
the verbose log in add_subprog_and_kfunc().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015142.1544736-1-kafai@fb.com
2021-03-26 20:41:51 -07:00
Martin KaFai Lau e16301fbe1 bpf: Simplify freeing logic in linfo and jited_linfo
This patch simplifies the linfo freeing logic by combining
"bpf_prog_free_jited_linfo()" and "bpf_prog_free_unused_jited_linfo()"
into the new "bpf_prog_jit_attempt_done()".
It is a prep work for the kernel function call support.  In a later
patch, freeing the kernel function call descriptors will also
be done in the "bpf_prog_jit_attempt_done()".

"bpf_prog_free_linfo()" is removed since it is only called by
"__bpf_prog_put_noref()".  The kvfree() are directly called
instead.

It also takes this chance to s/kcalloc/kvcalloc/ for the jited_linfo
allocation.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015130.1544323-1-kafai@fb.com
2021-03-26 20:41:50 -07:00
Alexei Starovoitov 350a5c4dd2 bpf: Dont allow vmlinux BTF to be used in map_create and prog_load.
The syzbot got FD of vmlinux BTF and passed it into map_create which caused
crash in btf_type_id_size() when it tried to access resolved_ids. The vmlinux
BTF doesn't have 'resolved_ids' and 'resolved_sizes' initialized to save
memory. To avoid such issues disallow using vmlinux BTF in prog_load and
map_create commands.

Fixes: 5329722057 ("bpf: Assign ID to vmlinux BTF and return extra info for BTF in GET_OBJ_INFO")
Reported-by: syzbot+8bab8ed346746e7540e8@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210307225248.79031-1-alexei.starovoitov@gmail.com
2021-03-08 13:32:46 +01:00
Alexei Starovoitov 9ed9e9ba23 bpf: Count the number of times recursion was prevented
Add per-program counter for number of times recursion prevention mechanism
was triggered and expose it via show_fdinfo and bpf_prog_info.
Teach bpftool to print it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-7-alexei.starovoitov@gmail.com
2021-02-11 16:19:20 +01:00
Alexei Starovoitov 700d4796ef bpf: Optimize program stats
Move bpf_prog_stats from prog->aux into prog to avoid one extra load
in critical path of program execution.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-2-alexei.starovoitov@gmail.com
2021-02-11 16:17:50 +01:00
Jiri Olsa 5541075a34 bpf: Prevent double bpf_prog_put call from bpf_tracing_prog_attach
The bpf_tracing_prog_attach error path calls bpf_prog_put
on prog, which causes refcount underflow when it's called
from link_create function.

  link_create
    prog = bpf_prog_get              <-- get
    ...
    tracing_bpf_link_attach(prog..
      bpf_tracing_prog_attach(prog..
        out_put_prog:
          bpf_prog_put(prog);        <-- put

    if (ret < 0)
      bpf_prog_put(prog);            <-- put

Removing bpf_prog_put call from bpf_tracing_prog_attach
and making sure its callers call it instead.

Fixes: 4a1e7c0c63 ("bpf: Support attaching freplace programs to multiple attach points")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210111191650.1241578-1-jolsa@kernel.org
2021-01-12 00:17:34 +01:00
David S. Miller 4bfc471484 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-12-28

The following pull-request contains BPF updates for your *net* tree.

There is a small merge conflict between bpf tree commit 69ca310f34
("bpf: Save correct stopping point in file seq iteration") and net tree
commit 66ed594409 ("bpf/task_iter: In task_file_seq_get_next use
task_lookup_next_fd_rcu"). The get_files_struct() does not exist anymore
in net, so take the hunk in HEAD and add the `info->tid = curr_tid` to
the error path:

  [...]
                curr_task = task_seq_get_next(ns, &curr_tid, true);
                if (!curr_task) {
                        info->task = NULL;
                        info->tid = curr_tid;
                        return NULL;
                }

                /* set info->task and info->tid */
  [...]

We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 11 files changed, 75 insertions(+), 20 deletions(-).

The main changes are:

1) Various AF_XDP fixes such as fill/completion ring leak on failed bind and
   fixing a race in skb mode's backpressure mechanism, from Magnus Karlsson.

2) Fix latency spikes on lockdep enabled kernels by adding a rescheduling
   point to BPF hashtab initialization, from Eric Dumazet.

3) Fix a splat in task iterator by saving the correct stopping point in the
   seq file iteration, from Jonathan Lemon.

4) Fix BPF maps selftest by adding retries in case hashtab returns EBUSY
   errors on update/deletes, from Andrii Nakryiko.

5) Fix BPF selftest error reporting to something more user friendly if the
   vmlinux BTF cannot be found, from Kamal Mostafa.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28 15:26:11 -08:00
Tian Tao d467d80dc3 bpf: Remove unused including <linux/version.h>
Remove including <linux/version.h> that don't need it.

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/1608086835-54523-1-git-send-email-tiantao6@hisilicon.com
2020-12-18 16:17:59 +01:00
Linus Torvalds faf145d6f3 Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "This set of changes ultimately fixes the interaction of posix file
  lock and exec. Fundamentally most of the change is just moving where
  unshare_files is called during exec, and tweaking the users of
  files_struct so that the count of files_struct is not unnecessarily
  played with.

  Along the way fcheck and related helpers were renamed to more
  accurately reflect what they do.

  There were also many other small changes that fell out, as this is the
  first time in a long time much of this code has been touched.

  Benchmarks haven't turned up any practical issues but Al Viro has
  observed a possibility for a lot of pounding on task_lock. So I have
  some changes in progress to convert put_files_struct to always rcu
  free files_struct. That wasn't ready for the merge window so that will
  have to wait until next time"

* 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  exec: Move io_uring_task_cancel after the point of no return
  coredump: Document coredump code exclusively used by cell spufs
  file: Remove get_files_struct
  file: Rename __close_fd_get_file close_fd_get_file
  file: Replace ksys_close with close_fd
  file: Rename __close_fd to close_fd and remove the files parameter
  file: Merge __alloc_fd into alloc_fd
  file: In f_dupfd read RLIMIT_NOFILE once.
  file: Merge __fd_install into fd_install
  proc/fd: In fdinfo seq_show don't use get_files_struct
  bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
  proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
  file: Implement task_lookup_next_fd_rcu
  kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
  proc/fd: In tid_fd_mode use task_lookup_fd_rcu
  file: Implement task_lookup_fd_rcu
  file: Rename fcheck lookup_fd_rcu
  file: Replace fcheck_files with files_lookup_fd_rcu
  file: Factor files_lookup_fd_locked out of fcheck_files
  file: Rename __fcheck_files to files_lookup_fd_raw
  ...
2020-12-15 19:29:43 -08:00
Eric W. Biederman b48845af01 bpf: In bpf_task_fd_query use fget_task
Use the helper fget_task to simplify bpf_task_fd_query.

As well as simplifying the code this removes one unnecessary increment of
struct files_struct.  This unnecessary increment of files_struct.count can
result in exec unnecessarily unsharing files_struct and breaking posix
locks, and it can result in fget_light having to fallback to fget reducing
performance.

This simplification comes from the observation that none of the
callers of get_files_struct actually need to call get_files_struct
that was made when discussing[1] exec and posix file locks.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-5-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-5-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:39:44 -06:00
Andrii Nakryiko 8bdd8e275e bpf: Return -ENOTSUPP when attaching to non-kernel BTF
Return -ENOTSUPP if tracing BPF program is attempted to be attached with
specified attach_btf_obj_fd pointing to non-kernel (neither vmlinux nor
module) BTF object. This scenario might be supported in the future and isn't
outright invalid, so -EINVAL isn't the most appropriate error code.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201208064326.667389-1-andrii@kernel.org
2020-12-08 17:14:27 +01:00
Andrii Nakryiko 290248a5b7 bpf: Allow to specify kernel module BTFs when attaching BPF programs
Add ability for user-space programs to specify non-vmlinux BTF when attaching
BTF-powered BPF programs: raw_tp, fentry/fexit/fmod_ret, LSM, etc. For this,
attach_prog_fd (now with the alias name attach_btf_obj_fd) should specify FD
of a module or vmlinux BTF object. For backwards compatibility reasons,
0 denotes vmlinux BTF. Only kernel BTF (vmlinux or module) can be specified.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-11-andrii@kernel.org
2020-12-03 17:38:21 -08:00
Andrii Nakryiko 22dc4a0f5e bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
wherever BTF type IDs are involved, also track the instance of struct btf that
goes along with the type ID. This allows to gradually add support for kernel
module BTFs and using/tracking module types across BPF helper calls and
registers.

This patch also renames btf_id() function to btf_obj_id() to minimize naming
clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.

Also, altough btf_vmlinux can't get destructed and thus doesn't need
refcounting, module BTFs need that, so apply BTF refcounting universally when
BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
makes for simpler clean up code.

Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
trampoline key to include BTF object ID. To differentiate that from target
program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
not allowed to take full 32 bits, so there is no danger of confusing that bit
with a valid BTF type ID.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
2020-12-03 17:38:21 -08:00
Roman Gushchin 3ac1f01b43 bpf: Eliminate rlimit-based memory accounting for bpf progs
Do not use rlimit-based memory accounting for bpf progs. It has been
replaced with memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-34-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin 80ee81e040 bpf: Eliminate rlimit-based memory accounting infra for bpf maps
Remove rlimit-based accounting infrastructure code, which is not used
anymore.

To provide a backward compatibility, use an approximation of the
bpf map memory footprint as a "memlock" value, available to a user
via map info. The approximation is based on the maximal number of
elements and key and value sizes.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-33-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin d5299b67dd bpf: Memcg-based memory accounting for bpf maps
This patch enables memcg-based memory accounting for memory allocated
by __bpf_map_area_alloc(), which is used by many types of bpf maps for
large initial memory allocations.

Please note, that __bpf_map_area_alloc() should not be used outside of
map creation paths without setting the active memory cgroup to the
map's memory cgroup.

Following patches in the series will refine the accounting for
some of the map types.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-8-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 48edc1f78a bpf: Prepare for memcg-based memory accounting for bpf maps
Bpf maps can be updated from an interrupt context and in such
case there is no process which can be charged. It makes the memory
accounting of bpf maps non-trivial.

Fortunately, after commit 4127c6504f ("mm: kmem: enable kernel
memcg accounting from interrupt contexts") and commit b87d8cefe4
("mm, memcg: rework remote charging API to support nesting")
it's finally possible.

To make the ownership model simple and consistent, when the map
is created, the memory cgroup of the current process is recorded.
All subsequent allocations related to the bpf map are charged to
the same memory cgroup. It includes allocations made by any processes
(even if they do belong to a different cgroup) and from interrupts.

This commit introduces 3 new helpers, which will be used by following
commits to enable the accounting of bpf maps memory:
  - bpf_map_kmalloc_node()
  - bpf_map_kzalloc()
  - bpf_map_alloc_percpu()

They are wrapping popular memory allocation functions. They set
the active memory cgroup to the map's memory cgroup and add
__GFP_ACCOUNT to the passed gfp flags. Then they call into
the corresponding memory allocation function and restore
the original active memory cgroup.

These helpers are supposed to use everywhere except the map creation
path. During the map creation when the map structure is allocated by
itself, it cannot be passed to those helpers. In those cases default
memory allocation function will be used with the __GFP_ACCOUNT flag.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-7-guro@fb.com
2020-12-02 18:32:34 -08:00
KP Singh 4cf1bc1f10 bpf: Implement task local storage
Similar to bpf_local_storage for sockets and inodes add local storage
for task_struct.

The life-cycle of storage is managed with the life-cycle of the
task_struct.  i.e. the storage is destroyed along with the owning task
with a callback to the bpf_task_storage_free from the task_free LSM
hook.

The BPF LSM allocates an __rcu pointer to the bpf_local_storage in
the security blob which are now stackable and can co-exist with other
LSMs.

The userspace map operations can be done by using a pid fd as a key
passed to the lookup, update and delete operations.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201106103747.2780972-3-kpsingh@chromium.org
2020-11-06 08:08:37 -08:00
Tom Rix 76702a2e72 bpf: Remove unneeded break
A break is not needed if it is preceded by a return.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201019173846.1021-1-trix@redhat.com
2020-10-19 20:40:21 +02:00
Stanislav Fomichev 1028ae4069 bpf: Deref map in BPF_PROG_BIND_MAP when it's already used
We are missing a deref for the case when we are doing BPF_PROG_BIND_MAP
on a map that's being already held by the program.
There is 'if (ret) bpf_map_put(map)' below which doesn't trigger
because we don't consider this an error.
Let's add missing bpf_map_put() for this specific condition.

Fixes: ef15314aa5 ("bpf: Add BPF_PROG_BIND_MAP syscall")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20201003002544.3601440-1-sdf@google.com
2020-10-02 19:21:25 -07:00
Toke Høiland-Jørgensen 4a1e7c0c63 bpf: Support attaching freplace programs to multiple attach points
This enables support for attaching freplace programs to multiple attach
points. It does this by amending the UAPI for bpf_link_Create with a target
btf ID that can be used to supply the new attachment point along with the
target program fd. The target must be compatible with the target that was
supplied at program load time.

The implementation reuses the checks that were factored out of
check_attach_btf_id() to ensure compatibility between the BTF types of the
old and new attachment. If these match, a new bpf_tracing_link will be
created for the new attach target, allowing multiple attachments to
co-exist simultaneously.

The code could theoretically support multiple-attach of other types of
tracing programs as well, but since I don't have a use case for any of
those, there is no API support for doing so.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/160138355169.48470.17165680973640685368.stgit@toke.dk
2020-09-29 13:09:24 -07:00
Toke Høiland-Jørgensen 3aac1ead5e bpf: Move prog->aux->linked_prog and trampoline into bpf_link on attach
In preparation for allowing multiple attachments of freplace programs, move
the references to the target program and trampoline into the
bpf_tracing_link structure when that is created. To do this atomically,
introduce a new mutex in prog->aux to protect writing to the two pointers
to target prog and trampoline, and rename the members to make it clear that
they are related.

With this change, it is no longer possible to attach the same tracing
program multiple times (detaching in-between), since the reference from the
tracing program to the target disappears on the first attach. However,
since the next patch will let the caller supply an attach target, that will
also make it possible to attach to the same place multiple times.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/160138355059.48470.2503076992210324984.stgit@toke.dk
2020-09-29 13:09:23 -07:00
Song Liu 1b4d60ec16 bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint
Add .test_run for raw_tracepoint. Also, introduce a new feature that runs
the target program on a specific CPU. This is achieved by a new flag in
bpf_attr.test, BPF_F_TEST_RUN_ON_CPU. When this flag is set, the program
is triggered on cpu with id bpf_attr.test.cpu. This feature is needed for
BPF programs that handle perf_event and other percpu resources, as the
program can access these resource locally.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200925205432.1777-2-songliubraving@fb.com
2020-09-28 21:52:36 +02:00
Alexei Starovoitov f00f2f7fe8 Revert "bpf: Fix potential call bpf_link_free() in atomic context"
This reverts commit 31f23a6a18.

This change made many selftests/bpf flaky: flow_dissector, sk_lookup, sk_assign and others.
There was no issue in the code.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-09-23 19:14:11 -07:00
David S. Miller 6d772f328d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-09-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 95 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).

The main changes are:

1) Full multi function support in libbpf, from Andrii.

2) Refactoring of function argument checks, from Lorenz.

3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.

4) Program metadata support, from YiFei.

5) bpf iterator optimizations, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-23 13:11:11 -07:00
Muchun Song 31f23a6a18 bpf: Fix potential call bpf_link_free() in atomic context
The in_atomic() macro cannot always detect atomic context, in particular,
it cannot know about held spinlocks in non-preemptible kernels. Although,
there is no user call bpf_link_put() with holding spinlock now, be on the
safe side, so we can avoid this in the future.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200917074453.20621-1-songmuchun@bytedance.com
2020-09-21 21:20:17 +02:00
YiFei Zhu ef15314aa5 bpf: Add BPF_PROG_BIND_MAP syscall
This syscall binds a map to a program. Returns success if the map is
already bound to the program.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: YiFei Zhu <zhuyifei1999@gmail.com>
Link: https://lore.kernel.org/bpf/20200915234543.3220146-3-sdf@google.com
2020-09-15 18:28:27 -07:00
YiFei Zhu 984fe94f94 bpf: Mutex protect used_maps array and count
To support modifying the used_maps array, we use a mutex to protect
the use of the counter and the array. The mutex is initialized right
after the prog aux is allocated, and destroyed right before prog
aux is freed. This way we guarantee it's initialized for both cBPF
and eBPF.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: YiFei Zhu <zhuyifei1999@gmail.com>
Link: https://lore.kernel.org/bpf/20200915234543.3220146-2-sdf@google.com
2020-09-15 18:28:27 -07:00
Jakub Kicinski 44a8c4f33c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.

Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-04 21:28:59 -07:00
Linus Torvalds 3e8d3bdc2a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Use netif_rx_ni() when necessary in batman-adv stack, from Jussi
    Kivilinna.

 2) Fix loss of RTT samples in rxrpc, from David Howells.

 3) Memory leak in hns_nic_dev_probe(), from Dignhao Liu.

 4) ravb module cannot be unloaded, fix from Yuusuke Ashizuka.

 5) We disable BH for too lokng in sctp_get_port_local(), add a
    cond_resched() here as well, from Xin Long.

 6) Fix memory leak in st95hf_in_send_cmd, from Dinghao Liu.

 7) Out of bound access in bpf_raw_tp_link_fill_link_info(), from
    Yonghong Song.

 8) Missing of_node_put() in mt7530 DSA driver, from Sumera
    Priyadarsini.

 9) Fix crash in bnxt_fw_reset_task(), from Michael Chan.

10) Fix geneve tunnel checksumming bug in hns3, from Yi Li.

11) Memory leak in rxkad_verify_response, from Dinghao Liu.

12) In tipc, don't use smp_processor_id() in preemptible context. From
    Tuong Lien.

13) Fix signedness issue in mlx4 memory allocation, from Shung-Hsi Yu.

14) Missing clk_disable_prepare() in gemini driver, from Dan Carpenter.

15) Fix ABI mismatch between driver and firmware in nfp, from Louis
    Peens.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (110 commits)
  net/smc: fix sock refcounting in case of termination
  net/smc: reset sndbuf_desc if freed
  net/smc: set rx_off for SMCR explicitly
  net/smc: fix toleration of fake add_link messages
  tg3: Fix soft lockup when tg3_reset_task() fails.
  doc: net: dsa: Fix typo in config code sample
  net: dp83867: Fix WoL SecureOn password
  nfp: flower: fix ABI mismatch between driver and firmware
  tipc: fix shutdown() of connectionless socket
  ipv6: Fix sysctl max for fib_multipath_hash_policy
  drivers/net/wan/hdlc: Change the default of hard_header_len to 0
  net: gemini: Fix another missing clk_disable_unprepare() in probe
  net: bcmgenet: fix mask check in bcmgenet_validate_flow()
  amd-xgbe: Add support for new port mode
  net: usb: dm9601: Add USB ID of Keenetic Plus DSL
  vhost: fix typo in error message
  net: ethernet: mlx4: Fix memory allocation in mlx4_buddy_init()
  pktgen: fix error message with wrong function name
  net: ethernet: ti: am65-cpsw: fix rmii 100Mbit link mode
  cxgb4: fix thermal zone device registration
  ...
2020-09-03 18:50:48 -07:00
Alexei Starovoitov 1e6c62a882 bpf: Introduce sleepable BPF programs
Introduce sleepable BPF programs that can request such property for themselves
via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
to use helpers like bpf_copy_from_user() that might sleep. At present only
fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
when they are attached to kernel functions that are known to allow sleeping.

The non-sleepable programs are relying on implicit rcu_read_lock() and
migrate_disable() to protect life time of programs, maps that they use and
per-cpu kernel structures used to pass info between bpf programs and the
kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
should not be enclosed in migrate_disable() as well. Therefore
rcu_read_lock_trace is used to protect the life time of sleepable progs.

There are many networking and tracing program types. In many cases the
'struct bpf_prog *' pointer itself is rcu protected within some other kernel
data structure and the kernel code is using rcu_dereference() to load that
program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
Instead sleepable bpf programs are allowed with bpf trampoline only. The
program pointers are hard-coded into generated assembly of bpf trampoline and
synchronize_rcu_tasks_trace() is used to protect the life time of the program.
The same trampoline can hold both sleepable and non-sleepable progs.

When rcu_read_lock_trace is held it means that some sleepable bpf program is
running from bpf trampoline. Those programs can use bpf arrays and preallocated
hash/lru maps. These map types are waiting on programs to complete via
synchronize_rcu_tasks_trace();

Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
synchronize_rcu_tasks() to wait for sleepable progs to finish and for
trampoline assembly to finish.

This is the first step of introducing sleepable progs. Eventually dynamically
allocated hash maps can be allowed and networking program types can become
sleepable too.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
2020-08-28 21:20:33 +02:00
Martin KaFai Lau f4d0525921 bpf: Add map_meta_equal map ops
Some properties of the inner map is used in the verification time.
When an inner map is inserted to an outer map at runtime,
bpf_map_meta_equal() is currently used to ensure those properties
of the inserting inner map stays the same as the verification
time.

In particular, the current bpf_map_meta_equal() checks max_entries which
turns out to be too restrictive for most of the maps which do not use
max_entries during the verification time.  It limits the use case that
wants to replace a smaller inner map with a larger inner map.  There are
some maps do use max_entries during verification though.  For example,
the map_gen_lookup in array_map_ops uses the max_entries to generate
the inline lookup code.

To accommodate differences between maps, the map_meta_equal is added
to bpf_map_ops.  Each map-type can decide what to check when its
map is used as an inner map during runtime.

Also, some map types cannot be used as an inner map and they are
currently black listed in bpf_map_meta_alloc() in map_in_map.c.
It is not unusual that the new map types may not aware that such
blacklist exists.  This patch enforces an explicit opt-in
and only allows a map to be used as an inner map if it has
implemented the map_meta_equal ops.  It is based on the
discussion in [1].

All maps that support inner map has its map_meta_equal points
to bpf_map_meta_equal in this patch.  A later patch will
relax the max_entries check for most maps.  bpf_types.h
counts 28 map types.  This patch adds 23 ".map_meta_equal"
by using coccinelle.  -5 for
	BPF_MAP_TYPE_PROG_ARRAY
	BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
	BPF_MAP_TYPE_STRUCT_OPS
	BPF_MAP_TYPE_ARRAY_OF_MAPS
	BPF_MAP_TYPE_HASH_OF_MAPS

The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
is moved such that the same error is returned.

[1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
2020-08-28 15:41:30 +02:00
KP Singh 8ea636848a bpf: Implement bpf_local_storage for inodes
Similar to bpf_local_storage for sockets, add local storage for inodes.
The life-cycle of storage is managed with the life-cycle of the inode.
i.e. the storage is destroyed along with the owning inode.

The BPF LSM allocates an __rcu pointer to the bpf_local_storage in the
security blob which are now stackable and can co-exist with other LSMs.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-6-kpsingh@chromium.org
2020-08-25 15:00:04 -07:00
Yonghong Song b474959d5a bpf: Fix a buffer out-of-bound access when filling raw_tp link_info
Commit f2e10bff16 ("bpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_link")
added link query for raw_tp. One of fields in link_info is to
fill a user buffer with tp_name. The Scurrent checking only
declares "ulen && !ubuf" as invalid. So "!ulen && ubuf" will be
valid. Later on, we do "copy_to_user(ubuf, tp_name, ulen - 1)" which
may overwrite user memory incorrectly.

This patch fixed the problem by disallowing "!ulen && ubuf" case as well.

Fixes: f2e10bff16 ("bpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_link")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200821191054.714731-1-yhs@fb.com
2020-08-24 21:03:07 -07:00
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Lorenz Bauer 13b79d3ffb bpf: sockmap: Call sock_map_update_elem directly
Don't go via map->ops to call sock_map_update_elem, since we know
what function to call in bpf_map_update_value. Since we currently
don't allow calling map_update_elem from BPF context, we can remove
ops->map_update_elem and rename the function to sock_map_update_elem_sys.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200821102948.21918-4-lmb@cloudflare.com
2020-08-21 15:16:11 -07:00
Alexei Starovoitov 005142b8a1 bpf: Factor out bpf_link_by_id() helper.
Refactor the code a bit to extract bpf_link_by_id() helper.
It's similar to existing bpf_prog_by_id().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200819042759.51280-2-alexei.starovoitov@gmail.com
2020-08-20 16:02:36 +02:00
Yonghong Song 5e7b30205c bpf: Change uapi for bpf iterator map elements
Commit a5cbe05a66 ("bpf: Implement bpf iterator for
map elements") added bpf iterator support for
map elements. The map element bpf iterator requires
info to identify a particular map. In the above
commit, the attr->link_create.target_fd is used
to carry map_fd and an enum bpf_iter_link_info
is added to uapi to specify the target_fd actually
representing a map_fd:
    enum bpf_iter_link_info {
	BPF_ITER_LINK_UNSPEC = 0,
	BPF_ITER_LINK_MAP_FD = 1,

	MAX_BPF_ITER_LINK_INFO,
    };

This is an extensible approach as we can grow
enumerator for pid, cgroup_id, etc. and we can
unionize target_fd for pid, cgroup_id, etc.
But in the future, there are chances that
more complex customization may happen, e.g.,
for tasks, it could be filtered based on
both cgroup_id and user_id.

This patch changed the uapi to have fields
	__aligned_u64	iter_info;
	__u32		iter_info_len;
for additional iter_info for link_create.
The iter_info is defined as
	union bpf_iter_link_info {
		struct {
			__u32   map_fd;
		} map;
	};

So future extension for additional customization
will be easier. The bpf_iter_link_info will be
passed to target callback to validate and generic
bpf_iter framework does not need to deal it any
more.

Note that map_fd = 0 will be considered invalid
and -EBADF will be returned to user space.

Fixes: a5cbe05a66 ("bpf: Implement bpf iterator for map elements")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200805055056.1457463-1-yhs@fb.com
2020-08-06 16:39:14 -07:00
Andrii Nakryiko 73b11c2ab0 bpf: Add support for forced LINK_DETACH command
Add LINK_DETACH command to force-detach bpf_link without destroying it. It has
the same behavior as auto-detaching of bpf_link due to cgroup dying for
bpf_cgroup_link or net_device being destroyed for bpf_xdp_link. In such case,
bpf_link is still a valid kernel object, but is defuncts and doesn't hold BPF
program attached to corresponding BPF hook. This functionality allows users
with enough access rights to manually force-detach attached bpf_link without
killing respective owner process.

This patch implements LINK_DETACH for cgroup, xdp, and netns links, mostly
re-using existing link release handling code.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200731182830.286260-2-andriin@fb.com
2020-08-01 20:38:28 -07:00
Andrii Nakryiko 310ad7970a bpf: Fix build without CONFIG_NET when using BPF XDP link
Entire net/core subsystem is not built without CONFIG_NET. linux/netdevice.h
just assumes that it's always there, so the easiest way to fix this is to
conditionally compile out bpf_xdp_link_attach() use in bpf/syscall.c.

Fixes: aa8d3a716b ("bpf, xdp: Add bpf_link-based XDP attachment API")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200728190527.110830-1-andriin@fb.com
2020-07-29 00:29:00 +02:00
Andrii Nakryiko aa8d3a716b bpf, xdp: Add bpf_link-based XDP attachment API
Add bpf_link-based API (bpf_xdp_link) to attach BPF XDP program through
BPF_LINK_CREATE command.

bpf_xdp_link is mutually exclusive with direct BPF program attachment,
previous BPF program should be detached prior to attempting to create a new
bpf_xdp_link attachment (for a given XDP mode). Once BPF link is attached, it
can't be replaced by other BPF program attachment or link attachment. It will
be detached only when the last BPF link FD is closed.

bpf_xdp_link will be auto-detached when net_device is shutdown, similarly to
how other BPF links behave (cgroup, flow_dissector). At that point bpf_link
will become defunct, but won't be destroyed until last FD is closed.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-5-andriin@fb.com
2020-07-25 20:37:02 -07:00
Alexei Starovoitov a228a64fc1 bpf: Add bpf_prog iterator
It's mostly a copy paste of commit 6086d29def ("bpf: Add bpf_map iterator")
that is use to implement bpf_seq_file opreations to traverse all bpf programs.

v1->v2: Tweak to use build time btf_id

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2020-07-25 20:16:32 -07:00
Jakub Sitnicki e9ddbb7707 bpf: Introduce SK_LOOKUP program type with a dedicated attach point
Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
when looking up a listening socket for a new connection request for
connection oriented protocols, or when looking up an unconnected socket for
a packet for connection-less protocols.

When called, SK_LOOKUP BPF program can select a socket that will receive
the packet. This serves as a mechanism to overcome the limits of what
bind() API allows to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, on fixed port to a socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, on any port to a socket

     198.51.100.1, any port -> L7 proxy socket

In its run-time context program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple. Context can be further extended to include ingress
interface identifier.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection. Transport layer then uses the selected
socket as a result of socket lookup.

In its basic form, SK_LOOKUP acts as a filter and hence must return either
SK_PASS or SK_DROP. If the program returns with SK_PASS, transport should
look for a socket to receive the packet, or use the one selected by the
program if available, while SK_DROP informs the transport layer that the
lookup should fail.

This patch only enables the user to attach an SK_LOOKUP program to a
network namespace. Subsequent patches hook it up to run on local delivery
path in ipv4 and ipv6 stacks.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200717103536.397595-3-jakub@cloudflare.com
2020-07-17 20:18:16 -07:00
David S. Miller 07dd1b7e68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-07-13

The following pull-request contains BPF updates for your *net-next* tree.

We've added 36 non-merge commits during the last 7 day(s) which contain
a total of 62 files changed, 2242 insertions(+), 468 deletions(-).

The main changes are:

1) Avoid trace_printk warning banner by switching bpf_trace_printk to use
   its own tracing event, from Alan.

2) Better libbpf support on older kernels, from Andrii.

3) Additional AF_XDP stats, from Ciara.

4) build time resolution of BTF IDs, from Jiri.

5) BPF_CGROUP_INET_SOCK_RELEASE hook, from Stanislav.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-13 18:04:05 -07:00
Linus Torvalds 5a764898af Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Restore previous behavior of CAP_SYS_ADMIN wrt loading networking
    BPF programs, from Maciej Żenczykowski.

 2) Fix dropped broadcasts in mac80211 code, from Seevalamuthu
    Mariappan.

 3) Slay memory leak in nl80211 bss color attribute parsing code, from
    Luca Coelho.

 4) Get route from skb properly in ip_route_use_hint(), from Miaohe Lin.

 5) Don't allow anything other than ARPHRD_ETHER in llc code, from Eric
    Dumazet.

 6) xsk code dips too deeply into DMA mapping implementation internals.
    Add dma_need_sync and use it. From Christoph Hellwig

 7) Enforce power-of-2 for BPF ringbuf sizes. From Andrii Nakryiko.

 8) Check for disallowed attributes when loading flow dissector BPF
    programs. From Lorenz Bauer.

 9) Correct packet injection to L3 tunnel devices via AF_PACKET, from
    Jason A. Donenfeld.

10) Don't advertise checksum offload on ipa devices that don't support
    it. From Alex Elder.

11) Resolve several issues in TCP MD5 signature support. Missing memory
    barriers, bogus options emitted when using syncookies, and failure
    to allow md5 key changes in established states. All from Eric
    Dumazet.

12) Fix interface leak in hsr code, from Taehee Yoo.

13) VF reset fixes in hns3 driver, from Huazhong Tan.

14) Make loopback work again with ipv6 anycast, from David Ahern.

15) Fix TX starvation under high load in fec driver, from Tobias
    Waldekranz.

16) MLD2 payload lengths not checked properly in bridge multicast code,
    from Linus Lüssing.

17) Packet scheduler code that wants to find the inner protocol
    currently only works for one level of VLAN encapsulation. Allow
    Q-in-Q situations to work properly here, from Toke
    Høiland-Jørgensen.

18) Fix route leak in l2tp, from Xin Long.

19) Resolve conflict between the sk->sk_user_data usage of bpf reuseport
    support and various protocols. From Martin KaFai Lau.

20) Fix socket cgroup v2 reference counting in some situations, from
    Cong Wang.

21) Cure memory leak in mlx5 connection tracking offload support, from
    Eli Britstein.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits)
  mlxsw: pci: Fix use-after-free in case of failed devlink reload
  mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON()
  net: macb: fix call to pm_runtime in the suspend/resume functions
  net: macb: fix macb_suspend() by removing call to netif_carrier_off()
  net: macb: fix macb_get/set_wol() when moving to phylink
  net: macb: mark device wake capable when "magic-packet" property present
  net: macb: fix wakeup test in runtime suspend/resume routines
  bnxt_en: fix NULL dereference in case SR-IOV configuration fails
  libbpf: Fix libbpf hashmap on (I)LP32 architectures
  net/mlx5e: CT: Fix memory leak in cleanup
  net/mlx5e: Fix port buffers cell size value
  net/mlx5e: Fix 50G per lane indication
  net/mlx5e: Fix CPU mapping after function reload to avoid aRFS RX crash
  net/mlx5e: Fix VXLAN configuration restore after function reload
  net/mlx5e: Fix usage of rcu-protected pointer
  net/mxl5e: Verify that rpriv is not NULL
  net/mlx5: E-Switch, Fix vlan or qos setting in legacy mode
  net/mlx5: Fix eeprom support for SFP module
  cgroup: Fix sock_cgroup_data on big-endian.
  selftests: bpf: Fix detach from sockmap tests
  ...
2020-07-10 18:16:22 -07:00
Kees Cook 6396026045 bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok()
When evaluating access control over kallsyms visibility, credentials at
open() time need to be used, not the "current" creds (though in BPF's
case, this has likely always been the same). Plumb access to associated
file->f_cred down through bpf_dump_raw_ok() and its callers now that
kallsysm_show_value() has been refactored to take struct cred.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 7105e828c0 ("bpf: allow for correlation of maps and helpers in dump")
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-08 16:01:21 -07:00
Stanislav Fomichev f5836749c9 bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook
Sometimes it's handy to know when the socket gets freed. In
particular, we'd like to try to use a smarter allocation of
ports for bpf_bind and explore the possibility of limiting
the number of SOCK_DGRAM sockets the process can have.

Implement BPF_CGROUP_INET_SOCK_RELEASE hook that triggers on
inet socket release. It triggers only for userspace sockets
(not in-kernel ones) and therefore has the same semantics as
the existing BPF_CGROUP_INET_SOCK_CREATE.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200706230128.4073544-2-sdf@google.com
2020-07-08 01:03:31 +02:00