Commit Graph

596 Commits

Author SHA1 Message Date
Viktor Malik 1cf517d9a1
bpf: introduce BPF token object
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit 4527358b76861dfd64ee34aba45d81648fbc8a61
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Thu Nov 30 10:52:15 2023 -0800

    bpf: introduce BPF token object
    
    Add new kind of BPF kernel object, BPF token. BPF token is meant to
    allow delegating privileged BPF functionality, like loading a BPF
    program or creating a BPF map, from privileged process to a *trusted*
    unprivileged process, all while having a good amount of control over which
    privileged operations could be performed using provided BPF token.
    
    This is achieved through mounting BPF FS instance with extra delegation
    mount options, which determine what operations are delegatable, and also
    constraining it to the owning user namespace (as mentioned in the
    previous patch).
    
    BPF token itself is just a derivative from BPF FS and can be created
    through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
    FS FD, which can be attained through open() API by opening BPF FS mount
    point. Currently, BPF token "inherits" delegated command, map types,
    prog type, and attach type bit sets from BPF FS as is. In the future,
    having an BPF token as a separate object with its own FD, we can allow
    to further restrict BPF token's allowable set of things either at the
    creation time or after the fact, allowing the process to guard itself
    further from unintentionally trying to load undesired kind of BPF
    programs. But for now we keep things simple and just copy bit sets as is.
    
    When BPF token is created from BPF FS mount, we take reference to the
    BPF super block's owning user namespace, and then use that namespace for
    checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
    capabilities that are normally only checked against init userns (using
    capable()), but now we check them using ns_capable() instead (if BPF
    token is provided). See bpf_token_capable() for details.
    
    Such setup means that BPF token in itself is not sufficient to grant BPF
    functionality. User namespaced process has to *also* have necessary
    combination of capabilities inside that user namespace. So while
    previously CAP_BPF was useless when granted within user namespace, now
    it gains a meaning and allows container managers and sys admins to have
    a flexible control over which processes can and need to use BPF
    functionality within the user namespace (i.e., container in practice).
    And BPF FS delegation mount options and derived BPF tokens serve as
    a per-container "flag" to grant overall ability to use bpf() (plus further
    restrict on which parts of bpf() syscalls are treated as namespaced).
    
    Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
    within the BPF FS owning user namespace, rounding up the ns_capable()
    story of BPF token.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:52:08 +02:00
Viktor Malik 6e7fe1a96c
bpf: align CAP_NET_ADMIN checks with bpf_capable() approach
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit 909fa05dd3c181e5b403912889057f7cdbf3906c
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Thu Nov 30 10:52:13 2023 -0800

    bpf: align CAP_NET_ADMIN checks with bpf_capable() approach
    
    Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
    compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
    CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
    takes over and grants whatever part of BPF syscall is required.
    
    Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
    One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
    BPF_PROG_LOAD, if the type of BPF program is "network-related" either
    CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.
    
    But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
    is set:
      - when creating DEVMAP/XDKMAP/CPU_MAP maps;
      - when attaching CGROUP_SKB programs;
      - when handling BPF_PROG_QUERY command.
    
    This patch is changing the latter three cases to follow BPF_PROG_LOAD
    model, that is allowing to proceed under either CAP_NET_ADMIN or
    CAP_SYS_ADMIN.
    
    This also makes it cleaner in subsequent BPF token patches to switch
    wholesomely to a generic bpf_token_capable(int cap) check, that always
    falls back to CAP_SYS_ADMIN if requested capability is missing.
    
    Cc: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20231130185229.2688956-2-andrii@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:52:08 +02:00
Viktor Malik a7312d7fb3
bpf: Optimize the free of inner map
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit af66bfd3c8538ed21cf72af18426fc4a408665cf
Author: Hou Tao <houtao1@huawei.com>
Date:   Mon Dec 4 22:04:23 2023 +0800

    bpf: Optimize the free of inner map
    
    When removing the inner map from the outer map, the inner map will be
    freed after one RCU grace period and one RCU tasks trace grace
    period, so it is certain that the bpf program, which may access the
    inner map, has exited before the inner map is freed.
    
    However there is no need to wait for one RCU tasks trace grace period if
    the outer map is only accessed by non-sleepable program. So adding
    sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding
    the outer map into env->used_maps for sleepable program. Although the
    max number of bpf program is INT_MAX - 1, the number of bpf programs
    which are being loaded may be greater than INT_MAX, so using atomic64_t
    instead of atomic_t for sleepable_refcnt. When removing the inner map
    from the outer map, using sleepable_refcnt to decide whether or not a
    RCU tasks trace grace period is needed before freeing the inner map.
    
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:52:04 +02:00
Viktor Malik e9ec28c956
bpf: Defer the free of inner map when necessary
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit 876673364161da50eed6b472d746ef88242b2368
Author: Hou Tao <houtao1@huawei.com>
Date:   Mon Dec 4 22:04:22 2023 +0800

    bpf: Defer the free of inner map when necessary
    
    When updating or deleting an inner map in map array or map htab, the map
    may still be accessed by non-sleepable program or sleepable program.
    However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map
    directly through bpf_map_put(), if the ref-counter is the last one
    (which is true for most cases), the inner map will be freed by
    ops->map_free() in a kworker. But for now, most .map_free() callbacks
    don't use synchronize_rcu() or its variants to wait for the elapse of a
    RCU grace period, so after the invocation of ops->map_free completes,
    the bpf program which is accessing the inner map may incur
    use-after-free problem.
    
    Fix the free of inner map by invoking bpf_map_free_deferred() after both
    one RCU grace period and one tasks trace RCU grace period if the inner
    map has been removed from the outer map before. The deferment is
    accomplished by using call_rcu() or call_rcu_tasks_trace() when
    releasing the last ref-counter of bpf map. The newly-added rcu_head
    field in bpf_map shares the same storage space with work field to
    reduce the size of bpf_map.
    
    Fixes: bba1dc0b55 ("bpf: Remove redundant synchronize_rcu.")
    Fixes: 638e4b825d ("bpf: Allows per-cpu maps and map-in-map in sleepable programs")
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:52:04 +02:00
Viktor Malik 20f6073121
bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit ff8867af01daa7ea770bebf5f91199b7434b74e5
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Fri Nov 17 09:14:04 2023 -0800

    bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
    
    Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral
    BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0].
    
    A few selftests and veristat need to be adjusted in the same patch as
    well.
    
      [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231112010609.848406-5-andrii@kernel.org/
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20231117171404.225508-1-andrii@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:51:50 +02:00
Viktor Malik 0d8497b0c8
bpf: add register bounds sanity checks and sanitization
JIRA: https://issues.redhat.com/browse/RHEL-23644

Conflicts: changed context due to missing upstream commit
           91721c2d02d3 ("netfilter: bpf: Support
           BPF_F_NETFILTER_IP_DEFRAG in netfilter link")

commit 5f99f312bd3bedb3b266b0d26376a8c500cdc97f
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Sat Nov 11 17:06:00 2023 -0800

    bpf: add register bounds sanity checks and sanitization

    Add simple sanity checks that validate well-formed ranges (min <= max)
    across u64, s64, u32, and s32 ranges. Also for cases when the value is
    constant (either 64-bit or 32-bit), we validate that ranges and tnums
    are in agreement.

    These bounds checks are performed at the end of BPF_ALU/BPF_ALU64
    operations, on conditional jumps, and for LDX instructions (where subreg
    zero/sign extension is probably the most important to check). This
    covers most of the interesting cases.

    Also, we validate the sanity of the return register when manually
    adjusting it for some special helpers.

    By default, sanity violation will trigger a warning in verifier log and
    resetting register bounds to "unbounded" ones. But to aid development
    and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will
    trigger hard failure of verification with -EFAULT on register bounds
    violations. This allows selftests to catch such issues. veristat will
    also gain a CLI option to enable this behavior.

    Acked-by: Eduard Zingerman <eddyz87@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
    Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:51:47 +02:00
Aristeu Rozanski e214620cfb mm: replace vma->vm_flags direct modifications with modifier calls
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped stuff we don't support when not applying cleanly, left the rest for sake of saving work

commit 1c71222e5f2393b5ea1a41795c67589eea7e3490
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:49 2023 -0800

    mm: replace vma->vm_flags direct modifications with modifier calls

    Replace direct modifications to vma->vm_flags with calls to modifier
    functions to be able to track flag changes and to keep vma locking
    correctness.

    [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
    Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Felix Maurer 20b979d4f1 bpf: Refuse unused attributes in bpf_prog_{attach,detach}
JIRA: https://issues.redhat.com/browse/RHEL-28590

commit ba62d61128bda71fd02622f320ac59d861fc4baa
Author: Lorenz Bauer <lmb@isovalent.com>
Date:   Sat Oct 7 00:06:51 2023 +0200

    bpf: Refuse unused attributes in bpf_prog_{attach,detach}
    
    The recently added tcx attachment extended the BPF UAPI for attaching and
    detaching by a couple of fields. Those fields are currently only supported
    for tcx, other types like cgroups and flow dissector silently ignore the
    new fields except for the new flags.
    
    This is problematic once we extend bpf_mprog to older attachment types, since
    it's hard to figure out whether the syscall really was successful if the
    kernel silently ignores non-zero values.
    
    Explicitly reject non-zero fields relevant to bpf_mprog for attachment types
    which don't use the latter yet.
    
    Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
    Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
    Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/r/20231006220655.1653-3-daniel@iogearbox.net
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-04-04 16:30:18 +02:00
Felix Maurer cbdf6d7deb bpf: Add fd-based tcx multi-prog infra with link support
JIRA: https://issues.redhat.com/browse/RHEL-28590
Conflicts:
- MAINTAINERS: The file has been restructured upstream, but this is not
  relevant for us. All paths are already covered.
- include/linux/netdevice.h: We have excluded TC from kABI with
  845ad79d11 ("net: exclude TC from kABI"). Keep this exclusion.
- include/linux/skbuff.h: The order of the fields has been changed upstream
  in c0ba861117c3 ("net: skbuff: move the fields BPF cares about directly
  next to the offset marker"). The actual change is just changing config
  options. Do this instead of picking the field reordering to make
  backporting easier.
- include/uapi/linux/bpf.h and tools/include/uapi/linux/bpf.h: The changes
  to these files were already backported through 1d5bff6a09 ("bpf: Add
  fd-based tcx multi-prog infra with link support") to keep UAPI close to
  upstream.
- kernel/bpf/syscall.c: Already backported 58ff9f1ec9 ("bpf: Add
  attach_type checks under bpf_prog_attach_check_attach_type") moves one
  switch block around. The case BPF_PROG_TYPE_SCHED_CLS was added during
  that backport, therefore this hunk is missing now. This also causes
  context differences.
- kernel/bpf/syscall.c: Already backported 81b5cf0a11 ("bpf: Fix
  BPF_PROG_QUERY last field check") fixed the QUERY_LAST_FIELD.

commit e420bed025071a623d2720a92bc2245c84757ecb
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Wed Jul 19 16:08:52 2023 +0200

    bpf: Add fd-based tcx multi-prog infra with link support

    This work refactors and adds a lightweight extension ("tcx") to the tc BPF
    ingress and egress data path side for allowing BPF program management based
    on fds via bpf() syscall through the newly added generic multi-prog API.
    The main goal behind this work which we also presented at LPC [0] last year
    and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
    BPF link functionality for tc BPF programs, which allows for a model of safe
    ownership and program detachment.

    Given the rise in tc BPF users in cloud native environments, this becomes
    necessary to avoid hard to debug incidents either through stale leftover
    programs or 3rd party applications accidentally stepping on each others toes.
    As a recap, a BPF link represents the attachment of a BPF program to a BPF
    hook point. The BPF link holds a single reference to keep BPF program alive.
    Moreover, hook points do not reference a BPF link, only the application's
    fd or pinning does. A BPF link holds meta-data specific to attachment and
    implements operations for link creation, (atomic) BPF program update,
    detachment and introspection. The motivation for BPF links for tc BPF programs
    is multi-fold, for example:

      - From Meta: "It's especially important for applications that are deployed
        fleet-wide and that don't "control" hosts they are deployed to. If such
        application crashes and no one notices and does anything about that, BPF
        program will keep running draining resources or even just, say, dropping
        packets. We at FB had outages due to such permanent BPF attachment
        semantics. With fd-based BPF link we are getting a framework, which allows
        safe, auto-detachable behavior by default, unless application explicitly
        opts in by pinning the BPF link." [1]

      - From Cilium-side the tc BPF programs we attach to host-facing veth devices
        and phys devices build the core datapath for Kubernetes Pods, and they
        implement forwarding, load-balancing, policy, EDT-management, etc, within
        BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
        experienced hard-to-debug issues in a user's staging environment where
        another Kubernetes application using tc BPF attached to the same prio/handle
        of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
        it. The goal is to establish a clear/safe ownership model via links which
        cannot accidentally be overridden. [0,2]

    BPF links for tc can co-exist with non-link attachments, and the semantics are
    in line also with XDP links: BPF links cannot replace other BPF links, BPF
    links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
    lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
    would solve mentioned issue of safe ownership model as 3rd party applications
    would not be able to accidentally wipe Cilium programs, even if they are not
    BPF link aware.

    Earlier attempts [4] have tried to integrate BPF links into core tc machinery
    to solve cls_bpf, which has been intrusive to the generic tc kernel API with
    extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
    be wiped from the qdisc also. Locking a tc BPF program in place this way, is
    getting into layering hacks given the two object models are vastly different.

    We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
    attach API, so that the BPF link implementation blends in naturally similar to
    other link types which are fd-based and without the need for changing core tc
    internal APIs. BPF programs for tc can then be successively migrated from classic
    cls_bpf to the new tc BPF link without needing to change the program's source
    code, just the BPF loader mechanics for attaching is sufficient.

    For the current tc framework, there is no change in behavior with this change
    and neither does this change touch on tc core kernel APIs. The gist of this
    patch is that the ingress and egress hook have a lightweight, qdisc-less
    extension for BPF to attach its tc BPF programs, in other words, a minimal
    entry point for tc BPF. The name tcx has been suggested from discussion of
    earlier revisions of this work as a good fit, and to more easily differ between
    the classic cls_bpf attachment and the fd-based one.

    For the ingress and egress tcx points, the device holds a cache-friendly array
    with program pointers which is separated from control plane (slow-path) data.
    Earlier versions of this work used priority to determine ordering and expression
    of dependencies similar as with classic tc, but it was challenged that for
    something more future-proof a better user experience is required. Hence this
    resulted in the design and development of the generic attach/detach/query API
    for multi-progs. See prior patch with its discussion on the API design. tcx is
    the first user and later we plan to integrate also others, for example, one
    candidate is multi-prog support for XDP which would benefit and have the same
    'look and feel' from API perspective.

    The goal with tcx is to have maximum compatibility to existing tc BPF programs,
    so they don't need to be rewritten specifically. Compatibility to call into
    classic tcf_classify() is also provided in order to allow successive migration
    or both to cleanly co-exist where needed given its all one logical tc layer and
    the tcx plus classic tc cls/act build one logical overall processing pipeline.

    tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
    to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
    The fd-based API is behind a static key, so that when unused the code is also
    not entered. The struct tcx_entry's program array is currently static, but
    could be made dynamic if necessary at a point in future. The a/b pair swap
    design has been chosen so that for detachment there are no allocations which
    otherwise could fail.

    The work has been tested with tc-testing selftest suite which all passes, as
    well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

    Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
    of this work.

      [0] https://lpc.events/event/16/contributions/1353/
      [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
      [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
      [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
      [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-04-04 16:21:22 +02:00
Artem Savkov c10287edfe bpf: Add missed value to kprobe perf link info
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 3acf8ace68230e9558cf916847f1cc9f208abdf1
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Wed Sep 20 23:31:39 2023 +0200

    bpf: Add missed value to kprobe perf link info
    
    Add missed value to kprobe attached through perf link info to
    hold the stats of missed kprobe handler execution.
    
    The kprobe's missed counter gets incremented when kprobe handler
    is not executed due to another kprobe running on the same cpu.
    
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20230920213145.1941596-4-jolsa@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 11:23:41 +01:00
Artem Savkov b703028f69 bpf: Use bpf_global_percpu_ma for per-cpu kptr in __bpf_obj_drop_impl()
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit e383a45902337356d9ccad797094a27c6b2150f9
Author: Hou Tao <houtao1@huawei.com>
Date:   Fri Oct 20 21:32:01 2023 +0800

    bpf: Use bpf_global_percpu_ma for per-cpu kptr in __bpf_obj_drop_impl()
    
    The following warning was reported when running "./test_progs -t
    test_bpf_ma/percpu_free_through_map_free":
    
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 68 at kernel/bpf/memalloc.c:342
      CPU: 1 PID: 68 Comm: kworker/u16:2 Not tainted 6.6.0-rc2+ #222
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
      Workqueue: events_unbound bpf_map_free_deferred
      RIP: 0010:bpf_mem_refill+0x21c/0x2a0
      ......
      Call Trace:
       <IRQ>
       ? bpf_mem_refill+0x21c/0x2a0
       irq_work_single+0x27/0x70
       irq_work_run_list+0x2a/0x40
       irq_work_run+0x18/0x40
       __sysvec_irq_work+0x1c/0xc0
       sysvec_irq_work+0x73/0x90
       </IRQ>
       <TASK>
       asm_sysvec_irq_work+0x1b/0x20
      RIP: 0010:unit_free+0x50/0x80
       ......
       bpf_mem_free+0x46/0x60
       __bpf_obj_drop_impl+0x40/0x90
       bpf_obj_free_fields+0x17d/0x1a0
       array_map_free+0x6b/0x170
       bpf_map_free_deferred+0x54/0xa0
       process_scheduled_works+0xba/0x370
       worker_thread+0x16d/0x2e0
       kthread+0x105/0x140
       ret_from_fork+0x39/0x60
       ret_from_fork_asm+0x1b/0x30
       </TASK>
      ---[ end trace 0000000000000000 ]---
    
    The reason is simple: __bpf_obj_drop_impl() does not know the freeing
    field is a per-cpu pointer and it uses bpf_global_ma to free the
    pointer. Because bpf_global_ma is not a per-cpu allocator, so ksize() is
    used to select the corresponding cache. The bpf_mem_cache with 16-bytes
    unit_size will always be selected to do the unmatched free and it will
    trigger the warning in free_bulk() eventually.
    
    Because per-cpu kptr doesn't support list or rb-tree now, so fix the
    problem by only checking whether or not the type of kptr is per-cpu in
    bpf_obj_free_fields(), and using bpf_global_percpu_ma to these kptrs.
    
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Link: https://lore.kernel.org/r/20231020133202.4043247-7-houtao@huaweicloud.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:55 +01:00
Artem Savkov 51fc10f0a1 bpf: Move the declaration of __bpf_obj_drop_impl() to bpf.h
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit e581a3461de3f129cfe888a67d9f31086328271f
Author: Hou Tao <houtao1@huawei.com>
Date:   Fri Oct 20 21:32:00 2023 +0800

    bpf: Move the declaration of __bpf_obj_drop_impl() to bpf.h
    
    both syscall.c and helpers.c have the declaration of
    __bpf_obj_drop_impl(), so just move it to a common header file.
    
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Link: https://lore.kernel.org/r/20231020133202.4043247-6-houtao@huaweicloud.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:55 +01:00
Artem Savkov f2a52b618d bpf: Implement support for adding hidden subprogs
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 335d1c5b545284d75ef96ee42e461eacefe865bb
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Wed Sep 13 01:32:00 2023 +0200

    bpf: Implement support for adding hidden subprogs
    
    Introduce support in the verifier for generating a subprogram and
    include it as part of a BPF program dynamically after the do_check phase
    is complete. The first user will be the next patch which generates
    default exception callbacks if none are set for the program. The phase
    of invocation will be do_misc_fixups. Note that this is an internal
    verifier function, and should be used with instruction blocks which
    uphold the invariants stated in check_subprogs.
    
    Since these subprogs are always appended to the end of the instruction
    sequence of the program, it becomes relatively inexpensive to do the
    related adjustments to the subprog_info of the program. Only the fake
    exit subprogram is shifted forward, making room for our new subprog.
    
    This is useful to insert a new subprogram, get it JITed, and obtain its
    function pointer. The next patch will use this functionality to insert a
    default exception callback which will be invoked after unwinding the
    stack.
    
    Note that these added subprograms are invisible to userspace, and never
    reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single
    subprogram is supported, but more can be easily supported in the future.
    
    To this end, two function counts are introduced now, the existing
    func_cnt, and real_func_cnt, the latter including hidden programs. This
    allows us to conver the JIT code to use the real_func_cnt for management
    of resources while syscall path continues working with existing
    func_cnt.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20230912233214.1518551-4-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:47 +01:00
Artem Savkov d0231b546b bpf: Add BPF_KPTR_PERCPU as a field type
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 55db92f42fe4a4ef7b4c2b4960c6212c8512dd53
Author: Yonghong Song <yonghong.song@linux.dev>
Date:   Sun Aug 27 08:27:39 2023 -0700

    bpf: Add BPF_KPTR_PERCPU as a field type
    
    BPF_KPTR_PERCPU represents a percpu field type like below
    
      struct val_t {
        ... fields ...
      };
      struct t {
        ...
        struct val_t __percpu_kptr *percpu_data_ptr;
        ...
      };
    
    where
      #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr")))
    
    While BPF_KPTR_REF points to a trusted kernel object or a trusted
    local object, BPF_KPTR_PERCPU points to a trusted local
    percpu object.
    
    This patch added basic support for BPF_KPTR_PERCPU
    related to percpu_kptr field parsing, recording and free operations.
    BPF_KPTR_PERCPU also supports the same map types
    as BPF_KPTR_REF does.
    
    Note that unlike a local kptr, it is possible that
    a BPF_KTPR_PERCPU struct may not contain any
    special fields like other kptr, bpf_spin_lock, bpf_list_head, etc.
    
    Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230827152739.1996391-1-yonghong.song@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:45 +01:00
Jerome Marchand 81b5cf0a11 bpf: Fix BPF_PROG_QUERY last field check
JIRA: https://issues.redhat.com/browse/RHEL-10691

Conflicts: Chnge from the partially backported commit e420bed02507
("bpf: Add fd-based tcx multi-prog infra with link support")
While this patch is tagged as fixing e420bed02507, whose change to
BPF_PROG_QUERY_LAST_FIELD we didn't backport, the change in the query
structure where actually introduced by commit 053c8e1f235d "(bpf: Add
generic attach/detach/query API for multi-progs") which we backported.

commit a4fe78386afb94780f8e6fcd10a67c4d4dfe4da8
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Sat Oct 7 00:06:49 2023 +0200

    bpf: Fix BPF_PROG_QUERY last field check

    While working on the ebpf-go [0] library integration for bpf_mprog and tcx,
    Lorenz noticed that two subsequent BPF_PROG_QUERY requests currently fail. A
    typical workflow is to first gather the bpf_mprog count without passing program/
    link arrays, followed by the second request which contains the actual array
    pointers.

    The initial call populates count and revision fields. The second call gets
    rejected due to a BPF_PROG_QUERY_LAST_FIELD bug which should point to
    query.revision instead of query.link_attach_flags since the former is really
    the last member.

    It was not noticed in libbpf as bpf_prog_query_opts() always calls bpf(2) with
    an on-stack bpf_attr that is memset() each time (and therefore query.revision
    was reset to zero).

      [0] https://ebpf-go.dev

    Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
    Reported-by: Lorenz Bauer <lmb@isovalent.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/r/20231006220655.1653-1-daniel@iogearbox.net
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:29:05 +01:00
Jerome Marchand 2830473d37 bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check.
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit 6764e767f4af1e35f87f3497e1182d945de37f93
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Wed Aug 30 10:04:05 2023 +0200

    bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check.

    __bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before
    performing the recursion check which means in case of a recursion
    __bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx
    value.

    __bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx
    after the recursion check which means in case of a recursion
    __bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not
    look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx
    isn't initialized upfront.

    Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and
    set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made.
    Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens
    a few cycles later.

    Fixes: e384c7b7b46d0 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack")
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:29:03 +01:00
Jerome Marchand e2cb82c5f0 bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf().
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit 7645629f7dc88cd777f98970134bf1a54c8d77e3
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Wed Aug 30 10:04:04 2023 +0200

    bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf().

    If __bpf_prog_enter_sleepable_recur() detects recursion then it returns
    0 without undoing rcu_read_lock_trace(), migrate_disable() or
    decrementing the recursion counter. This is fine in the JIT case because
    the JIT code will jump in the 0 case to the end and invoke the matching
    exit trampoline (__bpf_prog_exit_sleepable_recur()).

    This is not the case in kern_sys_bpf() which returns directly to the
    caller with an error code.

    Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case.

    Fixes: b1d18a7574d0d ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:29:03 +01:00
Jerome Marchand 222da990b2 bpf: Remove a WARN_ON_ONCE warning related to local kptr
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit 393dc4bd92de3711012d791a93072c8cc94c4c57
Author: Yonghong Song <yonghong.song@linux.dev>
Date:   Wed Aug 23 23:34:17 2023 -0700

    bpf: Remove a WARN_ON_ONCE warning related to local kptr

    Currently, in function bpf_obj_free_fields(), for local kptr,
    a warning will be issued if the struct does not contain any
    special fields. But actually the kernel seems totally okay
    with a local kptr without any special fields. Permitting
    no special fields also aligns with future percpu kptr which
    also allows no special fields.

    Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
    Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230824063417.201925-1-yonghong.song@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:29:01 +01:00
Jerome Marchand b5f84f6a73 bpf: Add pid filter support for uprobe_multi link
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit b733eeade4204423711793595c3c8d78a2fa8b2e
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Wed Aug 9 10:34:17 2023 +0200

    bpf: Add pid filter support for uprobe_multi link

    Adding support to specify pid for uprobe_multi link and the uprobes
    are created only for task with given pid value.

    Using the consumer.filter filter callback for that, so the task gets
    filtered during the uprobe installation.

    We still need to check the task during runtime in the uprobe handler,
    because the handler could get executed if there's another system
    wide consumer on the same uprobe (thanks Oleg for the insight).

    Cc: Oleg Nesterov <oleg@redhat.com>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230809083440.3209381-6-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:28:59 +01:00
Jerome Marchand 5e4ac3ed41 bpf: Add cookies support for uprobe_multi link
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit 0b779b61f651851df5c5c42938a6c441eb1b5100
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Wed Aug 9 10:34:16 2023 +0200

    bpf: Add cookies support for uprobe_multi link

    Adding support to specify cookies array for uprobe_multi link.

    The cookies array share indexes and length with other uprobe_multi
    arrays (offsets/ref_ctr_offsets).

    The cookies[i] value defines cookie for i-the uprobe and will be
    returned by bpf_get_attach_cookie helper when called from ebpf
    program hooked to that specific uprobe.

    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230809083440.3209381-5-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:28:59 +01:00
Jerome Marchand bfe606d61a bpf: Add multi uprobe link
JIRA: https://issues.redhat.com/browse/RHEL-10691

Conflicts: Context change from missing commit 91721c2d02d3
("netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter
link")

commit 89ae89f53d201143560f1e9ed4bfa62eee34f88e
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Wed Aug 9 10:34:15 2023 +0200

    bpf: Add multi uprobe link

    Adding new multi uprobe link that allows to attach bpf program
    to multiple uprobes.

    Uprobes to attach are specified via new link_create uprobe_multi
    union:

      struct {
        __aligned_u64   path;
        __aligned_u64   offsets;
        __aligned_u64   ref_ctr_offsets;
        __u32           cnt;
        __u32           flags;
      } uprobe_multi;

    Uprobes are defined for single binary specified in path and multiple
    calling sites specified in offsets array with optional reference
    counters specified in ref_ctr_offsets array. All specified arrays
    have length of 'cnt'.

    The 'flags' supports single bit for now that marks the uprobe as
    return probe.

    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230809083440.3209381-4-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:28:59 +01:00
Jerome Marchand 58ff9f1ec9 bpf: Add attach_type checks under bpf_prog_attach_check_attach_type
JIRA: https://issues.redhat.com/browse/RHEL-10691

Conflicts: Some change from the missing parts of partially backported
commit e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link
support")

commit 3505cb9fa26cfec9512744466e754a8cbc2365b0
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Wed Aug 9 10:34:14 2023 +0200

    bpf: Add attach_type checks under bpf_prog_attach_check_attach_type

    Add extra attach_type checks from link_create under
    bpf_prog_attach_check_attach_type.

    Suggested-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230809083440.3209381-3-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:28:59 +01:00
Jerome Marchand 2842f68679 bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe()
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit 0aa35162d2a1ed7ae5303b8d91f7290d3b8b9219
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Aug 13 14:18:59 2023 +0000

    bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe()

    The commit 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event") leads
    to the following Smatch static checker warning:

        kernel/bpf/syscall.c:3416 bpf_perf_link_fill_kprobe()
        error: uninitialized symbol 'type'.

    That can happens when uname is NULL. So fix it by verifying the uname when we
    really need to fill it.

    Fixes: 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event")
    Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Closes: https://lore.kernel.org/bpf/85697a7e-f897-4f74-8b43-82721bebc462@kili.mountain
    Link: https://lore.kernel.org/bpf/20230813141900.1268-2-laoar.shao@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:28:57 +01:00
Jerome Marchand 2e128f3948 bpf: Support ->fill_link_info for perf_event
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit 1b715e1b0ec531fae72cd6698fe1c98affa436f8
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Jul 9 02:56:28 2023 +0000

    bpf: Support ->fill_link_info for perf_event

    By introducing support for ->fill_link_info to the perf_event link, users
    gain the ability to inspect it using `bpftool link show`. While the current
    approach involves accessing this information via `bpftool perf show`,
    consolidating link information for all link types in one place offers
    greater convenience. Additionally, this patch extends support to the
    generic perf event, which is not currently accommodated by
    `bpftool perf show`. While only the perf type and config are exposed to
    userspace, other attributes such as sample_period and sample_freq are
    ignored. It's important to note that if kptr_restrict is not permitted, the
    probed address will not be exposed, maintaining security measures.

    A new enum bpf_perf_event_type is introduced to help the user understand
    which struct is relevant.

    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/r/20230709025630.3735-9-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-14 15:22:25 +01:00
Jerome Marchand 1a3f7eddc7 bpf: Add a common helper bpf_copy_to_user()
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit 57d485376552480602ab83bf7499b451bae5a1b9
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Jul 9 02:56:27 2023 +0000

    bpf: Add a common helper bpf_copy_to_user()

    Add a common helper bpf_copy_to_user(), which will be used at multiple
    places.
    No functional change.

    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20230709025630.3735-8-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-14 15:22:25 +01:00
Viktor Malik 46b8ca8e88
bpf: Keep BPF_PROG_LOAD permission checks clear of validations
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit 7f6719f7a8662a40afed367a685516f9f34e7bc2
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jun 13 15:35:33 2023 -0700

    bpf: Keep BPF_PROG_LOAD permission checks clear of validations
    
    Move out flags validation and license checks out of the permission
    checks. They were intermingled, which makes subsequent changes harder.
    Clean this up: perform straightforward flag validation upfront, and
    fetch and check license later, right where we use it. Also consolidate
    capabilities check in one block, right after basic attribute sanity
    checks.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20230613223533.3689589-5-andrii@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:20 +02:00
Viktor Malik daad46d419
bpf: Centralize permissions checks for all BPF map types
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit 6c3eba1c5e283fd2bb1c076dbfcb47f569c3bfde
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jun 13 15:35:32 2023 -0700

    bpf: Centralize permissions checks for all BPF map types
    
    This allows to do more centralized decisions later on, and generally
    makes it very explicit which maps are privileged and which are not
    (e.g., LRU_HASH and LRU_PERCPU_HASH, which are privileged HASH variants,
    as opposed to unprivileged HASH and HASH_PERCPU; now this is explicit
    and easy to verify).
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20230613223533.3689589-4-andrii@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:20 +02:00
Viktor Malik 74d7d090cf
bpf: Inline map creation logic in map_create() function
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit 22db41226b679768df8f0a4ff5de8e58f625f45b
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jun 13 15:35:31 2023 -0700

    bpf: Inline map creation logic in map_create() function
    
    Currently find_and_alloc_map() performs two separate functions: some
    argument sanity checking and partial map creation workflow hanling.
    Neither of those functions are self-sufficient and are augmented by
    further checks and initialization logic in the caller (map_create()
    function). So unify all the sanity checks, permission checks, and
    creation and initialization logic in one linear piece of code in
    map_create() instead. This also make it easier to further enhance
    permission checks and keep them located in one place.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20230613223533.3689589-3-andrii@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:20 +02:00
Viktor Malik 84b07be16e
bpf: Move unprivileged checks into map_create() and bpf_prog_load()
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit 1d28635abcf1914425d6516e641978011984c58a
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jun 13 15:35:30 2023 -0700

    bpf: Move unprivileged checks into map_create() and bpf_prog_load()
    
    Make each bpf() syscall command a bit more self-contained, making it
    easier to further enhance it. We move sysctl_unprivileged_bpf_disabled
    handling down to map_create() and bpf_prog_load(), two special commands
    in this regard.
    
    Also swap the order of checks, calling bpf_capable() only if
    sysctl_unprivileged_bpf_disabled is true, avoiding unnecessary audit
    messages.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20230613223533.3689589-2-andrii@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:20 +02:00
Viktor Malik 9bc25bf7fc
bpf: Remove in_atomic() from bpf_link_put().
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit ab5d47bd41b1db82c295b0e751e2b822b43a4b5a
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Wed Jun 14 10:34:30 2023 +0200

    bpf: Remove in_atomic() from bpf_link_put().
    
    bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
    invoked within softirq context. By setting rcutree.use_softirq=0 boot
    option the RCU callbacks will be invoked in a per-CPU kthread with
    bottom halves disabled which implies a RCU read section.
    
    On PREEMPT_RT the context remains fully preemptible. The RCU read
    section however does not allow schedule() invocation. The latter happens
    in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
    from bpf_link_put().
    
    It was pointed out that the bpf_link_put() invocation should not be
    delayed if originated from close(). It was also pointed out that other
    invocations from within a syscall should also avoid the workqueue.
    Everyone else should use workqueue by default to remain safe in the
    future (while auditing the code, every caller was preemptible except for
    the RCU case).
    
    Let bpf_link_put() use the worker unconditionally. Add
    bpf_link_put_direct() which will directly free the resources and is used
    by close() and from within __sys_bpf().
    
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20230614083430.oENawF8f@linutronix.de

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:19 +02:00
Viktor Malik 1eac8d32f1
bpf: Fix bad unlock balance on freeze_mutex
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit 4266f41feaeee2521749ce2cfb52eafd4e2947c5
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Fri May 26 12:13:56 2023 +0200

    bpf: Fix bad unlock balance on freeze_mutex
    
    Commit c4c84f6fb2c4 ("bpf: drop unnecessary bpf_capable() check in
    BPF_MAP_FREEZE command") moved the permissions check outside of the
    freeze_mutex in the map_freeze() handler. The error paths still jumps
    to the err_put which tries to unlock the freeze_mutex even though it
    was not locked in the first place. Fix it.
    
    Fixes: c4c84f6fb2c4 ("bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command")
    Reported-by: syzbot+8982e75c2878b9ffeac5@syzkaller.appspotmail.com
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:15 +02:00
Viktor Malik e7497c6a4e
bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit c4c84f6fb2c4dc4c0f5fd927b3c3d3fd28b7030e
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Wed May 24 15:54:19 2023 -0700

    bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command
    
    Seems like that extra bpf_capable() check in BPF_MAP_FREEZE handler was
    unintentionally left when we switched to a model that all BPF map
    operations should be allowed regardless of CAP_BPF (or any other
    capabilities), as long as process got BPF map FD somehow.
    
    This patch replaces bpf_capable() check in BPF_MAP_FREEZE handler with
    writeable access check, given conceptually freezing the map is modifying
    it: map becomes unmodifiable for subsequent updates.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20230524225421.1587859-2-andrii@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:15 +02:00
Viktor Malik 9358711447
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit cb8edce28073a906401c9e421eca7c99f3396da1
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Mon May 15 16:48:06 2023 -0700

    bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
    
    Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
    forces users to specify pinning location as a string-based absolute or
    relative (to current working directory) path. This has various
    implications related to security (e.g., symlink-based attacks), forces
    BPF FS to be exposed in the file system, which can cause races with
    other applications.
    
    One of the feedbacks we got from folks working with containers heavily
    was that inability to use purely FD-based location specification was an
    unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
    commands. This patch closes this oversight, adding path_fd field to
    BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
    *at() syscalls for dirfd + pathname combinations.
    
    This now allows interesting possibilities like working with detached BPF
    FS mount (e.g., to perform multiple pinnings without running a risk of
    someone interfering with them), and generally making pinning/getting
    more secure and not prone to any races and/or security attacks.
    
    This is demonstrated by a selftest added in subsequent patch that takes
    advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
    creating detached BPF FS mount, pinning, and then getting BPF map out of
    it, all while never exposing this private instance of BPF FS to outside
    worlds.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:14 +02:00
Viktor Malik 6568d0fe6a
bpf: Show target_{obj,btf}_id in tracing link fdinfo
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit e859e429511afb21d3f784bd0ccdf500d40b73ef
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Wed May 17 10:31:25 2023 +0000

    bpf: Show target_{obj,btf}_id in tracing link fdinfo
    
    The target_btf_id can help us understand which kernel function is
    linked by a tracing prog. The target_btf_id and target_obj_id have
    already been exposed to userspace, so we just need to show them.
    
    The result as follows,
    
    $ cat /proc/10673/fdinfo/10
    pos:    0
    flags:  02000000
    mnt_id: 15
    ino:    2094
    link_type:      tracing
    link_id:        2
    prog_tag:       a04f5eef06a7f555
    prog_id:        13
    attach_type:    24
    target_obj_id:  1
    target_btf_id:  13964
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Song Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20230517103126.68372-2-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:13 +02:00
Viktor Malik 10374fbbf1
bpf: Print a warning only if writing to unprivileged_bpf_disabled.
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit fedf99200ab086c42a572fca1d7266b06cdc3e3f
Author: Kui-Feng Lee <thinker.li@gmail.com>
Date:   Tue May 2 11:14:18 2023 -0700

    bpf: Print a warning only if writing to unprivileged_bpf_disabled.
    
    Only print the warning message if you are writing to
    "/proc/sys/kernel/unprivileged_bpf_disabled".
    
    The kernel may print an annoying warning when you read
    "/proc/sys/kernel/unprivileged_bpf_disabled" saying
    
      WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible
      via Spectre v2 BHB attacks!
    
    However, this message is only meaningful when the feature is
    disabled or enabled.
    
    Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20230502181418.308479-1-kuifeng@meta.com

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-12 11:40:56 +02:00
Artem Savkov 46feefc504 bpf: Force kprobe multi expected_attach_type for kprobe_multi link
Bugzilla: https://bugzilla.redhat.com/2221599

commit db8eae6bc5c702d8e3ab2d0c6bb5976c131576eb
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Sun Jun 18 15:14:14 2023 +0200

    bpf: Force kprobe multi expected_attach_type for kprobe_multi link
    
    We currently allow to create perf link for program with
    expected_attach_type == BPF_TRACE_KPROBE_MULTI.
    
    This will cause crash when we call helpers like get_attach_cookie or
    get_func_ip in such program, because it will call the kprobe_multi's
    version (current->bpf_ctx context setup) of those helpers while it
    expects perf_link's current->bpf_ctx context setup.
    
    Making sure that we use BPF_TRACE_KPROBE_MULTI expected_attach_type
    only for programs attaching through kprobe_multi link.
    
    Fixes: ca74823c6e16 ("bpf: Add cookie support to programs attached with kprobe multi link")
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20230618131414.75649-1-jolsa@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:36 +02:00
Artem Savkov a8f0051ae7 bpf: netfilter: Add BPF_NETFILTER bpf_attach_type
Bugzilla: https://bugzilla.redhat.com/2221599

commit 132328e8e85174ea788faf8f627c33258c88fbad
Author: Florian Westphal <fw@strlen.de>
Date:   Mon Jun 5 15:14:45 2023 +0200

    bpf: netfilter: Add BPF_NETFILTER bpf_attach_type
    
    Andrii Nakryiko writes:
    
     And we currently don't have an attach type for NETLINK BPF link.
     Thankfully it's not too late to add it. I see that link_create() in
     kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't
     have done that. Instead we need to add BPF_NETLINK attach type to enum
     bpf_attach_type. And wire all that properly throughout the kernel and
     libbpf itself.
    
    This adds BPF_NETFILTER and uses it.  This breaks uabi but this
    wasn't in any non-rc release yet, so it should be fine.
    
    v2: check link_attack prog type in link_create too
    
    Fixes: 84601d6ee68a ("bpf: add bpf_link support for BPF_NETFILTER programs")
    Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/
    Link: https://lore.kernel.org/bpf/20230605131445.32016-1-fw@strlen.de

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:36 +02:00
Artem Savkov 90a22b78b4 bpf: add bpf_link support for BPF_NETFILTER programs
Bugzilla: https://bugzilla.redhat.com/2221599

commit 84601d6ee68ae820dec97450934797046d62db4b
Author: Florian Westphal <fw@strlen.de>
Date:   Fri Apr 21 19:02:54 2023 +0200

    bpf: add bpf_link support for BPF_NETFILTER programs
    
    Add bpf_link support skeleton.  To keep this reviewable, no bpf program
    can be invoked yet, if a program is attached only a c-stub is called and
    not the actual bpf program.
    
    Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig.
    
    Uapi example usage:
    	union bpf_attr attr = { };
    
    	attr.link_create.prog_fd = progfd;
    	attr.link_create.attach_type = 0; /* unused */
    	attr.link_create.netfilter.pf = PF_INET;
    	attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN;
    	attr.link_create.netfilter.priority = -128;
    
    	err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr));
    
    ... this would attach progfd to ipv4:input hook.
    
    Such hook gets removed automatically if the calling program exits.
    
    BPF_NETFILTER program invocation is added in followup change.
    
    NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it
    allows to tell userspace which program is attached at the given hook
    when user runs 'nft hook list' command rather than just the priority
    and not-very-helpful 'this hook runs a bpf prog but I can't tell which
    one'.
    
    Will also be used to disallow registration of two bpf programs with
    same priority in a followup patch.
    
    v4: arm32 cmpxchg only supports 32bit operand
        s/prio/priority/
    v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if
        more use cases pop up (arptables, ebtables, netdev ingress/egress etc).
    
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.de
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:33 +02:00
Artem Savkov 3a4f990c68 bpf: lirc program type should not require SYS_CAP_ADMIN
Bugzilla: https://bugzilla.redhat.com/2221599

commit 69a8c792cd9518071dc801bb110e0f2210d9f958
Author: Sean Young <sean@mess.org>
Date:   Mon Apr 17 09:17:48 2023 +0100

    bpf: lirc program type should not require SYS_CAP_ADMIN
    
    Make it possible to load lirc program type with just CAP_BPF. There is
    nothing exceptional about lirc programs that means they require
    SYS_CAP_ADMIN.
    
    In order to attach or detach a lirc program type you need permission to
    open /dev/lirc0; if you have permission to do that, you can alter all
    sorts of lirc receiving options. Changing the IR protocol decoder is no
    different.
    
    Right now on a typical distribution /dev/lirc devices are only
    read/write by root. Ideally we would make them group read/write like
    other devices so that local users can use them without becoming root.
    
    Signed-off-by: Sean Young <sean@mess.org>
    Link: https://lore.kernel.org/r/ZD0ArKpwnDBJZsrE@gofer.mess.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:31 +02:00
Artem Savkov 5c11bd9376 bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
Bugzilla: https://bugzilla.redhat.com/2221599

commit d54730b50bae1f3119bd686d551d66f0fcc387ca
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Sat Apr 15 13:18:04 2023 -0700

    bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
    
    A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types
    meant for use in BPF programs. Similarly to other opaque types like
    bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in
    user-defined struct types a bpf_refcount can be located, so necessary
    btf_record plumbing is added to enable this. bpf_refcount is sized to
    hold a refcount_t.
    
    Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in
    btf_record as refcount_off in addition to being in the field array.
    Caching refcount_off makes sense for this field because further patches
    in the series will modify functions that take local kptrs (e.g.
    bpf_obj_drop) to change their behavior if the type they're operating on
    is refcounted. So enabling fast "is this type refcounted?" checks is
    desirable.
    
    No such verifier behavior changes are introduced in this patch, just
    logic to recognize 'struct bpf_refcount' in btf_record.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:30 +02:00
Artem Savkov 201f7b639f bpf: Remove btf_field_offs, use btf_record's fields instead
Bugzilla: https://bugzilla.redhat.com/2221599

commit cd2a8079014aced27da9b2e669784f31680f1351
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Sat Apr 15 13:18:03 2023 -0700

    bpf: Remove btf_field_offs, use btf_record's fields instead
    
    The btf_field_offs struct contains (offset, size) for btf_record fields,
    sorted by offset. btf_field_offs is always used in conjunction with
    btf_record, which has btf_field 'fields' array with (offset, type), the
    latter of which btf_field_offs' size is derived from via
    btf_field_type_size.
    
    This patch adds a size field to struct btf_field and sorts btf_record's
    fields by offset, making it possible to get rid of btf_field_offs. Less
    data duplication and less code complexity results.
    
    Since btf_field_offs' lifetime closely followed the btf_record used to
    populate it, most complexity wins are from removal of initialization
    code like:
    
      if (btf_record_successfully_initialized) {
        foffs = btf_parse_field_offs(rec);
        if (IS_ERR_OR_NULL(foffs))
          // free the btf_record and return err
      }
    
    Other changes in this patch are pretty mechanical:
    
      * foffs->field_off[i] -> rec->fields[i].offset
      * foffs->field_sz[i] -> rec->fields[i].size
      * Sort rec->fields in btf_parse_fields before returning
        * It's possible that this is necessary independently of other
          changes in this patch. btf_record_find in syscall.c expects
          btf_record's fields to be sorted by offset, yet there's no
          explicit sorting of them before this patch, record's fields are
          populated in the order they're read from BTF struct definition.
          BTF docs don't say anything about the sortedness of struct fields.
      * All functions taking struct btf_field_offs * input now instead take
        struct btf_record *. All callsites of these functions already have
        access to the correct btf_record.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230415201811.343116-2-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:30 +02:00
Artem Savkov 475e5e02ee bpf: Add log_true_size output field to return necessary log buffer size
Bugzilla: https://bugzilla.redhat.com/2221599

commit 47a71c1f9af0a334c9dfa97633c41de4feda4287
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Thu Apr 6 16:41:58 2023 -0700

    bpf: Add log_true_size output field to return necessary log buffer size
    
    Add output-only log_true_size and btf_log_true_size field to
    BPF_PROG_LOAD and BPF_BTF_LOAD commands, respectively. It will return
    the size of log buffer necessary to fit in all the log contents at
    specified log_level. This is very useful for BPF loader libraries like
    libbpf to be able to size log buffer correctly, but could be used by
    users directly, if necessary, as well.
    
    This patch plumbs all this through the code, taking into account actual
    bpf_attr size provided by user to determine if these new fields are
    expected by users. And if they are, set them from kernel on return.
    
    We refactory btf_parse() function to accommodate this, moving attr and
    uattr handling inside it. The rest is very straightforward code, which
    is split from the logging accounting changes in the previous patch to
    make it simpler to review logic vs UAPI changes.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Lorenz Bauer <lmb@isovalent.com>
    Link: https://lore.kernel.org/bpf/20230406234205.323208-13-andrii@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:28 +02:00
Artem Savkov f54944e86a bpf: Only invoke kptr dtor following non-NULL xchg
Bugzilla: https://bugzilla.redhat.com/2221599

commit 1431d0b584a673ea690c88a5f7e1aedd9caf0e84
Author: David Vernet <void@manifault.com>
Date:   Sat Mar 25 16:31:42 2023 -0500

    bpf: Only invoke kptr dtor following non-NULL xchg
    
    When a map value is being freed, we loop over all of the fields of the
    corresponding BPF object and issue the appropriate cleanup calls
    corresponding to the field's type. If the field is a referenced kptr, we
    atomically xchg the value out of the map, and invoke the kptr's
    destructor on whatever was there before (or bpf_obj_drop() it if it was
    a local kptr).
    
    Currently, we always invoke the destructor (either bpf_obj_drop() or the
    kptr's registered destructor) on any KPTR_REF-type field in a map, even
    if there wasn't a value in the map. This means that any function serving
    as the kptr's KF_RELEASE destructor must always treat the argument as
    possibly NULL, as the following can and regularly does happen:
    
    void *xchgd_field;
    
    /* No value was in the map, so xchgd_field is NULL */
    xchgd_field = (void *)xchg(unsigned long *field_ptr, 0);
    field->kptr.dtor(xchgd_field);
    
    These are odd semantics to impose on KF_RELEASE kfuncs -- BPF programs
    are prohibited by the verifier from passing NULL pointers to KF_RELEASE
    kfuncs, so it doesn't make sense to require this of BPF programs, but
    not the main kernel destructor path. It's also unnecessary to invoke any
    cleanup logic for local kptrs. If there is no object there, there's
    nothing to drop.
    
    So as to allow KF_RELEASE kfuncs to fully assume that an argument is
    non-NULL, this patch updates a KPTR_REF's destructor to only be invoked
    when a non-NULL value is xchg'd out of the kptr map field.
    
    Signed-off-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/r/20230325213144.486885-2-void@manifault.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:19 +02:00
Artem Savkov 54175d4877 bpf: Check IS_ERR for the bpf_map_get() return value
Bugzilla: https://bugzilla.redhat.com/2221599

commit 55fbae05476df65e5eee8be54f61d0257af0240b
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Fri Mar 24 11:42:41 2023 -0700

    bpf: Check IS_ERR for the bpf_map_get() return value
    
    This patch fixes a mistake in checking NULL instead of
    checking IS_ERR for the bpf_map_get() return value.
    
    It also fixes the return value in link_update_map() from -EINVAL
    to PTR_ERR(*_map).
    
    Reported-by: syzbot+71ccc0fe37abb458406b@syzkaller.appspotmail.com
    Fixes: 68b04864ca42 ("bpf: Create links for BPF struct_ops maps.")
    Fixes: aef56f2e918b ("bpf: Update the struct_ops of a bpf_link.")
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Acked-by: Kui-Feng Lee <kuifeng@meta.com>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230324184241.1387437-1-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:19 +02:00
Artem Savkov 84a3b67362 bpf: Update the struct_ops of a bpf_link.
Bugzilla: https://bugzilla.redhat.com/2221599

commit aef56f2e918bf8fc8de25f0b36e8c2aba44116ec
Author: Kui-Feng Lee <kuifeng@meta.com>
Date:   Wed Mar 22 20:24:02 2023 -0700

    bpf: Update the struct_ops of a bpf_link.
    
    By improving the BPF_LINK_UPDATE command of bpf(), it should allow you
    to conveniently switch between different struct_ops on a single
    bpf_link. This would enable smoother transitions from one struct_ops
    to another.
    
    The struct_ops maps passing along with BPF_LINK_UPDATE should have the
    BPF_F_LINK flag.
    
    Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20230323032405.3735486-6-kuifeng@meta.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:19 +02:00
Artem Savkov 1206412454 bpf: Create links for BPF struct_ops maps.
Bugzilla: https://bugzilla.redhat.com/2221599

Conflicts: missing d48567c9a0d1 mm: Introduce set_memory_rox()

commit 68b04864ca425d1894c96b8141d4fba1181f11cb
Author: Kui-Feng Lee <kuifeng@meta.com>
Date:   Wed Mar 22 20:24:00 2023 -0700

    bpf: Create links for BPF struct_ops maps.

    Make bpf_link support struct_ops.  Previously, struct_ops were always
    used alone without any associated links. Upon updating its value, a
    struct_ops would be activated automatically. Yet other BPF program
    types required to make a bpf_link with their instances before they
    could become active. Now, however, you can create an inactive
    struct_ops, and create a link to activate it later.

    With bpf_links, struct_ops has a behavior similar to other BPF program
    types. You can pin/unpin them from their links and the struct_ops will
    be deactivated when its link is removed while previously need someone
    to delete the value for it to be deactivated.

    bpf_links are responsible for registering their associated
    struct_ops. You can only use a struct_ops that has the BPF_F_LINK flag
    set to create a bpf_link, while a structs without this flag behaves in
    the same manner as before and is registered upon updating its value.

    The BPF_LINK_TYPE_STRUCT_OPS serves a dual purpose. Not only is it
    used to craft the links for BPF struct_ops programs, but also to
    create links for BPF struct_ops them-self.  Since the links of BPF
    struct_ops programs are only used to create trampolines internally,
    they are never seen in other contexts. Thus, they can be reused for
    struct_ops themself.

    To maintain a reference to the map supporting this link, we add
    bpf_struct_ops_link as an additional type. The pointer of the map is
    RCU and won't be necessary until later in the patchset.

    Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
    Link: https://lore.kernel.org/r/20230323032405.3735486-4-kuifeng@meta.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:19 +02:00
Artem Savkov 67465a1fdb bpf: Retire the struct_ops map kvalue->refcnt.
Bugzilla: https://bugzilla.redhat.com/2221599

Conflicts: missing d48567c9a0d1 mm: Introduce set_memory_rox()

commit b671c2067a04c0668df174ff5dfdb573d1f9b074
Author: Kui-Feng Lee <kuifeng@meta.com>
Date:   Wed Mar 22 20:23:58 2023 -0700

    bpf: Retire the struct_ops map kvalue->refcnt.

    We have replaced kvalue-refcnt with synchronize_rcu() to wait for an
    RCU grace period.

    Maintenance of kvalue->refcnt was a complicated task, as we had to
    simultaneously keep track of two reference counts: one for the
    reference count of bpf_map. When the kvalue->refcnt reaches zero, we
    also have to reduce the reference count on bpf_map - yet these steps
    are not performed in an atomic manner and require us to be vigilant
    when managing them. By eliminating kvalue->refcnt, we can make our
    maintenance more straightforward as the refcount of bpf_map is now
    solely managed!

    To prevent the trampoline image of a struct_ops from being released
    while it is still in use, we wait for an RCU grace period. The
    setsockopt(TCP_CONGESTION, "...") command allows you to change your
    socket's congestion control algorithm and can result in releasing the
    old struct_ops implementation. It is fine. However, this function is
    exposed through bpf_setsockopt(), it may be accessed by BPF programs
    as well. To ensure that the trampoline image belonging to struct_op
    can be safely called while its method is in use, the trampoline
    safeguarde the BPF program with rcu_read_lock(). Doing so prevents any
    destruction of the associated images before returning from a
    trampoline and requires us to wait for an RCU grace period.

    Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
    Link: https://lore.kernel.org/r/20230323032405.3735486-2-kuifeng@meta.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:19 +02:00
Artem Savkov 9e105169e6 bpf: Fix attaching fentry/fexit/fmod_ret/lsm to modules
Bugzilla: https://bugzilla.redhat.com/2221599

Conflicts: missing commits moving things into kernel/module/internal.h

commit 31bf1dbccfb0a9861d4846755096b3fff5687f8a
Author: Viktor Malik <vmalik@redhat.com>
Date:   Fri Mar 10 08:40:59 2023 +0100

    bpf: Fix attaching fentry/fexit/fmod_ret/lsm to modules

    This resolves two problems with attachment of fentry/fexit/fmod_ret/lsm
    to functions located in modules:

    1. The verifier tries to find the address to attach to in kallsyms. This
       is always done by searching the entire kallsyms, not respecting the
       module in which the function is located. Such approach causes an
       incorrect attachment address to be computed if the function to attach
       to is shadowed by a function of the same name located earlier in
       kallsyms.

    2. If the address to attach to is located in a module, the module
       reference is only acquired in register_fentry. If the module is
       unloaded between the place where the address is found
       (bpf_check_attach_target in the verifier) and register_fentry, it is
       possible that another module is loaded to the same address which may
       lead to potential errors.

    Since the attachment must contain the BTF of the program to attach to,
    we extract the module from it and search for the function address in the
    correct module (resolving problem no. 1). Then, the module reference is
    taken directly in bpf_check_attach_target and stored in the bpf program
    (in bpf_prog_aux). The reference is only released when the program is
    unloaded (resolving problem no. 2).

    Signed-off-by: Viktor Malik <vmalik@redhat.com>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
    Link: https://lore.kernel.org/r/3f6a9d8ae850532b5ef864ef16327b0f7a669063.1678432753.git.vmalik@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:17 +02:00
Artem Savkov b5247b9d85 bpf: Disable migration when freeing stashed local kptr using obj drop
Bugzilla: https://bugzilla.redhat.com/2221599

commit 9e36a204bd43553a9cd4bd574612cd9a5df791ea
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Mon Mar 13 14:46:41 2023 -0700

    bpf: Disable migration when freeing stashed local kptr using obj drop
    
    When a local kptr is stashed in a map and freed when the map goes away,
    currently an error like the below appears:
    
    [   39.195695] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u32:15/2875
    [   39.196549] caller is bpf_mem_free+0x56/0xc0
    [   39.196958] CPU: 15 PID: 2875 Comm: kworker/u32:15 Tainted: G           O       6.2.0-13016-g22df776a9a86 #4477
    [   39.197897] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    [   39.198949] Workqueue: events_unbound bpf_map_free_deferred
    [   39.199470] Call Trace:
    [   39.199703]  <TASK>
    [   39.199911]  dump_stack_lvl+0x60/0x70
    [   39.200267]  check_preemption_disabled+0xbf/0xe0
    [   39.200704]  bpf_mem_free+0x56/0xc0
    [   39.201032]  ? bpf_obj_new_impl+0xa0/0xa0
    [   39.201430]  bpf_obj_free_fields+0x1cd/0x200
    [   39.201838]  array_map_free+0xad/0x220
    [   39.202193]  ? finish_task_switch+0xe5/0x3c0
    [   39.202614]  bpf_map_free_deferred+0xea/0x210
    [   39.203006]  ? lockdep_hardirqs_on_prepare+0xe/0x220
    [   39.203460]  process_one_work+0x64f/0xbe0
    [   39.203822]  ? pwq_dec_nr_in_flight+0x110/0x110
    [   39.204264]  ? do_raw_spin_lock+0x107/0x1c0
    [   39.204662]  ? lockdep_hardirqs_on_prepare+0xe/0x220
    [   39.205107]  worker_thread+0x74/0x7a0
    [   39.205451]  ? process_one_work+0xbe0/0xbe0
    [   39.205818]  kthread+0x171/0x1a0
    [   39.206111]  ? kthread_complete_and_exit+0x20/0x20
    [   39.206552]  ret_from_fork+0x1f/0x30
    [   39.206886]  </TASK>
    
    This happens because the call to __bpf_obj_drop_impl I added in the patch
    adding support for stashing local kptrs doesn't disable migration. Prior
    to that patch, __bpf_obj_drop_impl logic only ran when called by a BPF
    progarm, whereas now it can be called from map free path, so it's
    necessary to explicitly disable migration.
    
    Also, refactor a bit to just call __bpf_obj_drop_impl directly instead
    of bothering w/ dtor union and setting pointer-to-obj_drop.
    
    Fixes: c8e187540914 ("bpf: Support __kptr to local kptrs")
    Reported-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230313214641.3731908-1-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:16 +02:00
Artem Savkov 400701606d bpf: Support __kptr to local kptrs
Bugzilla: https://bugzilla.redhat.com/2221599

commit c8e18754091479fac3f5b6c053c6bc4be0b7fb11
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Fri Mar 10 15:07:41 2023 -0800

    bpf: Support __kptr to local kptrs
    
    If a PTR_TO_BTF_ID type comes from program BTF - not vmlinux or module
    BTF - it must have been allocated by bpf_obj_new and therefore must be
    free'd with bpf_obj_drop. Such a PTR_TO_BTF_ID is considered a "local
    kptr" and is tagged with MEM_ALLOC type tag by bpf_obj_new.
    
    This patch adds support for treating __kptr-tagged pointers to "local
    kptrs" as having an implicit bpf_obj_drop destructor for referenced kptr
    acquire / release semantics. Consider the following example:
    
      struct node_data {
              long key;
              long data;
              struct bpf_rb_node node;
      };
    
      struct map_value {
              struct node_data __kptr *node;
      };
    
      struct {
              __uint(type, BPF_MAP_TYPE_ARRAY);
              __type(key, int);
              __type(value, struct map_value);
              __uint(max_entries, 1);
      } some_nodes SEC(".maps");
    
    If struct node_data had a matching definition in kernel BTF, the verifier would
    expect a destructor for the type to be registered. Since struct node_data does
    not match any type in kernel BTF, the verifier knows that there is no kfunc
    that provides a PTR_TO_BTF_ID to this type, and that such a PTR_TO_BTF_ID can
    only come from bpf_obj_new. So instead of searching for a registered dtor,
    a bpf_obj_drop dtor can be assumed.
    
    This allows the runtime to properly destruct such kptrs in
    bpf_obj_free_fields, which enables maps to clean up map_vals w/ such
    kptrs when going away.
    
    Implementation notes:
      * "kernel_btf" variable is renamed to "kptr_btf" in btf_parse_kptr.
        Before this patch, the variable would only ever point to vmlinux or
        module BTFs, but now it can point to some program BTF for local kptr
        type. It's later used to populate the (btf, btf_id) pair in kptr btf
        field.
      * It's necessary to btf_get the program BTF when populating btf_field
        for local kptr. btf_record_free later does a btf_put.
      * Behavior for non-local referenced kptrs is not modified, as
        bpf_find_btf_id helper only searches vmlinux and module BTFs for
        matching BTF type. If such a type is found, btf_field_kptr's btf will
        pass btf_is_kernel check, and the associated release function is
        some one-argument dtor. If btf_is_kernel check fails, associated
        release function is two-arg bpf_obj_drop_impl. Before this patch
        only btf_field_kptr's w/ kernel or module BTFs were created.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230310230743.2320707-2-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:16 +02:00
Artem Savkov a441724b52 bpf: Change btf_record_find enum parameter to field_mask
Bugzilla: https://bugzilla.redhat.com/2221599

commit 74843b57ec70af7b67b7e6153374834ee18d139f
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Thu Mar 9 10:01:08 2023 -0800

    bpf: Change btf_record_find enum parameter to field_mask
    
    btf_record_find's 3rd parameter can be multiple enum btf_field_type's
    masked together. The function is called with BPF_KPTR in two places in
    verifier.c, so it works with masked values already.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230309180111.1618459-4-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:16 +02:00
Artem Savkov abbb94d5b6 bpf: enforce all maps having memory usage callback
Bugzilla: https://bugzilla.redhat.com/2221599

commit 6b4a6ea2c62d34272d64161d43a19c02355576e2
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Mar 5 12:46:15 2023 +0000

    bpf: enforce all maps having memory usage callback
    
    We have implemented memory usage callback for all maps, and we enforce
    any newly added map having a callback as well. We check this callback at
    map creation time. If it doesn't have the callback, we will return
    EINVAL.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20230305124615.12358-19-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:13 +02:00
Artem Savkov 7fc5796a7e bpf: offload map memory usage
Bugzilla: https://bugzilla.redhat.com/2221599

commit 9629363cd05642fe43aded44938adec067ad1da3
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Mar 5 12:46:14 2023 +0000

    bpf: offload map memory usage
    
    A new helper is introduced to calculate offload map memory usage. But
    currently the memory dynamically allocated in netdev dev_ops, like
    nsim_map_update_elem, is not counted. Let's just put it aside now.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20230305124615.12358-18-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:13 +02:00
Artem Savkov 0378252f0c bpf: add new map ops ->map_mem_usage
Bugzilla: https://bugzilla.redhat.com/2221599

commit 90a5527d7686d3ebe0dd2a831356a6c7d7dc31bc
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Mar 5 12:45:58 2023 +0000

    bpf: add new map ops ->map_mem_usage
    
    Add a new map ops ->map_mem_usage to print the memory usage of a
    bpf map.
    
    This is a preparation for the followup change.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20230305124615.12358-2-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:12 +02:00
Artem Savkov ef8295203b bpf: Support kptrs in local storage maps
Bugzilla: https://bugzilla.redhat.com/2221599

commit 9db44fdd8105da00669d425acab887c668df75f6
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Sat Feb 25 16:40:09 2023 +0100

    bpf: Support kptrs in local storage maps
    
    Enable support for kptrs in local storage maps by wiring up the freeing
    of these kptrs from map value. Freeing of bpf_local_storage_map is only
    delayed in case there are special fields, therefore bpf_selem_free_*
    path can also only dereference smap safely in that case. This is
    recorded using a bool utilizing a hole in bpF_local_storage_elem. It
    could have been tagged in the pointer value smap using the lowest bit
    (since alignment > 1), but since there was already a hole I went with
    the simpler option. Only the map structure freeing is delayed using RCU
    barriers, as the buckets aren't used when selem is being freed, so they
    can be freed once all readers of the bucket lists can no longer access
    it.
    
    Cc: Martin KaFai Lau <martin.lau@kernel.org>
    Cc: KP Singh <kpsingh@kernel.org>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20230225154010.391965-3-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:09 +02:00
Artem Savkov 6d840b01af bpf: Support kptrs in percpu hashmap and percpu LRU hashmap
Bugzilla: https://bugzilla.redhat.com/2221599

commit 65334e64a493c6a0976de7ad56bf8b7a9ff04b4a
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Sat Feb 25 16:40:08 2023 +0100

    bpf: Support kptrs in percpu hashmap and percpu LRU hashmap
    
    Enable support for kptrs in percpu BPF hashmap and percpu BPF LRU
    hashmap by wiring up the freeing of these kptrs from percpu map
    elements.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20230225154010.391965-2-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:09 +02:00
Jan Stancek e341c7e709 Merge: bpf, xdp: update to 6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2583

Rebase bpf and xdp to 6.3.

Bugzilla: https://bugzilla.redhat.com/2178930

Signed-off-by: Viktor Malik <vmalik@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Jason Wang <jasowang@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-28 07:52:45 +02:00
Viktor Malik 7e487d11fc bpf: Add basic bpf_rb_{root,node} support
Bugzilla: https://bugzilla.redhat.com/2178930

commit 9c395c1b99bd23f74bc628fa000480c49593d17f
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Mon Feb 13 16:40:10 2023 -0800

    bpf: Add basic bpf_rb_{root,node} support
    
    This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
    BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
    types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
    map_values.
    
    structs bpf_rb_root and bpf_rb_node are opaque types meant to
    obscure structs rb_root_cached rb_node, respectively.
    
    btf_struct_access will prevent BPF programs from touching these special
    fields automatically now that they're recognized.
    
    btf_check_and_fixup_fields now groups list_head and rb_root together as
    "graph root" fields and {list,rb}_node as "graph node", and does same
    ownership cycle checking as before. Note that this function does _not_
    prevent ownership type mixups (e.g. rb_root owning list_node) - that's
    handled by btf_parse_graph_root.
    
    After this patch, a bpf program can have a struct bpf_rb_root in a
    map_value, but not add anything to nor do anything useful with it.
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:29 +02:00
Viktor Malik c4b5de4021 bpf: allow to disable bpf map memory accounting
Bugzilla: https://bugzilla.redhat.com/2178930

commit ee53cbfb1ebf990de0d084a7cd6b67b05fe1f7ac
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Fri Feb 10 15:47:33 2023 +0000

    bpf: allow to disable bpf map memory accounting
    
    We can simply set root memcg as the map's memcg to disable bpf memory
    accounting. bpf_map_area_alloc is a little special as it gets the memcg
    from current rather than from the map, so we need to disable GFP_ACCOUNT
    specifically for it.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
    Link: https://lore.kernel.org/r/20230210154734.4416-4-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:27 +02:00
Viktor Malik 17c5fbcb2c bpf: use bpf_map_kvcalloc in bpf_local_storage
Bugzilla: https://bugzilla.redhat.com/2178930

commit ddef81b5fd1da4d7c3cc8785d2043b73b72f38ef
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Fri Feb 10 15:47:32 2023 +0000

    bpf: use bpf_map_kvcalloc in bpf_local_storage
    
    Introduce new helper bpf_map_kvcalloc() for the memory allocation in
    bpf_local_storage(). Then the allocation will charge the memory from the
    map instead of from current, though currently they are the same thing as
    it is only used in map creation path now. By charging map's memory into
    the memcg from the map, it will be more clear.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
    Link: https://lore.kernel.org/r/20230210154734.4416-3-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:27 +02:00
Viktor Malik 9c40fe23e7 bpf: Drop always true do_idr_lock parameter to bpf_map_free_id
Bugzilla: https://bugzilla.redhat.com/2178930

commit 158e5e9eeaa0d7a86f2278313746ef6c8521790d
Author: Tobias Klauser <tklauser@distanz.ch>
Date:   Thu Feb 2 15:19:21 2023 +0100

    bpf: Drop always true do_idr_lock parameter to bpf_map_free_id
    
    The do_idr_lock parameter to bpf_map_free_id was introduced by commit
    bd5f5f4ecb ("bpf: Add BPF_MAP_GET_FD_BY_ID"). However, all callers set
    do_idr_lock = true since commit 1e0bd5a091 ("bpf: Switch bpf_map ref
    counter to atomic64_t so bpf_map_inc() never fails").
    
    While at it also inline __bpf_map_put into its only caller bpf_map_put
    now that do_idr_lock can be dropped from its signature.
    
    Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
    Link: https://lore.kernel.org/r/20230202141921.4424-1-tklauser@distanz.ch
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:21 +02:00
Felix Maurer e96fdaf0aa bpf: Support consuming XDP HW metadata from fext programs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit fd7c211d6875013f81acc09868effe199b5d2c0c
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Thu Jan 19 14:15:27 2023 -0800

    bpf: Support consuming XDP HW metadata from fext programs

    Instead of rejecting the attaching of PROG_TYPE_EXT programs to XDP
    programs that consume HW metadata, implement support for propagating the
    offload information. The extension program doesn't need to set a flag or
    ifindex, these will just be propagated from the target by the verifier.
    We need to create a separate offload object for the extension program,
    though, since it can be reattached to a different program later (which
    means we can't just inherit the offload information from the target).

    An additional check is added on attach that the new target is compatible
    with the offload information in the extension prog.

    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-9-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:14 +02:00
Felix Maurer e630642b6b bpf: Introduce device-bound XDP programs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 2b3486bc2d237ec345b3942b7be5deabf8c8fed1
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:24 2023 -0800

    bpf: Introduce device-bound XDP programs

    New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
    to associate a netdev with a BPF program at load time.

    netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:13 +02:00
Felix Maurer c0febc32b2 bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 9d03ebc71a027ca495c60f6e94d3cda81921791f
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:21 2023 -0800

    bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded

    BPF offloading infra will be reused to implement
    bound-but-not-offloaded bpf programs. Rename existing
    helpers for clarity. No functional changes.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:12 +02:00
Viktor Malik 4364d1c4d0 bpf: Remove unused field initialization in bpf's ctl_table
Bugzilla: https://bugzilla.redhat.com/2178930

commit cfca00767febba5f4f5e300fab10e0974491dd4b
Author: Ricardo Ribalda <ribalda@chromium.org>
Date:   Wed Dec 21 20:55:29 2022 +0100

    bpf: Remove unused field initialization in bpf's ctl_table
    
    Maxlen is used by standard proc_handlers such as proc_dointvec(), but in this
    case we have our own proc_handler via bpf_stats_handler(). Therefore, remove
    the initialization.
    
    Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20221221-bpf-syscall-v1-0-9550f5f2c3fc@chromium.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:44:27 +02:00
Ivan Vecera c3640c0d84 bpf: Remove the obsolte u64_stats_fetch_*_irq() users.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193170

commit 97c4090badca743451c3798f1c1846e9f3f252de
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Wed Oct 26 14:31:10 2022 +0200

    bpf: Remove the obsolte u64_stats_fetch_*_irq() users.

    Now that the 32bit UP oddity is gone and 32bit uses always a sequence
    count, there is no need for the fetch_irq() variants anymore.

    Convert to the regular interface.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/bpf/20221026123110.331690-1-bigeasy@linutronix.de

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-06-08 13:37:01 +02:00
Jerome Marchand af27d51cf6 bpf: remove the do_idr_lock parameter from bpf_prog_free_id()
Bugzilla: https://bugzilla.redhat.com/2177177

commit e7895f017b79410bf4591396a733b876dc1e0e9d
Author: Paul Moore <paul@paul-moore.com>
Date:   Fri Jan 6 10:44:00 2023 -0500

    bpf: remove the do_idr_lock parameter from bpf_prog_free_id()

    It was determined that the do_idr_lock parameter to
    bpf_prog_free_id() was not necessary as it should always be true.

    Suggested-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Paul Moore <paul@paul-moore.com>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230106154400.74211-2-paul@paul-moore.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:20 +02:00
Jerome Marchand 1b2a174b5c bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and PERF_BPF_EVENT_PROG_UNLOAD
Bugzilla: https://bugzilla.redhat.com/2177177

commit ef01f4e25c1760920e2c94f1c232350277ace69b
Author: Paul Moore <paul@paul-moore.com>
Date:   Fri Jan 6 10:43:59 2023 -0500

    bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and PERF_BPF_EVENT_PROG_UNLOAD

    When changing the ebpf program put() routines to support being called
    from within IRQ context the program ID was reset to zero prior to
    calling the perf event and audit UNLOAD record generators, which
    resulted in problems as the ebpf program ID was bogus (always zero).
    This patch addresses this problem by removing an unnecessary call to
    bpf_prog_free_id() in __bpf_prog_offload_destroy() and adjusting
    __bpf_prog_put() to only call bpf_prog_free_id() after audit and perf
    have finished their bpf program unload tasks in
    bpf_prog_put_deferred().  For the record, no one can determine, or
    remember, why it was necessary to free the program ID, and remove it
    from the IDR, prior to executing bpf_prog_put_deferred();
    regardless, both Stanislav and Alexei agree that the approach in this
    patch should be safe.

    It is worth noting that when moving the bpf_prog_free_id() call, the
    do_idr_lock parameter was forced to true as the ebpf devs determined
    this was the correct as the do_idr_lock should always be true.  The
    do_idr_lock parameter will be removed in a follow-up patch, but it
    was kept here to keep the patch small in an effort to ease any stable
    backports.

    I also modified the bpf_audit_prog() logic used to associate the
    AUDIT_BPF record with other associated records, e.g. @ctx != NULL.
    Instead of keying off the operation, it now keys off the execution
    context, e.g. '!in_irg && !irqs_disabled()', which is much more
    appropriate and should help better connect the UNLOAD operations with
    the associated audit state (other audit records).

    Cc: stable@vger.kernel.org
    Fixes: d809e134be7a ("bpf: Prepare bpf_prog_put() to be called from irq context.")
    Reported-by: Burn Alting <burn.alting@iinet.net.au>
    Reported-by: Jiri Olsa <olsajiri@gmail.com>
    Suggested-by: Stanislav Fomichev <sdf@google.com>
    Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
    Signed-off-by: Paul Moore <paul@paul-moore.com>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230106154400.74211-1-paul@paul-moore.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:20 +02:00
Jerome Marchand de6eb19233 bpf: Add comments for map BTF matching requirement for bpf_list_head
Bugzilla: https://bugzilla.redhat.com/2177177

commit c22dfdd21592c5d56b49d5fba8de300ad7bf293c
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:26:08 2022 +0530

    bpf: Add comments for map BTF matching requirement for bpf_list_head

    The old behavior of bpf_map_meta_equal was that it compared timer_off
    to be equal (but not spin_lock_off, because that was not allowed), and
    did memcmp of kptr_off_tab.

    Now, we memcmp the btf_record of two bpf_map structs, which has all
    fields.

    We preserve backwards compat as we kzalloc the array, so if only spin
    lock and timer exist in map, we only compare offset while the rest of
    unused members in the btf_field struct are zeroed out.

    In case of kptr, btf and everything else is of vmlinux or module, so as
    long type is same it will match, since kernel btf, module, dtor pointer
    will be same across maps.

    Now with list_head in the mix, things are a bit complicated. We
    implicitly add a requirement that both BTFs are same, because struct
    btf_field_list_head has btf and value_rec members.

    We obviously shouldn't force BTFs to be equal by default, as that breaks
    backwards compatibility.

    Currently it is only implicitly required due to list_head matching
    struct btf and value_rec member. value_rec points back into a btf_record
    stashed in the map BTF (btf member of btf_field_list_head). So that
    pointer and btf member has to match exactly.

    Document all these subtle details so that things don't break in the
    future when touching this code.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-19-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:07 +02:00
Jerome Marchand 5fb8030979 bpf: Verify ownership relationships for user BTF types
Bugzilla: https://bugzilla.redhat.com/2177177

commit 865ce09a49d79d2b2c1d980f4c05ffc0b3517bdc
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:25:57 2022 +0530

    bpf: Verify ownership relationships for user BTF types

    Ensure that there can be no ownership cycles among different types by
    way of having owning objects that can hold some other type as their
    element. For instance, a map value can only hold allocated objects, but
    these are allowed to have another bpf_list_head. To prevent unbounded
    recursion while freeing resources, elements of bpf_list_head in local
    kptrs can never have a bpf_list_head which are part of list in a map
    value. Later patches will verify this by having dedicated BTF selftests.

    Also, to make runtime destruction easier, once btf_struct_metas is fully
    populated, we can stash the metadata of the value type directly in the
    metadata of the list_head fields, as that allows easier access to the
    value type's layout to destruct it at runtime from the btf_field entry
    of the list head itself.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-8-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:06 +02:00
Jerome Marchand ef745b384b bpf: Recognize lock and list fields in allocated objects
Bugzilla: https://bugzilla.redhat.com/2177177

commit 8ffa5cc142137a59d6a10eb5273fa2ba5dcd4947
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:25:56 2022 +0530

    bpf: Recognize lock and list fields in allocated objects

    Allow specifying bpf_spin_lock, bpf_list_head, bpf_list_node fields in a
    allocated object.

    Also update btf_struct_access to reject direct access to these special
    fields.

    A bpf_list_head allows implementing map-in-map style use cases, where an
    allocated object with bpf_list_head is linked into a list in a map
    value. This would require embedding a bpf_list_node, support for which
    is also included. The bpf_spin_lock is used to protect the bpf_list_head
    and other data.

    While we strictly don't require to hold a bpf_spin_lock while touching
    the bpf_list_head in such objects, as when have access to it, we have
    complete ownership of the object, the locking constraint is still kept
    and may be conditionally lifted in the future.

    Note that the specification of such types can be done just like map
    values, e.g.:

    struct bar {
    	struct bpf_list_node node;
    };

    struct foo {
    	struct bpf_spin_lock lock;
    	struct bpf_list_head head __contains(bar, node);
    	struct bpf_list_node node;
    };

    struct map_value {
    	struct bpf_spin_lock lock;
    	struct bpf_list_head head __contains(foo, node);
    };

    To recognize such types in user BTF, we build a btf_struct_metas array
    of metadata items corresponding to each BTF ID. This is done once during
    the btf_parse stage to avoid having to do it each time during the
    verification process's requirement to inspect the metadata.

    Moreover, the computed metadata needs to be passed to some helpers in
    future patches which requires allocating them and storing them in the
    BTF that is pinned by the program itself, so that valid access can be
    assumed to such data during program runtime.

    A key thing to note is that once a btf_struct_meta is available for a
    type, both the btf_record and btf_field_offs should be available. It is
    critical that btf_field_offs is available in case special fields are
    present, as we extensively rely on special fields being zeroed out in
    map values and allocated objects in later patches. The code ensures that
    by bailing out in case of errors and ensuring both are available
    together. If the record is not available, the special fields won't be
    recognized, so not having both is also fine (in terms of being a
    verification error and not a runtime bug).

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-7-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:06 +02:00
Jerome Marchand 1a6464875f bpf: Do btf_record_free outside map_free callback
Bugzilla: https://bugzilla.redhat.com/2177177

commit d7f5ef653c3dd0c0d649cae6ef2708053bb1fb2b
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:25:52 2022 +0530

    bpf: Do btf_record_free outside map_free callback

    Since the commit being fixed, we now miss freeing btf_record for local
    storage maps which will have a btf_record populated in case they have
    bpf_spin_lock element.

    This was missed because I made the choice of offloading the job to free
    kptr_off_tab (now btf_record) to the map_free callback when adding
    support for kptrs.

    Revisiting the reason for this decision, there is the possibility that
    the btf_record gets used inside map_free callback (e.g. in case of maps
    embedding kptrs) to iterate over them and free them, hence doing it
    before the map_free callback would be leaking special field memory, and
    do invalid memory access. The btf_record keeps module references which
    is critical to ensure the dtor call made for referenced kptr is safe to
    do.

    If doing it after map_free callback, the map area is already freed, so
    we cannot access bpf_map structure anymore.

    To fix this and prevent such lapses in future, move bpf_map_free_record
    out of the map_free callback, and do it after map_free by remembering
    the btf_record pointer. There is no need to access bpf_map structure in
    that case, and we can avoid missing this case when support for new map
    types is added for other special fields.

    Since a btf_record and its btf_field_offs are used together, for
    consistency delay freeing of field_offs as well. While not a problem
    right now, a lot of code assumes that either both record and field_offs
    are set or none at once.

    Note that in case of map of maps (outer maps), inner_map_meta->record is
    only used during verification, not to free fields in map value, hence we
    simply keep the bpf_map_free_record call as is in bpf_map_meta_free and
    never touch map->inner_map_meta in bpf_map_free_deferred.

    Add a comment making note of these details.

    Fixes: db559117828d ("bpf: Consolidate spin_lock, timer management into btf_record")
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-3-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:06 +02:00
Jerome Marchand 774d15cf41 bpf: Fix early return in map_check_btf
Bugzilla: https://bugzilla.redhat.com/2177177

commit c237bfa5283a562cd5d74dd74b2d9016acd97f45
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:25:51 2022 +0530

    bpf: Fix early return in map_check_btf

    Instead of returning directly with -EOPNOTSUPP for the timer case, we
    need to free the btf_record before returning to userspace.

    Fixes: db559117828d ("bpf: Consolidate spin_lock, timer management into btf_record")
    Reported-by: Dan Carpenter <error27@gmail.com>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-2-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:06 +02:00
Jerome Marchand b928a49814 bpf: Pass map file to .map_update_batch directly
Bugzilla: https://bugzilla.redhat.com/2177177

commit 3af43ba4c6019b29c048921eb8147eb010165329
Author: Hou Tao <houtao1@huawei.com>
Date:   Wed Nov 16 15:50:58 2022 +0800

    bpf: Pass map file to .map_update_batch directly

    Currently bpf_map_do_batch() first invokes fdget(batch.map_fd) to get
    the target map file, then it invokes generic_map_update_batch() to do
    batch update. generic_map_update_batch() will get the target map file
    by using fdget(batch.map_fd) again and pass it to bpf_map_update_value().

    The problem is map file returned by the second fdget() may be NULL or a
    totally different file compared by map file in bpf_map_do_batch(). The
    reason is that the first fdget() only guarantees the liveness of struct
    file instead of file descriptor and the file description may be released
    by concurrent close() through pick_file().

    It doesn't incur any problem as for now, because maps with batch update
    support don't use map file in .map_fd_get_ptr() ops. But it is better to
    fix the potential access of an invalid map file.

    Using __bpf_map_get() again in generic_map_update_batch() can not fix
    the problem, because batch.map_fd may be closed and reopened, and the
    returned map file may be different with map file got in
    bpf_map_do_batch(), so just passing the map file directly to
    .map_update_batch() in bpf_map_do_batch().

    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20221116075059.1551277-1-houtao@huaweicloud.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:05 +02:00
Jerome Marchand d03c51f6bc bpf: Support bpf_list_head in map values
Bugzilla: https://bugzilla.redhat.com/2177177

commit f0c5941ff5b255413d31425bb327c2aec3625673
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Tue Nov 15 00:45:25 2022 +0530

    bpf: Support bpf_list_head in map values

    Add the support on the map side to parse, recognize, verify, and build
    metadata table for a new special field of the type struct bpf_list_head.
    To parameterize the bpf_list_head for a certain value type and the
    list_node member it will accept in that value type, we use BTF
    declaration tags.

    The definition of bpf_list_head in a map value will be done as follows:

    struct foo {
    	struct bpf_list_node node;
    	int data;
    };

    struct map_value {
    	struct bpf_list_head head __contains(foo, node);
    };

    Then, the bpf_list_head only allows adding to the list 'head' using the
    bpf_list_node 'node' for the type struct foo.

    The 'contains' annotation is a BTF declaration tag composed of four
    parts, "contains:name:node" where the name is then used to look up the
    type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
    the lookup. The node defines name of the member in this type that has
    the type struct bpf_list_node, which is actually used for linking into
    the linked list. For now, 'kind' part is hardcoded as struct.

    This allows building intrusive linked lists in BPF, using container_of
    to obtain pointer to entry, while being completely type safe from the
    perspective of the verifier. The verifier knows exactly the type of the
    nodes, and knows that list helpers return that type at some fixed offset
    where the bpf_list_node member used for this list exists. The verifier
    also uses this information to disallow adding types that are not
    accepted by a certain list.

    For now, no elements can be added to such lists. Support for that is
    coming in future patches, hence draining and freeing items is done with
    a TODO that will be resolved in a future patch.

    Note that the bpf_list_head_free function moves the list out to a local
    variable under the lock and releases it, doing the actual draining of
    the list items outside the lock. While this helps with not holding the
    lock for too long pessimizing other concurrent list operations, it is
    also necessary for deadlock prevention: unless every function called in
    the critical section would be notrace, a fentry/fexit program could
    attach and call bpf_map_update_elem again on the map, leading to the
    same lock being acquired if the key matches and lead to a deadlock.
    While this requires some special effort on part of the BPF programmer to
    trigger and is highly unlikely to occur in practice, it is always better
    if we can avoid such a condition.

    While notrace would prevent this, doing the draining outside the lock
    has advantages of its own, hence it is used to also fix the deadlock
    related problem.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221114191547.1694267-5-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:04 +02:00
Jerome Marchand e9b5bda40b bpf: Refactor map->off_arr handling
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Minor changes from already backported commit 1f6e04a1c7b8
("bpf: Fix offset calculation error in __copy_map_value and
zero_map_value")

commit f71b2f64177a199d5b1d2047e155d45fd98f564a
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 4 00:39:57 2022 +0530

    bpf: Refactor map->off_arr handling

    Refactor map->off_arr handling into generic functions that can work on
    their own without hardcoding map specific code. The btf_fields_offs
    structure is now returned from btf_parse_field_offs, which can be reused
    later for types in program BTF.

    All functions like copy_map_value, zero_map_value call generic
    underlying functions so that they can also be reused later for copying
    to values allocated in programs which encode specific fields.

    Later, some helper functions will also require access to this
    btf_field_offs structure to be able to skip over special fields at
    runtime.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221103191013.1236066-9-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:01 +02:00
Jerome Marchand 2b8a340165 bpf: Consolidate spin_lock, timer management into btf_record
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Context change from already backported commit 997849c4b969
("bpf: Zeroing allocated object from slab in bpf memory allocator")

commit db559117828d2448fe81ada051c60bcf39f822e9
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 4 00:39:56 2022 +0530

    bpf: Consolidate spin_lock, timer management into btf_record

    Now that kptr_off_tab has been refactored into btf_record, and can hold
    more than one specific field type, accomodate bpf_spin_lock and
    bpf_timer as well.

    While they don't require any more metadata than offset, having all
    special fields in one place allows us to share the same code for
    allocated user defined types and handle both map values and these
    allocated objects in a similar fashion.

    As an optimization, we still keep spin_lock_off and timer_off offsets in
    the btf_record structure, just to avoid having to find the btf_field
    struct each time their offset is needed. This is mostly needed to
    manipulate such objects in a map value at runtime. It's ok to hardcode
    just one offset as more than one field is disallowed.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:01 +02:00
Jerome Marchand 40100e4a5a bpf: Refactor kptr_off_tab into btf_record
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts:
 - Context change from already backported commit 997849c4b969 ("bpf:
Zeroing allocated object from slab in bpf memory allocator")
 - Minor changes from already backported commit 1f6e04a1c7b8 ("bpf:
Fix offset calculation error in __copy_map_value and zero_map_value")

commit aa3496accc412b3d975e4ee5d06076d73394d8b5
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 4 00:39:55 2022 +0530

    bpf: Refactor kptr_off_tab into btf_record

    To prepare the BPF verifier to handle special fields in both map values
    and program allocated types coming from program BTF, we need to refactor
    the kptr_off_tab handling code into something more generic and reusable
    across both cases to avoid code duplication.

    Later patches also require passing this data to helpers at runtime, so
    that they can work on user defined types, initialize them, destruct
    them, etc.

    The main observation is that both map values and such allocated types
    point to a type in program BTF, hence they can be handled similarly. We
    can prepare a field metadata table for both cases and store them in
    struct bpf_map or struct btf depending on the use case.

    Hence, refactor the code into generic btf_record and btf_field member
    structs. The btf_record represents the fields of a specific btf_type in
    user BTF. The cnt indicates the number of special fields we successfully
    recognized, and field_mask is a bitmask of fields that were found, to
    enable quick determination of availability of a certain field.

    Subsequently, refactor the rest of the code to work with these generic
    types, remove assumptions about kptr and kptr_off_tab, rename variables
    to more meaningful names, etc.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:01 +02:00
Jerome Marchand dcf538d57d bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Context change from missing commit 7f203bc89eb6 ("cgroup:
Replace cgroup->ancestor_ids[] with ->ancestors[]")

commit c4bcfb38a95edb1021a53f2d0356a78120ecfbe4
Author: Yonghong Song <yhs@fb.com>
Date:   Tue Oct 25 21:28:50 2022 -0700

    bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

    Similar to sk/inode/task storage, implement similar cgroup local storage.

    There already exists a local storage implementation for cgroup-attached
    bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
    bpf_get_local_storage(). But there are use cases such that non-cgroup
    attached bpf progs wants to access cgroup local storage data. For example,
    tc egress prog has access to sk and cgroup. It is possible to use
    sk local storage to emulate cgroup local storage by storing data in socket.
    But this is a waste as it could be lots of sockets belonging to a particular
    cgroup. Alternatively, a separate map can be created with cgroup id as the key.
    But this will introduce additional overhead to manipulate the new map.
    A cgroup local storage, similar to existing sk/inode/task storage,
    should help for this use case.

    The life-cycle of storage is managed with the life-cycle of the
    cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
    with a call to bpf_cgrp_storage_free() when cgroup itself
    is deleted.

    The userspace map operations can be done by using a cgroup fd as a key
    passed to the lookup, update and delete operations.

    Typically, the following code is used to get the current cgroup:
        struct task_struct *task = bpf_get_current_task_btf();
        ... task->cgroups->dfl_cgrp ...
    and in structure task_struct definition:
        struct task_struct {
            ....
            struct css_set __rcu            *cgroups;
            ....
        }
    With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
    So the current implementation only supports non-sleepable program and supporting
    sleepable program will be the next step together with adding rcu_read_lock
    protection for rcu tagged structures.

    Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
    storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
    for cgroup storage available to non-cgroup-attached bpf programs. The old
    cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
    The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
    functionality. While old cgroup storage pre-allocates storage memory, the new
    mechanism can also pre-allocate with a user space bpf_map_update_elem() call
    to avoid potential run-time memory allocation failure.
    Therefore, the new cgroup storage can provide all functionality w.r.t.
    the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
    BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
    be deprecated since the new one can provide the same functionality.

    Acked-by: David Vernet <void@manifault.com>
    Signed-off-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:42:58 +02:00
Jerome Marchand 5a72475fc0 bpf: Remove prog->active check for bpf_lsm and bpf_iter
Bugzilla: https://bugzilla.redhat.com/2177177

commit 271de525e1d7f564e88a9d212c50998b49a54476
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Tue Oct 25 11:45:16 2022 -0700

    bpf: Remove prog->active check for bpf_lsm and bpf_iter

    The commit 64696c40d03c ("bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline")
    removed prog->active check for struct_ops prog.  The bpf_lsm
    and bpf_iter is also using trampoline.  Like struct_ops, the bpf_lsm
    and bpf_iter have fixed hooks for the prog to attach.  The
    kernel does not call the same hook in a recursive way.
    This patch also removes the prog->active check for
    bpf_lsm and bpf_iter.

    A later patch has a test to reproduce the recursion issue
    for a sleepable bpf_lsm program.

    This patch appends the '_recur' naming to the existing
    enter and exit functions that track the prog->active counter.
    New __bpf_prog_{enter,exit}[_sleepable] function are
    added to skip the prog->active tracking. The '_struct_ops'
    version is also removed.

    It also moves the decision on picking the enter and exit function to
    the new bpf_trampoline_{enter,exit}().  It returns the '_recur' ones
    for all tracing progs to use.  For bpf_lsm, bpf_iter,
    struct_ops (no prog->active tracking after 64696c40d03c), and
    bpf_lsm_cgroup (no prog->active tracking after 69fd337a975c7),
    it will return the functions that don't track the prog->active.

    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20221025184524.3526117-2-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:42:57 +02:00
Artem Savkov fee78f87aa bpf: Prevent bpf program recursion for raw tracepoint probes
Bugzilla: https://bugzilla.redhat.com/2166911

commit 05b24ff9b2cfabfcfd951daaa915a036ab53c9e1
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Fri Sep 16 09:19:14 2022 +0200

    bpf: Prevent bpf program recursion for raw tracepoint probes
    
    We got report from sysbot [1] about warnings that were caused by
    bpf program attached to contention_begin raw tracepoint triggering
    the same tracepoint by using bpf_trace_printk helper that takes
    trace_printk_lock lock.
    
     Call Trace:
      <TASK>
      ? trace_event_raw_event_bpf_trace_printk+0x5f/0x90
      bpf_trace_printk+0x2b/0xe0
      bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
      bpf_trace_run2+0x26/0x90
      native_queued_spin_lock_slowpath+0x1c6/0x2b0
      _raw_spin_lock_irqsave+0x44/0x50
      bpf_trace_printk+0x3f/0xe0
      bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
      bpf_trace_run2+0x26/0x90
      native_queued_spin_lock_slowpath+0x1c6/0x2b0
      _raw_spin_lock_irqsave+0x44/0x50
      bpf_trace_printk+0x3f/0xe0
      bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
      bpf_trace_run2+0x26/0x90
      native_queued_spin_lock_slowpath+0x1c6/0x2b0
      _raw_spin_lock_irqsave+0x44/0x50
      bpf_trace_printk+0x3f/0xe0
      bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
      bpf_trace_run2+0x26/0x90
      native_queued_spin_lock_slowpath+0x1c6/0x2b0
      _raw_spin_lock_irqsave+0x44/0x50
      __unfreeze_partials+0x5b/0x160
      ...
    
    The can be reproduced by attaching bpf program as raw tracepoint on
    contention_begin tracepoint. The bpf prog calls bpf_trace_printk
    helper. Then by running perf bench the spin lock code is forced to
    take slow path and call contention_begin tracepoint.
    
    Fixing this by skipping execution of the bpf program if it's
    already running, Using bpf prog 'active' field, which is being
    currently used by trampoline programs for the same reason.
    
    Moving bpf_prog_inc_misses_counter to syscall.c because
    trampoline.c is compiled in just for CONFIG_BPF_JIT option.
    
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Reported-by: syzbot+2251879aa068ad9c960d@syzkaller.appspotmail.com
    [1] https://lore.kernel.org/bpf/YxhFe3EwqchC%2FfYf@krava/T/#t
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/r/20220916071914.7156-1-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:39 +01:00
Artem Savkov 42bf3eaa55 bpf: use kvmemdup_bpfptr helper
Bugzilla: https://bugzilla.redhat.com/2166911

commit a02c118ee9e898612cbae42121b9e8663455b515
Author: Wang Yufen <wangyufen@huawei.com>
Date:   Tue Sep 13 16:40:33 2022 +0800

    bpf: use kvmemdup_bpfptr helper
    
    Use kvmemdup_bpfptr helper instead of open-coding to
    simplify the code.
    
    Signed-off-by: Wang Yufen <wangyufen@huawei.com>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/1663058433-14089-1-git-send-email-wangyufen@huawei.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:13 +01:00
Artem Savkov 7f2b6b92f7 bpf: Ensure correct locking around vulnerable function find_vpid()
Bugzilla: https://bugzilla.redhat.com/2166911

commit 83c10cc362d91c0d8d25e60779ee52fdbbf3894d
Author: Lee Jones <lee@kernel.org>
Date:   Mon Sep 12 14:38:55 2022 +0100

    bpf: Ensure correct locking around vulnerable function find_vpid()
    
    The documentation for find_vpid() clearly states:
    
      "Must be called with the tasklist_lock or rcu_read_lock() held."
    
    Presently we do neither for find_vpid() instance in bpf_task_fd_query().
    Add proper rcu_read_lock/unlock() to fix the issue.
    
    Fixes: 41bdc4b40e ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY")
    Signed-off-by: Lee Jones <lee@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20220912133855.1218900-1-lee@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:12 +01:00
Artem Savkov 4b73ddbd13 bpf: Support kptrs in percpu arraymap
Bugzilla: https://bugzilla.redhat.com/2166911

commit 6df4ea1ff0ff70798ff1e7eed79f98ccb7b5b0a2
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Sun Sep 4 22:41:15 2022 +0200

    bpf: Support kptrs in percpu arraymap
    
    Enable support for kptrs in percpu BPF arraymap by wiring up the freeing
    of these kptrs from percpu map elements.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20220904204145.3089-3-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:10 +01:00
Artem Savkov ade3f4aa53 bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
Bugzilla: https://bugzilla.redhat.com/2166911

commit 8d5a8011b35d387c490a5c977b1d9eb4798aa071
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Fri Sep 2 14:10:51 2022 -0700

    bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
    
    SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down
    kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps
    and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change
    solves the memory consumption issue, avoids kmem_cache_destroy latency and
    keeps bpf hash map performance the same.
    
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220902211058.60789-10-alexei.starovoitov@gmail.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:07 +01:00
Artem Savkov 5c9c04c9ae bpf: prepare for more bpf syscall to be used from kernel and user space.
Bugzilla: https://bugzilla.redhat.com/2166911

commit b88df6979682333815536a0bf43bd56f9499f071
Author: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Date:   Wed Aug 24 15:40:36 2022 +0200

    bpf: prepare for more bpf syscall to be used from kernel and user space.
    
    Add BPF_MAP_GET_FD_BY_ID and BPF_MAP_DELETE_PROG.
    
    Only BPF_MAP_GET_FD_BY_ID needs to be amended to be able
    to access the bpf pointer either from the userspace or the kernel.
    
    Acked-by: Yonghong Song <yhs@fb.com>
    Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
    Link: https://lore.kernel.org/r/20220824134055.1328882-7-benjamin.tissoires@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:04 +01:00
Artem Savkov ad5fbfae98 bpf: prevent leak of lsm program after failed attach
Bugzilla: https://bugzilla.redhat.com/2137876

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit e89f3edffb860a0f54a9ed16deadb7a4a1fa3862
Author: Milan Landaverde <milan@mdaverde.com>
Date:   Tue Dec 13 12:57:14 2022 -0500

    bpf: prevent leak of lsm program after failed attach

    In [0], we added the ability to bpf_prog_attach LSM programs to cgroups,
    but in our validation to make sure the prog is meant to be attached to
    BPF_LSM_CGROUP, we return too early if the check fails. This results in
    lack of decrementing prog's refcnt (through bpf_prog_put)
    leaving the LSM program alive past the point of the expected lifecycle.
    This fix allows for the decrement to take place.

    [0] https://lore.kernel.org/all/20220628174314.1216643-4-sdf@google.com/

    Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")
    Signed-off-by: Milan Landaverde <milan@mdaverde.com>
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20221213175714.31963-1-milan@mdaverde.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:50:16 +01:00
Artem Savkov 1ec0942151 bpf: Restrict bpf_sys_bpf to CAP_PERFMON
Bugzilla: https://bugzilla.redhat.com/2137876

commit 14b20b784f59bdd95f6f1cfb112c9818bcec4d84
Author: YiFei Zhu <zhuyifei@google.com>
Date:   Tue Aug 16 20:55:16 2022 +0000

    bpf: Restrict bpf_sys_bpf to CAP_PERFMON
    
    The verifier cannot perform sufficient validation of any pointers passed
    into bpf_attr and treats them as integers rather than pointers. The helper
    will then read from arbitrary pointers passed into it. Restrict the helper
    to CAP_PERFMON since the security model in BPF of arbitrary kernel read is
    CAP_BPF + CAP_PERFMON.
    
    Fixes: af2ac3e13e ("bpf: Prepare bpf syscall to be used from kernel and user space.")
    Signed-off-by: YiFei Zhu <zhuyifei@google.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220816205517.682470-1-zhuyifei@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:48 +01:00
Artem Savkov 74fbc8786e bpf: Shut up kern_sys_bpf warning.
Bugzilla: https://bugzilla.redhat.com/2137876

commit 4e4588f1c4d2e67c993208f0550ef3fae33abce4
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Aug 10 23:52:28 2022 -0700

    bpf: Shut up kern_sys_bpf warning.
    
    Shut up this warning:
    kernel/bpf/syscall.c:5089:5: warning: no previous prototype for function 'kern_sys_bpf' [-Wmissing-prototypes]
    int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
    
    Reported-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:47 +01:00
Artem Savkov 602a48c545 bpf: Use proper target btf when exporting attach_btf_obj_id
Bugzilla: https://bugzilla.redhat.com/2137876

commit 6644aabbd8973a9f8008cabfd054a36b69a3a3f5
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Aug 4 13:11:39 2022 -0700

    bpf: Use proper target btf when exporting attach_btf_obj_id
    
    When attaching to program, the program itself might not be attached
    to anything (and, hence, might not have attach_btf), so we can't
    unconditionally use 'prog->aux->dst_prog->aux->attach_btf'.
    
    Instead, use bpf_prog_get_target_btf to pick proper target BTF:
    
      * when attached to dst_prog, use dst_prog->aux->btf
      * when attached to kernel btf, use prog->aux->attach_btf
    
    Fixes: b79c9fc9551b ("bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP")
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Hao Luo <haoluo@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20220804201140.1340684-1-sdf@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:46 +01:00
Artem Savkov 97e5df6f0b bpf: reparent bpf maps on memcg offlining
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: conflict with rhel-only commit 1f50357d24 "mm/memcg:
Exclude mem_cgroup pointer from kABI signature computation". mem_cgroup
is replaced by obj_cgroup, excluding it the same way as previous struct
was excluded.

commit 4201d9ab3e42d9e2a20320b751a931e6239c0df2
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Mon Jul 11 09:28:27 2022 -0700

    bpf: reparent bpf maps on memcg offlining

    The memory consumed by a bpf map is always accounted to the memory
    cgroup of the process which created the map. The map can outlive
    the memory cgroup if it's used by processes in other cgroups or
    is pinned on bpffs. In this case the map pins the original cgroup
    in the dying state.

    For other types of objects (slab objects, non-slab kernel allocations,
    percpu objects and recently LRU pages) there is a reparenting process
    implemented: on cgroup offlining charged objects are getting
    reassigned to the parent cgroup. Because all charges and statistics
    are fully recursive it's a fairly cheap operation.

    For efficiency and consistency with other types of objects, let's do
    the same for bpf maps. Fortunately thanks to the objcg API, the
    required changes are minimal.

    Please, note that individual allocations (slabs, percpu and large
    kmallocs) already have the reparenting mechanism. This commit adds
    it to the saved map->memcg pointer by replacing it to map->objcg.
    Because dying cgroups are not visible for a user and all charges are
    recursive, this commit doesn't bring any behavior changes for a user.

    v2:
      added a missing const qualifier

    Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Link: https://lore.kernel.org/r/20220711162827.184743-1-roman.gushchin@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:38 +01:00
Artem Savkov 0e3c3201fc bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP
Bugzilla: https://bugzilla.redhat.com/2137876

commit b79c9fc9551b45953a94abf550b7bd3b00e3a0f9
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Jun 28 10:43:08 2022 -0700

    bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP
    
    We have two options:
    1. Treat all BPF_LSM_CGROUP the same, regardless of attach_btf_id
    2. Treat BPF_LSM_CGROUP+attach_btf_id as a separate hook point
    
    I was doing (2) in the original patch, but switching to (1) here:
    
    * bpf_prog_query returns all attached BPF_LSM_CGROUP programs
    regardless of attach_btf_id
    * attach_btf_id is exported via bpf_prog_info
    
    Reviewed-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220628174314.1216643-6-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:33 +01:00
Artem Savkov 9a33161b25 bpf: per-cgroup lsm flavor
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: already applied 65d9ecfe0ca73 "bpf: Fix ref_obj_id for dynptr
data slices in verifier"

commit 69fd337a975c7e690dfe49d9cb4fe5ba1e6db44e
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Jun 28 10:43:06 2022 -0700

    bpf: per-cgroup lsm flavor

    Allow attaching to lsm hooks in the cgroup context.

    Attaching to per-cgroup LSM works exactly like attaching
    to other per-cgroup hooks. New BPF_LSM_CGROUP is added
    to trigger new mode; the actual lsm hook we attach to is
    signaled via existing attach_btf_id.

    For the hooks that have 'struct socket' or 'struct sock' as its first
    argument, we use the cgroup associated with that socket. For the rest,
    we use 'current' cgroup (this is all on default hierarchy == v2 only).
    Note that for some hooks that work on 'struct sock' we still
    take the cgroup from 'current' because some of them work on the socket
    that hasn't been properly initialized yet.

    Behind the scenes, we allocate a shim program that is attached
    to the trampoline and runs cgroup effective BPF programs array.
    This shim has some rudimentary ref counting and can be shared
    between several programs attaching to the same lsm hook from
    different cgroups.

    Note that this patch bloats cgroup size because we add 211
    cgroup_bpf_attach_type(s) for simplicity sake. This will be
    addressed in the subsequent patch.

    Also note that we only add non-sleepable flavor for now. To enable
    sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
    shim programs have to be freed via trace rcu, cgroup_bpf.effective
    should be also trace-rcu-managed + maybe some other changes that
    I'm not aware of.

    Reviewed-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:33 +01:00
Artem Savkov 00639ab4bd bpf: Unify data extension operation of jited_ksyms and jited_linfo
Bugzilla: https://bugzilla.redhat.com/2137876

commit 2cd008522707a59bf38c1f45d5c654eddbb86c20
Author: Pu Lehui <pulehui@huawei.com>
Date:   Mon May 30 17:28:10 2022 +0800

    bpf: Unify data extension operation of jited_ksyms and jited_linfo
    
    We found that 32-bit environment can not print BPF line info due to a data
    inconsistency between jited_ksyms[0] and jited_linfo[0].
    
    For example:
    
      jited_kyms[0] = 0xb800067c, jited_linfo[0] = 0xffffffffb800067c
    
    We know that both of them store BPF func address, but due to the different
    data extension operations when extended to u64, they may not be the same.
    We need to unify the data extension operations of them.
    
    Signed-off-by: Pu Lehui <pulehui@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/CAEf4BzZ-eDcdJZgJ+Np7Y=V-TVjDDvOMqPwzKjyWrh=i5juv4w@mail.gmail.com
    Link: https://lore.kernel.org/bpf/20220530092815.1112406-2-pulehui@huawei.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:28 +01:00
Yauheni Kaliuta 20067c525d bpf: Fix non-static bpf_func_proto struct definitions
Bugzilla: http://bugzilla.redhat.com/2120968

commit dc368e1c658e4f478a45e8d1d5b0c8392ca87506
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Thu Jun 16 15:54:07 2022 -0700

    bpf: Fix non-static bpf_func_proto struct definitions

    This patch does two things:

    1) Marks the dynptr bpf_func_proto structs that were added in [1]
       as static, as pointed out by the kernel test robot in [2].

    2) There are some bpf_func_proto structs marked as extern which can
       instead be statically defined.

      [1] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/
      [2] https://lore.kernel.org/bpf/62ab89f2.Pko7sI08RAKdF8R6%25lkp@intel.com/

    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220616225407.1878436-1-joannelkoong@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:09 +02:00
Yauheni Kaliuta ff04690df8 bpf: Fix resetting logic for unreferenced kptrs
Bugzilla: https://bugzilla.redhat.com/2120968

commit 9fad7fe5b29803584c7f17a2abe6c2936fec6828
Author: Jules Irenge <jbi.octave@gmail.com>
Date:   Wed Sep 7 16:24:20 2022 +0100

    bpf: Fix resetting logic for unreferenced kptrs
    
    Sparse reported a warning at bpf_map_free_kptrs()
    "warning: Using plain integer as NULL pointer"
    During the process of fixing this warning, it was discovered that the current
    code erroneously writes to the pointer variable instead of deferencing and
    writing to the actual kptr. Hence, Sparse tool accidentally helped to uncover
    this problem. Fix this by doing WRITE_ONCE(*p, 0) instead of WRITE_ONCE(p, 0).
    
    Note that the effect of this bug is that unreferenced kptrs will not be cleared
    during check_and_free_fields. It is not a problem if the clearing is not done
    during map_free stage, as there is nothing to free for them.
    
    Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
    Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
    Link: https://lore.kernel.org/r/Yxi3pJaK6UDjVJSy@playground
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:08 +02:00
Yauheni Kaliuta d0d03e9325 bpf: refine kernel.unprivileged_bpf_disabled behaviour
Bugzilla: https://bugzilla.redhat.com/2120968

commit c8644cd0efe719608ddcb341bcf087d4bc0bf6b8
Author: Alan Maguire <alan.maguire@oracle.com>
Date:   Thu May 19 15:25:33 2022 +0100

    bpf: refine kernel.unprivileged_bpf_disabled behaviour
    
    With unprivileged BPF disabled, all cmds associated with the BPF syscall
    are blocked to users without CAP_BPF/CAP_SYS_ADMIN.  However there are
    use cases where we may wish to allow interactions with BPF programs
    without being able to load and attach them.  So for example, a process
    with required capabilities loads/attaches a BPF program, and a process
    with less capabilities interacts with it; retrieving perf/ring buffer
    events, modifying map-specified config etc.  With all BPF syscall
    commands blocked as a result of unprivileged BPF being disabled,
    this mode of interaction becomes impossible for processes without
    CAP_BPF.
    
    As Alexei notes
    
    "The bpf ACL model is the same as traditional file's ACL.
    The creds and ACLs are checked at open().  Then during file's write/read
    additional checks might be performed. BPF has such functionality already.
    Different map_creates have capability checks while map_lookup has:
    map_get_sys_perms(map, f) & FMODE_CAN_READ.
    In other words it's enough to gate FD-receiving parts of bpf
    with unprivileged_bpf_disabled sysctl.
    The rest is handled by availability of FD and access to files in bpffs."
    
    So key fd creation syscall commands BPF_PROG_LOAD and BPF_MAP_CREATE
    are blocked with unprivileged BPF disabled and no CAP_BPF.
    
    And as Alexei notes, map creation with unprivileged BPF disabled off
    blocks creation of maps aside from array, hash and ringbuf maps.
    
    Programs responsible for loading and attaching the BPF program
    can still control access to its pinned representation by restricting
    permissions on the pin path, as with normal files.
    
    Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
    Acked-by: KP Singh <kpsingh@kernel.org>
    Link: https://lore.kernel.org/r/1652970334-30510-2-git-send-email-alan.maguire@oracle.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:06 +02:00
Yauheni Kaliuta 1c3a7dd065 bpf, x86: Attach a cookie to fentry/fexit/fmod_ret/lsm.
Bugzilla: https://bugzilla.redhat.com/2120968

commit 2fcc82411e74e5e6aba336561cf56fb899bfae4e
Author: Kui-Feng Lee <kuifeng@fb.com>
Date:   Tue May 10 13:59:21 2022 -0700

    bpf, x86: Attach a cookie to fentry/fexit/fmod_ret/lsm.
    
    Pass a cookie along with BPF_LINK_CREATE requests.
    
    Add a bpf_cookie field to struct bpf_tracing_link to attach a cookie.
    The cookie of a bpf_tracing_link is available by calling
    bpf_get_attach_cookie when running the BPF program of the attached
    link.
    
    The value of a cookie will be set at bpf_tramp_run_ctx by the
    trampoline of the link.
    
    Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220510205923.3206889-4-kuifeng@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:03 +02:00
Yauheni Kaliuta 2755884830 bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack
Bugzilla: https://bugzilla.redhat.com/2120968

commit e384c7b7b46d0a5f4bf3c554f963e6e9622d0ab1
Author: Kui-Feng Lee <kuifeng@fb.com>
Date:   Tue May 10 13:59:20 2022 -0700

    bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack

    BPF trampolines will create a bpf_tramp_run_ctx, a bpf_run_ctx, on
    stacks and set/reset the current bpf_run_ctx before/after calling a
    bpf_prog.

    Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220510205923.3206889-3-kuifeng@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:03 +02:00
Yauheni Kaliuta 503bec2387 bpf, x86: Generate trampolines from bpf_tramp_links
Bugzilla: https://bugzilla.redhat.com/2120968
Conflicts: already applied
  1d5f82d9dd47 ("bpf, x86: fix freeing of not-finalized bpf_prog_pack")

commit f7e0beaf39d3868dc700d4954b26cf8443c5d423
Author: Kui-Feng Lee <kuifeng@fb.com>
Date:   Tue May 10 13:59:19 2022 -0700

    bpf, x86: Generate trampolines from bpf_tramp_links

    Replace struct bpf_tramp_progs with struct bpf_tramp_links to collect
    struct bpf_tramp_link(s) for a trampoline.  struct bpf_tramp_link
    extends bpf_link to act as a linked list node.

    arch_prepare_bpf_trampoline() accepts a struct bpf_tramp_links to
    collects all bpf_tramp_link(s) that a trampoline should call.

    Change BPF trampoline and bpf_struct_ops to pass bpf_tramp_links
    instead of bpf_tramp_progs.

    Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220510205923.3206889-2-kuifeng@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:03 +02:00