Commit Graph

4 Commits

Author SHA1 Message Date
Felix Maurer 7b9efe0d0d bpf, tcx: Get rid of tcx_link_const
JIRA: https://issues.redhat.com/browse/RHEL-28590

commit b63dadd6f97522513fe9497b5fde84a154e39a0b
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Mon Oct 23 20:50:15 2023 +0200

    bpf, tcx: Get rid of tcx_link_const
    
    Small clean up to get rid of the extra tcx_link_const() and only retain
    the tcx_link().
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/r/20231023185015.21152-1-daniel@iogearbox.net
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-04-04 16:30:18 +02:00
Felix Maurer 260b83b5fc bpf: Handle bpf_mprog_query with NULL entry
JIRA: https://issues.redhat.com/browse/RHEL-28590
Conflicts:
- kernel/bpf/mprog.c: The changes to this file were already backported in
  773bccce1d ("bpf: Handle bpf_mprog_query with NULL entry").

commit edfa9af0a73ecc2000d7bb81d0b0fd3158cc9a65
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Sat Oct 7 00:06:50 2023 +0200

    bpf: Handle bpf_mprog_query with NULL entry

    Improve consistency for bpf_mprog_query() API and let the latter also handle
    a NULL entry as can be the case for tcx. Instead of returning -ENOENT, we
    copy a count of 0 and revision of 1 to user space, so that this can be fed
    into a subsequent bpf_mprog_attach() call as expected_revision. A BPF self-
    test as part of this series has been added to assert this case.

    Suggested-by: Lorenz Bauer <lmb@isovalent.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/r/20231006220655.1653-2-daniel@iogearbox.net
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-04-04 16:30:18 +02:00
Felix Maurer 7fbb619831 tcx: Fix splat during dev unregister
JIRA: https://issues.redhat.com/browse/RHEL-28590

commit 079082c60affefeb9d2bd4176a4f2b390a9ccfda
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Fri Jul 28 23:47:17 2023 +0200

    tcx: Fix splat during dev unregister
    
    During unregister_netdevice_many_notify(), the ordering of our concerned
    function calls is like this:
    
      unregister_netdevice_many_notify
        dev_shutdown
            qdisc_put
                clsact_destroy
        tcx_uninstall
    
    The syzbot reproducer triggered a case that the qdisc refcnt is not
    zero during dev_shutdown().
    
    tcx_uninstall() will then WARN_ON_ONCE(tcx_entry(entry)->miniq_active)
    because the miniq is still active and the entry should not be freed.
    The latter assumed that qdisc destruction happens before tcx teardown.
    
    This fix is to avoid tcx_uninstall() doing tcx_entry_free() when the
    miniq is still alive and let the clsact_destroy() do the free later, so
    that we do not assume any specific ordering for either of them.
    
    If still active, tcx_uninstall() does clear the entry when flushing out
    the prog/link. clsact_destroy() will then notice the "!tcx_entry_is_active()"
    and then does the tcx_entry_free() eventually.
    
    Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
    Reported-by: syzbot+376a289e86a0fd02b9ba@syzkaller.appspotmail.com
    Reported-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: syzbot+376a289e86a0fd02b9ba@syzkaller.appspotmail.com
    Tested-by: Leon Romanovsky <leonro@nvidia.com>
    Link: https://lore.kernel.org/r/222255fe07cb58f15ee662e7ee78328af5b438e4.1690549248.git.daniel@iogearbox.net
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-04-04 16:29:42 +02:00
Felix Maurer cbdf6d7deb bpf: Add fd-based tcx multi-prog infra with link support
JIRA: https://issues.redhat.com/browse/RHEL-28590
Conflicts:
- MAINTAINERS: The file has been restructured upstream, but this is not
  relevant for us. All paths are already covered.
- include/linux/netdevice.h: We have excluded TC from kABI with
  845ad79d11 ("net: exclude TC from kABI"). Keep this exclusion.
- include/linux/skbuff.h: The order of the fields has been changed upstream
  in c0ba861117c3 ("net: skbuff: move the fields BPF cares about directly
  next to the offset marker"). The actual change is just changing config
  options. Do this instead of picking the field reordering to make
  backporting easier.
- include/uapi/linux/bpf.h and tools/include/uapi/linux/bpf.h: The changes
  to these files were already backported through 1d5bff6a09 ("bpf: Add
  fd-based tcx multi-prog infra with link support") to keep UAPI close to
  upstream.
- kernel/bpf/syscall.c: Already backported 58ff9f1ec9 ("bpf: Add
  attach_type checks under bpf_prog_attach_check_attach_type") moves one
  switch block around. The case BPF_PROG_TYPE_SCHED_CLS was added during
  that backport, therefore this hunk is missing now. This also causes
  context differences.
- kernel/bpf/syscall.c: Already backported 81b5cf0a11 ("bpf: Fix
  BPF_PROG_QUERY last field check") fixed the QUERY_LAST_FIELD.

commit e420bed025071a623d2720a92bc2245c84757ecb
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Wed Jul 19 16:08:52 2023 +0200

    bpf: Add fd-based tcx multi-prog infra with link support

    This work refactors and adds a lightweight extension ("tcx") to the tc BPF
    ingress and egress data path side for allowing BPF program management based
    on fds via bpf() syscall through the newly added generic multi-prog API.
    The main goal behind this work which we also presented at LPC [0] last year
    and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
    BPF link functionality for tc BPF programs, which allows for a model of safe
    ownership and program detachment.

    Given the rise in tc BPF users in cloud native environments, this becomes
    necessary to avoid hard to debug incidents either through stale leftover
    programs or 3rd party applications accidentally stepping on each others toes.
    As a recap, a BPF link represents the attachment of a BPF program to a BPF
    hook point. The BPF link holds a single reference to keep BPF program alive.
    Moreover, hook points do not reference a BPF link, only the application's
    fd or pinning does. A BPF link holds meta-data specific to attachment and
    implements operations for link creation, (atomic) BPF program update,
    detachment and introspection. The motivation for BPF links for tc BPF programs
    is multi-fold, for example:

      - From Meta: "It's especially important for applications that are deployed
        fleet-wide and that don't "control" hosts they are deployed to. If such
        application crashes and no one notices and does anything about that, BPF
        program will keep running draining resources or even just, say, dropping
        packets. We at FB had outages due to such permanent BPF attachment
        semantics. With fd-based BPF link we are getting a framework, which allows
        safe, auto-detachable behavior by default, unless application explicitly
        opts in by pinning the BPF link." [1]

      - From Cilium-side the tc BPF programs we attach to host-facing veth devices
        and phys devices build the core datapath for Kubernetes Pods, and they
        implement forwarding, load-balancing, policy, EDT-management, etc, within
        BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
        experienced hard-to-debug issues in a user's staging environment where
        another Kubernetes application using tc BPF attached to the same prio/handle
        of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
        it. The goal is to establish a clear/safe ownership model via links which
        cannot accidentally be overridden. [0,2]

    BPF links for tc can co-exist with non-link attachments, and the semantics are
    in line also with XDP links: BPF links cannot replace other BPF links, BPF
    links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
    lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
    would solve mentioned issue of safe ownership model as 3rd party applications
    would not be able to accidentally wipe Cilium programs, even if they are not
    BPF link aware.

    Earlier attempts [4] have tried to integrate BPF links into core tc machinery
    to solve cls_bpf, which has been intrusive to the generic tc kernel API with
    extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
    be wiped from the qdisc also. Locking a tc BPF program in place this way, is
    getting into layering hacks given the two object models are vastly different.

    We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
    attach API, so that the BPF link implementation blends in naturally similar to
    other link types which are fd-based and without the need for changing core tc
    internal APIs. BPF programs for tc can then be successively migrated from classic
    cls_bpf to the new tc BPF link without needing to change the program's source
    code, just the BPF loader mechanics for attaching is sufficient.

    For the current tc framework, there is no change in behavior with this change
    and neither does this change touch on tc core kernel APIs. The gist of this
    patch is that the ingress and egress hook have a lightweight, qdisc-less
    extension for BPF to attach its tc BPF programs, in other words, a minimal
    entry point for tc BPF. The name tcx has been suggested from discussion of
    earlier revisions of this work as a good fit, and to more easily differ between
    the classic cls_bpf attachment and the fd-based one.

    For the ingress and egress tcx points, the device holds a cache-friendly array
    with program pointers which is separated from control plane (slow-path) data.
    Earlier versions of this work used priority to determine ordering and expression
    of dependencies similar as with classic tc, but it was challenged that for
    something more future-proof a better user experience is required. Hence this
    resulted in the design and development of the generic attach/detach/query API
    for multi-progs. See prior patch with its discussion on the API design. tcx is
    the first user and later we plan to integrate also others, for example, one
    candidate is multi-prog support for XDP which would benefit and have the same
    'look and feel' from API perspective.

    The goal with tcx is to have maximum compatibility to existing tc BPF programs,
    so they don't need to be rewritten specifically. Compatibility to call into
    classic tcf_classify() is also provided in order to allow successive migration
    or both to cleanly co-exist where needed given its all one logical tc layer and
    the tcx plus classic tc cls/act build one logical overall processing pipeline.

    tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
    to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
    The fd-based API is behind a static key, so that when unused the code is also
    not entered. The struct tcx_entry's program array is currently static, but
    could be made dynamic if necessary at a point in future. The a/b pair swap
    design has been chosen so that for detachment there are no allocations which
    otherwise could fail.

    The work has been tested with tc-testing selftest suite which all passes, as
    well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

    Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
    of this work.

      [0] https://lpc.events/event/16/contributions/1353/
      [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
      [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
      [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
      [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-04-04 16:21:22 +02:00