Commit Graph

329 Commits

Author SHA1 Message Date
Viktor Malik 262886d961 bpf: allow to disable bpf prog memory accounting
Bugzilla: https://bugzilla.redhat.com/2178930

commit bf3965082491601bf9cd6d9a0ce2d88cb219168a
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Fri Feb 10 15:47:34 2023 +0000

    bpf: allow to disable bpf prog memory accounting
    
    We can simply disable the bpf prog memory accouting by not setting the
    GFP_ACCOUNT.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
    Link: https://lore.kernel.org/r/20230210154734.4416-5-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-06-13 22:45:28 +02:00
Felix Maurer 58671712a5 bpf: XDP metadata RX kfuncs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930
Conflicts:
- include/linux/netdevice.h: Context difference due to missing 97dc7cd92ac6
  ("ptp: Support late timestamp determination")

commit 3d76a4d3d4e591af3e789698affaad88a5a8e8ab
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:26 2023 -0800

    bpf: XDP metadata RX kfuncs

    Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible
    XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not
    supported by the target device, the default implementation is called instead.
    The verifier, at load time, replaces a call to the generic kfunc with a call
    to the per-device one. Per-device kfunc pointers are stored in separate
    struct xdp_metadata_ops.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-8-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:14 +02:00
Felix Maurer e630642b6b bpf: Introduce device-bound XDP programs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 2b3486bc2d237ec345b3942b7be5deabf8c8fed1
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:24 2023 -0800

    bpf: Introduce device-bound XDP programs

    New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
    to associate a netdev with a BPF program at load time.

    netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:13 +02:00
Felix Maurer c0febc32b2 bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 9d03ebc71a027ca495c60f6e94d3cda81921791f
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:21 2023 -0800

    bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded

    BPF offloading infra will be reused to implement
    bound-but-not-offloaded bpf programs. Rename existing
    helpers for clarity. No functional changes.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:12 +02:00
Jerome Marchand 27bd10243d bpf: Resolve fext program type when checking map compatibility
Bugzilla: https://bugzilla.redhat.com/2177177

commit 1c123c567fb138ebd187480b7fc0610fcb0851f5
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Thu Dec 15 00:02:53 2022 +0100

    bpf: Resolve fext program type when checking map compatibility
    
    The bpf_prog_map_compatible() check makes sure that BPF program types are
    not mixed inside BPF map types that can contain programs (tail call maps,
    cpumaps and devmaps). It does this by setting the fields of the map->owner
    struct to the values of the first program being checked against, and
    rejecting any subsequent programs if the values don't match.
    
    One of the values being set in the map owner struct is the program type,
    and since the code did not resolve the prog type for fext programs, the map
    owner type would be set to PROG_TYPE_EXT and subsequent loading of programs
    of the target type into the map would fail.
    
    This bug is seen in particular for XDP programs that are loaded as
    PROG_TYPE_EXT using libxdp; these cannot insert programs into devmaps and
    cpumaps because the check fails as described above.
    
    Fix the bug by resolving the fext program type to its target program type
    as elsewhere in the verifier.
    
    v3:
    - Add Yonghong's ACK
    
    Fixes: f45d5b6ce2e8 ("bpf: generalise tail call map compatibility check")
    Acked-by: Yonghong Song <yhs@fb.com>
    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/r/20221214230254.790066-1-toke@redhat.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:19 +02:00
Jerome Marchand 27b1b8aed6 bpf: Introduce bpf_obj_new
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts: Context change from already backported commit 997849c4b969
("bpf: Zeroing allocated object from slab in bpf memory allocator"

commit 958cf2e273f0929c66169e0788031310e8118722
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Nov 18 07:26:03 2022 +0530

    bpf: Introduce bpf_obj_new

    Introduce type safe memory allocator bpf_obj_new for BPF programs. The
    kernel side kfunc is named bpf_obj_new_impl, as passing hidden arguments
    to kfuncs still requires having them in prototype, unlike BPF helpers
    which always take 5 arguments and have them checked using bpf_func_proto
    in verifier, ignoring unset argument types.

    Introduce __ign suffix to ignore a specific kfunc argument during type
    checks, then use this to introduce support for passing type metadata to
    the bpf_obj_new_impl kfunc.

    The user passes BTF ID of the type it wants to allocates in program BTF,
    the verifier then rewrites the first argument as the size of this type,
    after performing some sanity checks (to ensure it exists and it is a
    struct type).

    The second argument is also fixed up and passed by the verifier. This is
    the btf_struct_meta for the type being allocated. It would be needed
    mostly for the offset array which is required for zero initializing
    special fields while leaving the rest of storage in unitialized state.

    It would also be needed in the next patch to perform proper destruction
    of the object's special fields.

    Under the hood, bpf_obj_new will call bpf_mem_alloc and bpf_mem_free,
    using the any context BPF memory allocator introduced recently. To this
    end, a global instance of the BPF memory allocator is initialized on
    boot to be used for this purpose. This 'bpf_global_ma' serves all
    allocations for bpf_obj_new. In the future, bpf_obj_new variants will
    allow specifying a custom allocator.

    Note that now that bpf_obj_new can be used to allocate objects that can
    be linked to BPF linked list (when future linked list helpers are
    available), we need to also free the elements using bpf_mem_free.
    However, since the draining of elements is done outside the
    bpf_spin_lock, we need to do migrate_disable around the call since
    bpf_list_head_free can be called from map free path where migration is
    enabled. Otherwise, when called from BPF programs migration is already
    disabled.

    A convenience macro is included in the bpf_experimental.h header to hide
    over the ugly details of the implementation, leading to user code
    looking similar to a language level extension which allocates and
    constructs fields of a user type.

    struct bar {
            struct bpf_list_node node;
    };

    struct foo {
            struct bpf_spin_lock lock;
            struct bpf_list_head head __contains(bar, node);
    };

    void prog(void) {
            struct foo *f;

            f = bpf_obj_new(typeof(*f));
            if (!f)
                    return;
            ...
    }

    A key piece of this story is still missing, i.e. the free function,
    which will come in the next patch.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221118015614.2013203-14-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:07 +02:00
Jerome Marchand 6806cc5bf2 bpf: Use rcu_trace_implies_rcu_gp() for program array freeing
Bugzilla: https://bugzilla.redhat.com/2177177

commit 4835f9ee980c1867584018e69cbf1f62d7844cb3
Author: Hou Tao <houtao1@huawei.com>
Date:   Fri Oct 14 19:39:46 2022 +0800

    bpf: Use rcu_trace_implies_rcu_gp() for program array freeing

    To support both sleepable and normal uprobe bpf program, the freeing of
    trace program array chains a RCU-tasks-trace grace period and a normal
    RCU grace period one after the other.

    With the introduction of rcu_trace_implies_rcu_gp(),
    __bpf_prog_array_free_sleepable_cb() can check whether or not a normal
    RCU grace period has also passed after a RCU-tasks-trace grace period
    has passed. If it is true, it is safe to invoke kfree() directly.

    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Link: https://lore.kernel.org/r/20221014113946.965131-5-houtao@huaweicloud.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:42:53 +02:00
Artem Savkov d33eb78da0 treewide: use get_random_u32() when possible
Bugzilla: https://bugzilla.redhat.com/2166911

Conflicts: taking bpf parts only

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit a251c17aa558d8e3128a528af5cf8b9d7caae4fd
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed Oct 5 17:43:22 2022 +0200

    treewide: use get_random_u32() when possible

    The prandom_u32() function has been a deprecated inline wrapper around
    get_random_u32() for several releases now, and compiles down to the
    exact same code. Replace the deprecated wrapper with a direct call to
    the real function. The same also applies to get_random_int(), which is
    just a wrapper around get_random_u32(). This was done as a basic find
    and replace.

    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Yury Norov <yury.norov@gmail.com>
    Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
    Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
    Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
    Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
    Acked-by: Helge Deller <deller@gmx.de> # for parisc
    Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:27 +01:00
Artem Savkov fb99c831ac treewide: use prandom_u32_max() when possible, part 1
Bugzilla: https://bugzilla.redhat.com/2166911

Conflicts: taking only bpf parts

Upstream Status: linux.git

commit 81895a65ec63ee1daec3255dc1a06675d2fbe915
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed Oct 5 16:43:38 2022 +0200

    treewide: use prandom_u32_max() when possible, part 1

    Rather than incurring a division or requesting too many random bytes for
    the given range, use the prandom_u32_max() function, which only takes
    the minimum required bytes from the RNG and avoids divisions. This was
    done mechanically with this coccinelle script:

    @basic@
    expression E;
    type T;
    identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
    typedef u64;
    @@
    (
    - ((T)get_random_u32() % (E))
    + prandom_u32_max(E)
    |
    - ((T)get_random_u32() & ((E) - 1))
    + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
    |
    - ((u64)(E) * get_random_u32() >> 32)
    + prandom_u32_max(E)
    |
    - ((T)get_random_u32() & ~PAGE_MASK)
    + prandom_u32_max(PAGE_SIZE)
    )

    @multi_line@
    identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
    identifier RAND;
    expression E;
    @@

    -       RAND = get_random_u32();
            ... when != RAND
    -       RAND %= (E);
    +       RAND = prandom_u32_max(E);

    // Find a potential literal
    @literal_mask@
    expression LITERAL;
    type T;
    identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
    position p;
    @@

            ((T)get_random_u32()@p & (LITERAL))

    // Add one to the literal.
    @script:python add_one@
    literal << literal_mask.LITERAL;
    RESULT;
    @@

    value = None
    if literal.startswith('0x'):
            value = int(literal, 16)
    elif literal[0] in '123456789':
            value = int(literal, 10)
    if value is None:
            print("I don't know how to handle %s" % (literal))
            cocci.include_match(False)
    elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
            print("Skipping 0x%x for cleanup elsewhere" % (value))
            cocci.include_match(False)
    elif value & (value + 1) != 0:
            print("Skipping 0x%x because it's not a power of two minus one" % (value))
            cocci.include_match(False)
    elif literal.startswith('0x'):
            coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
    else:
            coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))

    // Replace the literal mask with the calculated result.
    @plus_one@
    expression literal_mask.LITERAL;
    position literal_mask.p;
    expression add_one.RESULT;
    identifier FUNC;
    @@

    -       (FUNC()@p & (LITERAL))
    +       prandom_u32_max(RESULT)

    @collapse_ret@
    type T;
    identifier VAR;
    expression E;
    @@

     {
    -       T VAR;
    -       VAR = (E);
    -       return VAR;
    +       return E;
     }

    @drop_var@
    type T;
    identifier VAR;
    @@

     {
    -       T VAR;
            ... when != VAR
     }

    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Yury Norov <yury.norov@gmail.com>
    Reviewed-by: KP Singh <kpsingh@kernel.org>
    Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
    Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
    Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
    Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:27 +01:00
Artem Savkov 0ef7ec3cf6 bpf: kmsan: initialize BPF registers with zeroes
Bugzilla: https://bugzilla.redhat.com/2166911

commit a6a7aaba7f39ee439f3d42e4b5bfc6e7f762d126
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Sep 15 17:04:15 2022 +0200

    bpf: kmsan: initialize BPF registers with zeroes
    
    When executing BPF programs, certain registers may get passed
    uninitialized to helper functions.  E.g.  when performing a JMP_CALL,
    registers BPF_R1-BPF_R5 are always passed to the helper, no matter how
    many of them are actually used.
    
    Passing uninitialized values as function parameters is technically
    undefined behavior, so we work around it by always initializing the
    registers.
    
    Link: https://lkml.kernel.org/r/20220915150417.722975-42-glider@google.com
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@google.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Eric Biggers <ebiggers@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Ilya Leoshkevich <iii@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vegard Nossum <vegard.nossum@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:21 +01:00
Artem Savkov e7b770ccac bpf: use bpf_prog_pack for bpf_dispatcher
Bugzilla: https://bugzilla.redhat.com/2166911

commit 19c02415da2345d0dda2b5c4495bc17cc14b18b5
Author: Song Liu <song@kernel.org>
Date:   Mon Sep 26 11:47:38 2022 -0700

    bpf: use bpf_prog_pack for bpf_dispatcher
    
    Allocate bpf_dispatcher with bpf_prog_pack_alloc so that bpf_dispatcher
    can share pages with bpf programs.
    
    arch_prepare_bpf_dispatcher() is updated to provide a RW buffer as working
    area for arch code to write to.
    
    This also fixes CPA W^X warnning like:
    
    CPA refuse W^X violation: 8000000000000163 -> 0000000000000163 range: ...
    
    Signed-off-by: Song Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20220926184739.3512547-2-song@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:19 +01:00
Artem Savkov b341ead9ab bpf: Add BPF-helper for accessing CLOCK_TAI
Bugzilla: https://bugzilla.redhat.com/2166911

commit c8996c98f703b09afe77a1d247dae691c9849dc1
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Tue Aug 9 08:08:02 2022 +0200

    bpf: Add BPF-helper for accessing CLOCK_TAI
    
    Commit 3dc6ffae2da2 ("timekeeping: Introduce fast accessor to clock tai")
    introduced a fast and NMI-safe accessor for CLOCK_TAI. Especially in time
    sensitive networks (TSN), where all nodes are synchronized by Precision Time
    Protocol (PTP), it's helpful to have the possibility to generate timestamps
    based on CLOCK_TAI instead of CLOCK_MONOTONIC. With a BPF helper for TAI in
    place, it becomes very convenient to correlate activity across different
    machines in the network.
    
    Use cases for such a BPF helper include functionalities such as Tx launch
    time (e.g. ETF and TAPRIO Qdiscs) and timestamping.
    
    Note: CLOCK_TAI is nothing new per se, only the NMI-safe variant of it is.
    
    Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
    [Kurt: Wrote changelog and renamed helper]
    Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
    Link: https://lore.kernel.org/r/20220809060803.5773-2-kurt@linutronix.de
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:53:59 +01:00
Artem Savkov a52835f503 bpf: Fix a data-race around bpf_jit_limit.
Bugzilla: https://bugzilla.redhat.com/2137876

commit 0947ae1121083d363d522ff7518ee72b55bd8d29
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 14:58:04 2022 -0700

    bpf: Fix a data-race around bpf_jit_limit.
    
    While reading bpf_jit_limit, it can be changed concurrently via sysctl,
    WRITE_ONCE() in __do_proc_doulongvec_minmax(). The size of bpf_jit_limit
    is long, so we need to add a paired READ_ONCE() to avoid load-tearing.
    
    Fixes: ede95a63b5 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220823215804.2177-1-kuniyu@amazon.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:48 +01:00
Artem Savkov b57e7d9b57 bpf: Simplify bpf_prog_pack_[size|mask]
Bugzilla: https://bugzilla.redhat.com/2137876

commit ea2babac63d40e59926dc5de4550dac94cc3c6d2
Author: Song Liu <song@kernel.org>
Date:   Wed Jul 13 13:49:50 2022 -0700

    bpf: Simplify bpf_prog_pack_[size|mask]
    
    Simplify the logic that selects bpf_prog_pack_size, and always use
    (PMD_SIZE * num_possible_nodes()). This is a good tradeoff, as most of
    the performance benefit observed is from less direct map fragmentation [0].
    
    Also, module_alloc(4MB) may not allocate 4MB aligned memory. Therefore,
    we cannot use (ptr & bpf_prog_pack_mask) to find the correct address of
    bpf_prog_pack. Fix this by checking the header address falls in the range
    of pack->ptr and (pack->ptr + bpf_prog_pack_size).
    
      [0] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/
    
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20220713204950.3015201-1-song@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:44 +01:00
Artem Savkov ee4f4249cd bpf: minimize number of allocated lsm slots per program
Bugzilla: https://bugzilla.redhat.com/2137876

commit c0e19f2c9a3edd38e4b1bdae98eb44555d02bc31
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Jun 28 10:43:07 2022 -0700

    bpf: minimize number of allocated lsm slots per program
    
    Previous patch adds 1:1 mapping between all 211 LSM hooks
    and bpf_cgroup program array. Instead of reserving a slot per
    possible hook, reserve 10 slots per cgroup for lsm programs.
    Those slots are dynamically allocated on demand and reclaimed.
    
    struct cgroup_bpf {
    	struct bpf_prog_array *    effective[33];        /*     0   264 */
    	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
    	struct hlist_head          progs[33];            /*   264   264 */
    	/* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */
    	u8                         flags[33];            /*   528    33 */
    
    	/* XXX 7 bytes hole, try to pack */
    
    	struct list_head           storages;             /*   568    16 */
    	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
    	struct bpf_prog_array *    inactive;             /*   584     8 */
    	struct percpu_ref          refcnt;               /*   592    16 */
    	struct work_struct         release_work;         /*   608    72 */
    
    	/* size: 680, cachelines: 11, members: 7 */
    	/* sum members: 673, holes: 1, sum holes: 7 */
    	/* last cacheline: 40 bytes */
    };
    
    Reviewed-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:33 +01:00
Artem Savkov 9a33161b25 bpf: per-cgroup lsm flavor
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: already applied 65d9ecfe0ca73 "bpf: Fix ref_obj_id for dynptr
data slices in verifier"

commit 69fd337a975c7e690dfe49d9cb4fe5ba1e6db44e
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Jun 28 10:43:06 2022 -0700

    bpf: per-cgroup lsm flavor

    Allow attaching to lsm hooks in the cgroup context.

    Attaching to per-cgroup LSM works exactly like attaching
    to other per-cgroup hooks. New BPF_LSM_CGROUP is added
    to trigger new mode; the actual lsm hook we attach to is
    signaled via existing attach_btf_id.

    For the hooks that have 'struct socket' or 'struct sock' as its first
    argument, we use the cgroup associated with that socket. For the rest,
    we use 'current' cgroup (this is all on default hierarchy == v2 only).
    Note that for some hooks that work on 'struct sock' we still
    take the cgroup from 'current' because some of them work on the socket
    that hasn't been properly initialized yet.

    Behind the scenes, we allocate a shim program that is attached
    to the trampoline and runs cgroup effective BPF programs array.
    This shim has some rudimentary ref counting and can be shared
    between several programs attaching to the same lsm hook from
    different cgroups.

    Note that this patch bloats cgroup size because we add 211
    cgroup_bpf_attach_type(s) for simplicity sake. This will be
    addressed in the subsequent patch.

    Also note that we only add non-sleepable flavor for now. To enable
    sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
    shim programs have to be freed via trace rcu, cgroup_bpf.effective
    should be also trace-rcu-managed + maybe some other changes that
    I'm not aware of.

    Reviewed-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:33 +01:00
Artem Savkov 9b9402d121 bpf, x64: Add predicate for bpf2bpf with tailcalls support in JIT
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: already applied 1d5f82d9dd47 ("bpf, x86: fix freeing of
not-finalized bpf_prog_pack")

commit 95acd8817e66d031d2e6ee7def3f1e1874819317
Author: Tony Ambardar <tony.ambardar@gmail.com>
Date:   Fri Jun 17 12:57:34 2022 +0200

    bpf, x64: Add predicate for bpf2bpf with tailcalls support in JIT

    The BPF core/verifier is hard-coded to permit mixing bpf2bpf and tail
    calls for only x86-64. Change the logic to instead rely on a new weak
    function 'bool bpf_jit_supports_subprog_tailcalls(void)', which a capable
    JIT backend can override.

    Update the x86-64 eBPF JIT to reflect this.

    Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com>
    [jakub: drop MIPS bits and tweak patch subject]
    Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220617105735.733938-2-jakub@cloudflare.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:31 +01:00
Artem Savkov bee9a85dbb bpf: implement sleepable uprobes by chaining gps
Bugzilla: https://bugzilla.redhat.com/2137876

commit 8c7dcb84e3b744b2b70baa7a44a9b1881c33a9c9
Author: Delyan Kratunov <delyank@fb.com>
Date:   Tue Jun 14 23:10:46 2022 +0000

    bpf: implement sleepable uprobes by chaining gps
    
    uprobes work by raising a trap, setting a task flag from within the
    interrupt handler, and processing the actual work for the uprobe on the
    way back to userspace. As a result, uprobe handlers already execute in a
    might_fault/_sleep context. The primary obstacle to sleepable bpf uprobe
    programs is therefore on the bpf side.
    
    Namely, the bpf_prog_array attached to the uprobe is protected by normal
    rcu. In order for uprobe bpf programs to become sleepable, it has to be
    protected by the tasks_trace rcu flavor instead (and kfree() called after
    a corresponding grace period).
    
    Therefore, the free path for bpf_prog_array now chains a tasks_trace and
    normal grace periods one after the other.
    
    Users who iterate under tasks_trace read section would
    be safe, as would users who iterate under normal read sections (from
    non-sleepable locations).
    
    The downside is that the tasks_trace latency affects all perf_event-attached
    bpf programs (and not just uprobe ones). This is deemed safe given the
    possible attach rates for kprobe/uprobe/tp programs.
    
    Separately, non-sleepable programs need access to dynamically sized
    rcu-protected maps, so bpf_run_prog_array_sleepables now conditionally takes
    an rcu read section, in addition to the overarching tasks_trace section.
    
    Signed-off-by: Delyan Kratunov <delyank@fb.com>
    Link: https://lore.kernel.org/r/ce844d62a2fd0443b08c5ab02e95bc7149f9aeb1.1655248076.git.delyank@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:30 +01:00
Artem Savkov 3d52fa1414 bpf: Correct the comment about insn_to_jit_off
Bugzilla: https://bugzilla.redhat.com/2137876

commit cc1685546df87d9872e1ccef5bf56ac5262be0b1
Author: Pu Lehui <pulehui@huawei.com>
Date:   Mon May 30 17:28:12 2022 +0800

    bpf: Correct the comment about insn_to_jit_off
    
    The insn_to_jit_off passed to bpf_prog_fill_jited_linfo should be the
    first byte of the next instruction, or the byte off to the end of the
    current instruction.
    
    Signed-off-by: Pu Lehui <pulehui@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220530092815.1112406-4-pulehui@huawei.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:28 +01:00
Yauheni Kaliuta 6d9bc53da1 bpf: Make sure mac_header was set before using it
Bugzilla: https://bugzilla.redhat.com/2120968

commit 0326195f523a549e0a9d7fd44c70b26fd7265090
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jul 7 12:39:00 2022 +0000

    bpf: Make sure mac_header was set before using it
    
    Classic BPF has a way to load bytes starting from the mac header.
    
    Some skbs do not have a mac header, and skb_mac_header()
    in this case is returning a pointer that 65535 bytes after
    skb->head.
    
    Existing range check in bpf_internal_load_pointer_neg_helper()
    was properly kicking and no illegal access was happening.
    
    New sanity check in skb_mac_header() is firing, so we need
    to avoid it.
    
    WARNING: CPU: 1 PID: 28990 at include/linux/skbuff.h:2785 skb_mac_header include/linux/skbuff.h:2785 [inline]
    WARNING: CPU: 1 PID: 28990 at include/linux/skbuff.h:2785 bpf_internal_load_pointer_neg_helper+0x1b1/0x1c0 kernel/bpf/core.c:74
    Modules linked in:
    CPU: 1 PID: 28990 Comm: syz-executor.0 Not tainted 5.19.0-rc4-syzkaller-00865-g4874fb9484be #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/29/2022
    RIP: 0010:skb_mac_header include/linux/skbuff.h:2785 [inline]
    RIP: 0010:bpf_internal_load_pointer_neg_helper+0x1b1/0x1c0 kernel/bpf/core.c:74
    Code: ff ff 45 31 f6 e9 5a ff ff ff e8 aa 27 40 00 e9 3b ff ff ff e8 90 27 40 00 e9 df fe ff ff e8 86 27 40 00 eb 9e e8 2f 2c f3 ff <0f> 0b eb b1 e8 96 27 40 00 e9 79 fe ff ff 90 41 57 41 56 41 55 41
    RSP: 0018:ffffc9000309f668 EFLAGS: 00010216
    RAX: 0000000000000118 RBX: ffffffffffeff00c RCX: ffffc9000e417000
    RDX: 0000000000040000 RSI: ffffffff81873f21 RDI: 0000000000000003
    RBP: ffff8880842878c0 R08: 0000000000000003 R09: 000000000000ffff
    R10: 000000000000ffff R11: 0000000000000001 R12: 0000000000000004
    R13: ffff88803ac56c00 R14: 000000000000ffff R15: dffffc0000000000
    FS: 00007f5c88a16700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fdaa9f6c058 CR3: 000000003a82c000 CR4: 00000000003506e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    <TASK>
    ____bpf_skb_load_helper_32 net/core/filter.c:276 [inline]
    bpf_skb_load_helper_32+0x191/0x220 net/core/filter.c:264
    
    Fixes: f9aefd6b2aa3 ("net: warn if mac header was not set")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220707123900.945305-1-edumazet@google.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:08 +02:00
Yauheni Kaliuta 5f5e60a97e bpf: Fix probe read error in ___bpf_prog_run()
Bugzilla: https://bugzilla.redhat.com/2120968

commit caff1fa4118cec4dfd4336521ebd22a6408a1e3e
Author: Menglong Dong <imagedong@tencent.com>
Date:   Tue May 24 10:12:27 2022 +0800

    bpf: Fix probe read error in ___bpf_prog_run()
    
    I think there is something wrong with BPF_PROBE_MEM in ___bpf_prog_run()
    in big-endian machine. Let's make a test and see what will happen if we
    want to load a 'u16' with BPF_PROBE_MEM.
    
    Let's make the src value '0x0001', the value of dest register will become
    0x0001000000000000, as the value will be loaded to the first 2 byte of
    DST with following code:
    
      bpf_probe_read_kernel(&DST, SIZE, (const void *)(long) (SRC + insn->off));
    
    Obviously, the value in DST is not correct. In fact, we can compare
    BPF_PROBE_MEM with LDX_MEM_H:
    
      DST = *(SIZE *)(unsigned long) (SRC + insn->off);
    
    If the memory load is done by LDX_MEM_H, the value in DST will be 0x1 now.
    
    And I think this error results in the test case 'test_bpf_sk_storage_map'
    failing:
    
      test_bpf_sk_storage_map:PASS:bpf_iter_bpf_sk_storage_map__open_and_load 0 nsec
      test_bpf_sk_storage_map:PASS:socket 0 nsec
      test_bpf_sk_storage_map:PASS:map_update 0 nsec
      test_bpf_sk_storage_map:PASS:socket 0 nsec
      test_bpf_sk_storage_map:PASS:map_update 0 nsec
      test_bpf_sk_storage_map:PASS:socket 0 nsec
      test_bpf_sk_storage_map:PASS:map_update 0 nsec
      test_bpf_sk_storage_map:PASS:attach_iter 0 nsec
      test_bpf_sk_storage_map:PASS:create_iter 0 nsec
      test_bpf_sk_storage_map:PASS:read 0 nsec
      test_bpf_sk_storage_map:FAIL:ipv6_sk_count got 0 expected 3
      $10/26 bpf_iter/bpf_sk_storage_map:FAIL
    
    The code of the test case is simply, it will load sk->sk_family to the
    register with BPF_PROBE_MEM and check if it is AF_INET6. With this patch,
    now the test case 'bpf_iter' can pass:
    
      $10  bpf_iter:OK
    
    Fixes: 2a02759ef5 ("bpf: Add support for BTF pointers to interpreter")
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Jiang Biao <benbjiang@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Cc: Ilya Leoshkevich <iii@linux.ibm.com>
    Link: https://lore.kernel.org/bpf/20220524021228.533216-1-imagedong@tencent.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:07 +02:00
Yauheni Kaliuta aef3783eaf bpf: Fix combination of jit blinding and pointers to bpf subprogs.
Bugzilla: https://bugzilla.redhat.com/2120968

commit 4b6313cf99b0d51b49aeaea98ec76ca8161ecb80
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Thu May 12 18:10:24 2022 -0700

    bpf: Fix combination of jit blinding and pointers to bpf subprogs.
    
    The combination of jit blinding and pointers to bpf subprogs causes:
    [   36.989548] BUG: unable to handle page fault for address: 0000000100000001
    [   36.990342] #PF: supervisor instruction fetch in kernel mode
    [   36.990968] #PF: error_code(0x0010) - not-present page
    [   36.994859] RIP: 0010:0x100000001
    [   36.995209] Code: Unable to access opcode bytes at RIP 0xffffffd7.
    [   37.004091] Call Trace:
    [   37.004351]  <TASK>
    [   37.004576]  ? bpf_loop+0x4d/0x70
    [   37.004932]  ? bpf_prog_3899083f75e4c5de_F+0xe3/0x13b
    
    The jit blinding logic didn't recognize that ld_imm64 with an address
    of bpf subprogram is a special instruction and proceeded to randomize it.
    By itself it wouldn't have been an issue, but jit_subprogs() logic
    relies on two step process to JIT all subprogs and then JIT them
    again when addresses of all subprogs are known.
    Blinding process in the first JIT phase caused second JIT to miss
    adjustment of special ld_imm64.
    
    Fix this issue by ignoring special ld_imm64 instructions that don't have
    user controlled constants and shouldn't be blinded.
    
    Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
    Reported-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20220513011025.13344-1-alexei.starovoitov@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:04 +02:00
Yauheni Kaliuta c0c280946f bpf: add bpf_map_lookup_percpu_elem for percpu map
Bugzilla: https://bugzilla.redhat.com/2120968

commit 07343110b293456d30393e89b86c4dee1ac051c8
Author: Feng Zhou <zhoufeng.zf@bytedance.com>
Date:   Wed May 11 17:38:53 2022 +0800

    bpf: add bpf_map_lookup_percpu_elem for percpu map
    
    Add new ebpf helpers bpf_map_lookup_percpu_elem.
    
    The implementation method is relatively simple, refer to the implementation
    method of map_lookup_elem of percpu map, increase the parameters of cpu, and
    obtain it according to the specified cpu.
    
    Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
    Link: https://lore.kernel.org/r/20220511093854.411-2-zhoufeng.zf@bytedance.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:04 +02:00
Jerome Marchand 5b89b8e979 bpf, x86: fix freeing of not-finalized bpf_prog_pack
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
Context change from missing commits 95acd8817e66 ("bpf, x64: Add
predicate for bpf2bpf with tailcalls support in JIT") and f7e0beaf39d3
("bpf, x86: Generate trampolines from bpf_tramp_links")

commit 1d5f82d9dd477d5c66e0214a68c3e4f308eadd6d
Author: Song Liu <song@kernel.org>
Date:   Tue Jul 5 17:26:12 2022 -0700

    bpf, x86: fix freeing of not-finalized bpf_prog_pack

    syzbot reported a few issues with bpf_prog_pack [1], [2]. This only happens
    with multiple subprogs. In jit_subprogs(), we first call bpf_int_jit_compile()
    on each sub program. And then, we call it on each sub program again. jit_data
    is not freed in the first call of bpf_int_jit_compile(). Similarly we don't
    call bpf_jit_binary_pack_finalize() in the first call of bpf_int_jit_compile().

    If bpf_int_jit_compile() failed for one sub program, we will call
    bpf_jit_binary_pack_finalize() for this sub program. However, we don't have a
    chance to call it for other sub programs. Then we will hit "goto out_free" in
    jit_subprogs(), and call bpf_jit_free on some subprograms that haven't got
    bpf_jit_binary_pack_finalize() yet.

    At this point, bpf_jit_binary_pack_free() is called and the whole 2MB page is
    freed erroneously.

    Fix this with a custom bpf_jit_free() for x86_64, which calls
    bpf_jit_binary_pack_finalize() if necessary. Also, with custom
    bpf_jit_free(), bpf_prog_aux->use_bpf_prog_pack is not needed any more,
    remove it.

    Fixes: 1022a5498f6f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
    [1] https://syzkaller.appspot.com/bug?extid=2f649ec6d2eea1495a8f
    [2] https://syzkaller.appspot.com/bug?extid=87f65c75f4a72db05445
    Reported-by: syzbot+2f649ec6d2eea1495a8f@syzkaller.appspotmail.com
    Reported-by: syzbot+87f65c75f4a72db05445@syzkaller.appspotmail.com
    Signed-off-by: Song Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20220706002612.4013790-1-song@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:10 +02:00
Jerome Marchand 3a88af3003 bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack
Bugzilla: https://bugzilla.redhat.com/2120966

commit fe736565efb775620dbcf3c459c1cd80d3e868da
Author: Song Liu <song@kernel.org>
Date:   Fri May 20 16:57:53 2022 -0700

    bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack

    Introduce bpf_arch_text_invalidate and use it to fill unused part of the
    bpf_prog_pack with illegal instructions when a BPF program is freed.

    Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator")
    Fixes: 33c9805860e5 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
    Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220520235758.1858153-4-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:09 +02:00
Jerome Marchand 0039278276 bpf: Fill new bpf_prog_pack with illegal instructions
Bugzilla: https://bugzilla.redhat.com/2120966

commit d88bb5eed04ce50cc20e7f9282977841728be798
Author: Song Liu <song@kernel.org>
Date:   Fri May 20 16:57:51 2022 -0700

    bpf: Fill new bpf_prog_pack with illegal instructions

    bpf_prog_pack enables sharing huge pages among multiple BPF programs.
    These pages are marked as executable before the JIT engine fill it with
    BPF programs. To make these pages safe, fill the hole bpf_prog_pack with
    illegal instructions before making it executable.

    Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator")
    Fixes: 33c9805860e5 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
    Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220520235758.1858153-2-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:09 +02:00
Jerome Marchand d4f3e612de bpf: Fix bpf_prog_pack when PMU_SIZE is not defined
Bugzilla: https://bugzilla.redhat.com/2120966

commit e581094167beb674c8a3bc2c27362f50dc5dd617
Author: Song Liu <song@kernel.org>
Date:   Mon Mar 21 11:00:09 2022 -0700

    bpf: Fix bpf_prog_pack when PMU_SIZE is not defined

    PMD_SIZE is not available in some special config, e.g. ARCH=arm with
    CONFIG_MMU=n. Use bpf_prog_pack of PAGE_SIZE in these cases.

    Fixes: ef078600eec2 ("bpf: Select proper size for bpf_prog_pack")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220321180009.1944482-3-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:07 +02:00
Jerome Marchand 67520d7745 bpf: Fix bpf_prog_pack for multi-node setup
Bugzilla: https://bugzilla.redhat.com/2120966

commit 96805674e5624b3c79780a2b41c7a3d6bc38dc76
Author: Song Liu <song@kernel.org>
Date:   Mon Mar 21 11:00:08 2022 -0700

    bpf: Fix bpf_prog_pack for multi-node setup

    module_alloc requires num_online_nodes * PMD_SIZE to allocate huge pages.
    bpf_prog_pack uses pack of size num_online_nodes * PMD_SIZE.
    OTOH, module_alloc returns addresses that are PMD_SIZE aligned (instead of
    num_online_nodes * PMD_SIZE aligned). Therefore, PMD_MASK should be used
    to calculate pack_ptr in bpf_prog_pack_free().

    Fixes: ef078600eec2 ("bpf: Select proper size for bpf_prog_pack")
    Reported-by: syzbot+c946805b5ce6ab87df0b@syzkaller.appspotmail.com
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220321180009.1944482-2-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:07 +02:00
Jerome Marchand 6282380046 bpf: Select proper size for bpf_prog_pack
Bugzilla: https://bugzilla.redhat.com/2120966

commit ef078600eec20f20eb7833cf597d4a5edf2953c1
Author: Song Liu <song@kernel.org>
Date:   Fri Mar 11 12:11:35 2022 -0800

    bpf: Select proper size for bpf_prog_pack

    Using HPAGE_PMD_SIZE as the size for bpf_prog_pack is not ideal in some
    cases. Specifically, for NUMA systems, __vmalloc_node_range requires
    PMD_SIZE * num_online_nodes() to allocate huge pages. Also, if the system
    does not support huge pages (i.e., with cmdline option nohugevmalloc), it
    is better to use PAGE_SIZE packs.

    Add logic to select proper size for bpf_prog_pack. This solution is not
    ideal, as it makes assumption about the behavior of module_alloc and
    __vmalloc_node_range. However, it appears to be the easiest solution as
    it doesn't require changes in module_alloc and vmalloc code.

    Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator")
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220311201135.3573610-1-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:06 +02:00
Jerome Marchand 4caaec1e32 bpf: Fix net.core.bpf_jit_harden race
Bugzilla: https://bugzilla.redhat.com/2120966

commit d2a3b7c5becc3992f8e7d2b9bf5eacceeedb9a48
Author: Hou Tao <houtao1@huawei.com>
Date:   Wed Mar 9 20:33:20 2022 +0800

    bpf: Fix net.core.bpf_jit_harden race

    It is the bpf_jit_harden counterpart to commit 60b58afc96 ("bpf: fix
    net.core.bpf_jit_enable race"). bpf_jit_harden will be tested twice
    for each subprog if there are subprogs in bpf program and constant
    blinding may increase the length of program, so when running
    "./test_progs -t subprogs" and toggling bpf_jit_harden between 0 and 2,
    jit_subprogs may fail because constant blinding increases the length
    of subprog instructions during extra passs.

    So cache the value of bpf_jit_blinding_enabled() during program
    allocation, and use the cached value during constant blinding, subprog
    JITing and args tracking of tail call.

    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220309123321.2400262-4-houtao1@huawei.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:03 +02:00
Jerome Marchand af540c52c6 bpf, x86: Set header->size properly before freeing it
Bugzilla: https://bugzilla.redhat.com/2120966

commit 676b2daabaf9a993db0e02a5ce79b984aaa0388b
Author: Song Liu <song@kernel.org>
Date:   Wed Mar 2 09:51:26 2022 -0800

    bpf, x86: Set header->size properly before freeing it

    On do_jit failure path, the header is freed by bpf_jit_binary_pack_free.
    While bpf_jit_binary_pack_free doesn't require proper ro_header->size,
    bpf_prog_pack_free still uses it. Set header->size in bpf_int_jit_compile
    before calling bpf_jit_binary_pack_free.

    Fixes: 1022a5498f6f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
    Fixes: 33c9805860e5 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
    Reported-by: Kui-Feng Lee <kuifeng@fb.com>
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20220302175126.247459-3-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:52 +02:00
Jerome Marchand 4dc7d047ff bpf: bpf_prog_pack: Set proper size before freeing ro_header
Bugzilla: https://bugzilla.redhat.com/2120966

commit d24d2a2b0a81dd5e9bb99aeb4559ec9734e1416f
Author: Song Liu <song@kernel.org>
Date:   Thu Feb 17 10:30:01 2022 -0800

    bpf: bpf_prog_pack: Set proper size before freeing ro_header

    bpf_prog_pack_free() uses header->size to decide whether the header
    should be freed with module_memfree() or the bpf_prog_pack logic.
    However, in kvmalloc() failure path of bpf_jit_binary_pack_alloc(),
    header->size is not set yet. As a result, bpf_prog_pack_free() may treat
    a slice of a pack as a standalone kvmalloc'd header and call
    module_memfree() on the whole pack. This in turn causes use-after-free by
    other users of the pack.

    Fix this by setting ro_header->size before freeing ro_header.

    Fixes: 33c9805860e5 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
    Reported-by: syzbot+2f649ec6d2eea1495a8f@syzkaller.appspotmail.com
    Reported-by: syzbot+ecb1e7e51c52f68f7481@syzkaller.appspotmail.com
    Reported-by: syzbot+87f65c75f4a72db05445@syzkaller.appspotmail.com
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220217183001.1876034-1-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:50 +02:00
Jerome Marchand db903be260 bpf: Fix bpf_prog_pack build for ppc64_defconfig
Bugzilla: https://bugzilla.redhat.com/2120966

commit 4cc0991abd3954609a6929234bbb8c0fe7a0298d
Author: Song Liu <song@kernel.org>
Date:   Thu Feb 10 18:49:39 2022 -0800

    bpf: Fix bpf_prog_pack build for ppc64_defconfig

    bpf_prog_pack causes build error with powerpc ppc64_defconfig:

    kernel/bpf/core.c:830:23: error: variably modified 'bitmap' at file scope
      830 |         unsigned long bitmap[BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)];
          |                       ^~~~~~

    This is because the marco expands as:

    unsigned long bitmap[((((((1UL) << (16 + __pte_index_size)) / (1 << 6))) \
         + ((sizeof(long) * 8)) - 1) / ((sizeof(long) * 8)))];

    where __pte_index_size is a global variable.

    Fix it by turning bitmap into a 0-length array.

    Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator")
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220211024939.2962537-1-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:49 +02:00
Jerome Marchand 9bcf072a89 bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE
Bugzilla: https://bugzilla.redhat.com/2120966

commit c1b13a9451ab9d46eefb80a2cc4b8b3206460829
Author: Song Liu <song@kernel.org>
Date:   Tue Feb 8 14:05:09 2022 -0800

    bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE

    Fix build with CONFIG_TRANSPARENT_HUGEPAGE=n with BPF_PROG_PACK_SIZE as
    PAGE_SIZE.

    Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220208220509.4180389-3-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:48 +02:00
Jerome Marchand e99fab9d65 bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]
Bugzilla: https://bugzilla.redhat.com/2120966

commit 33c9805860e584b194199cab1a1e81f4e6395408
Author: Song Liu <songliubraving@fb.com>
Date:   Fri Feb 4 10:57:41 2022 -0800

    bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]

    This is the jit binary allocator built on top of bpf_prog_pack.

    bpf_prog_pack allocates RO memory, which cannot be used directly by the
    JIT engine. Therefore, a temporary rw buffer is allocated for the JIT
    engine. Once JIT is done, bpf_jit_binary_pack_finalize is used to copy
    the program to the RO memory.

    bpf_jit_binary_pack_alloc reserves 16 bytes of extra space for illegal
    instructions, which is small than the 128 bytes space reserved by
    bpf_jit_binary_alloc. This change is necessary for bpf_jit_binary_hdr
    to find the correct header. Also, flag use_bpf_prog_pack is added to
    differentiate a program allocated by bpf_jit_binary_pack_alloc.

    Signed-off-by: Song Liu <songliubraving@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220204185742.271030-9-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:48 +02:00
Jerome Marchand 82c7f40057 bpf: Introduce bpf_prog_pack allocator
Bugzilla: https://bugzilla.redhat.com/2120966

commit 57631054fae6dcc9c892ae6310b58bbb6f6e5048
Author: Song Liu <song@kernel.org>
Date:   Fri Feb 4 10:57:40 2022 -0800

    bpf: Introduce bpf_prog_pack allocator

    Most BPF programs are small, but they consume a page each. For systems
    with busy traffic and many BPF programs, this could add significant
    pressure to instruction TLB. High iTLB pressure usually causes slow down
    for the whole system, which includes visible performance degradation for
    production workloads.

    Introduce bpf_prog_pack allocator to pack multiple BPF programs in a huge
    page. The memory is then allocated in 64 byte chunks.

    Memory allocated by bpf_prog_pack allocator is RO protected after initial
    allocation. To write to it, the user (jit engine) need to use text poke
    API.

    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220204185742.271030-8-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:48 +02:00
Jerome Marchand dc294c8eaa bpf: Introduce bpf_arch_text_copy
Bugzilla: https://bugzilla.redhat.com/2120966

commit ebc1415d9b4f043cef5a1fb002ec316e32167e7a
Author: Song Liu <song@kernel.org>
Date:   Fri Feb 4 10:57:39 2022 -0800

    bpf: Introduce bpf_arch_text_copy

    This will be used to copy JITed text to RO protected module memory. On
    x86, bpf_arch_text_copy is implemented with text_poke_copy.

    bpf_arch_text_copy returns pointer to dst on success, and ERR_PTR(errno)
    on errors.

    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220204185742.271030-7-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:48 +02:00
Jerome Marchand 4018c83849 bpf: Use prog->jited_len in bpf_prog_ksym_set_addr()
Bugzilla: https://bugzilla.redhat.com/2120966

commit d00c6473b1ee9050cc36d008c6d30bf0d3de0524
Author: Song Liu <song@kernel.org>
Date:   Fri Feb 4 10:57:37 2022 -0800

    bpf: Use prog->jited_len in bpf_prog_ksym_set_addr()

    Using prog->jited_len is simpler and more accurate than current
    estimation (header + header->size).

    Also, fix missing prog->jited_len with multi function program. This hasn't
    been a real issue before this.

    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220204185742.271030-5-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:47 +02:00
Jerome Marchand b04fc85369 bpf: Use size instead of pages in bpf_binary_header
Bugzilla: https://bugzilla.redhat.com/2120966

commit ed2d9e1a26cca963ff5ed3b76326d70f7d8201a9
Author: Song Liu <songliubraving@fb.com>
Date:   Fri Feb 4 10:57:36 2022 -0800

    bpf: Use size instead of pages in bpf_binary_header

    This is necessary to charge sub page memory for the BPF program.

    Signed-off-by: Song Liu <songliubraving@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220204185742.271030-4-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:47 +02:00
Jerome Marchand fb3b56065b bpf: Use bytes instead of pages for bpf_jit_[charge|uncharge]_modmem
Bugzilla: https://bugzilla.redhat.com/2120966

commit 3486bedd99196ecdfe99c0ab5b67ad3c47e8a8fa
Author: Song Liu <songliubraving@fb.com>
Date:   Fri Feb 4 10:57:35 2022 -0800

    bpf: Use bytes instead of pages for bpf_jit_[charge|uncharge]_modmem

    This enables sub-page memory charge and allocation.

    Signed-off-by: Song Liu <songliubraving@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220204185742.271030-3-song@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:47 +02:00
Jerome Marchand 35afa2bc38 cgroup/bpf: fast path skb BPF filtering
Bugzilla: https://bugzilla.redhat.com/2120966

commit 46531a30364bd483bfa1b041c15d42a196e77e93
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Jan 27 14:09:13 2022 +0000

    cgroup/bpf: fast path skb BPF filtering

    Even though there is a static key protecting from overhead from
    cgroup-bpf skb filtering when there is nothing attached, in many cases
    it's not enough as registering a filter for one type will ruin the fast
    path for all others. It's observed in production servers I've looked
    at but also in laptops, where registration is done during init by
    systemd or something else.

    Add a per-socket fast path check guarding from such overhead. This
    affects both receive and transmit paths of TCP, UDP and other
    protocols. It showed ~1% tx/s improvement in small payload UDP
    send benchmarks using a real NIC and in a server environment and the
    number jumps to 2-3% for preemtible kernels.

    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/r/d8c58857113185a764927a46f4b5a058d36d3ec3.1643292455.git.asml.silence@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:44 +02:00
Jiri Benc d1647a95d0 bpf: generalise tail call map compatibility check
Bugzilla: https://bugzilla.redhat.com/2120966

commit f45d5b6ce2e835834c94b8b700787984f02cd662
Author: Toke Hoiland-Jorgensen <toke@redhat.com>
Date:   Fri Jan 21 11:10:02 2022 +0100

    bpf: generalise tail call map compatibility check

    The check for tail call map compatibility ensures that tail calls only
    happen between maps of the same type. To ensure backwards compatibility for
    XDP frags we need a similar type of check for cpumap and devmap
    programs, so move the state from bpf_array_aux into bpf_map, add
    xdp_has_frags to the check, and apply the same check to cpumap and devmap.

    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Signed-off-by: Toke Hoiland-Jorgensen <toke@redhat.com>
    Link: https://lore.kernel.org/r/f19fd97c0328a39927f3ad03e1ca6b43fd53cdfd.1642758637.git.lorenzo@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:42 +02:00
Artem Savkov 20f48d42ec bpf, docs: Prune all references to "internal BPF"
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 06edc59c1fd7aababc8361655b20f4cc9870aef2
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Nov 19 17:32:13 2021 +0100

    bpf, docs: Prune all references to "internal BPF"

    The eBPF name has completely taken over from eBPF in general usage for
    the actual eBPF representation, or BPF for any general in-kernel use.
    Prune all remaining references to "internal BPF".

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20211119163215.971383-4-hch@lst.de

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:41 +02:00
Artem Savkov 900322337a bpf: Remove a redundant comment on bpf_prog_free
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit ccb00292eb2dbb58a55850639356d07630cd3c46
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Nov 19 17:32:12 2021 +0100

    bpf: Remove a redundant comment on bpf_prog_free

    The comment telling that the prog_free helper is freeing the program is
    not exactly useful, so just remove it.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20211119163215.971383-3-hch@lst.de

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:41 +02:00
Yauheni Kaliuta f76af43005 bpf: Introduce BPF support for kernel module function calls
Bugzilla: http://bugzilla.redhat.com/2069045

Conflicts: already applied
  3990ed4c4266 ("bpf: Stop caching subprog index in the bpf_pseudo_func insn")

commit 2357672c54c3f748f675446f8eba8b0432b1e7e2
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Sat Oct 2 06:47:49 2021 +0530

    bpf: Introduce BPF support for kernel module function calls

    This change adds support on the kernel side to allow for BPF programs to
    call kernel module functions. Userspace will prepare an array of module
    BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter.
    In the kernel, the module BTFs are placed in the auxilliary struct for
    bpf_prog, and loaded as needed.

    The verifier then uses insn->off to index into the fd_array. insn->off
    0 is reserved for vmlinux BTF (for backwards compat), so userspace must
    use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is
    sorted based on offset in an array, and each offset corresponds to one
    descriptor, with a max limit up to 256 such module BTFs.

    We also change existing kfunc_tab to distinguish each element based on
    imm, off pair as each such call will now be distinct.

    Another change is to check_kfunc_call callback, which now include a
    struct module * pointer, this is to be used in later patch such that the
    kfunc_id and module pointer are matched for dynamically registered BTF
    sets from loadable modules, so that same kfunc_id in two modules doesn't
    lead to check_kfunc_call succeeding. For the duration of the
    check_kfunc_call, the reference to struct module exists, as it returns
    the pointer stored in kfunc_btf_tab.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-06-03 17:23:38 +03:00
Yauheni Kaliuta 3c3c123ddb bpf: Add bpf_trace_vprintk helper
Bugzilla: http://bugzilla.redhat.com/2069045

commit 10aceb629e198429c849d5e995c3bb1ba7a9aaa3
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Fri Sep 17 11:29:05 2021 -0700

    bpf: Add bpf_trace_vprintk helper
    
    This helper is meant to be "bpf_trace_printk, but with proper vararg
    support". Follow bpf_snprintf's example and take a u64 pseudo-vararg
    array. Write to /sys/kernel/debug/tracing/trace_pipe using the same
    mechanism as bpf_trace_printk. The functionality of this helper was
    requested in the libbpf issue tracker [0].
    
    [0] Closes: https://github.com/libbpf/libbpf/issues/315
    
    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20210917182911.2426606-4-davemarchevsky@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-06-03 17:16:14 +03:00
Jerome Marchand dc7ed28382 bpf: Change value of MAX_TAIL_CALL_CNT from 32 to 33
Bugzilla: https://bugzilla.redhat.com/2041365

Conflicts: Missing commit 29eef85be2f6 ("bpf/tests: Add tail call
limit test with external function call") and commit eb63cfcd2ee8
("mips, bpf: Add eBPF JIT for 32-bit MIPS").

commit ebf7f6f0a6cdcc17a3da52b81e4b3a98c4005028
Author: Tiezhu Yang <yangtiezhu@loongson.cn>
Date:   Fri Nov 5 09:30:00 2021 +0800

    bpf: Change value of MAX_TAIL_CALL_CNT from 32 to 33

    In the current code, the actual max tail call count is 33 which is greater
    than MAX_TAIL_CALL_CNT (defined as 32). The actual limit is not consistent
    with the meaning of MAX_TAIL_CALL_CNT and thus confusing at first glance.
    We can see the historical evolution from commit 04fd61ab36 ("bpf: allow
    bpf programs to tail-call other bpf programs") and commit f9dabe016b63
    ("bpf: Undo off-by-one in interpreter tail call count limit"). In order
    to avoid changing existing behavior, the actual limit is 33 now, this is
    reasonable.

    After commit 874be05f525e ("bpf, tests: Add tail call test suite"), we can
    see there exists failed testcase.

    On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
     # echo 0 > /proc/sys/net/core/bpf_jit_enable
     # modprobe test_bpf
     # dmesg | grep -w FAIL
     Tail call error path, max count reached jited:0 ret 34 != 33 FAIL

    On some archs:
     # echo 1 > /proc/sys/net/core/bpf_jit_enable
     # modprobe test_bpf
     # dmesg | grep -w FAIL
     Tail call error path, max count reached jited:1 ret 34 != 33 FAIL

    Although the above failed testcase has been fixed in commit 18935a72eb25
    ("bpf/tests: Fix error in tail call limit tests"), it would still be good
    to change the value of MAX_TAIL_CALL_CNT from 32 to 33 to make the code
    more readable.

    The 32-bit x86 JIT was using a limit of 32, just fix the wrong comments and
    limit to 33 tail calls as the constant MAX_TAIL_CALL_CNT updated. For the
    mips64 JIT, use "ori" instead of "addiu" as suggested by Johan Almbladh.
    For the riscv JIT, use RV_REG_TCC directly to save one register move as
    suggested by Björn Töpel. For the other implementations, no function changes,
    it does not change the current limit 33, the new value of MAX_TAIL_CALL_CNT
    can reflect the actual max tail call count, the related tail call testcases
    in test_bpf module and selftests can work well for the interpreter and the
    JIT.

    Here are the test results on x86_64:

     # uname -m
     x86_64
     # echo 0 > /proc/sys/net/core/bpf_jit_enable
     # modprobe test_bpf test_suite=test_tail_calls
     # dmesg | tail -1
     test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
     # rmmod test_bpf
     # echo 1 > /proc/sys/net/core/bpf_jit_enable
     # modprobe test_bpf test_suite=test_tail_calls
     # dmesg | tail -1
     test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed]
     # rmmod test_bpf
     # ./test_progs -t tailcalls
     #142 tailcalls:OK
     Summary: 1/11 PASSED, 0 SKIPPED, 0 FAILED

    Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
    Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
    Acked-by: Björn Töpel <bjorn@kernel.org>
    Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
    Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
    Link: https://lore.kernel.org/bpf/1636075800-3264-1-git-send-email-yangtiezhu@loongson.cn

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:17:17 +02:00
Jerome Marchand 483ae4a299 bpf: Fix potential race in tail call compatibility check
Bugzilla: http://bugzilla.redhat.com/2041365

commit 54713c85f536048e685258f880bf298a74c3620d
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Tue Oct 26 13:00:19 2021 +0200

    bpf: Fix potential race in tail call compatibility check

    Lorenzo noticed that the code testing for program type compatibility of
    tail call maps is potentially racy in that two threads could encounter a
    map with an unset type simultaneously and both return true even though they
    are inserting incompatible programs.

    The race window is quite small, but artificially enlarging it by adding a
    usleep_range() inside the check in bpf_prog_array_compatible() makes it
    trivial to trigger from userspace with a program that does, essentially:

            map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0);
            pid = fork();
            if (pid) {
                    key = 0;
                    value = xdp_fd;
            } else {
                    key = 1;
                    value = tc_fd;
            }
            err = bpf_map_update_elem(map_fd, &key, &value, 0);

    While the race window is small, it has potentially serious ramifications in
    that triggering it would allow a BPF program to tail call to a program of a
    different type. So let's get rid of it by protecting the update with a
    spinlock. The commit in the Fixes tag is the last commit that touches the
    code in question.

    v2:
    - Use a spinlock instead of an atomic variable and cmpxchg() (Alexei)
    v3:
    - Put lock and the members it protects into an embedded 'owner' struct (Daniel)

    Fixes: 3324b584b6 ("ebpf: misc core cleanup")
    Reported-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211026110019.363464-1-toke@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:17:15 +02:00
Jerome Marchand b2ef65c25f bpf: Prevent increasing bpf_jit_limit above max
Bugzilla: http://bugzilla.redhat.com/2041365

commit fadb7ff1a6c2c565af56b4aacdd086b067eed440
Author: Lorenz Bauer <lmb@cloudflare.com>
Date:   Thu Oct 14 15:25:53 2021 +0100

    bpf: Prevent increasing bpf_jit_limit above max

    Restrict bpf_jit_limit to the maximum supported by the arch's JIT.

    Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211014142554.53120-4-lmb@cloudflare.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:17:15 +02:00
Jerome Marchand 0f25a832ce bpf: Exempt CAP_BPF from checks against bpf_jit_limit
Bugzilla: http://bugzilla.redhat.com/2041365

commit 8a98ae12fbefdb583a7696de719a1d57e5e940a2
Author: Lorenz Bauer <lmb@cloudflare.com>
Date:   Wed Sep 22 12:11:52 2021 +0100

    bpf: Exempt CAP_BPF from checks against bpf_jit_limit

    When introducing CAP_BPF, bpf_jit_charge_modmem() was not changed to treat
    programs with CAP_BPF as privileged for the purpose of JIT memory allocation.
    This means that a program without CAP_BPF can block a program with CAP_BPF
    from loading a program.

    Fix this by checking bpf_capable() in bpf_jit_charge_modmem().

    Fixes: 2c78ee898d ("bpf: Implement CAP_BPF")
    Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20210922111153.19843-1-lmb@cloudflare.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:17:13 +02:00
Jerome Marchand 37c7fd4359 bpf: Undo off-by-one in interpreter tail call count limit
Bugzilla: http://bugzilla.redhat.com/2041365

commit f9dabe016b63c9629e152bf876c126c29de223cb
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Thu Aug 19 15:59:33 2021 +0200

    bpf: Undo off-by-one in interpreter tail call count limit

    The BPF interpreter as well as x86-64 BPF JIT were both in line by allowing
    up to 33 tail calls (however odd that number may be!). Recently, this was
    changed for the interpreter to reduce it down to 32 with the assumption that
    this should have been the actual limit "which is in line with the behavior of
    the x86 JITs" according to b61a28cf11d61 ("bpf: Fix off-by-one in tail call
    count limiting").

    Paul recently reported:

      I'm a bit surprised by this because I had previously tested the tail call
      limit of several JIT compilers and found it to be 33 (i.e., allowing chains
      of up to 34 programs). I've just extended a test program I had to validate
      this again on the x86-64 JIT, and found a limit of 33 tail calls again [1].

      Also note we had previously changed the RISC-V and MIPS JITs to allow up to
      33 tail calls [2, 3], for consistency with other JITs and with the interpreter.
      We had decided to increase these two to 33 rather than decrease the other
      JITs to 32 for backward compatibility, though that probably doesn't matter
      much as I'd expect few people to actually use 33 tail calls.

      [1] ae78874829
      [2] 96bc4432f5 ("bpf, riscv: Limit to 33 tail calls")
      [3] e49e6f6db0 ("bpf, mips: Limit to 33 tail calls")

    Therefore, revert b61a28cf11d61 to re-align interpreter to limit a maximum of
    33 tail calls. While it is unlikely to hit the limit for the vast majority,
    programs in the wild could one way or another depend on this, so lets rather
    be a bit more conservative, and lets align the small remainder of JITs to 33.
    If needed in future, this limit could be slightly increased, but not decreased.

    Fixes: b61a28cf11d61 ("bpf: Fix off-by-one in tail call count limiting")
    Reported-by: Paul Chaignon <paul@cilium.io>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/CAO5pjwTWrC0_dzTbTHFPSqDwA56aVH+4KFGVqdq8=ASs0MqZGQ@mail.gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:43 +02:00
Jerome Marchand b0371ec3e5 bpf: Allow to specify user-provided bpf_cookie for BPF perf links
Bugzilla: http://bugzilla.redhat.com/2041365

commit 82e6b1eee6a8875ef4eacfd60711cce6965c6b04
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Sun Aug 15 00:05:58 2021 -0700

    bpf: Allow to specify user-provided bpf_cookie for BPF perf links

    Add ability for users to specify custom u64 value (bpf_cookie) when creating
    BPF link for perf_event-backed BPF programs (kprobe/uprobe, perf_event,
    tracepoints).

    This is useful for cases when the same BPF program is used for attaching and
    processing invocation of different tracepoints/kprobes/uprobes in a generic
    fashion, but such that each invocation is distinguished from each other (e.g.,
    BPF program can look up additional information associated with a specific
    kernel function without having to rely on function IP lookups). This enables
    new use cases to be implemented simply and efficiently that previously were
    possible only through code generation (and thus multiple instances of almost
    identical BPF program) or compilation at runtime (BCC-style) on target hosts
    (even more expensive resource-wise). For uprobes it is not even possible in
    some cases to know function IP before hand (e.g., when attaching to shared
    library without PID filtering, in which case base load address is not known
    for a library).

    This is done by storing u64 bpf_cookie in struct bpf_prog_array_item,
    corresponding to each attached and run BPF program. Given cgroup BPF programs
    already use two 8-byte pointers for their needs and cgroup BPF programs don't
    have (yet?) support for bpf_cookie, reuse that space through union of
    cgroup_storage and new bpf_cookie field.

    Make it available to kprobe/tracepoint BPF programs through bpf_trace_run_ctx.
    This is set by BPF_PROG_RUN_ARRAY, used by kprobe/uprobe/tracepoint BPF
    program execution code, which luckily is now also split from
    BPF_PROG_RUN_ARRAY_CG. This run context will be utilized by a new BPF helper
    giving access to this user-provided cookie value from inside a BPF program.
    Generic perf_event BPF programs will access this value from perf_event itself
    through passed in BPF program context.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/bpf/20210815070609.987780-6-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:41 +02:00
Jerome Marchand 5b5312cf91 bpf: Refactor BPF_PROG_RUN into a function
Bugzilla: http://bugzilla.redhat.com/2041365

Conflicts: Missing commit 879af96ffd72 ("net, core: Add support for XDP redirection to slave device")

commit fb7dd8bca0139fd73d3f4a6cd257b11731317ded
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Sun Aug 15 00:05:54 2021 -0700

    bpf: Refactor BPF_PROG_RUN into a function

    Turn BPF_PROG_RUN into a proper always inlined function. No functional and
    performance changes are intended, but it makes it much easier to understand
    what's going on with how BPF programs are actually get executed. It's more
    obvious what types and callbacks are expected. Also extra () around input
    parameters can be dropped, as well as `__` variable prefixes intended to avoid
    naming collisions, which makes the code simpler to read and write.

    This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
    a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
    BPF_PROG_RUN into a function causes naming conflict compilation error. So
    rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
    bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
    BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:40 +02:00
Jerome Marchand 325180400b bpf: Fix off-by-one in tail call count limiting
Bugzilla: http://bugzilla.redhat.com/2041365

commit b61a28cf11d61f512172e673b8f8c4a6c789b425
Author: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Date:   Wed Jul 28 18:47:41 2021 +0200

    bpf: Fix off-by-one in tail call count limiting

    Before, the interpreter allowed up to MAX_TAIL_CALL_CNT + 1 tail calls.
    Now precisely MAX_TAIL_CALL_CNT is allowed, which is in line with the
    behavior of the x86 JITs.

    Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20210728164741.350370-1-johan.almbladh@anyfinetworks.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:37 +02:00
Yauheni Kaliuta 1028699c17 bpf: Stop caching subprog index in the bpf_pseudo_func insn
Bugzilla: http://bugzilla.redhat.com/2033596

Conflicts: context (add_kfunc_call()) due to missing
  2357672c54c3 ("bpf: Introduce BPF support for kernel module function calls")

commit 3990ed4c426652fcd469f8c9dc08156294b36c28
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Fri Nov 5 18:40:14 2021 -0700

    bpf: Stop caching subprog index in the bpf_pseudo_func insn

    This patch is to fix an out-of-bound access issue when jit-ing the
    bpf_pseudo_func insn (i.e. ld_imm64 with src_reg == BPF_PSEUDO_FUNC)

    In jit_subprog(), it currently reuses the subprog index cached in
    insn[1].imm.  This subprog index is an index into a few array related
    to subprogs.  For example, in jit_subprog(), it is an index to the newly
    allocated 'struct bpf_prog **func' array.

    The subprog index was cached in insn[1].imm after add_subprog().  However,
    this could become outdated (and too big in this case) if some subprogs
    are completely removed during dead code elimination (in
    adjust_subprog_starts_after_remove).  The cached index in insn[1].imm
    is not updated accordingly and causing out-of-bound issue in the later
    jit_subprog().

    Unlike bpf_pseudo_'func' insn, the current bpf_pseudo_'call' insn
    is handling the DCE properly by calling find_subprog(insn->imm) to
    figure out the index instead of caching the subprog index.
    The existing bpf_adj_branches() will adjust the insn->imm
    whenever insn is added or removed.

    Instead of having two ways handling subprog index,
    this patch is to make bpf_pseudo_func works more like
    bpf_pseudo_call.

    First change is to stop caching the subprog index result
    in insn[1].imm after add_subprog().  The verification
    process will use find_subprog(insn->imm) to figure
    out the subprog index.

    Second change is in bpf_adj_branches() and have it to
    adjust the insn->imm for the bpf_pseudo_func insn also
    whenever insn is added or removed.

    Third change is in jit_subprog().  Like the bpf_pseudo_call handling,
    bpf_pseudo_func temporarily stores the find_subprog() result
    in insn->off.  It is fine because the prog's insn has been finalized
    at this point.  insn->off will be reset back to 0 later to avoid
    confusing the userspace prog dump tool.

    Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211106014014.651018-1-kafai@fb.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-02-09 00:39:20 +02:00
Jiri Olsa 4a4798ed0a [kernel] bpf: set default value for bpf_jit_harden
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028734

Upstream Status: RHEL only

The patch for configuring boot-time value for these
options has been proposed [1] and rejected upstream.

[1] https://lkml.org/lkml/2018/5/23/449

Set default values for net.bpf_jit_harden sysctl.

 - net.bpf_jit_harden is set to 1: it's a compromise between the fact that
   by default we do not have unprivileged BPF enabled (and there's little
   reason for enforcing constant blinding for root programs by default,
   considering performance tradeoffs), and providing some sane default for
   users that still want unprivileged BPF (and enable it via the boot
   option),

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
2021-12-05 01:16:42 +01:00
Randy Dunlap 019d0454c6 bpf, core: Fix kernel-doc notation
Fix kernel-doc warnings in kernel/bpf/core.c (found by scripts/kernel-doc
and W=1 builds). That is, correct a function name in a comment and add
return descriptions for 2 functions.

Fixes these kernel-doc warnings:

  kernel/bpf/core.c:1372: warning: expecting prototype for __bpf_prog_run(). Prototype was for ___bpf_prog_run() instead
  kernel/bpf/core.c:1372: warning: No description found for return value of '___bpf_prog_run'
  kernel/bpf/core.c:1883: warning: No description found for return value of 'bpf_prog_select_runtime'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809215229.7556-1-rdunlap@infradead.org
2021-08-10 13:09:28 +02:00
Daniel Borkmann f5e81d1117 bpf: Introduce BPF nospec instruction for mitigating Spectre v4
In case of JITs, each of the JIT backends compiles the BPF nospec instruction
/either/ to a machine instruction which emits a speculation barrier /or/ to
/no/ machine instruction in case the underlying architecture is not affected
by Speculative Store Bypass or has different mitigations in place already.

This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
instruction for mitigation. In case of arm64, we rely on the firmware mitigation
as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
it works for all of the kernel code with no need to provide any additional
instructions here (hence only comment in arm64 JIT). Other archs can follow
as needed. The BPF nospec instruction is specifically targeting Spectre v4
since i) we don't use a serialization barrier for the Spectre v1 case, and
ii) mitigation instructions for v1 and v4 might be different on some archs.

The BPF nospec is required for a future commit, where the BPF verifier does
annotate intermediate BPF programs with speculation barriers.

Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-07-29 00:20:56 +02:00
John Fastabend f263a81451 bpf: Track subprog poke descriptors correctly and fix use-after-free
Subprograms are calling map_poke_track(), but on program release there is no
hook to call map_poke_untrack(). However, on program release, the aux memory
(and poke descriptor table) is freed even though we still have a reference to
it in the element list of the map aux data. When we run map_poke_run(), we then
end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():

  [...]
  [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
  [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
  [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
  [  402.824715] Call Trace:
  [  402.824719]  dump_stack+0x93/0xc2
  [  402.824727]  print_address_description.constprop.0+0x1a/0x140
  [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824744]  kasan_report.cold+0x7c/0xd8
  [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
  [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
  [...]

The elements concerned are walked as follows:

    for (i = 0; i < elem->aux->size_poke_tab; i++) {
           poke = &elem->aux->poke_tab[i];
    [...]

The access to size_poke_tab is a 4 byte read, verified by checking offsets
in the KASAN dump:

  [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                 which belongs to the cache kmalloc-1k of size 1024
  [  402.825008] The buggy address is located 320 bytes inside of
                 1024-byte region [ffff8881905a7800, ffff8881905a7c00)

The pahole output of bpf_prog_aux:

  struct bpf_prog_aux {
    [...]
    /* --- cacheline 5 boundary (320 bytes) --- */
    u32                        size_poke_tab;        /*   320     4 */
    [...]

In general, subprograms do not necessarily manage their own data structures.
For example, BTF func_info and linfo are just pointers to the main program
structure. This allows reference counting and cleanup to be done on the latter
which simplifies their management a bit. The aux->poke_tab struct, however,
did not follow this logic. The initial proposed fix for this use-after-free
bug further embedded poke data tracking into the subprogram with proper
reference counting. However, Daniel and Alexei questioned why we were treating
these objects special; I agree, its unnecessary. The fix here removes the per
subprogram poke table allocation and map tracking and instead simply points
the aux->poke_tab pointer at the main programs poke table. This way, map
tracking is simplified to the main program and we do not need to manage them
per subprogram.

This also means, bpf_prog_free_deferred(), which unwinds the program reference
counting and kfrees objects, needs to ensure that we don't try to double free
the poke_tab when free'ing the subprog structures. This is easily solved by
NULL'ing the poke_tab pointer. The second detail is to ensure that per
subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
this, we add a pointer in the poke structure to point at the subprogram value
so JITs can easily check while walking the poke_tab structure if the current
entry belongs to the current program. The aux pointer is stable and therefore
suitable for such comparison. On the jit_subprogs() error path, we omit
cleaning up the poke->aux field because these are only ever referenced from
the JIT side, but on error we will never make it to the JIT, so its fine to
leave them dangling. Removing these pointers would complicate the error path
for no reason. However, we do need to untrack all poke descriptors from the
main program as otherwise they could race with the freeing of JIT memory from
the subprograms. Lastly, a748c6975d ("bpf: propagate poke descriptors to
subprograms") had an off-by-one on the subprogram instruction index range
check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
However, subprog_end is the next subprogram's start instruction.

Fixes: a748c6975d ("bpf: propagate poke descriptors to subprograms")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
2021-07-09 12:08:27 +02:00
Daniel Borkmann 28131e9d93 bpf: Fix up register-based shifts in interpreter to silence KUBSAN
syzbot reported a shift-out-of-bounds that KUBSAN observed in the
interpreter:

  [...]
  UBSAN: shift-out-of-bounds in kernel/bpf/core.c:1420:2
  shift exponent 255 is too large for 64-bit type 'long long unsigned int'
  CPU: 1 PID: 11097 Comm: syz-executor.4 Not tainted 5.12.0-rc2-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:79 [inline]
   dump_stack+0x141/0x1d7 lib/dump_stack.c:120
   ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
   __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327
   ___bpf_prog_run.cold+0x19/0x56c kernel/bpf/core.c:1420
   __bpf_prog_run32+0x8f/0xd0 kernel/bpf/core.c:1735
   bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline]
   bpf_prog_run_pin_on_cpu include/linux/filter.h:624 [inline]
   bpf_prog_run_clear_cb include/linux/filter.h:755 [inline]
   run_filter+0x1a1/0x470 net/packet/af_packet.c:2031
   packet_rcv+0x313/0x13e0 net/packet/af_packet.c:2104
   dev_queue_xmit_nit+0x7c2/0xa90 net/core/dev.c:2387
   xmit_one net/core/dev.c:3588 [inline]
   dev_hard_start_xmit+0xad/0x920 net/core/dev.c:3609
   __dev_queue_xmit+0x2121/0x2e00 net/core/dev.c:4182
   __bpf_tx_skb net/core/filter.c:2116 [inline]
   __bpf_redirect_no_mac net/core/filter.c:2141 [inline]
   __bpf_redirect+0x548/0xc80 net/core/filter.c:2164
   ____bpf_clone_redirect net/core/filter.c:2448 [inline]
   bpf_clone_redirect+0x2ae/0x420 net/core/filter.c:2420
   ___bpf_prog_run+0x34e1/0x77d0 kernel/bpf/core.c:1523
   __bpf_prog_run512+0x99/0xe0 kernel/bpf/core.c:1737
   bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline]
   bpf_test_run+0x3ed/0xc50 net/bpf/test_run.c:50
   bpf_prog_test_run_skb+0xabc/0x1c50 net/bpf/test_run.c:582
   bpf_prog_test_run kernel/bpf/syscall.c:3127 [inline]
   __do_sys_bpf+0x1ea9/0x4f00 kernel/bpf/syscall.c:4406
   do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  [...]

Generally speaking, KUBSAN reports from the kernel should be fixed.
However, in case of BPF, this particular report caused concerns since
the large shift is not wrong from BPF point of view, just undefined.
In the verifier, K-based shifts that are >= {64,32} (depending on the
bitwidth of the instruction) are already rejected. The register-based
cases were not given their content might not be known at verification
time. Ideas such as verifier instruction rewrite with an additional
AND instruction for the source register were brought up, but regularly
rejected due to the additional runtime overhead they incur.

As Edward Cree rightly put it:

  Shifts by more than insn bitness are legal in the BPF ISA; they are
  implementation-defined behaviour [of the underlying architecture],
  rather than UB, and have been made legal for performance reasons.
  Each of the JIT backends compiles the BPF shift operations to machine
  instructions which produce implementation-defined results in such a
  case; the resulting contents of the register may be arbitrary but
  program behaviour as a whole remains defined.

  Guard checks in the fast path (i.e. affecting JITted code) will thus
  not be accepted.

  The case of division by zero is not truly analogous here, as division
  instructions on many of the JIT-targeted architectures will raise a
  machine exception / fault on division by zero, whereas (to the best
  of my knowledge) none will do so on an out-of-bounds shift.

Given the KUBSAN report only affects the BPF interpreter, but not JITs,
one solution is to add the ANDs with 63 or 31 into ___bpf_prog_run().
That would make the shifts defined, and thus shuts up KUBSAN, and the
compiler would optimize out the AND on any CPU that interprets the shift
amounts modulo the width anyway (e.g., confirmed from disassembly that
on x86-64 and arm64 the generated interpreter code is the same before
and after this fix).

The BPF interpreter is slow path, and most likely compiled out anyway
as distros select BPF_JIT_ALWAYS_ON to avoid speculative execution of
BPF instructions by the interpreter. Given the main argument was to
avoid sacrificing performance, the fact that the AND is optimized away
from compiler for mainstream archs helps as well as a solution moving
forward. Also add a comment on LSH/RSH/ARSH translation for JIT authors
to provide guidance when they see the ___bpf_prog_run() interpreter
code and use it as a model for a new JIT backend.

Reported-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com
Reported-by: Kurt Manucredo <fuzzybritches0@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Co-developed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com
Cc: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/bpf/0000000000008f912605bd30d5d7@google.com
Link: https://lore.kernel.org/bpf/bac16d8d-c174-bdc4-91bd-bfa62b410190@gmail.com
2021-06-17 12:04:37 +02:00
He Fengqing 2ec9898e9c bpf: Remove unused parameter from ___bpf_prog_run
'stack' parameter is not used in ___bpf_prog_run() after f696b8f471
("bpf: split bpf core interpreter"), the base address have been set to
FP reg. So consequently remove it.

Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210331075135.3850782-1-hefengqing@huawei.com
2021-04-03 01:38:52 +02:00
Martin KaFai Lau e6ac2450d6 bpf: Support bpf program calling kernel function
This patch adds support to BPF verifier to allow bpf program calling
kernel function directly.

The use case included in this set is to allow bpf-tcp-cc to directly
call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
functions have already been used by some kernel tcp-cc implementations.

This set will also allow the bpf-tcp-cc program to directly call the
kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
from the kernel tcp_dctcp.c instead of reimplementing (or
copy-and-pasting) them.

The tcp-cc kernel functions mentioned above will be white listed
for the struct_ops bpf-tcp-cc programs to use in a later patch.
The white listed functions are not bounded to a fixed ABI contract.
Those functions have already been used by the existing kernel tcp-cc.
If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
implementations have to be changed.  The same goes for the struct_ops
bpf-tcp-cc programs which have to be adjusted accordingly.

This patch is to make the required changes in the bpf verifier.

First change is in btf.c, it adds a case in "btf_check_func_arg_match()".
When the passed in "btf->kernel_btf == true", it means matching the
verifier regs' states with a kernel function.  This will handle the
PTR_TO_BTF_ID reg.  It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET,
and PTR_TO_TCP_SOCK to its kernel's btf_id.

In the later libbpf patch, the insn calling a kernel function will
look like:

insn->code == (BPF_JMP | BPF_CALL)
insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */
insn->imm == func_btf_id /* btf_id of the running kernel */

[ For the future calling function-in-kernel-module support, an array
  of module btf_fds can be passed at the load time and insn->off
  can be used to index into this array. ]

At the early stage of verifier, the verifier will collect all kernel
function calls into "struct bpf_kfunc_desc".  Those
descriptors are stored in "prog->aux->kfunc_tab" and will
be available to the JIT.  Since this "add" operation is similar
to the current "add_subprog()" and looking for the same insn->code,
they are done together in the new "add_subprog_and_kfunc()".

In the "do_check()" stage, the new "check_kfunc_call()" is added
to verify the kernel function call instruction:
1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE.
   A new bpf_verifier_ops "check_kfunc_call" is added to do that.
   The bpf-tcp-cc struct_ops program will implement this function in
   a later patch.
2. Call "btf_check_kfunc_args_match()" to ensure the regs can be
   used as the args of a kernel function.
3. Mark the regs' type, subreg_def, and zext_dst.

At the later do_misc_fixups() stage, the new fixup_kfunc_call()
will replace the insn->imm with the function address (relative
to __bpf_call_base).  If needed, the jit can find the btf_func_model
by calling the new bpf_jit_find_kfunc_model(prog, insn).
With the imm set to the function address, "bpftool prog dump xlated"
will be able to display the kernel function calls the same way as
it displays other bpf helper calls.

gpl_compatible program is required to call kernel function.

This feature currently requires JIT.

The verifier selftests are adjusted because of the changes in
the verbose log in add_subprog_and_kfunc().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015142.1544736-1-kafai@fb.com
2021-03-26 20:41:51 -07:00
Martin KaFai Lau e16301fbe1 bpf: Simplify freeing logic in linfo and jited_linfo
This patch simplifies the linfo freeing logic by combining
"bpf_prog_free_jited_linfo()" and "bpf_prog_free_unused_jited_linfo()"
into the new "bpf_prog_jit_attempt_done()".
It is a prep work for the kernel function call support.  In a later
patch, freeing the kernel function call descriptors will also
be done in the "bpf_prog_jit_attempt_done()".

"bpf_prog_free_linfo()" is removed since it is only called by
"__bpf_prog_put_noref()".  The kvfree() are directly called
instead.

It also takes this chance to s/kcalloc/kvcalloc/ for the jited_linfo
allocation.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015130.1544323-1-kafai@fb.com
2021-03-26 20:41:50 -07:00
Alexei Starovoitov e21aa34178 bpf: Fix fexit trampoline.
The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
The synchronize_rcu_tasks() will not wait for such tasks to complete.
In such case the trampoline image will be freed and when the task
wakes up the return IP will point to freed memory causing the crash.
Solve this by adding percpu_ref_get/put for the duration of trampoline
and separate trampoline vs its image life times.
The "half page" optimization has to be removed, since
first_half->second_half->first_half transition cannot be guaranteed to
complete in deterministic time. Every trampoline update becomes a new image.
The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
call_rcu_tasks. Together they will wait for the original function and
trampoline asm to complete. The trampoline is patched from nop to jmp to skip
fexit progs. They are freed independently from the trampoline. The image with
fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
will wait for both sleepable and non-sleepable progs to complete.

Fixes: fec56f5890 ("bpf: Introduce BPF trampoline")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
Link: https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail.com
2021-03-18 00:22:51 +01:00
Brendan Jackman 39491867ac bpf: Explicitly zero-extend R0 after 32-bit cmpxchg
As pointed out by Ilya and explained in the new comment, there's a
discrepancy between x86 and BPF CMPXCHG semantics: BPF always loads
the value from memory into r0, while x86 only does so when r0 and the
value in memory are different. The same issue affects s390.

At first this might sound like pure semantics, but it makes a real
difference when the comparison is 32-bit, since the load will
zero-extend r0/rax.

The fix is to explicitly zero-extend rax after doing such a
CMPXCHG. Since this problem affects multiple archs, this is done in
the verifier by patching in a BPF_ZEXT_REG instruction after every
32-bit cmpxchg. Any archs that don't need such manual zero-extension
can do a look-ahead with insn_is_zext to skip the unnecessary mov.

Note this still goes on top of Ilya's patch:

https://lore.kernel.org/bpf/20210301154019.129110-1-iii@linux.ibm.com/T/#u

Differences v5->v6[1]:
 - Moved is_cmpxchg_insn and ensured it can be safely re-used. Also renamed it
   and removed 'inline' to match the style of the is_*_function helpers.
 - Fixed up comments in verifier test (thanks for the careful review, Martin!)

Differences v4->v5[1]:
 - Moved the logic entirely into opt_subreg_zext_lo32_rnd_hi32, thanks to Martin
   for suggesting this.

Differences v3->v4[1]:
 - Moved the optimization against pointless zext into the correct place:
   opt_subreg_zext_lo32_rnd_hi32 is called _after_ fixup_bpf_calls.

Differences v2->v3[1]:
 - Moved patching into fixup_bpf_calls (patch incoming to rename this function)
 - Added extra commentary on bpf_jit_needs_zext
 - Added check to avoid adding a pointless zext(r0) if there's already one there.

Difference v1->v2[1]: Now solved centrally in the verifier instead of
  specifically for the x86 JIT. Thanks to Ilya and Daniel for the suggestions!

[1] v5: https://lore.kernel.org/bpf/CA+i-1C3ytZz6FjcPmUg5s4L51pMQDxWcZNvM86w4RHZ_o2khwg@mail.gmail.com/T/#t
    v4: https://lore.kernel.org/bpf/CA+i-1C3ytZz6FjcPmUg5s4L51pMQDxWcZNvM86w4RHZ_o2khwg@mail.gmail.com/T/#t
    v3: https://lore.kernel.org/bpf/08669818-c99d-0d30-e1db-53160c063611@iogearbox.net/T/#t
    v2: https://lore.kernel.org/bpf/08669818-c99d-0d30-e1db-53160c063611@iogearbox.net/T/#t
    v1: https://lore.kernel.org/bpf/d7ebaefb-bfd6-a441-3ff2-2fdfe699b1d2@iogearbox.net/T/#t

Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Fixes: 5ffa25502b ("bpf: Add instructions for atomic_[cmp]xchg")
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-03-04 19:06:03 -08:00
Cong Wang 53f523f305 bpf: Clear percpu pointers in bpf_prog_clone_free()
Similar to bpf_prog_realloc(), bpf_prog_clone_create() also copies
the percpu pointers, but the clone still shares them with the original
prog, so we have to clear these two percpu pointers in
bpf_prog_clone_free(). Otherwise we would get a double free:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP PTI
 CPU: 13 PID: 8140 Comm: kworker/13:247 Kdump: loaded Tainted: G                W    OE
  5.11.0-rc4.bm.1-amd64+ #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
 test_bpf: #1 TXA
 Workqueue: events bpf_prog_free_deferred
 RIP: 0010:percpu_ref_get_many.constprop.97+0x42/0xf0
 Code: [...]
 RSP: 0018:ffffa6bce1f9bda0 EFLAGS: 00010002
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000021dfc7b
 RDX: ffffffffae2eeb90 RSI: 867f92637e338da5 RDI: 0000000000000046
 RBP: ffffa6bce1f9bda8 R08: 0000000000000000 R09: 0000000000000001
 R10: 0000000000000046 R11: 0000000000000000 R12: 0000000000000280
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff9b5f3ffdedc0
 FS:    0000000000000000(0000) GS:ffff9b5f2fb40000(0000) knlGS:0000000000000000
 CS:    0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 000000027c36c002 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
    refill_obj_stock+0x5e/0xd0
    free_percpu+0xee/0x550
    __bpf_prog_free+0x4d/0x60
    process_one_work+0x26a/0x590
    worker_thread+0x3c/0x390
    ? process_one_work+0x590/0x590
    kthread+0x130/0x150
    ? kthread_park+0x80/0x80
    ret_from_fork+0x1f/0x30

This bug is 100% reproducible with test_kmod.sh.

Fixes: 700d4796ef ("bpf: Optimize program stats")
Fixes: ca06f55b90 ("bpf: Add per-program recursion prevention mechanism")
Reported-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210218001647.71631-1-xiyou.wangcong@gmail.com
2021-02-22 18:08:35 +01:00
Alexei Starovoitov 1336c66247 bpf: Clear per_cpu pointers during bpf_prog_realloc
bpf_prog_realloc copies contents of struct bpf_prog.
The pointers have to be cleared before freeing old struct.

Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Fixes: 700d4796ef ("bpf: Optimize program stats")
Fixes: ca06f55b90 ("bpf: Add per-program recursion prevention mechanism")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-02-11 19:35:00 -08:00
Alexei Starovoitov ca06f55b90 bpf: Add per-program recursion prevention mechanism
Since both sleepable and non-sleepable programs execute under migrate_disable
add recursion prevention mechanism to both types of programs when they're
executed via bpf trampoline.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-5-alexei.starovoitov@gmail.com
2021-02-11 16:19:13 +01:00
Alexei Starovoitov 700d4796ef bpf: Optimize program stats
Move bpf_prog_stats from prog->aux into prog to avoid one extra load
in critical path of program execution.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-2-alexei.starovoitov@gmail.com
2021-02-11 16:17:50 +01:00
Brendan Jackman 981f94c3e9 bpf: Add bitwise atomic instructions
This adds instructions for

atomic[64]_[fetch_]and
atomic[64]_[fetch_]or
atomic[64]_[fetch_]xor

All these operations are isomorphic enough to implement with the same
verifier, interpreter, and x86 JIT code, hence being a single commit.

The main interesting thing here is that x86 doesn't directly support
the fetch_ version these operations, so we need to generate a CMPXCHG
loop in the JIT. This requires the use of two temporary registers,
IIUC it's safe to use BPF_REG_AX and x86's AUX_REG for this purpose.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-10-jackmanb@google.com
2021-01-14 18:34:29 -08:00
Brendan Jackman 462910670e bpf: Pull out a macro for interpreting atomic ALU operations
Since the atomic operations that are added in subsequent commits are
all isomorphic with BPF_ADD, pull out a macro to avoid the
interpreter becoming dominated by lines of atomic-related code.

Note that this sacrificies interpreter performance (combining
STX_ATOMIC_W and STX_ATOMIC_DW into single switch case means that we
need an extra conditional branch to differentiate them) in favour of
compact and (relatively!) simple C code.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-9-jackmanb@google.com
2021-01-14 18:34:29 -08:00
Brendan Jackman 5ffa25502b bpf: Add instructions for atomic_[cmp]xchg
This adds two atomic opcodes, both of which include the BPF_FETCH
flag. XCHG without the BPF_FETCH flag would naturally encode
atomic_set. This is not supported because it would be of limited
value to userspace (it doesn't imply any barriers). CMPXCHG without
BPF_FETCH woulud be an atomic compare-and-write. We don't have such
an operation in the kernel so it isn't provided to BPF either.

There are two significant design decisions made for the CMPXCHG
instruction:

 - To solve the issue that this operation fundamentally has 3
   operands, but we only have two register fields. Therefore the
   operand we compare against (the kernel's API calls it 'old') is
   hard-coded to be R0. x86 has similar design (and A64 doesn't
   have this problem).

   A potential alternative might be to encode the other operand's
   register number in the immediate field.

 - The kernel's atomic_cmpxchg returns the old value, while the C11
   userspace APIs return a boolean indicating the comparison
   result. Which should BPF do? A64 returns the old value. x86 returns
   the old value in the hard-coded register (and also sets a
   flag). That means return-old-value is easier to JIT, so that's
   what we use.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-8-jackmanb@google.com
2021-01-14 18:34:29 -08:00
Brendan Jackman 5ca419f286 bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
The BPF_FETCH field can be set in bpf_insn.imm, for BPF_ATOMIC
instructions, in order to have the previous value of the
atomically-modified memory location loaded into the src register
after an atomic op is carried out.

Suggested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-7-jackmanb@google.com
2021-01-14 18:34:29 -08:00
Brendan Jackman 91c960b005 bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
A subsequent patch will add additional atomic operations. These new
operations will use the same opcode field as the existing XADD, with
the immediate discriminating different operations.

In preparation, rename the instruction mode BPF_ATOMIC and start
calling the zero immediate BPF_ADD.

This is possible (doesn't break existing valid BPF progs) because the
immediate field is currently reserved MBZ and BPF_ADD is zero.

All uses are removed from the tree but the BPF_XADD definition is
kept around to avoid breaking builds for people including kernel
headers.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-5-jackmanb@google.com
2021-01-14 18:34:29 -08:00
Andrii Nakryiko 541c3bad8d bpf: Support BPF ksym variables in kernel modules
Add support for directly accessing kernel module variables from BPF programs
using special ldimm64 instructions. This functionality builds upon vmlinux
ksym support, but extends ldimm64 with src_reg=BPF_PSEUDO_BTF_ID to allow
specifying kernel module BTF's FD in insn[1].imm field.

During BPF program load time, verifier will resolve FD to BTF object and will
take reference on BTF object itself and, for module BTFs, corresponding module
as well, to make sure it won't be unloaded from under running BPF program. The
mechanism used is similar to how bpf_prog keeps track of used bpf_maps.

One interesting change is also in how per-CPU variable is determined. The
logic is to find .data..percpu data section in provided BTF, but both vmlinux
and module each have their own .data..percpu entries in BTF. So for module's
case, the search for DATASEC record needs to look at only module's added BTF
types. This is implemented with custom search function.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/bpf/20210112075520.4103414-6-andrii@kernel.org
2021-01-12 17:24:30 -08:00
Roman Gushchin 3ac1f01b43 bpf: Eliminate rlimit-based memory accounting for bpf progs
Do not use rlimit-based memory accounting for bpf progs. It has been
replaced with memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-34-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin ddf8503c7c bpf: Memcg-based memory accounting for bpf progs
Include memory used by bpf programs into the memcg-based accounting.
This includes the memory used by programs itself, auxiliary data,
statistics and bpf line info. A memory cgroup containing the
process which loads the program is getting charged.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-6-guro@fb.com
2020-12-02 18:28:06 -08:00
Dmitrii Banshchikov d055126180 bpf: Add bpf_ktime_get_coarse_ns helper
The helper uses CLOCK_MONOTONIC_COARSE source of time that is less
accurate but more performant.

We have a BPF CGROUP_SKB firewall that supports event logging through
bpf_perf_event_output(). Each event has a timestamp and currently we use
bpf_ktime_get_ns() for it. Use of bpf_ktime_get_coarse_ns() saves ~15-20
ns in time required for event logging.

bpf_ktime_get_ns():
EgressLogByRemoteEndpoint                              113.82ns    8.79M

bpf_ktime_get_coarse_ns():
EgressLogByRemoteEndpoint                               95.40ns   10.48M

Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201117184549.257280-1-me@ubique.spb.ru
2020-11-18 23:25:32 +01:00
Ard Biesheuvel 080b6f4076 bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE
Commit 3193c0836 ("bpf: Disable GCC -fgcse optimization for
___bpf_prog_run()") introduced a __no_fgcse macro that expands to a
function scope __attribute__((optimize("-fno-gcse"))), to disable a
GCC specific optimization that was causing trouble on x86 builds, and
was not expected to have any positive effect in the first place.

However, as the GCC manual documents, __attribute__((optimize))
is not for production use, and results in all other optimization
options to be forgotten for the function in question. This can
cause all kinds of trouble, but in one particular reported case,
it causes -fno-asynchronous-unwind-tables to be disregarded,
resulting in .eh_frame info to be emitted for the function.

This reverts commit 3193c0836, and instead, it disables the -fgcse
optimization for the entire source file, but only when building for
X86 using GCC with CONFIG_BPF_JIT_ALWAYS_ON disabled. Note that the
original commit states that CONFIG_RETPOLINE=n triggers the issue,
whereas CONFIG_RETPOLINE=y performs better without the optimization,
so it is kept disabled in both cases.

Fixes: 3193c0836f ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/lkml/CAMuHMdUg0WJHEcq6to0-eODpXPOywLot6UD2=GFHpzoj_hCoBQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20201028171506.15682-2-ardb@kernel.org
2020-10-29 20:01:46 -07:00
Linus Torvalds 9ff9b0d392 networking changes for the 5.10 merge window
Add redirect_neigh() BPF packet redirect helper, allowing to limit stack
 traversal in common container configs and improving TCP back-pressure.
 Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
 
 Expand netlink policy support and improve policy export to user space.
 (Ge)netlink core performs request validation according to declared
 policies. Expand the expressiveness of those policies (min/max length
 and bitmasks). Allow dumping policies for particular commands.
 This is used for feature discovery by user space (instead of kernel
 version parsing or trial and error).
 
 Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge.
 
 Allow more than 255 IPv4 multicast interfaces.
 
 Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
 packets of TCPv6.
 
 In Multi-patch TCP (MPTCP) support concurrent transmission of data
 on multiple subflows in a load balancing scenario. Enhance advertising
 addresses via the RM_ADDR/ADD_ADDR options.
 
 Support SMC-Dv2 version of SMC, which enables multi-subnet deployments.
 
 Allow more calls to same peer in RxRPC.
 
 Support two new Controller Area Network (CAN) protocols -
 CAN-FD and ISO 15765-2:2016.
 
 Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
 kernel problem.
 
 Add TC actions for implementing MPLS L2 VPNs.
 
 Improve nexthop code - e.g. handle various corner cases when nexthop
 objects are removed from groups better, skip unnecessary notifications
 and make it easier to offload nexthops into HW by converting
 to a blocking notifier.
 
 Support adding and consuming TCP header options by BPF programs,
 opening the doors for easy experimental and deployment-specific
 TCP option use.
 
 Reorganize TCP congestion control (CC) initialization to simplify life
 of TCP CC implemented in BPF.
 
 Add support for shipping BPF programs with the kernel and loading them
 early on boot via the User Mode Driver mechanism, hence reusing all the
 user space infra we have.
 
 Support sleepable BPF programs, initially targeting LSM and tracing.
 
 Add bpf_d_path() helper for returning full path for given 'struct path'.
 
 Make bpf_tail_call compatible with bpf-to-bpf calls.
 
 Allow BPF programs to call map_update_elem on sockmaps.
 
 Add BPF Type Format (BTF) support for type and enum discovery, as
 well as support for using BTF within the kernel itself (current use
 is for pretty printing structures).
 
 Support listing and getting information about bpf_links via the bpf
 syscall.
 
 Enhance kernel interfaces around NIC firmware update. Allow specifying
 overwrite mask to control if settings etc. are reset during update;
 report expected max time operation may take to users; support firmware
 activation without machine reboot incl. limits of how much impact
 reset may have (e.g. dropping link or not).
 
 Extend ethtool configuration interface to report IEEE-standard
 counters, to limit the need for per-vendor logic in user space.
 
 Adopt or extend devlink use for debug, monitoring, fw update
 in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw,
 mv88e6xxx, dpaa2-eth).
 
 In mlxsw expose critical and emergency SFP module temperature alarms.
 Refactor port buffer handling to make the defaults more suitable and
 support setting these values explicitly via the DCBNL interface.
 
 Add XDP support for Intel's igb driver.
 
 Support offloading TC flower classification and filtering rules to
 mscc_ocelot switches.
 
 Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
 fixed interval period pulse generator and one-step timestamping in
 dpaa-eth.
 
 Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
 offload.
 
 Add Lynx PHY/PCS MDIO module, and convert various drivers which have
 this HW to use it. Convert mvpp2 to split PCS.
 
 Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
 7-port Mediatek MT7531 IP.
 
 Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
 and wcn3680 support in wcn36xx.
 
 Improve performance for packets which don't require much offloads
 on recent Mellanox NICs by 20% by making multiple packets share
 a descriptor entry.
 
 Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto
 subtree to drivers/net. Move MDIO drivers out of the phy directory.
 
 Clean up a lot of W=1 warnings, reportedly the actively developed
 subsections of networking drivers should now build W=1 warning free.
 
 Make sure drivers don't use in_interrupt() to dynamically adapt their
 code. Convert tasklets to use new tasklet_setup API (sadly this
 conversion is not yet complete).
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S
 IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81
 PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U
 Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh
 r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E
 iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy
 2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F
 mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl
 wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe
 owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp
 HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP
 81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc=
 =bc1U
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:

 - Add redirect_neigh() BPF packet redirect helper, allowing to limit
   stack traversal in common container configs and improving TCP
   back-pressure.

   Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

 - Expand netlink policy support and improve policy export to user
   space. (Ge)netlink core performs request validation according to
   declared policies. Expand the expressiveness of those policies
   (min/max length and bitmasks). Allow dumping policies for particular
   commands. This is used for feature discovery by user space (instead
   of kernel version parsing or trial and error).

 - Support IGMPv3/MLDv2 multicast listener discovery protocols in
   bridge.

 - Allow more than 255 IPv4 multicast interfaces.

 - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
   packets of TCPv6.

 - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
   multiple subflows in a load balancing scenario. Enhance advertising
   addresses via the RM_ADDR/ADD_ADDR options.

 - Support SMC-Dv2 version of SMC, which enables multi-subnet
   deployments.

 - Allow more calls to same peer in RxRPC.

 - Support two new Controller Area Network (CAN) protocols - CAN-FD and
   ISO 15765-2:2016.

 - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
   kernel problem.

 - Add TC actions for implementing MPLS L2 VPNs.

 - Improve nexthop code - e.g. handle various corner cases when nexthop
   objects are removed from groups better, skip unnecessary
   notifications and make it easier to offload nexthops into HW by
   converting to a blocking notifier.

 - Support adding and consuming TCP header options by BPF programs,
   opening the doors for easy experimental and deployment-specific TCP
   option use.

 - Reorganize TCP congestion control (CC) initialization to simplify
   life of TCP CC implemented in BPF.

 - Add support for shipping BPF programs with the kernel and loading
   them early on boot via the User Mode Driver mechanism, hence reusing
   all the user space infra we have.

 - Support sleepable BPF programs, initially targeting LSM and tracing.

 - Add bpf_d_path() helper for returning full path for given 'struct
   path'.

 - Make bpf_tail_call compatible with bpf-to-bpf calls.

 - Allow BPF programs to call map_update_elem on sockmaps.

 - Add BPF Type Format (BTF) support for type and enum discovery, as
   well as support for using BTF within the kernel itself (current use
   is for pretty printing structures).

 - Support listing and getting information about bpf_links via the bpf
   syscall.

 - Enhance kernel interfaces around NIC firmware update. Allow
   specifying overwrite mask to control if settings etc. are reset
   during update; report expected max time operation may take to users;
   support firmware activation without machine reboot incl. limits of
   how much impact reset may have (e.g. dropping link or not).

 - Extend ethtool configuration interface to report IEEE-standard
   counters, to limit the need for per-vendor logic in user space.

 - Adopt or extend devlink use for debug, monitoring, fw update in many
   drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
   dpaa2-eth).

 - In mlxsw expose critical and emergency SFP module temperature alarms.
   Refactor port buffer handling to make the defaults more suitable and
   support setting these values explicitly via the DCBNL interface.

 - Add XDP support for Intel's igb driver.

 - Support offloading TC flower classification and filtering rules to
   mscc_ocelot switches.

 - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
   fixed interval period pulse generator and one-step timestamping in
   dpaa-eth.

 - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
   offload.

 - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
   this HW to use it. Convert mvpp2 to split PCS.

 - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
   7-port Mediatek MT7531 IP.

 - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
   and wcn3680 support in wcn36xx.

 - Improve performance for packets which don't require much offloads on
   recent Mellanox NICs by 20% by making multiple packets share a
   descriptor entry.

 - Move chelsio inline crypto drivers (for TLS and IPsec) from the
   crypto subtree to drivers/net. Move MDIO drivers out of the phy
   directory.

 - Clean up a lot of W=1 warnings, reportedly the actively developed
   subsections of networking drivers should now build W=1 warning free.

 - Make sure drivers don't use in_interrupt() to dynamically adapt their
   code. Convert tasklets to use new tasklet_setup API (sadly this
   conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
  Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
  net, sockmap: Don't call bpf_prog_put() on NULL pointer
  bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
  bpf, sockmap: Add locking annotations to iterator
  netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
  net: fix pos incrementment in ipv6_route_seq_next
  net/smc: fix invalid return code in smcd_new_buf_create()
  net/smc: fix valid DMBE buffer sizes
  net/smc: fix use-after-free of delayed events
  bpfilter: Fix build error with CONFIG_BPFILTER_UMH
  cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
  net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
  bpf: Fix register equivalence tracking.
  rxrpc: Fix loss of final ack on shutdown
  rxrpc: Fix bundle counting for exclusive connections
  netfilter: restore NF_INET_NUMHOOKS
  ibmveth: Identify ingress large send packets.
  ibmveth: Switch order of ibmveth_helper calls.
  cxgb4: handle 4-tuple PEDIT to NAT mode translation
  selftests: Add VRF route leaking tests
  ...
2020-10-15 18:42:13 -07:00
Toke Høiland-Jørgensen 3aac1ead5e bpf: Move prog->aux->linked_prog and trampoline into bpf_link on attach
In preparation for allowing multiple attachments of freplace programs, move
the references to the target program and trampoline into the
bpf_tracing_link structure when that is created. To do this atomically,
introduce a new mutex in prog->aux to protect writing to the two pointers
to target prog and trampoline, and rename the members to make it clear that
they are related.

With this change, it is no longer possible to attach the same tracing
program multiple times (detaching in-between), since the reference from the
tracing program to the target disappears on the first attach. However,
since the next patch will let the caller supply an attach target, that will
also make it possible to attach to the same place multiple times.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/160138355059.48470.2503076992210324984.stgit@toke.dk
2020-09-29 13:09:23 -07:00
Alan Maguire eb411377ae bpf: Add bpf_seq_printf_btf helper
A helper is added to allow seq file writing of kernel data
structures using vmlinux BTF.  Its signature is

long bpf_seq_printf_btf(struct seq_file *m, struct btf_ptr *ptr,
                        u32 btf_ptr_size, u64 flags);

Flags and struct btf_ptr definitions/use are identical to the
bpf_snprintf_btf helper, and the helper returns 0 on success
or a negative error value.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-8-git-send-email-alan.maguire@oracle.com
2020-09-28 18:26:58 -07:00
Alan Maguire c4d0bfb450 bpf: Add bpf_snprintf_btf helper
A helper is added to support tracing kernel type information in BPF
using the BPF Type Format (BTF).  Its signature is

long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
		      u32 btf_ptr_size, u64 flags);

struct btf_ptr * specifies

- a pointer to the data to be traced
- the BTF id of the type of data pointed to
- a flags field is provided for future use; these flags
  are not to be confused with the BTF_F_* flags
  below that control how the btf_ptr is displayed; the
  flags member of the struct btf_ptr may be used to
  disambiguate types in kernel versus module BTF, etc;
  the main distinction is the flags relate to the type
  and information needed in identifying it; not how it
  is displayed.

For example a BPF program with a struct sk_buff *skb
could do the following:

	static struct btf_ptr b = { };

	b.ptr = skb;
	b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
	bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);

Default output looks like this:

(struct sk_buff){
 .transport_header = (__u16)65535,
 .mac_header = (__u16)65535,
 .end = (sk_buff_data_t)192,
 .head = (unsigned char *)0x000000007524fd8b,
 .data = (unsigned char *)0x000000007524fd8b,
 .truesize = (unsigned int)768,
 .users = (refcount_t){
  .refs = (atomic_t){
   .counter = (int)1,
  },
 },
}

Flags modifying display are as follows:

- BTF_F_COMPACT:	no formatting around type information
- BTF_F_NONAME:		no struct/union member names/types
- BTF_F_PTR_RAW:	show raw (unobfuscated) pointer values;
			equivalent to %px.
- BTF_F_ZERO:		show zero-valued struct/union members;
			they are not displayed by default

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com
2020-09-28 18:26:58 -07:00
Maciej Fijalkowski ebf7d1f508 bpf, x64: rework pro/epilogue and tailcall handling in JIT
This commit serves two things:
1) it optimizes BPF prologue/epilogue generation
2) it makes possible to have tailcalls within BPF subprogram

Both points are related to each other since without 1), 2) could not be
achieved.

In [1], Alexei says:
"The prologue will look like:
nop5
xor eax,eax  // two new bytes if bpf_tail_call() is used in this
             // function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15

Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
rbp, rsp'

This way new function will set its own stack size and will init tail
call
counter with whatever value the parent had.

If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there."

Implement that suggestion.

Since the layout of stack is changed, tail call counter handling can not
rely anymore on popping it to rbx just like it have been handled for
constant prologue case and later overwrite of rbx with actual value of
rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
is considered to be volatile/caller-saved and pop the value of tail call
counter in there in the epilogue.

Drop the BUILD_BUG_ON in emit_prologue and in
emit_bpf_tail_call_indirect where instruction layout is not constant
anymore.

Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
dedicated for skipping the register pops and stack unwind that are
generated right before the actual jump to target program.
For case when the target program is not present, BPF program will skip
the pop instructions and nop5 dedicated for jmpq $target. An example of
such state when only R6 of callee saved registers is used by program:

ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
ffffffffc0513aa6:       5b                      pop    %rbx
ffffffffc0513aa7:       58                      pop    %rax
ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi

When target program is inserted, the jump that was there to skip
pops/nop5 will become the nop5, so CPU will go over pops and do the
actual tailcall.

One might ask why there simply can not be pushes after the nop5?
In the following example snippet:

ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
(...)
ffffffffc0370332:       5b                      pop    %rbx
ffffffffc0370333:       58                      pop    %rax
ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
ffffffffc0370347:       50                      push   %rax
ffffffffc0370348:       53                      push   %rbx
ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548

There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
and jump target is not present. ctx is in %rbx register and BPF
subprogram that we will call into on ffffffffc037034c is relying on it,
e.g. it will pick ctx from there. Such code layout is therefore broken
as we would overwrite the content of %rbx with the value that was pushed
on the prologue. That is the reason for the 'bypass' approach.

Special care needs to be taken during the install/update/remove of
tailcall target. In case when target program is not present, the CPU
must not execute the pop instructions that precede the tailcall.

To address that, the following states can be defined:
A nop, unwind, nop
B nop, unwind, tail
C skip, unwind, nop
D skip, unwind, tail

A is forbidden (lead to incorrectness). The state transitions between
tailcall install/update/remove will work as follows:

First install tail call f: C->D->B(f)
 * poke the tailcall, after that get rid of the skip
Update tail call f to f': B(f)->B(f')
 * poke the tailcall (poke->tailcall_target) and do NOT touch the
   poke->tailcall_bypass
Remove tail call: B(f')->C(f')
 * poke->tailcall_bypass is poked back to jump, then we wait the RCU
   grace period so that other programs will finish its execution and
   after that we are safe to remove the poke->tailcall_target
Install new tail call (f''): C(f')->D(f'')->B(f'').
 * same as first step

This way CPU can never be exposed to "unwind, tail" state.

Last but not least, when tailcalls get mixed with bpf2bpf calls, it
would be possible to encounter the endless loop due to clearing the
tailcall counter if for example we would use the tailcall3-like from BPF
selftests program that would be subprogram-based, meaning the tailcall
would be present within the BPF subprogram.

This test, broken down to particular steps, would do:
entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
func0 -> call subprog_tail
(we are NOT skipping the first 11 bytes of prologue and this subprogram
has a tailcall, therefore we clear the counter...)
subprog -> do the same thing as entry

and then loop forever.

To address this, the idea is to go through the call chain of bpf2bpf progs
and look for a tailcall presence throughout whole chain. If we saw a single
tail call then each node in this call chain needs to be marked as a subprog
that can reach the tailcall. We would later feed the JIT with this info
and:
- set eax to 0 only when tailcall is reachable and this is the entry prog
- if tailcall is reachable but there's no tailcall in insns of currently
  JITed prog then push rax anyway, so that it will be possible to
  propagate further down the call chain
- finally if tailcall is reachable, then we need to precede the 'call'
  insn with mov rax, [rbp - (stack_depth + 8)]

Tail call related cases from test_verifier kselftest are also working
fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
work properly as well.

[1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-09-17 19:55:30 -07:00
Maciej Fijalkowski cf71b174d3 bpf: rename poke descriptor's 'ip' member to 'tailcall_target'
Reflect the actual purpose of poke->ip and rename it to
poke->tailcall_target so that it will not the be confused with another
poke target that will be introduced in next commit.

While at it, do the same thing with poke->ip_stable - rename it to
poke->tailcall_target_stable.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-09-17 12:59:31 -07:00
YiFei Zhu 984fe94f94 bpf: Mutex protect used_maps array and count
To support modifying the used_maps array, we use a mutex to protect
the use of the counter and the array. The mutex is initialized right
after the prog aux is allocated, and destroyed right before prog
aux is freed. This way we guarantee it's initialized for both cBPF
and eBPF.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: YiFei Zhu <zhuyifei1999@gmail.com>
Link: https://lore.kernel.org/bpf/20200915234543.3220146-2-sdf@google.com
2020-09-15 18:28:27 -07:00
Julien Thierry 00089c048e objtool: Rename frame.h -> objtool.h
Header frame.h is getting more code annotations to help objtool analyze
object files.

Rename the file to objtool.h.

[ jpoimboe: add objtool.h to MAINTAINERS ]

Signed-off-by: Julien Thierry <jthierry@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
2020-09-10 10:43:13 -05:00
Randy Dunlap b8c1a30907 bpf: Delete repeated words in comments
Drop repeated words in kernel/bpf/: {has, the}

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200807033141.10437-1-rdunlap@infradead.org
2020-08-07 18:57:24 +02:00
YiFei Zhu 7d9c342789 bpf: Make cgroup storages shared between programs on the same cgroup
This change comes in several parts:

One, the restriction that the CGROUP_STORAGE map can only be used
by one program is removed. This results in the removal of the field
'aux' in struct bpf_cgroup_storage_map, and removal of relevant
code associated with the field, and removal of now-noop functions
bpf_free_cgroup_storage and bpf_cgroup_storage_release.

Second, we permit a key of type u64 as the key to the map.
Providing such a key type indicates that the map should ignore
attach type when comparing map keys. However, for simplicity newly
linked storage will still have the attach type at link time in
its key struct. cgroup_storage_check_btf is adapted to accept
u64 as the type of the key.

Third, because the storages are now shared, the storages cannot
be unconditionally freed on program detach. There could be two
ways to solve this issue:
* A. Reference count the usage of the storages, and free when the
     last program is detached.
* B. Free only when the storage is impossible to be referred to
     again, i.e. when either the cgroup_bpf it is attached to, or
     the map itself, is freed.
Option A has the side effect that, when the user detach and
reattach a program, whether the program gets a fresh storage
depends on whether there is another program attached using that
storage. This could trigger races if the user is multi-threaded,
and since nondeterminism in data races is evil, go with option B.

The both the map and the cgroup_bpf now tracks their associated
storages, and the storage unlink and free are removed from
cgroup_bpf_detach and added to cgroup_bpf_release and
cgroup_storage_map_free. The latter also new holds the cgroup_mutex
to prevent any races with the former.

Fourth, on attach, we reuse the old storage if the key already
exists in the map, via cgroup_storage_lookup. If the storage
does not exist yet, we create a new one, and publish it at the
last step in the attach process. This does not create a race
condition because for the whole attach the cgroup_mutex is held.
We keep track of an array of new storages that was allocated
and if the process fails only the new storages would get freed.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/d5401c6106728a00890401190db40020a1f84ff1.1595565795.git.zhuyifei@google.com
2020-07-25 20:16:35 -07:00
Jakub Sitnicki ce3aa9cc51 bpf, netns: Handle multiple link attachments
Extend the BPF netns link callbacks to rebuild (grow/shrink) or update the
prog_array at given position when link gets attached/updated/released.

This let's us lift the limit of having just one link attached for the new
attach type introduced by subsequent patch.

No functional changes intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200717103536.397595-2-jakub@cloudflare.com
2020-07-17 20:18:16 -07:00
Linus Torvalds cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Christoph Hellwig 88dca4ca5a mm: remove the pgprot argument to __vmalloc
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Chris Packham 0142dddcbe bpf: Fix spelling in comment explaining ARG1 in ___bpf_prog_run
Change 'handeled' to 'handled'.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200525230025.14470-1-chris.packham@alliedtelesis.co.nz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:20 -07:00
Alexei Starovoitov 2c78ee898d bpf: Implement CAP_BPF
Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
  env->allow_ptr_leaks = bpf_allow_ptr_leaks();
  env->bypass_spec_v1 = bpf_bypass_spec_v1();
  env->bypass_spec_v4 = bpf_bypass_spec_v4();
  env->bpf_capable = bpf_capable();

The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.

'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.

That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
2020-05-15 17:29:41 +02:00
Eric Biggers 6b0b0fa2bc crypto: lib/sha1 - rename "sha" to "sha1"
The library implementation of the SHA-1 compression function is
confusingly called just "sha_transform()".  Alongside it are some "SHA_"
constants and "sha_init()".  Presumably these are left over from a time
when SHA just meant SHA-1.  But now there are also SHA-2 and SHA-3, and
moreover SHA-1 is now considered insecure and thus shouldn't be used.

Therefore, rename these functions and constants to make it very clear
that they are for SHA-1.  Also add a comment to make it clear that these
shouldn't be used.

For the extra-misleadingly named "SHA_MESSAGE_BYTES", rename it to
SHA1_BLOCK_SIZE and define it to just '64' rather than '(512/8)' so that
it matches the same definition in <crypto/sha.h>.  This prepares for
merging <linux/cryptohash.h> into <crypto/sha.h>.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-05-08 15:32:17 +10:00
Maciej Żenczykowski 71d1921477 bpf: add bpf_ktime_get_boot_ns()
On a device like a cellphone which is constantly suspending
and resuming CLOCK_MONOTONIC is not particularly useful for
keeping track of or reacting to external network events.
Instead you want to use CLOCK_BOOTTIME.

Hence add bpf_ktime_get_boot_ns() as a mirror of bpf_ktime_get_ns()
based around CLOCK_BOOTTIME instead of CLOCK_MONOTONIC.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-04-26 09:43:05 -07:00
Stanislav Fomichev 6890896bd7 bpf: Fix missing bpf_base_func_proto in cgroup_base_func_proto for CGROUP_NET=n
linux-next build bot reported compile issue [1] with one of its
configs. It looks like when we have CONFIG_NET=n and
CONFIG_BPF{,_SYSCALL}=y, we are missing the bpf_base_func_proto
definition (from net/core/filter.c) in cgroup_base_func_proto.

I'm reshuffling the code a bit to make it work. The common helpers
are moved into kernel/bpf/helpers.c and the bpf_base_func_proto is
exported from there.
Also, bpf_get_raw_cpu_id goes into kernel/bpf/core.c akin to existing
bpf_user_rnd_u32.

[1] https://lore.kernel.org/linux-next/CAKH8qBsBvKHswiX1nx40LgO+BGeTmb1NX8tiTttt_0uu6T3dCA@mail.gmail.com/T/#mff8b0c083314c68c2e2ef0211cb11bc20dc13c72

Fixes: 0456ea170c ("bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200424235941.58382-1-sdf@google.com
2020-04-26 08:53:13 -07:00
Daniel Borkmann 0f09abd105 bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
recvmsg() and bind-related hooks in order to retrieve the cgroup v2
context which can then be used as part of the key for BPF map lookups,
for example. Given these hooks operate in process context 'current' is
always valid and pointing to the app that is performing mentioned
syscalls if it's subject to a v2 cgroup. Also with same motivation of
commit 7723628101 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
enable retrieval of ancestor from current so the cgroup id can be used
for policy lookups which can then forbid connect() / bind(), for example.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:39 -07:00
Jiri Olsa dba122fb5e bpf: Add bpf_ksym_add/del functions
Separating /proc/kallsyms add/del code and adding bpf_ksym_add/del
functions for that.

Moving bpf_prog_ksym_node_add/del functions to __bpf_ksym_add/del
and changing their argument to 'struct bpf_ksym' object. This way
we can call them for other bpf objects types like trampoline and
dispatcher.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-10-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-03-13 12:49:52 -07:00
Jiri Olsa cbd76f8d5a bpf: Add prog flag to struct bpf_ksym object
Adding 'prog' bool flag to 'struct bpf_ksym' to mark that
this object belongs to bpf_prog object.

This change allows having bpf_prog objects together with
other types (trampolines and dispatchers) in the single
bpf_tree. It's used when searching for bpf_prog exception
tables by the bpf_prog_ksym_find function, where we need
to get the bpf_prog pointer.

>From now we can safely add bpf_ksym support for trampoline
or dispatcher objects, because we can differentiate them
from bpf_prog objects.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-9-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-03-13 12:49:52 -07:00