Commit Graph

30 Commits

Author SHA1 Message Date
Jerome Marchand 4275e2f620 bpf: Add MEM_WRITE attribute
JIRA: https://issues.redhat.com/browse/RHEL-63880

commit 6fad274f06f038c29660aa53fbad14241c9fd976
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Mon Oct 21 17:28:05 2024 +0200

    bpf: Add MEM_WRITE attribute

    Add a MEM_WRITE attribute for BPF helper functions which can be used in
    bpf_func_proto to annotate an argument type in order to let the verifier
    know that the helper writes into the memory passed as an argument. In
    the past MEM_UNINIT has been (ab)used for this function, but the latter
    merely tells the verifier that the passed memory can be uninitialized.

    There have been bugs with overloading the latter but aside from that
    there are also cases where the passed memory is read + written which
    currently cannot be expressed, see also 4b3786a6c539 ("bpf: Zero former
    ARG_PTR_TO_{LONG,INT} args in case of error").

    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20241021152809.33343-1-daniel@iogearbox.net
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2025-01-21 11:27:08 +01:00
Luis Claudio R. Goncalves 99a1d014e8 bpf: Use raw_spinlock_t in ringbuf
JIRA: https://issues.redhat.com/browse/RHEL-20608

commit 8b62645b09f870d70c7910e7550289d444239a46
Author: Wander Lairson Costa <wander.lairson@gmail.com>
Date:   Fri Sep 20 16:06:59 2024 -0300

    bpf: Use raw_spinlock_t in ringbuf

    The function __bpf_ringbuf_reserve is invoked from a tracepoint, which
    disables preemption. Using spinlock_t in this context can lead to a
    "sleep in atomic" warning in the RT variant. This issue is illustrated
    in the example below:

    BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 556208, name: test_progs
    preempt_count: 1, expected: 0
    RCU nest depth: 1, expected: 1
    INFO: lockdep is turned off.
    Preemption disabled at:
    [<ffffd33a5c88ea44>] migrate_enable+0xc0/0x39c
    CPU: 7 PID: 556208 Comm: test_progs Tainted: G
    Hardware name: Qualcomm SA8775P Ride (DT)
    Call trace:
     dump_backtrace+0xac/0x130
     show_stack+0x1c/0x30
     dump_stack_lvl+0xac/0xe8
     dump_stack+0x18/0x30
     __might_resched+0x3bc/0x4fc
     rt_spin_lock+0x8c/0x1a4
     __bpf_ringbuf_reserve+0xc4/0x254
     bpf_ringbuf_reserve_dynptr+0x5c/0xdc
     bpf_prog_ac3d15160d62622a_test_read_write+0x104/0x238
     trace_call_bpf+0x238/0x774
     perf_call_bpf_enter.isra.0+0x104/0x194
     perf_syscall_enter+0x2f8/0x510
     trace_sys_enter+0x39c/0x564
     syscall_trace_enter+0x220/0x3c0
     do_el0_svc+0x138/0x1dc
     el0_svc+0x54/0x130
     el0t_64_sync_handler+0x134/0x150
     el0t_64_sync+0x17c/0x180

    Switch the spinlock to raw_spinlock_t to avoid this error.

    Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
    Reported-by: Brian Grech <bgrech@redhat.com>
    Signed-off-by: Wander Lairson Costa <wander.lairson@gmail.com>
    Signed-off-by: Wander Lairson Costa <wander@redhat.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/r/20240920190700.617253-1-wander@redhat.com

Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
2024-11-05 12:06:33 -03:00
CKI Backport Bot 51eaf868c4 bpf: Fix overrunning reservations in ringbuf
JIRA: https://issues.redhat.com/browse/RHEL-62881
CVE: CVE-2024-41009

commit cfa1a2329a691ffd991fcf7248a57d752e712881
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Fri Jun 21 16:08:27 2024 +0200

    bpf: Fix overrunning reservations in ringbuf

    The BPF ring buffer internally is implemented as a power-of-2 sized circular
    buffer, with two logical and ever-increasing counters: consumer_pos is the
    consumer counter to show which logical position the consumer consumed the
    data, and producer_pos which is the producer counter denoting the amount of
    data reserved by all producers.

    Each time a record is reserved, the producer that "owns" the record will
    successfully advance producer counter. In user space each time a record is
    read, the consumer of the data advanced the consumer counter once it finished
    processing. Both counters are stored in separate pages so that from user
    space, the producer counter is read-only and the consumer counter is read-write.

    One aspect that simplifies and thus speeds up the implementation of both
    producers and consumers is how the data area is mapped twice contiguously
    back-to-back in the virtual memory, allowing to not take any special measures
    for samples that have to wrap around at the end of the circular buffer data
    area, because the next page after the last data page would be first data page
    again, and thus the sample will still appear completely contiguous in virtual
    memory.

    Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for
    book-keeping the length and offset, and is inaccessible to the BPF program.
    Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ`
    for the BPF program to use. Bing-Jhong and Muhammad reported that it is however
    possible to make a second allocated memory chunk overlapping with the first
    chunk and as a result, the BPF program is now able to edit first chunk's
    header.

    For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size
    of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to
    bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in
    [0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets
    allocate a chunk B with size 0x3000. This will succeed because consumer_pos
    was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask`
    check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able
    to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned
    earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data
    pages. This means that chunk B at [0x4000,0x4008] is chunk A's header.
    bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then
    locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk
    B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong
    page and could cause a crash.

    Fix it by calculating the oldest pending_pos and check whether the range
    from the oldest outstanding record to the newest would span beyond the ring
    buffer size. If that is the case, then reject the request. We've tested with
    the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh)
    before/after the fix and while it seems a bit slower on some benchmarks, it
    is still not significantly enough to matter.

    Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
    Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
    Reported-by: Muhammad Ramdhan <ramdhan@starlabs.sg>
    Co-developed-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
    Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2024-10-16 16:01:31 +00:00
Aristeu Rozanski e214620cfb mm: replace vma->vm_flags direct modifications with modifier calls
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped stuff we don't support when not applying cleanly, left the rest for sake of saving work

commit 1c71222e5f2393b5ea1a41795c67589eea7e3490
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:49 2023 -0800

    mm: replace vma->vm_flags direct modifications with modifier calls

    Replace direct modifications to vma->vm_flags with calls to modifier
    functions to be able to track flag changes and to keep vma locking
    correctness.

    [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
    Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Artem Savkov 39672f48df bpf: Fold smp_mb__before_atomic() into atomic_set_release()
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 06646da01458682023321bdc7553b8140e95d077
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed Oct 18 15:28:32 2023 -0700

    bpf: Fold smp_mb__before_atomic() into atomic_set_release()
    
    The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set()
    immediately preceded by smp_mb__before_atomic() so as to order storing
    of ring-buffer consumer and producer positions prior to the atomic_set()
    call's clearing of the ->busy flag, as follows:
    
            smp_mb__before_atomic();
            atomic_set(&rb->busy, 0);
    
    Although this works given current architectures and implementations, and
    given that this only needs to order prior writes against a later write.
    However, it does so by accident because the smp_mb__before_atomic()
    is only guaranteed to work with read-modify-write atomic operations, and
    not at all with things like atomic_set() and atomic_read().
    
    Note especially that smp_mb__before_atomic() will not, repeat *not*,
    order the prior write to "a" before the subsequent non-read-modify-write
    atomic read from "b", even on strongly ordered systems such as x86:
    
            WRITE_ONCE(a, 1);
            smp_mb__before_atomic();
            r1 = atomic_read(&b);
    
    Therefore, replace the smp_mb__before_atomic() and atomic_set() with
    atomic_set_release() as follows:
    
            atomic_set_release(&rb->busy, 0);
    
    This is no slower (and sometimes is faster) than the original, and also
    provides a formal guarantee of ordering that the original lacks.
    
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:56 +01:00
Jerome Marchand 91b2eeaa04 bpf: Remove unnecessary ring buffer size check
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit cf6eeb8f9dace014f63a3b2e959d0922bf737233
Author: Hou Tao <houtao1@huawei.com>
Date:   Tue Jul 4 15:40:14 2023 +0800

    bpf: Remove unnecessary ring buffer size check

    The theoretical maximum size of ring buffer is about 64GB, but now the
    size of ring buffer is specified by max_entries in bpf_attr and its
    maximum value is (4GB - 1), and it won't be possible for overflow.

    So just remove the unnecessary size check in ringbuf_map_alloc() but
    keep the comments for possible extension in future.

    Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Closes: https://lore.kernel.org/bpf/9c636a63-1f3d-442d-9223-96c2dccb9469@moroto.mountain
    Link: https://lore.kernel.org/bpf/20230704074014.216616-1-houtao@huaweicloud.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-14 15:22:23 +01:00
Artem Savkov 128dd7c7f8 bpf: return long from bpf_map_ops funcs
Bugzilla: https://bugzilla.redhat.com/2221599

commit d7ba4cc900bf1eea2d8c807c6b1fc6bd61f41237
Author: JP Kobryn <inwardvessel@gmail.com>
Date:   Wed Mar 22 12:47:54 2023 -0700

    bpf: return long from bpf_map_ops funcs
    
    This patch changes the return types of bpf_map_ops functions to long, where
    previously int was returned. Using long allows for bpf programs to maintain
    the sign bit in the absence of sign extension during situations where
    inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
    error is returned.
    
    The definitions of the helper funcs are generated from comments in the bpf
    uapi header at `include/uapi/linux/bpf.h`. The return type of these
    helpers was previously changed from int to long in commit bdb7b79b4c. For
    any case where one of the map helpers call the bpf_map_ops funcs that are
    still returning 32-bit int, a compiler might not include sign extension
    instructions to properly convert the 32-bit negative value a 64-bit
    negative value.
    
    For example:
    bpf assembly excerpt of an inlined helper calling a kernel function and
    checking for a specific error:
    
    ; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
      ...
      46:	call   0xffffffffe103291c	; htab_map_update_elem
    ; if (err && err != -EEXIST) {
      4b:	cmp    $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
    
    kernel function assembly excerpt of return value from
    `htab_map_update_elem` returning 32-bit int:
    
    movl $0xffffffef, %r9d
    ...
    movl %r9d, %eax
    
    ...results in the comparison:
    cmp $0xffffffffffffffef, $0x00000000ffffffef
    
    Fixes: bdb7b79b4c ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
    Tested-by: Eduard Zingerman <eddyz87@gmail.com>
    Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
    Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:19 +02:00
Artem Savkov eb5e051f30 bpf: ringbuf memory usage
Bugzilla: https://bugzilla.redhat.com/2221599

commit 2f7e4ab2caa9a692ffc843f801cea90632cdc782
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Mar 5 12:46:04 2023 +0000

    bpf: ringbuf memory usage
    
    A new helper ringbuf_map_mem_usage() is introduced to calculate ringbuf
    memory usage.
    
    The result as follows,
    - before
    15: ringbuf  name count_map  flags 0x0
            key 0B  value 0B  max_entries 65536  memlock 0B
    
    - after
    15: ringbuf  name count_map  flags 0x0
            key 0B  value 0B  max_entries 65536  memlock 78424B
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20230305124615.12358-8-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:12 +02:00
Jerome Marchand 990c05047c bpf: Rename MEM_ALLOC to MEM_RINGBUF
Bugzilla: https://bugzilla.redhat.com/2177177

commit 894f2a8b1673a355a1a7507a4dfa6a3c836d07c1
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Tue Nov 15 00:45:27 2022 +0530

    bpf: Rename MEM_ALLOC to MEM_RINGBUF

    Currently, verifier uses MEM_ALLOC type tag to specially tag memory
    returned from bpf_ringbuf_reserve helper. However, this is currently
    only used for this purpose and there is an implicit assumption that it
    only refers to ringbuf memory (e.g. the check for ARG_PTR_TO_ALLOC_MEM
    in check_func_arg_reg_off).

    Hence, rename MEM_ALLOC to MEM_RINGBUF to indicate this special
    relationship and instead open the use of MEM_ALLOC for more generic
    allocations made for user types.

    Also, since ARG_PTR_TO_ALLOC_MEM_OR_NULL is unused, simply drop it.

    Finally, update selftests using 'alloc_' verifier string to 'ringbuf_'.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Link: https://lore.kernel.org/r/20221114191547.1694267-7-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:04 +02:00
Artem Savkov d4045a2578 bpf: Add bpf_user_ringbuf_drain() helper
Bugzilla: https://bugzilla.redhat.com/2166911

Conflicts: fixing previously incorrect order of cases in verifier.c

commit 20571567384428dfc9fe5cf9f2e942e1df13c2dd
Author: David Vernet <void@manifault.com>
Date:   Mon Sep 19 19:00:58 2022 -0500

    bpf: Add bpf_user_ringbuf_drain() helper

    In a prior change, we added a new BPF_MAP_TYPE_USER_RINGBUF map type which
    will allow user-space applications to publish messages to a ring buffer
    that is consumed by a BPF program in kernel-space. In order for this
    map-type to be useful, it will require a BPF helper function that BPF
    programs can invoke to drain samples from the ring buffer, and invoke
    callbacks on those samples. This change adds that capability via a new BPF
    helper function:

    bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx,
                           u64 flags)

    BPF programs may invoke this function to run callback_fn() on a series of
    samples in the ring buffer. callback_fn() has the following signature:

    long callback_fn(struct bpf_dynptr *dynptr, void *context);

    Samples are provided to the callback in the form of struct bpf_dynptr *'s,
    which the program can read using BPF helper functions for querying
    struct bpf_dynptr's.

    In order to support bpf_ringbuf_drain(), a new PTR_TO_DYNPTR register
    type is added to the verifier to reflect a dynptr that was allocated by
    a helper function and passed to a BPF program. Unlike PTR_TO_STACK
    dynptrs which are allocated on the stack by a BPF program, PTR_TO_DYNPTR
    dynptrs need not use reference tracking, as the BPF helper is trusted to
    properly free the dynptr before returning. The verifier currently only
    supports PTR_TO_DYNPTR registers that are also DYNPTR_TYPE_LOCAL.

    Note that while the corresponding user-space libbpf logic will be added
    in a subsequent patch, this patch does contain an implementation of the
    .map_poll() callback for BPF_MAP_TYPE_USER_RINGBUF maps. This
    .map_poll() callback guarantees that an epoll-waiting user-space
    producer will receive at least one event notification whenever at least
    one sample is drained in an invocation of bpf_user_ringbuf_drain(),
    provided that the function is not invoked with the BPF_RB_NO_WAKEUP
    flag. If the BPF_RB_FORCE_WAKEUP flag is provided, a wakeup
    notification is sent even if no sample was drained.

    Signed-off-by: David Vernet <void@manifault.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220920000100.477320-3-void@manifault.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:27 +01:00
Artem Savkov eded417814 bpf: Define new BPF_MAP_TYPE_USER_RINGBUF map type
Bugzilla: https://bugzilla.redhat.com/2166911

commit 583c1f420173f7d84413a1a1fbf5109d798b4faa
Author: David Vernet <void@manifault.com>
Date:   Mon Sep 19 19:00:57 2022 -0500

    bpf: Define new BPF_MAP_TYPE_USER_RINGBUF map type
    
    We want to support a ringbuf map type where samples are published from
    user-space, to be consumed by BPF programs. BPF currently supports a
    kernel -> user-space circular ring buffer via the BPF_MAP_TYPE_RINGBUF
    map type.  We'll need to define a new map type for user-space -> kernel,
    as none of the helpers exported for BPF_MAP_TYPE_RINGBUF will apply
    to a user-space producer ring buffer, and we'll want to add one or
    more helper functions that would not apply for a kernel-producer
    ring buffer.
    
    This patch therefore adds a new BPF_MAP_TYPE_USER_RINGBUF map type
    definition. The map type is useless in its current form, as there is no
    way to access or use it for anything until we one or more BPF helpers. A
    follow-on patch will therefore add a new helper function that allows BPF
    programs to run callbacks on samples that are published to the ring
    buffer.
    
    Signed-off-by: David Vernet <void@manifault.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220920000100.477320-2-void@manifault.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:14 +01:00
Artem Savkov 954e5bcd83 bpf: Use bpf_map_area_alloc consistently on bpf map creation
Bugzilla: https://bugzilla.redhat.com/2166911

commit 73cf09a36bf7bfb3e5a3ff23755c36d49137c44d
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Wed Aug 10 15:18:29 2022 +0000

    bpf: Use bpf_map_area_alloc consistently on bpf map creation
    
    Let's use the generic helper bpf_map_area_alloc() instead of the
    open-coded kzalloc helpers in bpf maps creation path.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20220810151840.16394-5-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:00 +01:00
Artem Savkov c87afb33ed bpf: Make __GFP_NOWARN consistent in bpf map creation
Bugzilla: https://bugzilla.redhat.com/2166911

commit 992c9e13f5939437037627c67bcb51e674b64265
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Wed Aug 10 15:18:28 2022 +0000

    bpf: Make __GFP_NOWARN consistent in bpf map creation
    
    Some of the bpf maps are created with __GFP_NOWARN, i.e. arraymap,
    bloom_filter, bpf_local_storage, bpf_struct_ops, lpm_trie,
    queue_stack_maps, reuseport_array, stackmap and xskmap, while others are
    created without __GFP_NOWARN, i.e. cpumap, devmap, hashtab,
    local_storage, offload, ringbuf and sock_map. But there are not key
    differences between the creation of these maps. So let make this
    allocation flag consistent in all bpf maps creation. Then we can use a
    generic helper to alloc all bpf maps.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20220810151840.16394-4-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:00 +01:00
Artem Savkov 1be541f058 bpf: Use bpf_map_area_free instread of kvfree
Bugzilla: https://bugzilla.redhat.com/2166911

commit 8f58ee54c2eae790f50c51dfa64a153601451f08
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Wed Aug 10 15:18:27 2022 +0000

    bpf: Use bpf_map_area_free instread of kvfree
    
    bpf_map_area_alloc() should be paired with bpf_map_area_free().
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20220810151840.16394-3-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:00 +01:00
Yauheni Kaliuta b08a7de6e2 bpf: Dynptr support for ring buffers
Bugzilla: https://bugzilla.redhat.com/2120968

commit bc34dee65a65e9c920c420005b8a43f2a721a458
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Mon May 23 14:07:09 2022 -0700

    bpf: Dynptr support for ring buffers
    
    Currently, our only way of writing dynamically-sized data into a ring
    buffer is through bpf_ringbuf_output but this incurs an extra memcpy
    cost. bpf_ringbuf_reserve + bpf_ringbuf_commit avoids this extra
    memcpy, but it can only safely support reservation sizes that are
    statically known since the verifier cannot guarantee that the bpf
    program won’t access memory outside the reserved space.
    
    The bpf_dynptr abstraction allows for dynamically-sized ring buffer
    reservations without the extra memcpy.
    
    There are 3 new APIs:
    
    long bpf_ringbuf_reserve_dynptr(void *ringbuf, u32 size, u64 flags, struct bpf_dynptr *ptr);
    void bpf_ringbuf_submit_dynptr(struct bpf_dynptr *ptr, u64 flags);
    void bpf_ringbuf_discard_dynptr(struct bpf_dynptr *ptr, u64 flags);
    
    These closely follow the functionalities of the original ringbuf APIs.
    For example, all ringbuffer dynptrs that have been reserved must be
    either submitted or discarded before the program exits.
    
    Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: David Vernet <void@manifault.com>
    Link: https://lore.kernel.org/bpf/20220523210712.3641569-4-joannelkoong@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:06 +02:00
Yauheni Kaliuta 11fec2f10e bpf: Compute map_btf_id during build time
Bugzilla: https://bugzilla.redhat.com/2120968

commit c317ab71facc2cd0a94145973318a4c914e11acc
Author: Menglong Dong <imagedong@tencent.com>
Date:   Mon Apr 25 21:32:47 2022 +0800

    bpf: Compute map_btf_id during build time
    
    For now, the field 'map_btf_id' in 'struct bpf_map_ops' for all map
    types are computed during vmlinux-btf init:
    
      btf_parse_vmlinux() -> btf_vmlinux_map_ids_init()
    
    It will lookup the btf_type according to the 'map_btf_name' field in
    'struct bpf_map_ops'. This process can be done during build time,
    thanks to Jiri's resolve_btfids.
    
    selftest of map_ptr has passed:
    
      $96 map_ptr:OK
      Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
    
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:00 +02:00
Yauheni Kaliuta 6e36eccfc4 bpf: Tag argument to be released in bpf_func_proto
Bugzilla: https://bugzilla.redhat.com/2120968

commit 8f14852e89113d738c99c375b4c8b8b7e1073df1
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Mon Apr 25 03:18:50 2022 +0530

    bpf: Tag argument to be released in bpf_func_proto
    
    Add a new type flag for bpf_arg_type that when set tells verifier that
    for a release function, that argument's register will be the one for
    which meta.ref_obj_id will be set, and which will then be released
    using release_reference. To capture the regno, introduce a new field
    release_regno in bpf_call_arg_meta.
    
    This would be required in the next patch, where we may either pass NULL
    or a refcounted pointer as an argument to the release function
    bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not
    enough, as there is a case where the type of argument needed matches,
    but the ref_obj_id is set to 0. Hence, we must enforce that whenever
    meta.ref_obj_id is zero, the register that is to be released can only
    be NULL for a release function.
    
    Since we now indicate whether an argument is to be released in
    bpf_func_proto itself, is_release_function helper has lost its utitlity,
    hence refactor code to work without it, and just rely on
    meta.release_regno to know when to release state for a ref_obj_id.
    Still, the restriction of one release argument and only one ref_obj_id
    passed to BPF helper or kfunc remains. This may be lifted in the future.
    
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220424214901.2743946-3-memxor@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-28 16:52:11 +02:00
Artem Savkov e056cdd2ba bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit b293dcc473d22a62dc6d78de2b15e4f49515db56
Author: Hou Tao <hotforest@gmail.com>
Date:   Wed Feb 2 14:01:58 2022 +0800

    bpf: Use VM_MAP instead of VM_ALLOC for ringbuf

    After commit 2fd3fb0be1d1 ("kasan, vmalloc: unpoison VM_ALLOC pages
    after mapping"), non-VM_ALLOC mappings will be marked as accessible
    in __get_vm_area_node() when KASAN is enabled. But now the flag for
    ringbuf area is VM_ALLOC, so KASAN will complain out-of-bound access
    after vmap() returns. Because the ringbuf area is created by mapping
    allocated pages, so use VM_MAP instead.

    After the change, info in /proc/vmallocinfo also changes from
      [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmalloc user
    to
      [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmap user

    Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
    Reported-by: syzbot+5ad567a418794b9b5983@syzkaller.appspotmail.com
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220202060158.6260-1-houtao1@huawei.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:55 +02:00
Artem Savkov 7f76bfc54f bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 216e3cd2f28dbbf1fe86848e0e29e6693b9f0a20
Author: Hao Luo <haoluo@google.com>
Date:   Thu Dec 16 16:31:51 2021 -0800

    bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.

    Some helper functions may modify its arguments, for example,
    bpf_d_path, bpf_get_stack etc. Previously, their argument types
    were marked as ARG_PTR_TO_MEM, which is compatible with read-only
    mem types, such as PTR_TO_RDONLY_BUF. Therefore it's legitimate,
    but technically incorrect, to modify a read-only memory by passing
    it into one of such helper functions.

    This patch tags the bpf_args compatible with immutable memory with
    MEM_RDONLY flag. The arguments that don't have this flag will be
    only compatible with mutable memory types, preventing the helper
    from modifying a read-only memory. The bpf_args that have
    MEM_RDONLY are compatible with both mutable memory and immutable
    memory.

    Signed-off-by: Hao Luo <haoluo@google.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211217003152.48334-9-haoluo@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:50 +02:00
Rustam Kovhaev ccff81e1d0 bpf: Fix false positive kmemleak report in bpf_ringbuf_area_alloc()
kmemleak scans struct page, but it does not scan the page content. If we
allocate some memory with kmalloc(), then allocate page with alloc_page(),
and if we put kmalloc pointer somewhere inside that page, kmemleak will
report kmalloc pointer as a false positive.

We can instruct kmemleak to scan the memory area by calling kmemleak_alloc()
and kmemleak_free(), but part of struct bpf_ringbuf is mmaped to user space,
and if struct bpf_ringbuf changes we would have to revisit and review size
argument in kmemleak_alloc(), because we do not want kmemleak to scan the
user space memory. Let's simplify things and use kmemleak_not_leak() here.

For posterity, also adding additional prior analysis from Andrii:

  I think either kmemleak or syzbot are misreporting this. I've added a
  bunch of printks around all allocations performed by BPF ringbuf. [...]
  On repro side I get these two warnings:

  [vmuser@archvm bpf]$ sudo ./repro
  BUG: memory leak
  unreferenced object 0xffff88810d538c00 (size 64):
    comm "repro", pid 2140, jiffies 4294692933 (age 14.540s)
    hex dump (first 32 bytes):
      00 af 19 04 00 ea ff ff c0 ae 19 04 00 ea ff ff  ................
      80 ae 19 04 00 ea ff ff c0 29 2e 04 00 ea ff ff  .........)......
    backtrace:
      [<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
      [<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
      [<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
      [<00000000f601d565>] do_syscall_64+0x2d/0x40
      [<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae

  BUG: memory leak
  unreferenced object 0xffff88810d538c80 (size 64):
    comm "repro", pid 2143, jiffies 4294699025 (age 8.448s)
    hex dump (first 32 bytes):
      80 aa 19 04 00 ea ff ff 00 ab 19 04 00 ea ff ff  ................
      c0 ab 19 04 00 ea ff ff 80 44 28 04 00 ea ff ff  .........D(.....
    backtrace:
      [<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
      [<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
      [<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
      [<00000000f601d565>] do_syscall_64+0x2d/0x40
      [<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae

  Note that both reported leaks (ffff88810d538c80 and ffff88810d538c00)
  correspond to pages array bpf_ringbuf is allocating and tracking properly
  internally. Note also that syzbot repro doesn't close FD of created BPF
  ringbufs, and even when ./repro itself exits with error, there are still
  two forked processes hanging around in my system. So clearly ringbuf maps
  are alive at that point. So reporting any memory leak looks weird at that
  point, because that memory is being used by active referenced BPF ringbuf.

  It's also a question why repro doesn't clean up its forks. But if I do a
  `pkill repro`, I do see that all the allocated memory is /properly/ cleaned
  up [and the] "leaks" are deallocated properly.

  BTW, if I add close() right after bpf() syscall in syzbot repro, I see that
  everything is immediately deallocated, like designed. And no memory leak
  is reported. So I don't think the problem is anywhere in bpf_ringbuf code,
  rather in the leak detection and/or repro itself.

Reported-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
[ Daniel: also included analysis from Andrii to the commit log ]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzYk+dqs+jwu6VKXP-RttcTEGFe+ySTGWT9CRNkagDiJVA@mail.gmail.com
Link: https://lore.kernel.org/lkml/YNTAqiE7CWJhOK2M@nuc10
Link: https://lore.kernel.org/lkml/20210615101515.GC26027@arm.com
Link: https://syzkaller.appspot.com/bug?extid=5d895828587f49e7fe9b
Link: https://lore.kernel.org/bpf/20210626181156.1873604-1-rkovhaev@gmail.com
2021-06-28 15:57:46 +02:00
Andrii Nakryiko 04ea3086c4 bpf: Prevent writable memory-mapping of read-only ringbuf pages
Only the very first page of BPF ringbuf that contains consumer position
counter is supposed to be mapped as writeable by user-space. Producer
position is read-only and can be modified only by the kernel code. BPF ringbuf
data pages are read-only as well and are not meant to be modified by
user-code to maintain integrity of per-record headers.

This patch allows to map only consumer position page as writeable and
everything else is restricted to be read-only. remap_vmalloc_range()
internally adds VM_DONTEXPAND, so all the established memory mappings can't be
extended, which prevents any future violations through mremap()'ing.

Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: Ryota Shiga (Flatt Security)
Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-11 13:31:10 +02:00
Thadeu Lima de Souza Cascardo 4b81ccebae bpf, ringbuf: Deny reserve of buffers larger than ringbuf
A BPF program might try to reserve a buffer larger than the ringbuf size.
If the consumer pointer is way ahead of the producer, that would be
successfully reserved, allowing the BPF program to read or write out of
the ringbuf allocated area.

Reported-by: Ryota Shiga (Flatt Security)
Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-11 13:30:45 +02:00
Roman Gushchin abbdd0813f bpf: Eliminate rlimit-based memory accounting for bpf ringbuffer
Do not use rlimit-based memory accounting for bpf ringbuffer.
It has been replaced with the memcg-based memory accounting.

bpf_ringbuf_alloc() can't return anything except ERR_PTR(-ENOMEM)
and a valid pointer, so to simplify the code make it return NULL
in the first case. This allows to drop a couple of lines in
ringbuf_map_alloc() and also makes it look similar to other memory
allocating function like kmalloc().

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-28-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin be4035c734 bpf: Memcg-based memory accounting for bpf ringbuffer
Enable the memcg-based memory accounting for the memory used by
the bpf ringbuffer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-15-guro@fb.com
2020-12-02 18:32:45 -08:00
Martin KaFai Lau f4d0525921 bpf: Add map_meta_equal map ops
Some properties of the inner map is used in the verification time.
When an inner map is inserted to an outer map at runtime,
bpf_map_meta_equal() is currently used to ensure those properties
of the inserting inner map stays the same as the verification
time.

In particular, the current bpf_map_meta_equal() checks max_entries which
turns out to be too restrictive for most of the maps which do not use
max_entries during the verification time.  It limits the use case that
wants to replace a smaller inner map with a larger inner map.  There are
some maps do use max_entries during verification though.  For example,
the map_gen_lookup in array_map_ops uses the max_entries to generate
the inline lookup code.

To accommodate differences between maps, the map_meta_equal is added
to bpf_map_ops.  Each map-type can decide what to check when its
map is used as an inner map during runtime.

Also, some map types cannot be used as an inner map and they are
currently black listed in bpf_map_meta_alloc() in map_in_map.c.
It is not unusual that the new map types may not aware that such
blacklist exists.  This patch enforces an explicit opt-in
and only allows a map to be used as an inner map if it has
implemented the map_meta_equal ops.  It is based on the
discussion in [1].

All maps that support inner map has its map_meta_equal points
to bpf_map_meta_equal in this patch.  A later patch will
relax the max_entries check for most maps.  bpf_types.h
counts 28 map types.  This patch adds 23 ".map_meta_equal"
by using coccinelle.  -5 for
	BPF_MAP_TYPE_PROG_ARRAY
	BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
	BPF_MAP_TYPE_STRUCT_OPS
	BPF_MAP_TYPE_ARRAY_OF_MAPS
	BPF_MAP_TYPE_HASH_OF_MAPS

The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
is moved such that the same error is returned.

[1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
2020-08-28 15:41:30 +02:00
David S. Miller 71930d6102 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
All conflicts seemed rather trivial, with some guidance from
Saeed Mameed on the tc_ct.c one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-11 00:46:00 -07:00
Alexei Starovoitov bba1dc0b55 bpf: Remove redundant synchronize_rcu.
bpf_free_used_maps() or close(map_fd) will trigger map_free callback.
bpf_free_used_maps() is called after bpf prog is no longer executing:
bpf_prog_put->call_rcu->bpf_prog_free->bpf_free_used_maps.
Hence there is no need to call synchronize_rcu() to protect map elements.

Note that hash_of_maps and array_of_maps update/delete inner maps via
sys_bpf() that calls maybe_wait_bpf_programs() and synchronize_rcu().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/bpf/20200630043343.53195-2-alexei.starovoitov@gmail.com
2020-07-01 08:07:13 -07:00
Andrii Nakryiko 517bbe1994 bpf: Enforce BPF ringbuf size to be the power of 2
BPF ringbuf assumes the size to be a multiple of page size and the power of
2 value. The latter is important to avoid division while calculating position
inside the ring buffer and using (N-1) mask instead. This patch fixes omission
to enforce power-of-2 size rule.

Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200630061500.1804799-1-andriin@fb.com
2020-06-30 16:31:55 +02:00
Andrey Ignatov 2872e9ac33 bpf: Set map_btf_{name, id} for all map types
Set map_btf_name and map_btf_id for all map types so that map fields can
be accessed by bpf programs.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com
2020-06-22 22:22:58 +02:00
Andrii Nakryiko 457f44363a bpf: Implement BPF ring buffer and verifier support for it
This commit adds a new MPSC ring buffer implementation into BPF ecosystem,
which allows multiple CPUs to submit data to a single shared ring buffer. On
the consumption side, only single consumer is assumed.

Motivation
----------
There are two distinctive motivators for this work, which are not satisfied by
existing perf buffer, which prompted creation of a new ring buffer
implementation.
  - more efficient memory utilization by sharing ring buffer across CPUs;
  - preserving ordering of events that happen sequentially in time, even
  across multiple CPUs (e.g., fork/exec/exit events for a task).

These two problems are independent, but perf buffer fails to satisfy both.
Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
also solved by having an MPSC implementation of ring buffer. The ordering
problem could technically be solved for perf buffer with some in-kernel
counting, but given the first one requires an MPSC buffer, the same solution
would solve the second problem automatically.

Semantics and APIs
------------------
Single ring buffer is presented to BPF programs as an instance of BPF map of
type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately
rejected.

One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make
BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce
"same CPU only" rule. This would be more familiar interface compatible with
existing perf buffer use in BPF, but would fail if application needed more
advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses
this with current approach. Additionally, given the performance of BPF
ringbuf, many use cases would just opt into a simple single ring buffer shared
among all CPUs, for which current approach would be an overkill.

Another approach could introduce a new concept, alongside BPF map, to
represent generic "container" object, which doesn't necessarily have key/value
interface with lookup/update/delete operations. This approach would add a lot
of extra infrastructure that has to be built for observability and verifier
support. It would also add another concept that BPF developers would have to
familiarize themselves with, new syntax in libbpf, etc. But then would really
provide no additional benefits over the approach of using a map.
BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so
doesn't few other map types (e.g., queue and stack; array doesn't support
delete, etc).

The approach chosen has an advantage of re-using existing BPF map
infrastructure (introspection APIs in kernel, libbpf support, etc), being
familiar concept (no need to teach users a new type of object in BPF program),
and utilizing existing tooling (bpftool). For common scenario of using
a single ring buffer for all CPUs, it's as simple and straightforward, as
would be with a dedicated "container" object. On the other hand, by being
a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to
implement a wide variety of topologies, from one ring buffer for each CPU
(e.g., as a replacement for perf buffer use cases), to a complicated
application hashing/sharding of ring buffers (e.g., having a small pool of
ring buffers with hashed task's tgid being a look up key to preserve order,
but reduce contention).

Key and value sizes are enforced to be zero. max_entries is used to specify
the size of ring buffer and has to be a power of 2 value.

There are a bunch of similarities between perf buffer
(BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
  - variable-length records;
  - if there is no more space left in ring buffer, reservation fails, no
    blocking;
  - memory-mappable data area for user-space applications for ease of
    consumption and high performance;
  - epoll notifications for new incoming data;
  - but still the ability to do busy polling for new data to achieve the
    lowest latency, if necessary.

BPF ringbuf provides two sets of APIs to BPF programs:
  - bpf_ringbuf_output() allows to *copy* data from one place to a ring
    buffer, similarly to bpf_perf_event_output();
  - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs
    split the whole process into two steps. First, a fixed amount of space is
    reserved. If successful, a pointer to a data inside ring buffer data area
    is returned, which BPF programs can use similarly to a data inside
    array/hash maps. Once ready, this piece of memory is either committed or
    discarded. Discard is similar to commit, but makes consumer ignore the
    record.

bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because
record has to be prepared in some other place first. But it allows to submit
records of the length that's not known to verifier beforehand. It also closely
matches bpf_perf_event_output(), so will simplify migration significantly.

bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory
pointer directly to ring buffer memory. In a lot of cases records are larger
than BPF stack space allows, so many programs have use extra per-CPU array as
a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
completely. But in exchange, it only allows a known constant size of memory to
be reserved, such that verifier can verify that BPF program can't access
memory outside its reserved record space. bpf_ringbuf_output(), while slightly
slower due to extra memory copy, covers some use cases that are not suitable
for bpf_ringbuf_reserve().

The difference between commit and discard is very small. Discard just marks
a record as discarded, and such records are supposed to be ignored by consumer
code. Discard is useful for some advanced use-cases, such as ensuring
all-or-nothing multi-record submission, or emulating temporary malloc()/free()
within single BPF program invocation.

Each reserved record is tracked by verifier through existing
reference-tracking logic, similar to socket ref-tracking. It is thus
impossible to reserve a record, but forget to submit (or discard) it.

bpf_ringbuf_query() helper allows to query various properties of ring buffer.
Currently 4 are supported:
  - BPF_RB_AVAIL_DATA returns amount of unconsumed data in ring buffer;
  - BPF_RB_RING_SIZE returns the size of ring buffer;
  - BPF_RB_CONS_POS/BPF_RB_PROD_POS returns current logical possition of
    consumer/producer, respectively.
Returned values are momentarily snapshots of ring buffer state and could be
off by the time helper returns, so this should be used only for
debugging/reporting reasons or for implementing various heuristics, that take
into account highly-changeable nature of some of those characteristics.

One such heuristic might involve more fine-grained control over poll/epoll
notifications about new data availability in ring buffer. Together with
BPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discard helpers,
it allows BPF program a high degree of control and, e.g., more efficient
batched notifications. Default self-balancing strategy, though, should be
adequate for most applications and will work reliable and efficiently already.

Design and implementation
-------------------------
This reserve/commit schema allows a natural way for multiple producers, either
on different CPUs or even on the same CPU/in the same BPF program, to reserve
independent records and work with them without blocking other producers. This
means that if BPF program was interruped by another BPF program sharing the
same ring buffer, they will both get a record reserved (provided there is
enough space left) and can work with it and submit it independently. This
applies to NMI context as well, except that due to using a spinlock during
reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock,
in which case reservation will fail even if ring buffer is not full.

The ring buffer itself internally is implemented as a power-of-2 sized
circular buffer, with two logical and ever-increasing counters (which might
wrap around on 32-bit architectures, that's not a problem):
  - consumer counter shows up to which logical position consumer consumed the
    data;
  - producer counter denotes amount of data reserved by all producers.

Each time a record is reserved, producer that "owns" the record will
successfully advance producer counter. At that point, data is still not yet
ready to be consumed, though. Each record has 8 byte header, which contains
the length of reserved record, as well as two extra bits: busy bit to denote
that record is still being worked on, and discard bit, which might be set at
commit time if record is discarded. In the latter case, consumer is supposed
to skip the record and move on to the next one. Record header also encodes
record's relative offset from the beginning of ring buffer data area (in
pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only
the pointer to the record itself, without requiring also the pointer to ring
buffer itself. Ring buffer memory location will be restored from record
metadata header. This significantly simplifies verifier, as well as improving
API usability.

Producer counter increments are serialized under spinlock, so there is
a strict ordering between reservations. Commits, on the other hand, are
completely lockless and independent. All records become available to consumer
in the order of reservations, but only after all previous records where
already committed. It is thus possible for slow producers to temporarily hold
off submitted records, that were reserved later.

Reservation/commit/consumer protocol is verified by litmus tests in
Documentation/litmus-test/bpf-rb.

One interesting implementation bit, that significantly simplifies (and thus
speeds up as well) implementation of both producers and consumers is how data
area is mapped twice contiguously back-to-back in the virtual memory. This
allows to not take any special measures for samples that have to wrap around
at the end of the circular buffer data area, because the next page after the
last data page would be first data page again, and thus the sample will still
appear completely contiguous in virtual memory. See comment and a simple ASCII
diagram showing this visually in bpf_ringbuf_area_alloc().

Another feature that distinguishes BPF ringbuf from perf ring buffer is
a self-pacing notifications of new data being availability.
bpf_ringbuf_commit() implementation will send a notification of new record
being available after commit only if consumer has already caught up right up
to the record being committed. If not, consumer still has to catch up and thus
will see new data anyways without needing an extra poll notification.
Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c) show that
this allows to achieve a very high throughput without having to resort to
tricks like "notify only every Nth sample", which are necessary with perf
buffer. For extreme cases, when BPF program wants more manual control of
notifications, commit/discard/output helpers accept BPF_RB_NO_WAKEUP and
BPF_RB_FORCE_WAKEUP flags, which give full control over notifications of data
availability, but require extra caution and diligence in using this API.

Comparison to alternatives
--------------------------
Before considering implementing BPF ring buffer from scratch existing
alternatives in kernel were evaluated, but didn't seem to meet the needs. They
largely fell into few categores:
  - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations
    outlined above (ordering and memory consumption);
  - linked list-based implementations; while some were multi-producer designs,
    consuming these from user-space would be very complicated and most
    probably not performant; memory-mapping contiguous piece of memory is
    simpler and more performant for user-space consumers;
  - io_uring is SPSC, but also requires fixed-sized elements. Naively turning
    SPSC queue into MPSC w/ lock would have subpar performance compared to
    locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
    elements would be too limiting for BPF programs, given existing BPF
    programs heavily rely on variable-sized perf buffer already;
  - specialized implementations (like a new printk ring buffer, [0]) with lots
    of printk-specific limitations and implications, that didn't seem to fit
    well for intended use with BPF programs.

  [0] https://lwn.net/Articles/779550/

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200529075424.3139988-2-andriin@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:22 -07:00