JIRA: https://issues.redhat.com/browse/RHEL-63880
Upstream Status: RHEL-Only
This a very partial backport of 2ddec2c80b44 ("riscv, bpf: inline
bpf_get_smp_processor_id()") It doesn't backport any of the riscv
part, only the bpf_jit_inlines_helper_call() that is needed for the
next patch. I marked it RHEL-Only to prevent kerneloscope to believe
that 2ddec2c80b44 has been backported.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-63880
commit 28ead3eaabc16ecc907cfb71876da028080f6356
Author: Xu Kuohai <xukuohai@huawei.com>
Date: Fri Jul 19 19:00:53 2024 +0800
bpf: Prevent tail call between progs attached to different hooks
bpf progs can be attached to kernel functions, and the attached functions
can take different parameters or return different return values. If
prog attached to one kernel function tail calls prog attached to another
kernel function, the ctx access or return value verification could be
bypassed.
For example, if prog1 is attached to func1 which takes only 1 parameter
and prog2 is attached to func2 which takes two parameters. Since verifier
assumes the bpf ctx passed to prog2 is constructed based on func2's
prototype, verifier allows prog2 to access the second parameter from
the bpf ctx passed to it. The problem is that verifier does not prevent
prog1 from passing its bpf ctx to prog2 via tail call. In this case,
the bpf ctx passed to prog2 is constructed from func1 instead of func2,
that is, the assumption for ctx access verification is bypassed.
Another example, if BPF LSM prog1 is attached to hook file_alloc_security,
and BPF LSM prog2 is attached to hook bpf_lsm_audit_rule_known. Verifier
knows the return value rules for these two hooks, e.g. it is legal for
bpf_lsm_audit_rule_known to return positive number 1, and it is illegal
for file_alloc_security to return positive number. So verifier allows
prog2 to return positive number 1, but does not allow prog1 to return
positive number. The problem is that verifier does not prevent prog1
from calling prog2 via tail call. In this case, prog2's return value 1
will be used as the return value for prog1's hook file_alloc_security.
That is, the return value rule is bypassed.
This patch adds restriction for tail call to prevent such bypasses.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240719110059.797546-4-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30774
commit ab224b9ef7c4eaa752752455ea79bd7022209d5d
Author: Rafael Passos <rafael@rcpassos.me>
Date: Fri Jun 14 23:24:09 2024 -0300
bpf: remove unused parameter in __bpf_free_used_btfs
Fixes a compiler warning. The __bpf_free_used_btfs function
was taking an extra unused struct bpf_prog_aux *aux param
Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Link: https://lore.kernel.org/r/20240615022641.210320-3-rafael@rcpassos.me
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30774
Conflicts: omitting bits for unsupported arch (RISC-V)
commit 9919c5c98cb25dbf7e76aadb9beab55a2a25f830
Author: Rafael Passos <rafael@rcpassos.me>
Date: Fri Jun 14 23:24:08 2024 -0300
bpf: remove unused parameter in bpf_jit_binary_pack_finalize
Fixes a compiler warning. the bpf_jit_binary_pack_finalize function
was taking an extra bpf_prog parameter that went unused.
This removves it and updates the callers accordingly.
Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Link: https://lore.kernel.org/r/20240615022641.210320-2-rafael@rcpassos.me
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30773
commit a3034872cd90a6881ad4e10ca6d30e1215a99ada
Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date: Mon Apr 29 15:00:05 2024 +0300
bpf: Switch to krealloc_array()
Let the krealloc_array() copy the original data and
check for a multiplication overflow.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240429120005.3539116-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30773
commit cb01621b6d91567ac74c8b95e4db731febdbdec3
Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date: Mon Apr 29 15:13:22 2024 +0300
bpf: Use struct_size()
Use struct_size() instead of hand writing it.
This is less verbose and more robust.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240429121323.3818497-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30773
commit a7de265cb2d849f8986a197499ad58dca0a4f209
Author: Rafael Passos <rafael@rcpassos.me>
Date: Wed Apr 17 15:49:14 2024 -0300
bpf: Fix typos in comments
Found the following typos in comments, and fixed them:
s/unpriviledged/unprivileged/
s/reponsible/responsible/
s/possiblities/possibilities/
s/Divison/Division/
s/precsion/precision/
s/havea/have a/
s/reponsible/responsible/
s/responsibile/responsible/
s/tigher/tighter/
s/respecitve/respective/
Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/6af7deb4-bb24-49e8-b3f1-8dd410597337@smtp-relay.sendinblue.com
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30773
Conflicts: There's a conflict with already applied upstream commit
66e13b615a0ce ("bpf: verifier: prevent userspace memory access").
It was probably resolved in a merge commit so adjust now to
align with upstream.
commit d503a04f8bc0c75dc9db9452d8cc79d748afb752
Author: Alexei Starovoitov <ast@kernel.org>
Date: Fri Apr 5 16:11:33 2024 -0700
bpf: Add support for certain atomics in bpf_arena to x86 JIT
Support atomics in bpf_arena that can be JITed as a single x86 instruction.
Instructions that are JITed as loops are not supported at the moment,
since they require more complex extable and loop logic.
JITs can choose to do smarter things with bpf_jit_supports_insn().
Like arm64 may decide to support all bpf atomics instructions
when emit_lse_atomic is available and none in ll_sc mode.
bpf_jit_supports_percpu_insn(), bpf_jit_supports_ptr_xchg() and
other such callbacks can be replaced with bpf_jit_supports_insn()
in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240405231134.17274-1-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30773
Conflicts: Context changes in arch/x86/net/bpf_jit_comp.c as commits
1d5f82d9dd47 ("bpf, x86: fix freeing of not-finalized bpf_prog_pack") and
95acd8817e66 ("bpf, x64: Add predicate for bpf2bpf with tailcalls support in JIT")
were taken out-of-order.
Move bpf_jit_supports_subprog_tailcalls() to its correct
place to align with upstream and prevent future conflics.
commit 7bdbf7446305cb65c510c16d57cde82bc76b234a
Author: Andrii Nakryiko <andrii@kernel.org>
Date: Mon Apr 1 19:13:02 2024 -0700
bpf: add special internal-only MOV instruction to resolve per-CPU addrs
Add a new BPF instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF JITs.
We use a special BPF_MOV | BPF_ALU64 | BPF_X form with insn->off field
set to BPF_ADDR_PERCPU = -1. I used negative offset value to distinguish
them from positive ones used by user-exposed instructions.
Such instruction performs a resolution of a per-CPU offset stored in
a register to a valid kernel address which can be dereferenced. It is
useful in any use case where absolute address of a per-CPU data has to
be resolved (e.g., in inlining bpf_map_lookup_elem()).
BPF disassembler is also taught to recognize them to support dumping
final BPF assembly code (non-JIT'ed version).
Add arch-specific way for BPF JITs to mark support for this instructions.
This patch also adds support for these instructions in x86-64 BPF JIT.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30773
commit 2e114248e086fb376405ed3f89b220f8586a2541
Author: Justin Stitt <justinstitt@google.com>
Date: Tue Apr 2 23:52:50 2024 +0000
bpf: Replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.
bpf sym names get looked up and compared/cleaned with various string
apis. This suggests they need to be NUL-terminated (strncpy() suggests
this but does not guarantee it).
| static int compare_symbol_name(const char *name, char *namebuf)
| {
| cleanup_symbol_name(namebuf);
| return strcmp(name, namebuf);
| }
| static void cleanup_symbol_name(char *s)
| {
| ...
| res = strstr(s, ".llvm.");
| ...
| }
Use strscpy() as this method guarantees NUL-termination on the
destination buffer.
This patch also replaces two uses of strncpy() used in log.c. These are
simple replacements as postfix has been zero-initialized on the stack
and has source arguments with a size less than the destination's size.
Note that this patch uses the new 2-argument version of strscpy
introduced in commit e6584c3964f2f ("string: Allow 2-argument strscpy()").
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/bpf/20240402-strncpy-kernel-bpf-core-c-v1-1-7cb07a426e78@google.com
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30773
commit c733239f8f530872a1f80d8c45dcafbaff368737
Author: Christophe Leroy <christophe.leroy@csgroup.eu>
Date: Sat Mar 16 08:35:41 2024 +0100
bpf: Check return from set_memory_rox()
arch_protect_bpf_trampoline() and alloc_new_pack() call
set_memory_rox() which can fail, leading to unprotected memory.
Take into account return from set_memory_rox() function and add
__must_check flag to arch_protect_bpf_trampoline().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/fe1c163c83767fde5cab31d209a4a6be3ddb3a73.1710574353.git.christophe.leroy@csgroup.eu
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30773
commit 7d2cc63eca0c993c99d18893214abf8f85d566d8
Author: Christophe Leroy <christophe.leroy@csgroup.eu>
Date: Fri Mar 8 06:38:07 2024 +0100
bpf: Take return from set_memory_ro() into account with bpf_prog_lock_ro()
set_memory_ro() can fail, leaving memory unprotected.
Check its return and take it into account as an error.
Link: https://github.com/KSPP/linux/issues/7
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: linux-hardening@vger.kernel.org <linux-hardening@vger.kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Message-ID: <286def78955e04382b227cb3e4b6ba272a7442e3.1709850515.git.christophe.leroy@csgroup.eu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23649
commit 66c8473135c62f478301a0e5b3012f203562dfa6
Author: Andrii Nakryiko <andrii@kernel.org>
Date: Fri Mar 8 16:47:39 2024 -0800
bpf: move sleepable flag from bpf_prog_aux to bpf_prog
prog->aux->sleepable is checked very frequently as part of (some) BPF
program run hot paths. So this extra aux indirection seems wasteful and
on busy systems might cause unnecessary memory cache misses.
Let's move sleepable flag into prog itself to eliminate unnecessary
pointer dereference.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Message-ID: <20240309004739.2961431-1-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23649
commit d6170e4aaf86424c24ce06e355b4573daa891b17
Author: Puranjay Mohan <puranjay12@gmail.com>
Date: Mon Mar 11 12:27:22 2024 +0000
bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
On some architectures like ARM64, PMD_SIZE can be really large in some
configurations. Like with CONFIG_ARM64_64K_PAGES=y the PMD_SIZE is
512MB.
Use 2MB * num_possible_nodes() as the size for allocations done through
the prog pack allocator. On most architectures, PMD_SIZE will be equal
to 2MB in case of 4KB pages and will be greater than 2MB for bigger page
sizes.
Fixes: ea2babac63d4 ("bpf: Simplify bpf_prog_pack_[size|mask]")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Closes: https://lore.kernel.org/all/7e216c88-77ee-47b8-becc-a0f780868d3c@sirena.org.uk/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202403092219.dhgcuz2G-lkp@intel.com/
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Message-ID: <20240311122722.86232-1-puranjay12@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23649
commit 142fd4d2dcf58b1720a6af644f31de1a5551f219
Author: Alexei Starovoitov <ast@kernel.org>
Date: Thu Mar 7 17:08:02 2024 -0800
bpf: Add x86-64 JIT support for bpf_addr_space_cast instruction.
LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))).
The addr_space=1 is reserved as bpf_arena address space.
rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY
rY = addr_space_cast(rX, 1, 0) has to be converted by JIT:
aux_reg = upper_32_bits of arena->user_vm_start
aux_reg <<= 32
wX = wY // clear upper 32 bits of dst register
if (wX) // if not zero add upper bits of user_vm_start
wX |= aux_reg
JIT can do it more efficiently:
mov dst_reg32, src_reg32 // 32-bit move
shl dst_reg, 32
or dst_reg, user_vm_start
rol dst_reg, 32
xor r11, r11
test dst_reg32, dst_reg32 // check if lower 32-bit are zero
cmove r11, dst_reg // if so, set dst_reg to zero
// Intel swapped src/dst register encoding in CMOVcc
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-5-alexei.starovoitov@gmail.com
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23649
commit 317460317a02a1af512697e6e964298dedd8a163
Author: Alexei Starovoitov <ast@kernel.org>
Date: Thu Mar 7 17:07:59 2024 -0800
bpf: Introduce bpf_arena.
Introduce bpf_arena, which is a sparse shared memory region between the bpf
program and user space.
Use cases:
1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed
anonymous region, like memcached or any key/value storage. The bpf
program implements an in-kernel accelerator. XDP prog can search for
a key in bpf_arena and return a value without going to user space.
2. The bpf program builds arbitrary data structures in bpf_arena (hash
tables, rb-trees, sparse arrays), while user space consumes it.
3. bpf_arena is a "heap" of memory from the bpf program's point of view.
The user space may mmap it, but bpf program will not convert pointers
to user base at run-time to improve bpf program speed.
Initially, the kernel vm_area and user vma are not populated. User space
can fault in pages within the range. While servicing a page fault,
bpf_arena logic will insert a new page into the kernel and user vmas. The
bpf program can allocate pages from that region via
bpf_arena_alloc_pages(). This kernel function will insert pages into the
kernel vm_area. The subsequent fault-in from user space will populate that
page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation time
can be used to prevent fault-in from user space. In such a case, if a page
is not allocated by the bpf program and not present in the kernel vm_area,
the user process will segfault. This is useful for use cases 2 and 3 above.
bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
either at a specific address within the arena or allocates a range with the
maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees
pages and removes the range from the kernel vm_area and from user process
vmas.
bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of
bpf program is more important than ease of sharing with user space. This is
use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended.
It will tell the verifier to treat the rX = bpf_arena_cast_user(rY)
instruction as a 32-bit move wX = wY, which will improve bpf prog
performance. Otherwise, bpf_arena_cast_user is translated by JIT to
conditionally add the upper 32 bits of user vm_start (if the pointer is not
NULL) to arena pointers before they are stored into memory. This way, user
space sees them as valid 64-bit pointers.
Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF
backend generate the bpf_addr_space_cast() instruction to cast pointers
between address_space(1) which is reserved for bpf_arena pointers and
default address space zero. All arena pointers in a bpf program written in
C language are tagged as __attribute__((address_space(1))). Hence, clang
provides helpful diagnostics when pointers cross address space. Libbpf and
the kernel support only address_space == 1. All other address space
identifiers are reserved.
rX = bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the
verifier that rX->type = PTR_TO_ARENA. Any further operations on
PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will
mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate
them as kern_vm_start + 32bit_addr memory accesses. The behavior is similar
to copy_from_kernel_nofault() except that no address checks are necessary.
The address is guaranteed to be in the 4GB range. If the page is not
present, the destination register is zeroed on read, and the operation is
ignored on write.
rX = bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type =
unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the
verifier converts such cast instructions to mov32. Otherwise, JIT will emit
native code equivalent to:
rX = (u32)rY;
if (rY)
rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */
After such conversion, the pointer becomes a valid user pointer within
bpf_arena range. The user process can access data structures created in
bpf_arena without any additional computations. For example, a linked list
built by a bpf program can be walked natively by user space.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Barret Rhoden <brho@google.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-2-alexei.starovoitov@gmail.com
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23649
commit 011832b97b311bb9e3c27945bc0d1089a14209c9
Author: Alexei Starovoitov <ast@kernel.org>
Date: Tue Mar 5 19:19:26 2024 -0800
bpf: Introduce may_goto instruction
Introduce may_goto instruction that from the verifier pov is similar to
open coded iterators bpf_for()/bpf_repeat() and bpf_loop() helper, but it
doesn't iterate any objects.
In assembly 'may_goto' is a nop most of the time until bpf runtime has to
terminate the program for whatever reason. In the current implementation
may_goto has a hidden counter, but other mechanisms can be used.
For programs written in C the later patch introduces 'cond_break' macro
that combines 'may_goto' with 'break' statement and has similar semantics:
cond_break is a nop until bpf runtime has to break out of this loop.
It can be used in any normal "for" or "while" loop, like
for (i = zero; i < cnt; cond_break, i++) {
The verifier recognizes that may_goto is used in the program, reserves
additional 8 bytes of stack, initializes them in subprog prologue, and
replaces may_goto instruction with:
aux_reg = *(u64 *)(fp - 40)
if aux_reg == 0 goto pc+off
aux_reg -= 1
*(u64 *)(fp - 40) = aux_reg
may_goto instruction can be used by LLVM to implement __builtin_memcpy,
__builtin_strcmp.
may_goto is not a full substitute for bpf_for() macro.
bpf_for() doesn't have induction variable that verifiers sees,
so 'i' in bpf_for(i, 0, 100) is seen as imprecise and bounded.
But when the code is written as:
for (i = 0; i < 100; cond_break, i++)
the verifier see 'i' as precise constant zero,
hence cond_break (aka may_goto) doesn't help to converge the loop.
A static or global variable can be used as a workaround:
static int zero = 0;
for (i = zero; i < 100; cond_break, i++) // works!
may_goto works well with arena pointers that don't need to be bounds
checked on access. Load/store from arena returns imprecise unbounded
scalar and loops with may_goto pass the verifier.
Reserve new opcode BPF_JMP | BPF_JCOND for may_goto insn.
JCOND stands for conditional pseudo jump.
Since goto_or_nop insn was proposed, it may use the same opcode.
may_goto vs goto_or_nop can be distinguished by src_reg:
code = BPF_JMP | BPF_JCOND
src_reg = 0 - may_goto
src_reg = 1 - goto_or_nop
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240306031929.42666-2-alexei.starovoitov@gmail.com
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23649
commit caf8f28e036c4ba1e823355da6c0c01c39e70ab9
Author: Andrii Nakryiko <andrii@kernel.org>
Date: Tue Jan 23 18:21:03 2024 -0800
bpf: Add BPF token support to BPF_PROG_LOAD command
Add basic support of BPF token to BPF_PROG_LOAD. BPF_F_TOKEN_FD flag
should be set in prog_flags field when providing prog_token_fd.
Wire through a set of allowed BPF program types and attach types,
derived from BPF FS at BPF token creation time. Then make sure we
perform bpf_token_capable() checks everywhere where it's relevant.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-7-andrii@kernel.org
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23649
Conflicts:
Context change. The bpf_arch_poke_desc_update() function has been
placed before bpf_jit_supports_exceptions() and
bpf_jit_supports_exceptions() by a previous backport (commit
fb0a7b0e48 "bpf: Fix prog_array_map_poke_run map poke update").
Reorder the function as upstream to help with furture backport.
commit 7c05e7f3e74e7e550534d524e04d7e6f78d6fa24
Author: Hou Tao <houtao1@huawei.com>
Date: Fri Jan 5 18:48:17 2024 +0800
bpf: Support inlining bpf_kptr_xchg() helper
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. The inline also has
downside: both the kasan and kcsan checks on the pointer will be
unavailable.
bpf_kptr_xchg() can be inlined by converting the calling of
bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion
depends on two conditions:
1) JIT backend supports atomic_xchg() on pointer-sized word
2) For the specific arch, the implementation of xchg is the same as
atomic_xchg() on pointer-sized words.
It seems most 64-bit JIT backends satisfies these two conditions. But
as a precaution, defining a weak function bpf_jit_supports_ptr_xchg()
to state whether such conversion is safe and only supporting inline for
64-bit host.
For x86-64, it supports BPF_XCHG atomic operation and both xchg() and
atomic_xchg() use arch_xchg() to implement the exchange, so enabling the
inline of bpf_kptr_xchg() on x86-64 first.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23644
Conflicts: changed context due to missing upstream commit
89245600941e4 ("cfi: Switch to -fsanitize=kcfi")
commit 4f9087f16651aca4a5f32da840a53f6660f0579a
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Dec 15 10:12:18 2023 +0100
x86/cfi,bpf: Fix BPF JIT call
The current BPF call convention is __nocfi, except when it calls !JIT things,
then it calls regular C functions.
It so happens that with FineIBT the __nocfi and C calling conventions are
incompatible. Specifically __nocfi will call at func+0, while FineIBT will have
endbr-poison there, which is not a valid indirect target. Causing #CP.
Notably this only triggers on IBT enabled hardware, which is probably why this
hasn't been reported (also, most people will have JIT on anyway).
Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for
x86.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23644
commit f08a1c658257c73697a819c4ded3a84b6f0ead74
Author: Song Liu <song@kernel.org>
Date: Wed Dec 6 14:40:48 2023 -0800
bpf: Let bpf_prog_pack_free handle any pointer
Currently, bpf_prog_pack_free only can only free pointer to struct
bpf_binary_header, which is not flexible. Add a size argument to
bpf_prog_pack_free so that it can handle any pointer.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-2-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23644
commit 8062fb12de99b2da33754c6a3be1bfc30d9a35f4
Author: Andrii Nakryiko <andrii@kernel.org>
Date: Thu Nov 30 10:52:20 2023 -0800
bpf: consistently use BPF token throughout BPF verifier logic
Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23644
commit e1cef620f598853a90f17701fcb1057a6768f7b8
Author: Andrii Nakryiko <andrii@kernel.org>
Date: Thu Nov 30 10:52:18 2023 -0800
bpf: add BPF token support to BPF_PROG_LOAD command
Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of
allowed BPF program types and attach types, derived from BPF FS at BPF
token creation time. Then make sure we perform bpf_token_capable()
checks everywhere where it's relevant.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23644
commit af66bfd3c8538ed21cf72af18426fc4a408665cf
Author: Hou Tao <houtao1@huawei.com>
Date: Mon Dec 4 22:04:23 2023 +0800
bpf: Optimize the free of inner map
When removing the inner map from the outer map, the inner map will be
freed after one RCU grace period and one RCU tasks trace grace
period, so it is certain that the bpf program, which may access the
inner map, has exited before the inner map is freed.
However there is no need to wait for one RCU tasks trace grace period if
the outer map is only accessed by non-sleepable program. So adding
sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding
the outer map into env->used_maps for sleepable program. Although the
max number of bpf program is INT_MAX - 1, the number of bpf programs
which are being loaded may be greater than INT_MAX, so using atomic64_t
instead of atomic_t for sleepable_refcnt. When removing the inner map
from the outer map, using sleepable_refcnt to decide whether or not a
RCU tasks trace grace period is needed before freeing the inner map.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23643
Conflicts: added required include that was added upstream by missing
commit 680ee0456a571 ("net: invert the netdevice.h vs xdp.h
dependency")
commit 1fda5bb66ad8fb24ecb3858e61a13a6548428898
Author: Yonghong Song <yonghong.song@linux.dev>
Date: Fri Nov 10 17:39:28 2023 -0800
bpf: Do not allocate percpu memory at init stage
Kirill Shutemov reported significant percpu memory consumption increase after
booting in 288-cpu VM ([1]) due to commit 41a5db8d8161 ("bpf: Add support for
non-fix-size percpu mem allocation"). The percpu memory consumption is
increased from 111MB to 969MB. The number is from /proc/meminfo.
I tried to reproduce the issue with my local VM which at most supports upto
255 cpus. With 252 cpus, without the above commit, the percpu memory
consumption immediately after boot is 57MB while with the above commit the
percpu memory consumption is 231MB.
This is not good since so far percpu memory from bpf memory allocator is not
widely used yet. Let us change pre-allocation in init stage to on-demand
allocation when verifier detects there is a need of percpu memory for bpf
program. With this change, percpu memory consumption after boot can be reduced
signicantly.
[1] https://lore.kernel.org/lkml/20231109154934.4saimljtqx625l3v@box.shutemov.name/
Fixes: 41a5db8d8161 ("bpf: Add support for non-fix-size percpu mem allocation")
Reported-and-tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231111013928.948838-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23643
commit 66d9111f3517f85ef2af0337ece02683ce0faf21
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date: Wed Sep 13 01:32:08 2023 +0200
bpf: Detect IP == ksym.end as part of BPF program
Now that bpf_throw kfunc is the first such call instruction that has
noreturn semantics within the verifier, this also kicks in dead code
elimination in unprecedented ways. For one, any instruction following
a bpf_throw call will never be marked as seen. Moreover, if a callchain
ends up throwing, any instructions after the call instruction to the
eventually throwing subprog in callers will also never be marked as
seen.
The tempting way to fix this would be to emit extra 'int3' instructions
which bump the jited_len of a program, and ensure that during runtime
when a program throws, we can discover its boundaries even if the call
instruction to bpf_throw (or to subprogs that always throw) is emitted
as the final instruction in the program.
An example of such a program would be this:
do_something():
...
r0 = 0
exit
foo():
r1 = 0
call bpf_throw
r0 = 0
exit
bar(cond):
if r1 != 0 goto pc+2
call do_something
exit
call foo
r0 = 0 // Never seen by verifier
exit //
main(ctx):
r1 = ...
call bar
r0 = 0
exit
Here, if we do end up throwing, the stacktrace would be the following:
bpf_throw
foo
bar
main
In bar, the final instruction emitted will be the call to foo, as such,
the return address will be the subsequent instruction (which the JIT
emits as int3 on x86). This will end up lying outside the jited_len of
the program, thus, when unwinding, we will fail to discover the return
address as belonging to any program and end up in a panic due to the
unreliable stack unwinding of BPF programs that we never expect.
To remedy this case, make bpf_prog_ksym_find treat IP == ksym.end as
part of the BPF program, so that is_bpf_text_address returns true when
such a case occurs, and we are able to unwind reliably when the final
instruction ends up being a call instruction.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-12-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23643
commit f18b03fabaa9b7c80e80b72a621f481f0d706ae0
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date: Wed Sep 13 01:32:01 2023 +0200
bpf: Implement BPF exceptions
This patch implements BPF exceptions, and introduces a bpf_throw kfunc
to allow programs to throw exceptions during their execution at runtime.
A bpf_throw invocation is treated as an immediate termination of the
program, returning back to its caller within the kernel, unwinding all
stack frames.
This allows the program to simplify its implementation, by testing for
runtime conditions which the verifier has no visibility into, and assert
that they are true. In case they are not, the program can simply throw
an exception from the other branch.
BPF exceptions are explicitly *NOT* an unlikely slowpath error handling
primitive, and this objective has guided design choices of the
implementation of the them within the kernel (with the bulk of the cost
for unwinding the stack offloaded to the bpf_throw kfunc).
The implementation of this mechanism requires use of add_hidden_subprog
mechanism introduced in the previous patch, which generates a couple of
instructions to move R1 to R0 and exit. The JIT then rewrites the
prologue of this subprog to take the stack pointer and frame pointer as
inputs and reset the stack frame, popping all callee-saved registers
saved by the main subprog. The bpf_throw function then walks the stack
at runtime, and invokes this exception subprog with the stack and frame
pointers as parameters.
Reviewers must take note that currently the main program is made to save
all callee-saved registers on x86_64 during entry into the program. This
is because we must do an equivalent of a lightweight context switch when
unwinding the stack, therefore we need the callee-saved registers of the
caller of the BPF program to be able to return with a sane state.
Note that we have to additionally handle r12, even though it is not used
by the program, because when throwing the exception the program makes an
entry into the kernel which could clobber r12 after saving it on the
stack. To be able to preserve the value we received on program entry, we
push r12 and restore it from the generated subprogram when unwinding the
stack.
For now, bpf_throw invocation fails when lingering resources or locks
exist in that path of the program. In a future followup, bpf_throw will
be extended to perform frame-by-frame unwinding to release lingering
resources for each stack frame, removing this limitation.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23643
commit 335d1c5b545284d75ef96ee42e461eacefe865bb
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date: Wed Sep 13 01:32:00 2023 +0200
bpf: Implement support for adding hidden subprogs
Introduce support in the verifier for generating a subprogram and
include it as part of a BPF program dynamically after the do_check phase
is complete. The first user will be the next patch which generates
default exception callbacks if none are set for the program. The phase
of invocation will be do_misc_fixups. Note that this is an internal
verifier function, and should be used with instruction blocks which
uphold the invariants stated in check_subprogs.
Since these subprogs are always appended to the end of the instruction
sequence of the program, it becomes relatively inexpensive to do the
related adjustments to the subprog_info of the program. Only the fake
exit subprogram is shifted forward, making room for our new subprog.
This is useful to insert a new subprogram, get it JITed, and obtain its
function pointer. The next patch will use this functionality to insert a
default exception callback which will be invoked after unwinding the
stack.
Note that these added subprograms are invisible to userspace, and never
reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single
subprogram is supported, but more can be easily supported in the future.
To this end, two function counts are introduced now, the existing
func_cnt, and real_func_cnt, the latter including hidden programs. This
allows us to conver the JIT code to use the real_func_cnt for management
of resources while syscall path continues working with existing
func_cnt.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-23643
commit 41a5db8d8161457b121a03fde999ff6e00090ee2
Author: Yonghong Song <yonghong.song@linux.dev>
Date: Sun Aug 27 08:27:34 2023 -0700
bpf: Add support for non-fix-size percpu mem allocation
This is needed for later percpu mem allocation when the
allocation is done by bpf program. For such cases, a global
bpf_global_percpu_ma is added where a flexible allocation
size is needed.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152734.1995725-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25415
Conflicts: Minor drift issues.
commit fd5d27b70188379bb441d404c29a0afb111e1753
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date: Wed Sep 13 01:31:59 2023 +0200
arch/x86: Implement arch_bpf_stack_walk
The plumbing for offline unwinding when we throw an exception in
programs would require walking the stack, hence introduce a new
arch_bpf_stack_walk function. This is provided when the JIT supports
exceptions, i.e. bpf_jit_supports_exceptions is true. The arch-specific
code is really minimal, hence it should be straightforward to extend
this support to other architectures as well, as it reuses the logic of
arch_stack_walk, but allowing access to unwind_state data.
Once the stack pointer and frame pointer are known for the main subprog
during the unwinding, we know the stack layout and location of any
callee-saved registers which must be restored before we return back to
the kernel. This handling will be added in the subsequent patches.
Note that while we primarily unwind through BPF frames, which are
effectively CONFIG_UNWINDER_FRAME_POINTER, we still need one of this or
CONFIG_UNWINDER_ORC to be able to unwind through the bpf_throw frame
from which we begin walking the stack. We also require both sp and bp
(stack and frame pointers) from the unwind_state structure, which are
only available when one of these two options are enabled.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25415
Conflicts: Minor drift issues, and not worried about unsupported arches.
Changes to arch/arm/mach-omap[12] are made in arch/arm/plat-omap which
is unified in RHEL9.
commit d48567c9a0d1e605639f8a8705a61bbb55fb4e84
Author: Peter Zijlstra <peterz@infradead.org>
Date: Wed Oct 26 12:13:03 2022 +0200
mm: Introduce set_memory_rox()
Because endlessly repeating:
set_memory_ro()
set_memory_x()
is getting tedious.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Y1jek64pXOsougmz@hirez.programming.kicks-ass.net
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-10691
commit dfce9cb3140592b886838e06f3e0c25fea2a9cae
Author: Yonghong Song <yonghong.song@linux.dev>
Date: Thu Nov 30 18:46:40 2023 -0800
bpf: Fix a verifier bug due to incorrect branch offset comparison with cpu=v4
Bpf cpu=v4 support is introduced in [1] and Commit 4cd58e9af8b9
("bpf: Support new 32bit offset jmp instruction") added support for new
32bit offset jmp instruction. Unfortunately, in function
bpf_adj_delta_to_off(), for new branch insn with 32bit offset, the offset
(plus/minor a small delta) compares to 16-bit offset bound
[S16_MIN, S16_MAX], which caused the following verification failure:
$ ./test_progs-cpuv4 -t verif_scale_pyperf180
...
insn 10 cannot be patched due to 16-bit range
...
libbpf: failed to load object 'pyperf180.bpf.o'
scale_test:FAIL:expect_success unexpected error: -12 (errno 12)
#405 verif_scale_pyperf180:FAIL
Note that due to recent llvm18 development, the patch [2] (already applied
in bpf-next) needs to be applied to bpf tree for testing purpose.
The fix is rather simple. For 32bit offset branch insn, the adjusted
offset compares to [S32_MIN, S32_MAX] and then verification succeeded.
[1] https://lore.kernel.org/all/20230728011143.3710005-1-yonghong.song@linux.dev
[2] https://lore.kernel.org/bpf/20231110193644.3130906-1-yonghong.song@linux.dev
Fixes: 4cd58e9af8b9 ("bpf: Support new 32bit offset jmp instruction")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231201024640.3417057-1-yonghong.song@linux.dev
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-10691
commit 20e490adea279d49d57b800475938f5b67926d98
Author: Puranjay Mohan <puranjay12@gmail.com>
Date: Thu Aug 31 13:12:26 2023 +0000
bpf: make bpf_prog_pack allocator portable
The bpf_prog_pack allocator currently uses module_alloc() and
module_memfree() to allocate and free memory. This is not portable
because different architectures use different methods for allocating
memory for BPF programs. Like ARM64 and riscv use vmalloc()/vfree().
Use bpf_jit_alloc_exec() and bpf_jit_free_exec() for memory management
in bpf_prog_pack allocator. Other architectures can override these with
their implementation and will be able to use bpf_prog_pack directly.
On architectures that don't override bpf_jit_alloc/free_exec() this is
basically a NOP.
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230831131229.497941-2-puranjay12@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-10691
commit 6a5a148aaf14747570cc634f9cdfcb0393f5617f
Author: Arnd Bergmann <arnd@arndb.de>
Date: Tue Aug 1 13:13:58 2023 +0200
bpf: fix bpf_probe_read_kernel prototype mismatch
bpf_probe_read_kernel() has a __weak definition in core.c and another
definition with an incompatible prototype in kernel/trace/bpf_trace.c,
when CONFIG_BPF_EVENTS is enabled.
Since the two are incompatible, there cannot be a shared declaration in
a header file, but the lack of a prototype causes a W=1 warning:
kernel/bpf/core.c:1638:12: error: no previous prototype for 'bpf_probe_read_kernel' [-Werror=missing-prototypes]
On 32-bit architectures, the local prototype
u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
passes arguments in other registers as the one in bpf_trace.c
BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size,
const void *, unsafe_ptr)
which uses 64-bit arguments in pairs of registers.
As both versions of the function are fairly simple and only really
differ in one line, just move them into a header file as an inline
function that does not add any overhead for the bpf_trace.c callers
and actually avoids a function call for the other one.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/ac25cb0f-b804-1649-3afb-1dc6138c2716@iogearbox.net/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230801111449.185301-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-10691
commit 09fedc731874123e0f6e5e5e3572db0c60378c2a
Author: Yonghong Song <yonghong.song@linux.dev>
Date: Thu Jul 27 22:57:40 2023 -0700
bpf: Fix compilation warning with -Wparentheses
The kernel test robot reported compilation warnings when -Wparentheses is
added to KBUILD_CFLAGS with gcc compiler. The following is the error message:
.../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_reg_to_size_sx’:
.../bpf-next/kernel/bpf/verifier.c:5901:14:
error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
if (s64_max >= 0 == s64_min >= 0) {
~~~~~~~~^~~~
.../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_subreg_to_size_sx’:
.../bpf-next/kernel/bpf/verifier.c:5965:14:
error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
if (s32_min >= 0 == s32_max >= 0) {
~~~~~~~~^~~~
To fix the issue, add proper parentheses for the above '>=' condition
to silence the warning/error.
I tried a few clang compilers like clang16 and clang18 and they do not emit
such warnings with -Wparentheses.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202307281133.wi0c4SqG-lkp@intel.com/
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230728055740.2284534-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-10691
commit 4cd58e9af8b9d9fff6b7145e742abbfcda0af4af
Author: Yonghong Song <yonghong.song@linux.dev>
Date: Thu Jul 27 18:12:31 2023 -0700
bpf: Support new 32bit offset jmp instruction
Add interpreter/jit/verifier support for 32bit offset jmp instruction.
If a conditional jmp instruction needs more than 16bit offset,
it can be simulated with a conditional jmp + a 32bit jmp insn.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011231.3716103-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-10691
commit ec0e2da95f72d4a46050a4d994e4fe471474fd80
Author: Yonghong Song <yonghong.song@linux.dev>
Date: Thu Jul 27 18:12:19 2023 -0700
bpf: Support new signed div/mod instructions.
Add interpreter/jit support for new signed div/mod insns.
The new signed div/mod instructions are encoded with
unsigned div/mod instructions plus insn->off == 1.
Also add basic verifier support to ensure new insns get
accepted.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011219.3714605-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-10691
Conflicts: Context change from missing commit 291d044fd51f ("bpf: Fix
precision tracking for BPF_ALU | BPF_TO_BE | BPF_END")
commit 8100928c881482a73ed8bd499d602bab0fe55608
Author: Yonghong Song <yonghong.song@linux.dev>
Date: Thu Jul 27 18:12:02 2023 -0700
bpf: Support new sign-extension mov insns
Add interpreter/jit support for new sign-extension mov insns.
The original 'MOV' insn is extended to support reg-to-reg
signed version for both ALU and ALU64 operations. For ALU mode,
the insn->off value of 8 or 16 indicates sign-extension
from 8- or 16-bit value to 32-bit value. For ALU64 mode,
the insn->off value of 8/16/32 indicates sign-extension
from 8-, 16- or 32-bit value to 64-bit value.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011202.3712300-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-10691
commit 1f9a1ea821ff25353a0e80d971e7958cd55b47a3
Author: Yonghong Song <yonghong.song@linux.dev>
Date: Thu Jul 27 18:11:56 2023 -0700
bpf: Support new sign-extension load insns
Add interpreter/jit support for new sign-extension load insns
which adds a new mode (BPF_MEMSX).
Also add verifier support to recognize these insns and to
do proper verification with new insns. In verifier, besides
to deduce proper bounds for the dst_reg, probed memory access
is also properly handled.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011156.3711870-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9957
commit ba49f976885869835a1783863376221dc24f1817
Author: Arnd Bergmann <arnd@arndb.de>
Date: Fri Jun 2 15:50:18 2023 +0200
bpf: Hide unused bpf_patch_call_args
This function is only used when CONFIG_BPF_JIT_ALWAYS_ON is disabled, but
CONFIG_BPF_SYSCALL is enabled. When both are turned off, the prototype is
missing but the unused function is still compiled, as seen from this W=1
warning:
[...]
kernel/bpf/core.c:2075:6: error: no previous prototype for 'bpf_patch_call_args' [-Werror=missing-prototypes]
[...]
Add a matching #ifdef for the definition to leave it out.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230602135128.1498362-1-arnd@kernel.org
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2221599
Conflicts: already present dd16dc6e35 "x86/speculation: Include
unprivileged eBPF status in Spectre v2 mitigation reporting"
commit 1cf3bfc60f9836f44da951f58b6ae24680484b35
Author: Ilya Leoshkevich <iii@linux.ibm.com>
Date: Thu Apr 13 01:06:32 2023 +0200
bpf: Support 64-bit pointers to kfuncs
test_ksyms_module fails to emit a kfunc call targeting a module on
s390x, because the verifier stores the difference between kfunc
address and __bpf_call_base in bpf_insn.imm, which is s32, and modules
are roughly (1 << 42) bytes away from the kernel on s390x.
Fix by keeping BTF id in bpf_insn.imm for BPF_PSEUDO_KFUNC_CALLs,
and storing the absolute address in bpf_kfunc_desc.
Introduce bpf_jit_supports_far_kfunc_call() in order to limit this new
behavior to the s390x JIT. Otherwise other JITs need to be modified,
which is not desired.
Introduce bpf_get_kfunc_addr() instead of exposing both
find_kfunc_desc() and struct bpf_kfunc_desc.
In addition to sorting kfuncs by imm, also sort them by offset, in
order to handle conflicting imms from different modules. Do this on
all architectures in order to simplify code.
Factor out resolving specialized kfuncs (XPD and dynptr) from
fixup_kfunc_call(). This was required in the first place, because
fixup_kfunc_call() uses find_kfunc_desc(), which returns a const
pointer, so it's not possible to modify kfunc addr without stripping
const, which is not nice. It also removes repetition of code like:
if (bpf_jit_supports_far_kfunc_call())
desc->addr = func;
else
insn->imm = BPF_CALL_IMM(func);
and separates kfunc_desc_tab fixups from kfunc_call fixups.
Suggested-by: Jiri Olsa <olsajiri@gmail.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230412230632.885985-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2178930
commit 10ec8ca8ec1a2f04c4ed90897225231c58c124a7
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon Mar 20 15:37:25 2023 +0100
bpf: Adjust insufficient default bpf_jit_limit
We've seen recent AWS EKS (Kubernetes) user reports like the following:
After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS
clusters after a few days a number of the nodes have containers stuck
in ContainerCreating state or liveness/readiness probes reporting the
following error:
Readiness probe errored: rpc error: code = Unknown desc = failed to
exec in container: failed to start exec "4a11039f730203ffc003b7[...]":
OCI runtime exec failed: exec failed: unable to start container process:
unable to init seccomp: error loading seccomp filter into kernel:
error loading seccomp filter: errno 524: unknown
However, we had not been seeing this issue on previous AMIs and it only
started to occur on v20230217 (following the upgrade from kernel 5.4 to
5.10) with no other changes to the underlying cluster or workloads.
We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528)
which helped to immediately allow containers to be created and probes to
execute but after approximately a day the issue returned and the value
returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
was steadily increasing.
I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem
their sizes passed in as well as bpf_jit_current under tcpdump BPF filter,
seccomp BPF and native (e)BPF programs, and the behavior all looks sane
and expected, that is nothing "leaking" from an upstream perspective.
The bpf_jit_limit knob was originally added in order to avoid a situation
where unprivileged applications loading BPF programs (e.g. seccomp BPF
policies) consuming all the module memory space via BPF JIT such that loading
of kernel modules would be prevented. The default limit was defined back in
2018 and while good enough back then, we are generally seeing far more BPF
consumers today.
Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the
module memory space to better reflect today's needs and avoid more users
running into potentially hard to debug issues.
Fixes: fdadd04931 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
Reported-by: Stephen Haynes <sh@synk.net>
Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://github.com/awslabs/amazon-eks-ami/issues/1179
Link: https://github.com/awslabs/amazon-eks-ami/issues/1219
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2178930
commit f3dd0c53370e70c0f9b7e931bbec12916f3bb8cc
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed Feb 22 09:52:32 2023 -0800
bpf: add missing header file include
Commit 74e19ef0ff80 ("uaccess: Add speculation barrier to
copy_from_user()") built fine on x86-64 and arm64, and that's the extent
of my local build testing.
It turns out those got the <linux/nospec.h> include incidentally through
other header files (<linux/kvm_host.h> in particular), but that was not
true of other architectures, resulting in build errors
kernel/bpf/core.c: In function ‘___bpf_prog_run’:
kernel/bpf/core.c:1913:3: error: implicit declaration of function ‘barrier_nospec’
so just make sure to explicitly include the proper <linux/nospec.h>
header file to make everybody see it.
Fixes: 74e19ef0ff80 ("uaccess: Add speculation barrier to copy_from_user()")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Reported-by: Huacai Chen <chenhuacai@loongson.cn>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>