Commit Graph

110 Commits

Author SHA1 Message Date
Felix Maurer 1496583229 bpf: fix OOB devmap writes when deleting elements
JIRA: https://issues.redhat.com/browse/RHEL-68071

commit ab244dd7cf4c291f82faacdc50b45cc0f55b674d
Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date:   Fri Nov 22 13:10:30 2024 +0100

    bpf: fix OOB devmap writes when deleting elements
    
    Jordy reported issue against XSKMAP which also applies to DEVMAP - the
    index used for accessing map entry, due to being a signed integer,
    causes the OOB writes. Fix is simple as changing the type from int to
    u32, however, when compared to XSKMAP case, one more thing needs to be
    addressed.
    
    When map is released from system via dev_map_free(), we iterate through
    all of the entries and an iterator variable is also an int, which
    implies OOB accesses. Again, change it to be u32.
    
    Example splat below:
    
    [  160.724676] BUG: unable to handle page fault for address: ffffc8fc2c001000
    [  160.731662] #PF: supervisor read access in kernel mode
    [  160.736876] #PF: error_code(0x0000) - not-present page
    [  160.742095] PGD 0 P4D 0
    [  160.744678] Oops: Oops: 0000 [#1] PREEMPT SMP
    [  160.749106] CPU: 1 UID: 0 PID: 520 Comm: kworker/u145:12 Not tainted 6.12.0-rc1+ #487
    [  160.757050] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
    [  160.767642] Workqueue: events_unbound bpf_map_free_deferred
    [  160.773308] RIP: 0010:dev_map_free+0x77/0x170
    [  160.777735] Code: 00 e8 fd 91 ed ff e8 b8 73 ed ff 41 83 7d 18 19 74 6e 41 8b 45 24 49 8b bd f8 00 00 00 31 db 85 c0 74 48 48 63 c3 48 8d 04 c7 <48> 8b 28 48 85 ed 74 30 48 8b 7d 18 48 85 ff 74 05 e8 b3 52 fa ff
    [  160.796777] RSP: 0018:ffffc9000ee1fe38 EFLAGS: 00010202
    [  160.802086] RAX: ffffc8fc2c001000 RBX: 0000000080000000 RCX: 0000000000000024
    [  160.809331] RDX: 0000000000000000 RSI: 0000000000000024 RDI: ffffc9002c001000
    [  160.816576] RBP: 0000000000000000 R08: 0000000000000023 R09: 0000000000000001
    [  160.823823] R10: 0000000000000001 R11: 00000000000ee6b2 R12: dead000000000122
    [  160.831066] R13: ffff88810c928e00 R14: ffff8881002df405 R15: 0000000000000000
    [  160.838310] FS:  0000000000000000(0000) GS:ffff8897e0c40000(0000) knlGS:0000000000000000
    [  160.846528] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  160.852357] CR2: ffffc8fc2c001000 CR3: 0000000005c32006 CR4: 00000000007726f0
    [  160.859604] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [  160.866847] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [  160.874092] PKRU: 55555554
    [  160.876847] Call Trace:
    [  160.879338]  <TASK>
    [  160.881477]  ? __die+0x20/0x60
    [  160.884586]  ? page_fault_oops+0x15a/0x450
    [  160.888746]  ? search_extable+0x22/0x30
    [  160.892647]  ? search_bpf_extables+0x5f/0x80
    [  160.896988]  ? exc_page_fault+0xa9/0x140
    [  160.900973]  ? asm_exc_page_fault+0x22/0x30
    [  160.905232]  ? dev_map_free+0x77/0x170
    [  160.909043]  ? dev_map_free+0x58/0x170
    [  160.912857]  bpf_map_free_deferred+0x51/0x90
    [  160.917196]  process_one_work+0x142/0x370
    [  160.921272]  worker_thread+0x29e/0x3b0
    [  160.925082]  ? rescuer_thread+0x4b0/0x4b0
    [  160.929157]  kthread+0xd4/0x110
    [  160.932355]  ? kthread_park+0x80/0x80
    [  160.936079]  ret_from_fork+0x2d/0x50
    [  160.943396]  ? kthread_park+0x80/0x80
    [  160.950803]  ret_from_fork_asm+0x11/0x20
    [  160.958482]  </TASK>
    
    Fixes: 546ac1ffb7 ("bpf: add devmap, a map for storing net device references")
    CC: stable@vger.kernel.org
    Reported-by: Jordy Zomer <jordyzomer@google.com>
    Suggested-by: Jordy Zomer <jordyzomer@google.com>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    Link: https://lore.kernel.org/r/20241122121030.716788-3-maciej.fijalkowski@intel.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2025-01-09 17:43:24 +01:00
Rado Vrbovsky 56ce48fda2 Merge: BPF 6.10 rebase
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5693

Rebase BPF subsystem to upstream version 6.10

JIRA: https://issues.redhat.com/browse/RHEL-30773  
JIRA: https://issues.redhat.com/browse/RHEL-64874

CVE: CVE-2024-38564

Omitted-fix: 2897b1e2a2f4 ("selftests/bpf: Fix arena_atomics failure due to llvm change")  
Fix for a bug caused by a not yet released version of LLVM. Will be backported with 6.12 rebase where it belongs.

Omitted-fix: ff9fbcafbaf1 ("selftests/hid: fix bpf_wq new API")  
The fixed test is not yet present in RHEL 9.

Signed-off-by: Viktor Malik <vmalik@redhat.com>

Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Derek Barbosa <debarbos@redhat.com>
Approved-by: Jerome Marchand <jmarchan@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-22 09:24:52 +00:00
Viktor Malik 109fea232b
bpf, devmap: Remove unnecessary if check in for loop
JIRA: https://issues.redhat.com/browse/RHEL-30773

commit 2317dc2c22cc353b699c7d1db47b2fe91f54055c
Author: Thorsten Blum <thorsten.blum@toblux.com>
Date:   Wed May 29 12:19:01 2024 +0200

    bpf, devmap: Remove unnecessary if check in for loop
    
    The iterator variable dst cannot be NULL and the if check can be removed.
    Remove it and fix the following Coccinelle/coccicheck warning reported
    by itnull.cocci:
    
    	ERROR: iterator variable bound on line 762 cannot be NULL
    
    Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/bpf/20240529101900.103913-2-thorsten.blum@toblux.com

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-11-13 09:39:21 +01:00
Felix Maurer 0b7b4054a5 bpf: devmap: provide rxq after redirect
JIRA: https://issues.redhat.com/browse/RHEL-65205

commit ca9984c5f0ab3690d98b13937b2485a978c8dd73
Author: Florian Kauer <florian.kauer@linutronix.de>
Date:   Wed Sep 11 10:41:18 2024 +0200

    bpf: devmap: provide rxq after redirect
    
    rxq contains a pointer to the device from where
    the redirect happened. Currently, the BPF program
    that was executed after a redirect via BPF_MAP_TYPE_DEVMAP*
    does not have it set.
    
    This is particularly bad since accessing ingress_ifindex, e.g.
    
    SEC("xdp")
    int prog(struct xdp_md *pkt)
    {
            return bpf_redirect_map(&dev_redirect_map, 0, 0);
    }
    
    SEC("xdp/devmap")
    int prog_after_redirect(struct xdp_md *pkt)
    {
            bpf_printk("ifindex %i", pkt->ingress_ifindex);
            return XDP_PASS;
    }
    
    depends on access to rxq, so a NULL pointer gets dereferenced:
    
    <1>[  574.475170] BUG: kernel NULL pointer dereference, address: 0000000000000000
    <1>[  574.475188] #PF: supervisor read access in kernel mode
    <1>[  574.475194] #PF: error_code(0x0000) - not-present page
    <6>[  574.475199] PGD 0 P4D 0
    <4>[  574.475207] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
    <4>[  574.475217] CPU: 4 UID: 0 PID: 217 Comm: kworker/4:1 Not tainted 6.11.0-rc5-reduced-00859-g780801200300 #23
    <4>[  574.475226] Hardware name: Intel(R) Client Systems NUC13ANHi7/NUC13ANBi7, BIOS ANRPL357.0026.2023.0314.1458 03/14/2023
    <4>[  574.475231] Workqueue: mld mld_ifc_work
    <4>[  574.475247] RIP: 0010:bpf_prog_5e13354d9cf5018a_prog_after_redirect+0x17/0x3c
    <4>[  574.475257] Code: cc cc cc cc cc cc cc 80 00 00 00 cc cc cc cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3 0f 1e fa 48 8b 57 20 <48> 8b 52 00 8b 92 e0 00 00 00 48 bf f8 a6 d5 c4 5d a0 ff ff be 0b
    <4>[  574.475263] RSP: 0018:ffffa62440280c98 EFLAGS: 00010206
    <4>[  574.475269] RAX: ffffa62440280cd8 RBX: 0000000000000001 RCX: 0000000000000000
    <4>[  574.475274] RDX: 0000000000000000 RSI: ffffa62440549048 RDI: ffffa62440280ce0
    <4>[  574.475278] RBP: ffffa62440280c98 R08: 0000000000000002 R09: 0000000000000001
    <4>[  574.475281] R10: ffffa05dc8b98000 R11: ffffa05f577fca40 R12: ffffa05dcab24000
    <4>[  574.475285] R13: ffffa62440280ce0 R14: ffffa62440549048 R15: ffffa62440549000
    <4>[  574.475289] FS:  0000000000000000(0000) GS:ffffa05f4f700000(0000) knlGS:0000000000000000
    <4>[  574.475294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    <4>[  574.475298] CR2: 0000000000000000 CR3: 000000025522e000 CR4: 0000000000f50ef0
    <4>[  574.475303] PKRU: 55555554
    <4>[  574.475306] Call Trace:
    <4>[  574.475313]  <IRQ>
    <4>[  574.475318]  ? __die+0x23/0x70
    <4>[  574.475329]  ? page_fault_oops+0x180/0x4c0
    <4>[  574.475339]  ? skb_pp_cow_data+0x34c/0x490
    <4>[  574.475346]  ? kmem_cache_free+0x257/0x280
    <4>[  574.475357]  ? exc_page_fault+0x67/0x150
    <4>[  574.475368]  ? asm_exc_page_fault+0x26/0x30
    <4>[  574.475381]  ? bpf_prog_5e13354d9cf5018a_prog_after_redirect+0x17/0x3c
    <4>[  574.475386]  bq_xmit_all+0x158/0x420
    <4>[  574.475397]  __dev_flush+0x30/0x90
    <4>[  574.475407]  veth_poll+0x216/0x250 [veth]
    <4>[  574.475421]  __napi_poll+0x28/0x1c0
    <4>[  574.475430]  net_rx_action+0x32d/0x3a0
    <4>[  574.475441]  handle_softirqs+0xcb/0x2c0
    <4>[  574.475451]  do_softirq+0x40/0x60
    <4>[  574.475458]  </IRQ>
    <4>[  574.475461]  <TASK>
    <4>[  574.475464]  __local_bh_enable_ip+0x66/0x70
    <4>[  574.475471]  __dev_queue_xmit+0x268/0xe40
    <4>[  574.475480]  ? selinux_ip_postroute+0x213/0x420
    <4>[  574.475491]  ? alloc_skb_with_frags+0x4a/0x1d0
    <4>[  574.475502]  ip6_finish_output2+0x2be/0x640
    <4>[  574.475512]  ? nf_hook_slow+0x42/0xf0
    <4>[  574.475521]  ip6_finish_output+0x194/0x300
    <4>[  574.475529]  ? __pfx_ip6_finish_output+0x10/0x10
    <4>[  574.475538]  mld_sendpack+0x17c/0x240
    <4>[  574.475548]  mld_ifc_work+0x192/0x410
    <4>[  574.475557]  process_one_work+0x15d/0x380
    <4>[  574.475566]  worker_thread+0x29d/0x3a0
    <4>[  574.475573]  ? __pfx_worker_thread+0x10/0x10
    <4>[  574.475580]  ? __pfx_worker_thread+0x10/0x10
    <4>[  574.475587]  kthread+0xcd/0x100
    <4>[  574.475597]  ? __pfx_kthread+0x10/0x10
    <4>[  574.475606]  ret_from_fork+0x31/0x50
    <4>[  574.475615]  ? __pfx_kthread+0x10/0x10
    <4>[  574.475623]  ret_from_fork_asm+0x1a/0x30
    <4>[  574.475635]  </TASK>
    <4>[  574.475637] Modules linked in: veth br_netfilter bridge stp llc iwlmvm x86_pkg_temp_thermal iwlwifi efivarfs nvme nvme_core
    <4>[  574.475662] CR2: 0000000000000000
    <4>[  574.475668] ---[ end trace 0000000000000000 ]---
    
    Therefore, provide it to the program by setting rxq properly.
    
    Fixes: cb261b594b ("bpf: Run devmap xdp_prog on flush instead of bulk enqueue")
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: Florian Kauer <florian.kauer@linutronix.de>
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20240911-devel-koalo-fix-ingress-ifindex-v4-1-5c643ae10258@linutronix.de
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-11-06 19:29:48 +01:00
Jerome Marchand b64c7f96d7 bpf: Fix DEVMAP_HASH overflow check on 32-bit arches
JIRA: https://issues.redhat.com/browse/RHEL-23649

commit 281d464a34f540de166cee74b723e97ac2515ec3
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Thu Mar 7 13:03:35 2024 +0100

    bpf: Fix DEVMAP_HASH overflow check on 32-bit arches

    The devmap code allocates a number hash buckets equal to the next power
    of two of the max_entries value provided when creating the map. When
    rounding up to the next power of two, the 32-bit variable storing the
    number of buckets can overflow, and the code checks for overflow by
    checking if the truncated 32-bit value is equal to 0. However, on 32-bit
    arches the rounding up itself can overflow mid-way through, because it
    ends up doing a left-shift of 32 bits on an unsigned long value. If the
    size of an unsigned long is four bytes, this is undefined behaviour, so
    there is no guarantee that we'll end up with a nice and tidy 0-value at
    the end.

    Syzbot managed to turn this into a crash on arm32 by creating a
    DEVMAP_HASH with max_entries > 0x80000000 and then trying to update it.
    Fix this by moving the overflow check to before the rounding up
    operation.

    Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Link: https://lore.kernel.org/r/000000000000ed666a0611af6818@google.com
    Reported-and-tested-by: syzbot+8cd36f6b65f3cafd400a@syzkaller.appspotmail.com
    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Message-ID: <20240307120340.99577-2-toke@redhat.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2024-10-15 10:49:14 +02:00
Jerome Marchand bdd2bacb15 bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit 1ea66e89f68cd463ae189fb5d491df52e330136a
Author: Hou Tao <houtao1@huawei.com>
Date:   Fri Jul 28 09:49:42 2023 +0800

    bpf, devmap: Remove unused dtab field from bpf_dtab_netdev

    Commit 96360004b8 ("xdp: Make devmap flush_list common for all map
    instances") removes the use of bpf_dtab_netdev::dtab in bq_enqueue(),
    so just remove dtab from bpf_dtab_netdev.

    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
    Link: https://lore.kernel.org/r/20230728014942.892272-3-houtao@huaweicloud.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:28:55 +01:00
Viktor Malik daad46d419
bpf: Centralize permissions checks for all BPF map types
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit 6c3eba1c5e283fd2bb1c076dbfcb47f569c3bfde
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jun 13 15:35:32 2023 -0700

    bpf: Centralize permissions checks for all BPF map types
    
    This allows to do more centralized decisions later on, and generally
    makes it very explicit which maps are privileged and which are not
    (e.g., LRU_HASH and LRU_PERCPU_HASH, which are privileged HASH variants,
    as opposed to unprivileged HASH and HASH_PERCPU; now this is explicit
    and easy to verify).
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/bpf/20230613223533.3689589-4-andrii@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:20 +02:00
Artem Savkov 128dd7c7f8 bpf: return long from bpf_map_ops funcs
Bugzilla: https://bugzilla.redhat.com/2221599

commit d7ba4cc900bf1eea2d8c807c6b1fc6bd61f41237
Author: JP Kobryn <inwardvessel@gmail.com>
Date:   Wed Mar 22 12:47:54 2023 -0700

    bpf: return long from bpf_map_ops funcs
    
    This patch changes the return types of bpf_map_ops functions to long, where
    previously int was returned. Using long allows for bpf programs to maintain
    the sign bit in the absence of sign extension during situations where
    inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
    error is returned.
    
    The definitions of the helper funcs are generated from comments in the bpf
    uapi header at `include/uapi/linux/bpf.h`. The return type of these
    helpers was previously changed from int to long in commit bdb7b79b4c. For
    any case where one of the map helpers call the bpf_map_ops funcs that are
    still returning 32-bit int, a compiler might not include sign extension
    instructions to properly convert the 32-bit negative value a 64-bit
    negative value.
    
    For example:
    bpf assembly excerpt of an inlined helper calling a kernel function and
    checking for a specific error:
    
    ; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
      ...
      46:	call   0xffffffffe103291c	; htab_map_update_elem
    ; if (err && err != -EEXIST) {
      4b:	cmp    $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
    
    kernel function assembly excerpt of return value from
    `htab_map_update_elem` returning 32-bit int:
    
    movl $0xffffffef, %r9d
    ...
    movl %r9d, %eax
    
    ...results in the comparison:
    cmp $0xffffffffffffffef, $0x00000000ffffffef
    
    Fixes: bdb7b79b4c ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
    Tested-by: Eduard Zingerman <eddyz87@gmail.com>
    Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
    Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:19 +02:00
Artem Savkov 6d4376f73f bpf: devmap memory usage
Bugzilla: https://bugzilla.redhat.com/2221599

commit fa5e83df173beb6c1334737a463fc2b4ef4efa8b
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Mar 5 12:46:07 2023 +0000

    bpf: devmap memory usage
    
    A new helper is introduced to calculate the memory usage of devmap and
    devmap_hash. The number of dynamically allocated elements are recored
    for devmap_hash already, but not for devmap. To track the memory size of
    dynamically allocated elements, this patch also count the numbers for
    devmap.
    
    The result as follows,
    - before
    40: devmap  name count_map  flags 0x80
            key 4B  value 4B  max_entries 65536  memlock 524288B
    41: devmap_hash  name count_map  flags 0x80
            key 4B  value 4B  max_entries 65536  memlock 524288B
    
    - after
    40: devmap  name count_map  flags 0x80  <<<< no elements
            key 4B  value 4B  max_entries 65536  memlock 524608B
    41: devmap_hash  name count_map  flags 0x80 <<<< no elements
            key 4B  value 4B  max_entries 65536  memlock 524608B
    
    Note that the number of buckets is same with max_entries for devmap_hash
    in this case.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20230305124615.12358-11-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:12 +02:00
Felix Maurer aaeec9dda9 bpf: devmap: check XDP features in __xdp_enqueue routine
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit b9d460c9245541b13de2369e79688f8e0acc0c3d
Author: Lorenzo Bianconi <lorenzo@kernel.org>
Date:   Wed Feb 1 11:24:22 2023 +0100

    bpf: devmap: check XDP features in __xdp_enqueue routine

    Check if the destination device implements ndo_xdp_xmit callback relying
    on NETDEV_XDP_ACT_NDO_XMIT flags. Moreover, check if the destination device
    supports XDP non-linear frame in __xdp_enqueue and is_valid_dst routines.
    This patch allows to perform XDP_REDIRECT on non-linear XDP buffers.

    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Link: https://lore.kernel.org/r/26a94c33520c0bfba021b3fbb2cb8c1e69bf53b8.1675245258.git.lorenzo@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-14 10:44:19 +02:00
Jerome Marchand 861012d6dd bpf: Expand map key argument of bpf_redirect_map to u64
Bugzilla: https://bugzilla.redhat.com/2177177

commit 32637e33003f36e75e9147788cc0e2f21706ef99
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Tue Nov 8 15:06:00 2022 +0100

    bpf: Expand map key argument of bpf_redirect_map to u64

    For queueing packets in XDP we want to add a new redirect map type with
    support for 64-bit indexes. To prepare fore this, expand the width of the
    'key' argument to the bpf_redirect_map() helper. Since BPF registers are
    always 64-bit, this should be safe to do after the fact.

    Acked-by: Song Liu <song@kernel.org>
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/r/20221108140601.149971-3-toke@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:05 +02:00
Artem Savkov 954e5bcd83 bpf: Use bpf_map_area_alloc consistently on bpf map creation
Bugzilla: https://bugzilla.redhat.com/2166911

commit 73cf09a36bf7bfb3e5a3ff23755c36d49137c44d
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Wed Aug 10 15:18:29 2022 +0000

    bpf: Use bpf_map_area_alloc consistently on bpf map creation
    
    Let's use the generic helper bpf_map_area_alloc() instead of the
    open-coded kzalloc helpers in bpf maps creation path.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20220810151840.16394-5-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:00 +01:00
Artem Savkov c87afb33ed bpf: Make __GFP_NOWARN consistent in bpf map creation
Bugzilla: https://bugzilla.redhat.com/2166911

commit 992c9e13f5939437037627c67bcb51e674b64265
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Wed Aug 10 15:18:28 2022 +0000

    bpf: Make __GFP_NOWARN consistent in bpf map creation
    
    Some of the bpf maps are created with __GFP_NOWARN, i.e. arraymap,
    bloom_filter, bpf_local_storage, bpf_struct_ops, lpm_trie,
    queue_stack_maps, reuseport_array, stackmap and xskmap, while others are
    created without __GFP_NOWARN, i.e. cpumap, devmap, hashtab,
    local_storage, offload, ringbuf and sock_map. But there are not key
    differences between the creation of these maps. So let make this
    allocation flag consistent in all bpf maps creation. Then we can use a
    generic helper to alloc all bpf maps.
    
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20220810151840.16394-4-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:00 +01:00
Artem Savkov 23f4e366b0 bpf, devmap: Compute proper xdp_frame len redirecting frames
Bugzilla: https://bugzilla.redhat.com/2137876

commit bd82ea52f0ee2890b698155a6d7bf9ea5bd8930d
Author: Lorenzo Bianconi <lorenzo@kernel.org>
Date:   Sat Jul 23 19:17:10 2022 +0200

    bpf, devmap: Compute proper xdp_frame len redirecting frames
    
    Even if it is currently forbidden to XDP_REDIRECT a multi-frag xdp_frame into
    a devmap, compute proper xdp_frame length in __xdp_enqueue and is_valid_dst
    routines running xdp_get_frame_len().
    
    Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/894d99c01139e921bdb6868158ff8e67f661c072.1658596075.git.lorenzo@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:44 +01:00
Artem Savkov b40b6237e2 bpf: Make non-preallocated allocation low priority
Bugzilla: https://bugzilla.redhat.com/2137876

commit ace2bee839e08df324cb320763258dfd72e6120e
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sat Jul 9 15:44:56 2022 +0000

    bpf: Make non-preallocated allocation low priority
    
    GFP_ATOMIC doesn't cooperate well with memcg pressure so far, especially
    if we allocate too much GFP_ATOMIC memory. For example, when we set the
    memcg limit to limit a non-preallocated bpf memory, the GFP_ATOMIC can
    easily break the memcg limit by force charge. So it is very dangerous to
    use GFP_ATOMIC in non-preallocated case. One way to make it safe is to
    remove __GFP_HIGH from GFP_ATOMIC, IOW, use (__GFP_ATOMIC |
    __GFP_KSWAPD_RECLAIM) instead, then it will be limited if we allocate
    too much memory. There's a plan to completely remove __GFP_ATOMIC in the
    mm side[1], so let's use GFP_NOWAIT instead.
    
    We introduced BPF_F_NO_PREALLOC is because full map pre-allocation is
    too memory expensive for some cases. That means removing __GFP_HIGH
    doesn't break the rule of BPF_F_NO_PREALLOC, but has the same goal with
    it-avoiding issues caused by too much memory. So let's remove it.
    
    This fix can also apply to other run-time allocations, for example, the
    allocation in lpm trie, local storage and devmap. So let fix it
    consistently over the bpf code
    
    It also fixes a typo in the comment.
    
    [1]. https://lore.kernel.org/linux-mm/163712397076.13692.4727608274002939094@noble.neil.brown.name/
    
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: NeilBrown <neilb@suse.de>
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Link: https://lore.kernel.org/r/20220709154457.57379-2-laoar.shao@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:38 +01:00
Yauheni Kaliuta 11fec2f10e bpf: Compute map_btf_id during build time
Bugzilla: https://bugzilla.redhat.com/2120968

commit c317ab71facc2cd0a94145973318a4c914e11acc
Author: Menglong Dong <imagedong@tencent.com>
Date:   Mon Apr 25 21:32:47 2022 +0800

    bpf: Compute map_btf_id during build time
    
    For now, the field 'map_btf_id' in 'struct bpf_map_ops' for all map
    types are computed during vmlinux-btf init:
    
      btf_parse_vmlinux() -> btf_vmlinux_map_ids_init()
    
    It will lookup the btf_type according to the 'map_btf_name' field in
    'struct bpf_map_ops'. This process can be done during build time,
    thanks to Jiri's resolve_btfids.
    
    selftest of map_ptr has passed:
    
      $96 map_ptr:OK
      Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
    
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:00 +02:00
Jiri Benc d1647a95d0 bpf: generalise tail call map compatibility check
Bugzilla: https://bugzilla.redhat.com/2120966

commit f45d5b6ce2e835834c94b8b700787984f02cd662
Author: Toke Hoiland-Jorgensen <toke@redhat.com>
Date:   Fri Jan 21 11:10:02 2022 +0100

    bpf: generalise tail call map compatibility check

    The check for tail call map compatibility ensures that tail calls only
    happen between maps of the same type. To ensure backwards compatibility for
    XDP frags we need a similar type of check for cpumap and devmap
    programs, so move the state from bpf_array_aux into bpf_map, add
    xdp_has_frags to the check, and apply the same check to cpumap and devmap.

    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Signed-off-by: Toke Hoiland-Jorgensen <toke@redhat.com>
    Link: https://lore.kernel.org/r/f19fd97c0328a39927f3ad03e1ca6b43fd53cdfd.1642758637.git.lorenzo@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:42 +02:00
Felix Maurer 8a6c336726 xdp: Move conversion to xdp_frame out of map functions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620

commit d53ad5d8b218a885e95080d4d3d556b16b91b1b9
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Mon Jan 3 16:08:09 2022 +0100

    xdp: Move conversion to xdp_frame out of map functions

    All map redirect functions except XSK maps convert xdp_buff to xdp_frame
    before enqueueing it. So move this conversion of out the map functions
    and into xdp_do_redirect(). This removes a bit of duplicated code, but more
    importantly it makes it possible to support caller-allocated xdp_frame
    structures, which will be added in a subsequent commit.

    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220103150812.87914-5-toke@redhat.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-08-24 12:53:57 +02:00
Ivan Vecera ca7c7d9c0c bpf: Let bpf_warn_invalid_xdp_action() report more info
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073454

Conflicts:
- N/A hunk for unsupported octeontx2 driver omitted

commit c8064e5b4adac5e1255cf4f3b374e75b5376e7ca
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Tue Nov 30 11:08:07 2021 +0100

    bpf: Let bpf_warn_invalid_xdp_action() report more info

    In non trivial scenarios, the action id alone is not sufficient to
    identify the program causing the warning. Before the previous patch,
    the generated stack-trace pointed out at least the involved device
    driver.

    Let's additionally include the program name and id, and the relevant
    device name.

    If the user needs additional infos, he can fetch them via a kernel
    probe, leveraging the arguments added here.

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 16:13:14 +02:00
Jiri Benc 780af5a81f bpf, devmap: Exclude XDP broadcast to master device
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit aeea1b86f9363f3feabb496534d886f082a89f21
Author: Jussi Maki <joamaki@gmail.com>
Date:   Sat Jul 31 05:57:35 2021 +0000

    bpf, devmap: Exclude XDP broadcast to master device

    If the ingress device is bond slave, do not broadcast back through it or
    the bond master.

    Signed-off-by: Jussi Maki <joamaki@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20210731055738.16820-5-joamaki@gmail.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:47 +02:00
Jerome Marchand 850123ac6a bpf: devmap: Implement devmap prog execution for generic XDP
Bugzilla: http://bugzilla.redhat.com/2041365

commit 2ea5eabaf04a1829383aefe98ac38a2e5ae2d698
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Jul 2 16:48:24 2021 +0530

    bpf: devmap: Implement devmap prog execution for generic XDP

    This lifts the restriction on running devmap BPF progs in generic
    redirect mode. To match native XDP behavior, it is invoked right before
    generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/
    XDP_DROP actions.

    We also return 0 even if devmap program drops the packet, as
    semantically redirect has already succeeded and the devmap prog is the
    last point before TX of the packet to device where it can deliver a
    verdict on the packet.

    This also means it must take care of freeing the skb, as
    xdp_do_generic_redirect callers only do that in case an error is
    returned.

    Since devmap entry prog is supported, remove the check in
    generic_xdp_install entirely.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:30 +02:00
Toke Høiland-Jørgensen 0fc4dcc13f bpf, devmap: Convert remaining READ_ONCE() to rcu_dereference_check()
There were a couple of READ_ONCE()-invocations left-over by the devmap
RCU conversion. Convert these to rcu_dereference_check() as well to avoid
complaints from sparse.

Fixes: 782347b6bc ("xdp: Add proper __rcu annotations to redirect map entries")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210629093907.573598-1-toke@redhat.com
2021-07-01 09:28:38 +02:00
Jakub Kicinski b6df00789e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in net/netfilter/nf_tables_api.c.

Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.

skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-29 15:45:27 -07:00
Toke Høiland-Jørgensen 782347b6bc xdp: Add proper __rcu annotations to redirect map entries
XDP_REDIRECT works by a three-step process: the bpf_redirect() and
bpf_redirect_map() helpers will lookup the target of the redirect and store
it (along with some other metadata) in a per-CPU struct bpf_redirect_info.
Next, when the program returns the XDP_REDIRECT return code, the driver
will call xdp_do_redirect() which will use the information thus stored to
actually enqueue the frame into a bulk queue structure (that differs
slightly by map type, but shares the same principle). Finally, before
exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will
flush all the different bulk queues, thus completing the redirect.

Pointers to the map entries will be kept around for this whole sequence of
steps, protected by RCU. However, there is no top-level rcu_read_lock() in
the core code; instead drivers add their own rcu_read_lock() around the XDP
portions of the code, but somewhat inconsistently as Martin discovered[0].
However, things still work because everything happens inside a single NAPI
poll sequence, which means it's between a pair of calls to
local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could
document this intention by using rcu_dereference_check() with
rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and
lockdep to verify that everything is done correctly.

This patch does just that: we add an __rcu annotation to the map entry
pointers and remove the various comments explaining the NAPI poll assurance
strewn through devmap.c in favour of a longer explanation in filter.c. The
goal is to have one coherent documentation of the entire flow, and rely on
the RCU annotations as a "standard" way of communicating the flow in the
map code (which can additionally be understood by sparse and lockdep).

The RCU annotation replacements result in a fairly straight-forward
replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE()
becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the
proper constructs to cast the pointer back and forth between __rcu and
__kernel address space (for the benefit of sparse). The one complication is
that xskmap has a few constructions where double-pointers are passed back
and forth; these simply all gain __rcu annotations, and only the final
reference/dereference to the inner-most pointer gets changed.

With this, everything can be run through sparse without eliciting
complaints, and lockdep can verify correctness even without the use of
rcu_read_lock() in the drivers. Subsequent patches will clean these up from
the drivers.

[0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
[1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com
2021-06-24 19:41:15 +02:00
Bui Quang Minh 7dd5d437c2 bpf: Fix integer overflow in argument calculation for bpf_map_area_alloc
In 32-bit architecture, the result of sizeof() is a 32-bit integer so
the expression becomes the multiplication between 2 32-bit integer which
can potentially leads to integer overflow. As a result,
bpf_map_area_alloc() allocates less memory than needed.

Fix this by casting 1 operand to u64.

Fixes: 0d2c4f9640 ("bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash maps")
Fixes: 99c51064fb ("devmap: Use bpf_map_area_alloc() for allocating hash buckets")
Fixes: 546ac1ffb7 ("bpf: add devmap, a map for storing net device references")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210613143440.71975-1-minhquangbui99@gmail.com
2021-06-22 10:14:29 -07:00
Hangbin Liu e8e0f0f484 bpf, devmap: Remove drops variable from bq_xmit_all()
As Colin pointed out, the first drops assignment after declaration will
be overwritten by the second drops assignment before using, which makes
it useless.

Since the drops variable will be used only once. Just remove it and
use "cnt - sent" in trace_xdp_devmap_xmit().

Fixes: cb261b594b ("bpf: Run devmap xdp_prog on flush instead of bulk enqueue")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210528024356.24333-1-liuhangbin@gmail.com
2021-05-28 22:14:18 +02:00
Hangbin Liu e624d4ed4a xdp: Extend xdp_redirect_map with broadcast support
This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
extend xdp_redirect_map for broadcast support.

With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
excluded when do broadcasting.

When getting the devices in dev hash map via dev_map_hash_get_next_key(),
there is a possibility that we fall back to the first key when a device
was removed. This will duplicate packets on some interfaces. So just walk
the whole buckets to avoid this issue. For dev array map, we also walk the
whole map to find valid interfaces.

Function bpf_clear_redirect_map() was removed in
commit ee75aef23a ("bpf, xdp: Restructure redirect actions").
Add it back as we need to use ri->map again.

With test topology:
  +-------------------+             +-------------------+
  | Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
  +-------------------+             |                   |
                                    |   Host B          |
  +-------------------+             |                   |
  | Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
  +-------------------+             |                   |
                                    |          +------+ |
                                    | veth0 -- | Peer | |
                                    | veth1 -- |      | |
                                    | veth2 -- |  NS  | |
                                    |          +------+ |
                                    +-------------------+

On Host A:
 # pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64

On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
All the veth peers in the NS have a XDP_DROP program loaded. The
forward_map max_entries in xdp_redirect_map_multi is modify to 4.

Testing the performance impact on the regular xdp_redirect path with and
without patch (to check impact of additional check for broadcast mode):

5.12 rc4         | redirect_map        i40e->i40e      |    2.0M |  9.7M
5.12 rc4         | redirect_map        i40e->veth      |    1.7M | 11.8M
5.12 rc4 + patch | redirect_map        i40e->i40e      |    2.0M |  9.6M
5.12 rc4 + patch | redirect_map        i40e->veth      |    1.7M | 11.7M

Testing the performance when cloning packets with the redirect_map_multi
test, using a redirect map size of 4, filled with 1-3 devices:

5.12 rc4 + patch | redirect_map multi  i40e->veth (x1) |    1.7M | 11.4M
5.12 rc4 + patch | redirect_map multi  i40e->veth (x2) |    1.1M |  4.3M
5.12 rc4 + patch | redirect_map multi  i40e->veth (x3) |    0.8M |  2.6M

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
2021-05-26 09:46:16 +02:00
Jesper Dangaard Brouer cb261b594b bpf: Run devmap xdp_prog on flush instead of bulk enqueue
This changes the devmap XDP program support to run the program when the
bulk queue is flushed instead of before the frame is enqueued. This has
a couple of benefits:

- It "sorts" the packets by destination devmap entry, and then runs the
  same BPF program on all the packets in sequence. This ensures that we
  keep the XDP program and destination device properties hot in I-cache.

- It makes the multicast implementation simpler because it can just
  enqueue packets using bq_enqueue() without having to deal with the
  devmap program at all.

The drawback is that if the devmap program drops the packet, the enqueue
step is redundant. However, arguably this is mostly visible in a
micro-benchmark, and with more mixed traffic the I-cache benefit should
win out. The performance impact of just this patch is as follows:

Using 2 10Gb i40e NIC, redirecting one to another, or into a veth interface,
which do XDP_DROP on veth peer. With xdp_redirect_map in sample/bpf, send
pkts via pktgen cmd:
./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64

There are about +/- 0.1M deviation for native testing, the performance
improved for the base-case, but some drop back with xdp devmap prog attached.

Version          | Test                           | Generic | Native | Native + 2nd xdp_prog
5.12 rc4         | xdp_redirect_map   i40e->i40e  |    1.9M |   9.6M |  8.4M
5.12 rc4         | xdp_redirect_map   i40e->veth  |    1.7M |  11.7M |  9.8M
5.12 rc4 + patch | xdp_redirect_map   i40e->i40e  |    1.9M |   9.8M |  8.0M
5.12 rc4 + patch | xdp_redirect_map   i40e->veth  |    1.7M |  12.0M |  9.4M

When bq_xmit_all() is called from bq_enqueue(), another packet will
always be enqueued immediately after, so clearing dev_rx, xdp_prog and
flush_node in bq_xmit_all() is redundant. Move the clear to __dev_flush(),
and only check them once in bq_enqueue() since they are all modified
together.

This change also has the side effect of extending the lifetime of the
RCU-protected xdp_prog that lives inside the devmap entries: Instead of
just living for the duration of the XDP program invocation, the
reference now lives all the way until the bq is flushed. This is safe
because the bq flush happens at the end of the NAPI poll loop, so
everything happens between a local_bh_disable()/local_bh_enable() pair.
However, this is by no means obvious from looking at the call sites; in
particular, some drivers have an additional rcu_read_lock() around only
the XDP program invocation, which only confuses matters further.
Cleaning this up will be done in a separate patch series.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210519090747.1655268-2-liuhangbin@gmail.com
2021-05-26 09:46:16 +02:00
Zhen Lei 8fb33b6055 bpf: Fix spelling mistakes
Fix some spelling mistakes in comments:
aother ==> another
Netiher ==> Neither
desribe ==> describe
intializing ==> initializing
funciton ==> function
wont ==> won't and move the word 'the' at the end to the next line
accross ==> across
pathes ==> paths
triggerred ==> triggered
excute ==> execute
ether ==> either
conervative ==> conservative
convetion ==> convention
markes ==> marks
interpeter ==> interpreter

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210525025659.8898-2-thunder.leizhen@huawei.com
2021-05-24 21:13:05 -07:00
Lorenzo Bianconi fdc13979f9 bpf, devmap: Move drop error path to devmap for XDP_REDIRECT
We want to change the current ndo_xdp_xmit drop semantics because it will
allow us to implement better queue overflow handling. This is working
towards the larger goal of a XDP TX queue-hook. Move XDP_REDIRECT error
path handling from each XDP ethernet driver to devmap code. According to
the new APIs, the driver running the ndo_xdp_xmit pointer, will break tx
loop whenever the hw reports a tx error and it will just return to devmap
caller the number of successfully transmitted frames. It will be devmap
responsibility to free dropped frames.

Move each XDP ndo_xdp_xmit capable driver to the new APIs:

- veth
- virtio-net
- mvneta
- mvpp2
- socionext
- amazon ena
- bnxt
- freescale (dpaa2, dpaa)
- xen-frontend
- qede
- ice
- igb
- ixgbe
- i40e
- mlx5
- ti (cpsw, cpsw-new)
- tun
- sfc

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Link: https://lore.kernel.org/bpf/ed670de24f951cfd77590decf0229a0ad7fd12f6.1615201152.git.lorenzo@kernel.org
2021-03-18 16:38:51 +01:00
Björn Töpel ee75aef23a bpf, xdp: Restructure redirect actions
The XDP_REDIRECT implementations for maps and non-maps are fairly
similar, but obviously need to take different code paths depending on
if the target is using a map or not. Today, the redirect targets for
XDP either uses a map, or is based on ifindex.

Here, the map type and id are added to bpf_redirect_info, instead of
the actual map. Map type, map item/ifindex, and the map_id (if any) is
passed to xdp_do_redirect().

For ifindex-based redirect, used by the bpf_redirect() XDP BFP helper,
a special map type/id are used. Map type of UNSPEC together with map id
equal to INT_MAX has the special meaning of an ifindex based
redirect. Note that valid map ids are 1 inclusive, INT_MAX exclusive
([1,INT_MAX[).

In addition to making the code easier to follow, using explicit type
and id in bpf_redirect_info has a slight positive performance impact
by avoiding a pointer indirection for the map type lookup, and instead
use the cacheline for bpf_redirect_info.

Since the actual map is not passed via bpf_redirect_info anymore, the
map lookup is only done in the BPF helper. This means that the
bpf_clear_redirect_map() function can be removed. The actual map item
is RCU protected.

The bpf_redirect_info flags member is not used by XDP, and not
read/written any more. The map member is only written to when
required/used, and not unconditionally.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210308112907.559576-3-bjorn.topel@gmail.com
2021-03-10 01:06:34 +01:00
Björn Töpel e6a4750ffe bpf, xdp: Make bpf_redirect_map() a map operation
Currently the bpf_redirect_map() implementation dispatches to the
correct map-lookup function via a switch-statement. To avoid the
dispatching, this change adds bpf_redirect_map() as a map
operation. Each map provides its bpf_redirect_map() version, and
correct function is automatically selected by the BPF verifier.

A nice side-effect of the code movement is that the map lookup
functions are now local to the map implementation files, which removes
one additional function call.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210308112907.559576-2-bjorn.topel@gmail.com
2021-03-10 01:06:34 +01:00
Jun'ichi Nomura 7d4553b69f bpf, devmap: Use GFP_KERNEL for xdp bulk queue allocation
The devmap bulk queue is allocated with GFP_ATOMIC and the allocation
may fail if there is no available space in existing percpu pool.

Since commit 75ccae62cb ("xdp: Move devmap bulk queue into struct net_device")
moved the bulk queue allocation to NETDEV_REGISTER callback, whose context
is allowed to sleep, use GFP_KERNEL instead of GFP_ATOMIC to let percpu
allocator extend the pool when needed and avoid possible failure of netdev
registration.

As the required alignment is natural, we can simply use alloc_percpu().

Fixes: 75ccae62cb ("xdp: Move devmap bulk queue into struct net_device")
Signed-off-by: Jun'ichi Nomura <junichi.nomura@nec.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210209082451.GA44021@jeru.linux.bs1.fc.nec.co.jp
2021-02-13 00:11:26 +01:00
Roman Gushchin 844f157f6c bpf: Eliminate rlimit-based memory accounting for devmap maps
Do not use rlimit-based memory accounting for devmap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-23-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin 1440290adf bpf: Refine memcg-based memory accounting for devmap maps
Include map metadata and the node size (struct bpf_dtab_netdev)
into the accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-12-guro@fb.com
2020-12-02 18:32:45 -08:00
Björn Töpel ebc4ecd48c bpf: {cpu,dev}map: Change various functions return type from int to void
The functions bq_enqueue(), bq_flush_to_queue(), and bq_xmit_all() in
{cpu,dev}map.c always return zero. Changing the return type from int
to void makes the code easier to follow.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200901083928.6199-1-bjorn.topel@gmail.com
2020-09-01 15:45:58 +02:00
Martin KaFai Lau f4d0525921 bpf: Add map_meta_equal map ops
Some properties of the inner map is used in the verification time.
When an inner map is inserted to an outer map at runtime,
bpf_map_meta_equal() is currently used to ensure those properties
of the inserting inner map stays the same as the verification
time.

In particular, the current bpf_map_meta_equal() checks max_entries which
turns out to be too restrictive for most of the maps which do not use
max_entries during the verification time.  It limits the use case that
wants to replace a smaller inner map with a larger inner map.  There are
some maps do use max_entries during verification though.  For example,
the map_gen_lookup in array_map_ops uses the max_entries to generate
the inline lookup code.

To accommodate differences between maps, the map_meta_equal is added
to bpf_map_ops.  Each map-type can decide what to check when its
map is used as an inner map during runtime.

Also, some map types cannot be used as an inner map and they are
currently black listed in bpf_map_meta_alloc() in map_in_map.c.
It is not unusual that the new map types may not aware that such
blacklist exists.  This patch enforces an explicit opt-in
and only allows a map to be used as an inner map if it has
implemented the map_meta_equal ops.  It is based on the
discussion in [1].

All maps that support inner map has its map_meta_equal points
to bpf_map_meta_equal in this patch.  A later patch will
relax the max_entries check for most maps.  bpf_types.h
counts 28 map types.  This patch adds 23 ".map_meta_equal"
by using coccinelle.  -5 for
	BPF_MAP_TYPE_PROG_ARRAY
	BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
	BPF_MAP_TYPE_STRUCT_OPS
	BPF_MAP_TYPE_ARRAY_OF_MAPS
	BPF_MAP_TYPE_HASH_OF_MAPS

The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
is moved such that the same error is returned.

[1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
2020-08-28 15:41:30 +02:00
David S. Miller f91c031e65 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-07-04

The following pull-request contains BPF updates for your *net-next* tree.

We've added 73 non-merge commits during the last 17 day(s) which contain
a total of 106 files changed, 5233 insertions(+), 1283 deletions(-).

The main changes are:

1) bpftool ability to show PIDs of processes having open file descriptors
   for BPF map/program/link/BTF objects, relying on BPF iterator progs
   to extract this info efficiently, from Andrii Nakryiko.

2) Addition of BPF iterator progs for dumping TCP and UDP sockets to
   seq_files, from Yonghong Song.

3) Support access to BPF map fields in struct bpf_map from programs
   through BTF struct access, from Andrey Ignatov.

4) Add a bpf_get_task_stack() helper to be able to dump /proc/*/stack
   via seq_file from BPF iterator progs, from Song Liu.

5) Make SO_KEEPALIVE and related options available to bpf_setsockopt()
   helper, from Dmitry Yakunin.

6) Optimize BPF sk_storage selection of its caching index, from Martin
   KaFai Lau.

7) Removal of redundant synchronize_rcu()s from BPF map destruction which
   has been a historic leftover, from Alexei Starovoitov.

8) Several improvements to test_progs to make it easier to create a shell
   loop that invokes each test individually which is useful for some CIs,
   from Jesper Dangaard Brouer.

9) Fix bpftool prog dump segfault when compiled without skeleton code on
   older clang versions, from John Fastabend.

10) Bunch of cleanups and minor improvements, from various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 17:48:34 -07:00
Andrey Ignatov 2872e9ac33 bpf: Set map_btf_{name, id} for all map types
Set map_btf_name and map_btf_id for all map types so that map fields can
be accessed by bpf programs.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com
2020-06-22 22:22:58 +02:00
Toke Høiland-Jørgensen 99c51064fb devmap: Use bpf_map_area_alloc() for allocating hash buckets
Syzkaller discovered that creating a hash of type devmap_hash with a large
number of entries can hit the memory allocator limit for allocating
contiguous memory regions. There's really no reason to use kmalloc_array()
directly in the devmap code, so just switch it to the existing
bpf_map_area_alloc() function that is used elsewhere.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200616142829.114173-1-toke@redhat.com
2020-06-17 10:01:19 -07:00
Jesper Dangaard Brouer 281920b7e0 bpf: Devmap adjust uapi for attach bpf program
V2:
- Defer changing BPF-syscall to start at file-descriptor 1
- Use {} to zero initialise struct.

The recent commit fbee97feed ("bpf: Add support to attach bpf program to a
devmap entry"), introduced ability to attach (and run) a separate XDP
bpf_prog for each devmap entry. A bpf_prog is added via a file-descriptor.
As zero were a valid FD, not using the feature requires using value minus-1.
The UAPI is extended via tail-extending struct bpf_devmap_val and using
map->value_size to determine the feature set.

This will break older userspace applications not using the bpf_prog feature.
Consider an old userspace app that is compiled against newer kernel
uapi/bpf.h, it will not know that it need to initialise the member
bpf_prog.fd to minus-1. Thus, users will be forced to update source code to
get program running on newer kernels.

This patch remove the minus-1 checks, and have zero mean feature isn't used.

Followup patches either for kernel or libbpf should handle and avoid
returning file-descriptor zero in the first place.

Fixes: fbee97feed ("bpf: Add support to attach bpf program to a devmap entry")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159170950687.2102545.7235914718298050113.stgit@firesoul
2020-06-09 11:36:18 -07:00
David Ahern 26afa0a4eb bpf: Reset data_meta before running programs attached to devmap entry
This is a new context that does not handle metadata at the moment, so
mark data_meta invalid.

Fixes: fbee97feed ("bpf: Add support to attach bpf program to a devmap entry")
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200608151723.9539-1-dsahern@kernel.org
2020-06-09 11:19:35 -07:00
Lorenzo Bianconi 1b698fa5d8 xdp: Rename convert_to_xdp_frame in xdp_convert_buff_to_frame
In order to use standard 'xdp' prefix, rename convert_to_xdp_frame
utility routine in xdp_convert_buff_to_frame and replace all the
occurrences

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/6344f739be0d1a08ab2b9607584c4d5478c8c083.1590698295.git.lorenzo@kernel.org
2020-06-01 15:02:53 -07:00
David Ahern 64b59025c1 xdp: Add xdp_txq_info to xdp_buff
Add xdp_txq_info as the Tx counterpart to xdp_rxq_info. At the
moment only the device is added. Other fields (queue_index)
can be added as use cases arise.

>From a UAPI perspective, add egress_ifindex to xdp context for
bpf programs to see the Tx device.

Update the verifier to only allow accesses to egress_ifindex by
XDP programs with BPF_XDP_DEVMAP expected attach type.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-4-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00
David Ahern fbee97feed bpf: Add support to attach bpf program to a devmap entry
Add BPF_XDP_DEVMAP attach type for use with programs associated with a
DEVMAP entry.

Allow DEVMAPs to associate a program with a device entry by adding
a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program
id, so the fd and id are a union. bpf programs can get access to the
struct via vmlinux.h.

The program associated with the fd must have type XDP with expected
attach type BPF_XDP_DEVMAP. When a program is associated with a device
index, the program is run on an XDP_REDIRECT and before the buffer is
added to the per-cpu queue. At this point rxq data is still valid; the
next patch adds tx device information allowing the prorgam to see both
ingress and egress device indices.

XDP generic is skb based and XDP programs do not work with skb's. Block
the use case by walking maps used by a program that is to be attached
via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with

Block attach of BPF_XDP_DEVMAP programs to devices.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-3-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00
David Ahern 7f1c04269f devmap: Formalize map value as a named struct
Add 'struct bpf_devmap_val' to formalize the expected values that can
be passed in for a DEVMAP. Update devmap code to use the struct.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-2-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:31 -07:00
Ioana Ciornei 788f87ac60 xdp: export the DEV_MAP_BULK_SIZE macro
Export the DEV_MAP_BULK_SIZE macro to the header file so that drivers
can directly use it as the maximum number of xdp_frames received in the
.ndo_xdp_xmit() callback.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 20:11:29 -07:00
John Fastabend b23bfa5633 bpf, xdp: Remove no longer required rcu_read_{un}lock()
Now that we depend on rcu_call() and synchronize_rcu() to also wait
for preempt_disabled region to complete the rcu read critical section
in __dev_map_flush() is no longer required. Except in a few special
cases in drivers that need it for other reasons.

These originally ensured the map reference was safe while a map was
also being free'd. And additionally that bpf program updates via
ndo_bpf did not happen while flush updates were in flight. But flush
by new rules can only be called from preempt-disabled NAPI context.
The synchronize_rcu from the map free path and the rcu_call from the
delete path will ensure the reference there is safe. So lets remove
the rcu_read_lock and rcu_read_unlock pair to avoid any confusion
around how this is being protected.

If the rcu_read_lock was required it would mean errors in the above
logic and the original patch would also be wrong.

Now that we have done above we put the rcu_read_lock in the driver
code where it is needed in a driver dependent way. I think this
helps readability of the code so we know where and why we are
taking read locks. Most drivers will not need rcu_read_locks here
and further XDP drivers already have rcu_read_locks in their code
paths for reading xdp programs on RX side so this makes it symmetric
where we don't have half of rcu critical sections define in driver
and the other half in devmap.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/1580084042-11598-4-git-send-email-john.fastabend@gmail.com
2020-01-27 11:16:25 +01:00
John Fastabend 42a84a8cd0 bpf, xdp: Update devmap comments to reflect napi/rcu usage
Now that we rely on synchronize_rcu and call_rcu waiting to
exit perempt-disable regions (NAPI) lets update the comments
to reflect this.

Fixes: 0536b85239 ("xdp: Simplify devmap cleanup")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/1580084042-11598-2-git-send-email-john.fastabend@gmail.com
2020-01-27 11:16:20 +01:00
Amol Grover 485ec2ea9c bpf, devmap: Pass lockdep expression to RCU lists
head is traversed using hlist_for_each_entry_rcu outside an RCU
read-side critical section but under the protection of dtab->index_lock.

Hence, add corresponding lockdep expression to silence false-positive
lockdep warnings, and harden RCU lists.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200123120437.26506-1-frextrite@gmail.com
2020-01-23 23:01:16 +01:00