Commit Graph

702 Commits

Author SHA1 Message Date
Toke Høiland-Jørgensen e9b6f5c14c bpf: Add bpf_sock_destroy kfunc
JIRA: https://issues.redhat.com/browse/RHEL-65787

Conflicts: Context difference due to missing af9784d007d8 ("tcp: diag:
add support for TIME_WAIT sockets to tcp_abort()") and out-of-order
backport of bac76cf89816 ("tcp: fix forever orphan socket caused by
tcp_abort")

commit 4ddbcb886268af8d12a23e6640b39d1d9c652b1b
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date:   Fri May 19 22:51:55 2023 +0000

    bpf: Add bpf_sock_destroy kfunc

    The socket destroy kfunc is used to forcefully terminate sockets from
    certain BPF contexts. We plan to use the capability in Cilium
    load-balancing to terminate client sockets that continue to connect to
    deleted backends.  The other use case is on-the-fly policy enforcement
    where existing socket connections prevented by policies need to be
    forcefully terminated.  The kfunc also allows terminating sockets that may
    or may not be actively sending traffic.

    The kfunc can currently be called only from BPF TCP and UDP iterators
    where users can filter, and terminate selected sockets. More
    specifically, it can only be called from  BPF contexts that ensure
    socket locking in order to allow synchronous execution of protocol
    specific `diag_destroy` handlers. The previous commit that batches UDP
    sockets during iteration facilitated a synchronous invocation of the UDP
    destroy callback from BPF context by skipping socket locks in
    `udp_abort`. TCP iterator already supported batching of sockets being
    iterated. To that end, `tracing_iter_filter` callback filter is added so
    that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
    attach type, and reject other programs.

    The kfunc takes `sock_common` type argument, even though it expects, and
    casts them to a `sock` pointer. This enables the verifier to allow the
    sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
    `sock` structs. Furthermore, as `sock_common` only has a subset of
    certain fields of `sock`, casting pointer to the latter type might not
    always be safe for certain sockets like request sockets, but these have a
    special handling in the diag_destroy handlers.

    Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
    cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
    eg. getting a sk pointer (may be even NULL) by following another sk
    pointer. The pointer socket argument passed in TCP and UDP iterators is
    tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info.  The TRUSTED arg changes
    are contributed by Martin KaFai Lau <martin.lau@kernel.org>.

    Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
    Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:54 +01:00
Toke Høiland-Jørgensen 692dba9fd0 bpf: Avoid iter->offset making backward progress in bpf_iter_udp
JIRA: https://issues.redhat.com/browse/RHEL-65787

commit 2242fd537fab52d5f4d2fbb1845f047c01fad0cf
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Fri Jan 12 11:05:29 2024 -0800

    bpf: Avoid iter->offset making backward progress in bpf_iter_udp

    There is a bug in the bpf_iter_udp_batch() function that stops
    the userspace from making forward progress.

    The case that triggers the bug is the userspace passed in
    a very small read buffer. When the bpf prog does bpf_seq_printf,
    the userspace read buffer is not enough to capture the whole bucket.

    When the read buffer is not large enough, the kernel will remember
    the offset of the bucket in iter->offset such that the next userspace
    read() can continue from where it left off.

    The kernel will skip the number (== "iter->offset") of sockets in
    the next read(). However, the code directly decrements the
    "--iter->offset". This is incorrect because the next read() may
    not consume the whole bucket either and then the next-next read()
    will start from offset 0. The net effect is the userspace will
    keep reading from the beginning of a bucket and the process will
    never finish. "iter->offset" must always go forward until the
    whole bucket is consumed.

    This patch fixes it by using a local variable "resume_offset"
    and "resume_bucket". "iter->offset" is always reset to 0 before
    it may be used. "iter->offset" will be advanced to the
    "resume_offset" when it continues from the "resume_bucket" (i.e.
    "state->bucket == resume_bucket"). This brings it closer to
    the bpf_iter_tcp's offset handling which does not suffer
    the same bug.

    Cc: Aditi Ghag <aditi.ghag@isovalent.com>
    Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator")
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Reviewed-by: Aditi Ghag <aditi.ghag@isovalent.com>
    Link: https://lore.kernel.org/r/20240112190530.3751661-3-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:54 +01:00
Toke Høiland-Jørgensen 61c014209a bpf: iter_udp: Retry with a larger batch size without going back to the previous bucket
JIRA: https://issues.redhat.com/browse/RHEL-65787

commit 19ca0823f6eaad01d18f664a00550abe912c034c
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Fri Jan 12 11:05:28 2024 -0800

    bpf: iter_udp: Retry with a larger batch size without going back to the previous bucket

    The current logic is to use a default size 16 to batch the whole bucket.
    If it is too small, it will retry with a larger batch size.

    The current code accidentally does a state->bucket-- before retrying.
    This goes back to retry with the previous bucket which has already
    been done. This patch fixed it.

    It is hard to create a selftest. I added a WARN_ON(state->bucket < 0),
    forced a particular port to be hashed to the first bucket,
    created >16 sockets, and observed the for-loop went back
    to the "-1" bucket.

    Cc: Aditi Ghag <aditi.ghag@isovalent.com>
    Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator")
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Reviewed-by: Aditi Ghag <aditi.ghag@isovalent.com>
    Link: https://lore.kernel.org/r/20240112190530.3751661-2-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:54 +01:00
Toke Høiland-Jørgensen e08ca1e323 bpf: udp: Implement batching for sockets iterator
JIRA: https://issues.redhat.com/browse/RHEL-65787

commit c96dac8d369ffd713a45f4e5c30f23c47a1671f0
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date:   Fri May 19 22:51:53 2023 +0000

    bpf: udp: Implement batching for sockets iterator

    Batch UDP sockets from BPF iterator that allows for overlapping locking
    semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
    BPF socket destroy kfunc (introduced by follow-up patches) to execute from
    BPF iterator programs.

    Previously, BPF iterators acquired the sock lock and sockets hash table
    bucket lock while executing BPF programs. This prevented BPF helpers that
    again acquire these locks to be executed from BPF iterators.  With the
    batching approach, we acquire a bucket lock, batch all the bucket sockets,
    and then release the bucket lock. This enables BPF or kernel helpers to
    skip sock locking when invoked in the supported BPF contexts.

    The batching logic is similar to the logic implemented in TCP iterator:
    https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.

    Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
    Link: https://lore.kernel.org/r/20230519225157.760788-6-aditi.ghag@isovalent.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:54 +01:00
Toke Høiland-Jørgensen cde5da29d8 udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state
JIRA: https://issues.redhat.com/browse/RHEL-65787

commit e4fe1bf13e09019578b9b93b942fff3d76ed5793
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date:   Fri May 19 22:51:52 2023 +0000

    udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state

    This is a preparatory commit to remove the field. The field was
    previously shared between proc fs and BPF UDP socket iterators. As the
    follow-up commits will decouple the implementation for the iterators,
    remove the field. As for BPF socket iterator, filtering of sockets is
    exepected to be done in BPF programs.

    Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
    Link: https://lore.kernel.org/r/20230519225157.760788-5-aditi.ghag@isovalent.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:54 +01:00
Toke Høiland-Jørgensen dcbb4b88e1 bpf: udp: Encapsulate logic to get udp table
JIRA: https://issues.redhat.com/browse/RHEL-65787

commit 7625d2e9741c1f6e08ee79c28a1e27bbb5071805
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date:   Fri May 19 22:51:51 2023 +0000

    bpf: udp: Encapsulate logic to get udp table

    This is a preparatory commit that encapsulates the logic
    to get udp table in iterator inside udp_get_table_afinfo, and
    renames the function to `udp_get_table_seq` accordingly.

    Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
    Link: https://lore.kernel.org/r/20230519225157.760788-4-aditi.ghag@isovalent.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:54 +01:00
Toke Høiland-Jørgensen 6a228086f4 udp: seq_file: Helper function to match socket attributes
JIRA: https://issues.redhat.com/browse/RHEL-65787

commit f44b1c515833c59701c86f92d47b4edd478fb0f3
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date:   Fri May 19 22:51:50 2023 +0000

    udp: seq_file: Helper function to match socket attributes

    This is a preparatory commit to refactor code that matches socket
    attributes in iterators to a helper function, and use it in the
    proc fs iterator.

    Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
    Link: https://lore.kernel.org/r/20230519225157.760788-3-aditi.ghag@isovalent.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:54 +01:00
Toke Høiland-Jørgensen 25cf0b2b68 udp: Access &udp_table via net.
JIRA: https://issues.redhat.com/browse/RHEL-65787

commit ba6aac1516779dd0ced22c136a2c2c4a9c70cf29
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Nov 14 13:57:56 2022 -0800

    udp: Access &udp_table via net.

    We will soon introduce an optional per-netns hash table
    for UDP.

    This means we cannot use udp_table directly in most places.

    Instead, access it via net->ipv4.udp_table.

    The access will be valid only while initialising udp_table
    itself and creating/destroying each netns.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:53 +01:00
Toke Høiland-Jørgensen c17fe8b439 udp: Set NULL to udp_seq_afinfo.udp_table.
JIRA: https://issues.redhat.com/browse/RHEL-65787

commit 478aee5d6bf617c932f4e9c2981f17e86e093fc5
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Nov 14 13:57:55 2022 -0800

    udp: Set NULL to udp_seq_afinfo.udp_table.

    We will soon introduce an optional per-netns hash table
    for UDP.

    This means we cannot use the global udp_seq_afinfo.udp_table
    to fetch a UDP hash table.

    Instead, set NULL to udp_seq_afinfo.udp_table for UDP and get
    a proper table from net->ipv4.udp_table.

    Note that we still need udp_seq_afinfo.udp_table for UDP LITE.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:53 +01:00
Toke Høiland-Jørgensen 80f3b66d62 udp: Set NULL to sk->sk_prot->h.udp_table.
JIRA: https://issues.redhat.com/browse/RHEL-65787

Conflicts: Context difference due to already backported
7a7160edf1bf ("net: Return errno in sk->sk_prot->get_port().")

commit 67fb43308f4b354f13aabcc66dd5d99bfbb7e838
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Nov 14 13:57:54 2022 -0800

    udp: Set NULL to sk->sk_prot->h.udp_table.

    We will soon introduce an optional per-netns hash table
    for UDP.

    This means we cannot use the global sk->sk_prot->h.udp_table
    to fetch a UDP hash table.

    Instead, set NULL to sk->sk_prot->h.udp_table for UDP and get
    a proper table from net->ipv4.udp_table.

    Note that we still need sk->sk_prot->h.udp_table for UDP LITE.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:53 +01:00
Toke Høiland-Jørgensen c9f94d6c3a udp: Clean up some functions.
JIRA: https://issues.redhat.com/browse/RHEL-65787

Conflicts: Context difference due to already backported
7a7160edf1bf ("net: Return errno in sk->sk_prot->get_port().")

commit 919dfa0b20ae56060dce0436eb710717f8987d18
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Nov 14 13:57:53 2022 -0800

    udp: Clean up some functions.

    This patch adds no functional change and cleans up some functions
    that the following patches touch around so that we make them tidy
    and easy to review/revert.  The change is mainly to keep reverse
    christmas tree order.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:52 +01:00
Lucas Zampieri 55f96777fb Merge: net: backport visibility improvements
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4765

JIRA: https://issues.redhat.com/browse/RHEL-48648  
  
Various visibility improvements; mainly around drop reasons, reset reason and improved tracepoints this time.  
  
Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-12 16:18:50 +00:00
CKI Backport Bot ca2a05a2e6 udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().
JIRA: https://issues.redhat.com/browse/RHEL-51033
CVE: CVE-2024-41041

commit 5c0b485a8c6116516f33925b9ce5b6104a6eadfd
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Jul 9 12:13:56 2024 -0700

    udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().

    syzkaller triggered the warning [0] in udp_v4_early_demux().

    In udp_v[46]_early_demux() and sk_lookup(), we do not touch the refcount
    of the looked-up sk and use sock_pfree() as skb->destructor, so we check
    SOCK_RCU_FREE to ensure that the sk is safe to access during the RCU grace
    period.

    Currently, SOCK_RCU_FREE is flagged for a bound socket after being put
    into the hash table.  Moreover, the SOCK_RCU_FREE check is done too early
    in udp_v[46]_early_demux() and sk_lookup(), so there could be a small race
    window:

      CPU1                                 CPU2
      ----                                 ----
      udp_v4_early_demux()                 udp_lib_get_port()
      |                                    |- hlist_add_head_rcu()
      |- sk = __udp4_lib_demux_lookup()    |
      |- DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
                                           `- sock_set_flag(sk, SOCK_RCU_FREE)

    We had the same bug in TCP and fixed it in commit 871019b22d1b ("net:
    set SOCK_RCU_FREE before inserting socket into hashtable").

    Let's apply the same fix for UDP.

    [0]:
    WARNING: CPU: 0 PID: 11198 at net/ipv4/udp.c:2599 udp_v4_early_demux+0x481/0xb70 net/ipv4/udp.c:2599
    Modules linked in:
    CPU: 0 PID: 11198 Comm: syz-executor.1 Not tainted 6.9.0-g93bda33046e7 #13
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
    RIP: 0010:udp_v4_early_demux+0x481/0xb70 net/ipv4/udp.c:2599
    Code: c5 7a 15 fe bb 01 00 00 00 44 89 e9 31 ff d3 e3 81 e3 bf ef ff ff 89 de e8 2c 74 15 fe 85 db 0f 85 02 06 00 00 e8 9f 7a 15 fe <0f> 0b e8 98 7a 15 fe 49 8d 7e 60 e8 4f 39 2f fe 49 c7 46 60 20 52
    RSP: 0018:ffffc9000ce3fa58 EFLAGS: 00010293
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8318c92c
    RDX: ffff888036ccde00 RSI: ffffffff8318c2f1 RDI: 0000000000000001
    RBP: ffff88805a2dd6e0 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0001ffffffffffff R12: ffff88805a2dd680
    R13: 0000000000000007 R14: ffff88800923f900 R15: ffff88805456004e
    FS:  00007fc449127640(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc449126e38 CR3: 000000003de4b002 CR4: 0000000000770ef0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
    PKRU: 55555554
    Call Trace:
     <TASK>
     ip_rcv_finish_core.constprop.0+0xbdd/0xd20 net/ipv4/ip_input.c:349
     ip_rcv_finish+0xda/0x150 net/ipv4/ip_input.c:447
     NF_HOOK include/linux/netfilter.h:314 [inline]
     NF_HOOK include/linux/netfilter.h:308 [inline]
     ip_rcv+0x16c/0x180 net/ipv4/ip_input.c:569
     __netif_receive_skb_one_core+0xb3/0xe0 net/core/dev.c:5624
     __netif_receive_skb+0x21/0xd0 net/core/dev.c:5738
     netif_receive_skb_internal net/core/dev.c:5824 [inline]
     netif_receive_skb+0x271/0x300 net/core/dev.c:5884
     tun_rx_batched drivers/net/tun.c:1549 [inline]
     tun_get_user+0x24db/0x2c50 drivers/net/tun.c:2002
     tun_chr_write_iter+0x107/0x1a0 drivers/net/tun.c:2048
     new_sync_write fs/read_write.c:497 [inline]
     vfs_write+0x76f/0x8d0 fs/read_write.c:590
     ksys_write+0xbf/0x190 fs/read_write.c:643
     __do_sys_write fs/read_write.c:655 [inline]
     __se_sys_write fs/read_write.c:652 [inline]
     __x64_sys_write+0x41/0x50 fs/read_write.c:652
     x64_sys_call+0xe66/0x1990 arch/x86/include/generated/asm/syscalls_64.h:2
     do_syscall_x64 arch/x86/entry/common.c:52 [inline]
     do_syscall_64+0x4b/0x110 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x4b/0x53
    RIP: 0033:0x7fc44a68bc1f
    Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 e9 cf f5 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 3c d0 f5 ff 48
    RSP: 002b:00007fc449126c90 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 00000000004bc050 RCX: 00007fc44a68bc1f
    RDX: 0000000000000032 RSI: 00000000200000c0 RDI: 00000000000000c8
    RBP: 00000000004bc050 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000032 R11: 0000000000000293 R12: 0000000000000000
    R13: 000000000000000b R14: 00007fc44a5ec530 R15: 0000000000000000
     </TASK>

    Fixes: 6acc9b432e ("bpf: Add helper to retrieve socket in BPF")
    Reported-by: syzkaller <syzkaller@googlegroups.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/20240709191356.24010-1-kuniyu@amazon.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2024-07-30 09:48:30 +00:00
Antoine Tenart 32669d8760 udp: use sk_skb_reason_drop to free rx packets
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: net-next.git

commit fc0cc9248843b37243fa5fd3287a121ec41d291f
Author: Yan Zhai <yan@cloudflare.com>
Date:   Mon Jun 17 11:09:24 2024 -0700

    udp: use sk_skb_reason_drop to free rx packets

    Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
    socket to the tracepoint.

    Reported-by: kernel test robot <lkp@intel.com>
    Closes: https://lore.kernel.org/r/202406011751.NpVN0sSk-lkp@intel.com/
    Signed-off-by: Yan Zhai <yan@cloudflare.com>
    Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:42 +02:00
Antoine Tenart 4e410b55fd net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit e9669a00bba79442dd4862c57761333d6a020c24
Author: Balazs Scheidler <bazsi77@gmail.com>
Date:   Tue Mar 26 19:05:47 2024 +0100

    net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb

    The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source
    and destination IP/port whereas this information can be critical in case
    of UDP/syslog.

    Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
    Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
    Link: https://lore.kernel.org/r/0c8b3e33dbf679e190be6f4c6736603a76988a20.1711475011.git.balazs.scheidler@axoflow.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 5d9b38e8a9 udp: do not accept non-tunnel GSO skbs landing in a tunnel
JIRA: https://issues.redhat.com/browse/RHEL-19729
Upstream Status: net.git

commit 3d010c8031e39f5fa1e8b13ada77e0321091011f
Author: Antoine Tenart <atenart@kernel.org>
Date:   Tue Mar 26 12:33:58 2024 +0100

    udp: do not accept non-tunnel GSO skbs landing in a tunnel

    When rx-udp-gro-forwarding is enabled UDP packets might be GROed when
    being forwarded. If such packets might land in a tunnel this can cause
    various issues and udp_gro_receive makes sure this isn't the case by
    looking for a matching socket. This is performed in
    udp4/6_gro_lookup_skb but only in the current netns. This is an issue
    with tunneled packets when the endpoint is in another netns. In such
    cases the packets will be GROed at the UDP level, which leads to various
    issues later on. The same thing can happen with rx-gro-list.

    We saw this with geneve packets being GROed at the UDP level. In such
    case gso_size is set; later the packet goes through the geneve rx path,
    the geneve header is pulled, the offset are adjusted and frag_list skbs
    are not adjusted with regard to geneve. When those skbs hit
    skb_fragment, it will misbehave. Different outcomes are possible
    depending on what the GROed skbs look like; from corrupted packets to
    kernel crashes.

    One example is a BUG_ON[1] triggered in skb_segment while processing the
    frag_list. Because gso_size is wrong (geneve header was pulled)
    skb_segment thinks there is "geneve header size" of data in frag_list,
    although it's in fact the next packet. The BUG_ON itself has nothing to
    do with the issue. This is only one of the potential issues.

    Looking up for a matching socket in udp_gro_receive is fragile: the
    lookup could be extended to all netns (not speaking about performances)
    but nothing prevents those packets from being modified in between and we
    could still not find a matching socket. It's OK to keep the current
    logic there as it should cover most cases but we also need to make sure
    we handle tunnel packets being GROed too early.

    This is done by extending the checks in udp_unexpected_gso: GSO packets
    lacking the SKB_GSO_UDP_TUNNEL/_CSUM bits and landing in a tunnel must
    be segmented.

    [1] kernel BUG at net/core/skbuff.c:4408!
        RIP: 0010:skb_segment+0xd2a/0xf70
        __udp_gso_segment+0xaa/0x560

    Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
    Fixes: 36707061d6 ("udp: allow forwarding of plain (non-fraglisted) UDP GRO packets")
    Signed-off-by: Antoine Tenart <atenart@kernel.org>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-03-29 13:55:00 +01:00
Jeff Moyer 092f5d645a net: ioctl: Use kernel memory on protocol ioctl callbacks
JIRA: https://issues.redhat.com/browse/RHEL-12076
Conflicts: There are contextual differences as we're missing commit
  559260fd9d9a ("ipmr: do not acquire mrt_lock in
  ioctl(SIOCGETVIFCNT)").  I also pulled in header changes from commit
  949d6b405e61 ("net: add missing includes and forward declarations
  under net/") to address a build failure with this patch applied.

commit e1d001fa5b477c4da46a29be1fcece91db7c7c6f
Author: Breno Leitao <leitao@debian.org>
Date:   Fri Jun 9 08:27:42 2023 -0700

    net: ioctl: Use kernel memory on protocol ioctl callbacks
    
    Most of the ioctls to net protocols operates directly on userspace
    argument (arg). Usually doing get_user()/put_user() directly in the
    ioctl callback.  This is not flexible, because it is hard to reuse these
    functions without passing userspace buffers.
    
    Change the "struct proto" ioctls to avoid touching userspace memory and
    operate on kernel buffers, i.e., all protocol's ioctl callbacks is
    adapted to operate on a kernel memory other than on userspace (so, no
    more {put,get}_user() and friends being called in the ioctl callback).
    
    This changes the "struct proto" ioctl format in the following way:
    
        int                     (*ioctl)(struct sock *sk, int cmd,
    -                                        unsigned long arg);
    +                                        int *karg);
    
    (Important to say that this patch does not touch the "struct proto_ops"
    protocols)
    
    So, the "karg" argument, which is passed to the ioctl callback, is a
    pointer allocated to kernel space memory (inside a function wrapper).
    This buffer (karg) may contain input argument (copied from userspace in
    a prep function) and it might return a value/buffer, which is copied
    back to userspace if necessary. There is not one-size-fits-all format
    (that is I am using 'may' above), but basically, there are three type of
    ioctls:
    
    1) Do not read from userspace, returns a result to userspace
    2) Read an input parameter from userspace, and does not return anything
      to userspace
    3) Read an input from userspace, and return a buffer to userspace.
    
    The default case (1) (where no input parameter is given, and an "int" is
    returned to userspace) encompasses more than 90% of the cases, but there
    are two other exceptions. Here is a list of exceptions:
    
    * Protocol RAW:
       * cmd = SIOCGETVIFCNT:
         * input and output = struct sioc_vif_req
       * cmd = SIOCGETSGCNT
         * input and output = struct sioc_sg_req
       * Explanation: for the SIOCGETVIFCNT case, userspace passes the input
         argument, which is struct sioc_vif_req. Then the callback populates
         the struct, which is copied back to userspace.
    
    * Protocol RAW6:
       * cmd = SIOCGETMIFCNT_IN6
         * input and output = struct sioc_mif_req6
       * cmd = SIOCGETSGCNT_IN6
         * input and output = struct sioc_sg_req6
    
    * Protocol PHONET:
      * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
         * input int (4 bytes)
      * Nothing is copied back to userspace.
    
    For the exception cases, functions sock_sk_ioctl_inout() will
    copy the userspace input, and copy it back to kernel space.
    
    The wrapper that prepare the buffer and put the buffer back to user is
    sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
    calls sk_ioctl(), which will handle all cases.
    
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:32:16 -04:00
Scott Weaver 9ec000dabc Merge: UDP: stable backports for rhel 9.4 phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3249

JIRA: https://issues.redhat.com/browse/RHEL-14356
UDP: stable backports for rhel 9.4 phase 1
Tested: LNST, Tier1

A bunch of fixlet for the UDP procotol

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-30 15:41:53 -04:00
Paolo Abeni 4b1fe55101 udp: re-score reuseport groups when connected sockets are present
JIRA: https://issues.redhat.com/browse/RHEL-14356
Tested: LNST, Tier1

Upstream commit:
commit f0ea27e7bfe1c34e1f451a63eb68faa1d4c3a86d
Author: Lorenz Bauer <lmb@isovalent.com>
Date:   Thu Jul 20 17:30:05 2023 +0200

    udp: re-score reuseport groups when connected sockets are present

    Contrary to TCP, UDP reuseport groups can contain TCP_ESTABLISHED
    sockets. To support these properly we remember whether a group has
    a connected socket and skip the fast reuseport early-return. In
    effect we continue scoring all reuseport sockets and then choose the
    one with the highest score.

    The current code fails to re-calculate the score for the result of
    lookup_reuseport. According to Kuniyuki Iwashima:

        1) SO_INCOMING_CPU is set
           -> selected sk might have +1 score

        2) BPF prog returns ESTABLISHED and/or SO_INCOMING_CPU sk
           -> selected sk will have more than 8

      Using the old score could trigger more lookups depending on the
      order that sockets are created.

        sk -> sk (SO_INCOMING_CPU) -> sk (ESTABLISHED)
        |     |
        `-> select the next SO_INCOMING_CPU sk
              |
              `-> select itself (We should save this lookup)

    Fixes: efc6b6f6c3 ("udp: Improve load balancing for SO_REUSEPORT.")
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
    Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-1-7021b683cdae@isovalent.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-20 13:05:37 +02:00
Chris von Recklinghausen 1f619343f6 treewide: use get_random_u32() when possible
Conflicts:
	drivers/gpu/drm/tests/drm_buddy_test.c
	drivers/gpu/drm/tests/drm_mm_test.c - We already have
		ce28ab1380e8 ("drm/tests: Add back seed value information")
		so keep calls to kunit_info.
	drop changes to drivers/misc/habanalabs/gaudi2/gaudi2.c
		fs/ntfs3/fslog.c - files not in CS9
	net/sunrpc/auth_gss/gss_krb5_wrap.c - We already have
		7f675ca7757b ("SUNRPC: Improve Kerberos confounder generation")
		so code to change is gone.
	drivers/gpu/drm/i915/i915_gem_gtt.c
	drivers/gpu/drm/i915/selftests/i915_selftest.c
	drivers/gpu/drm/tests/drm_buddy_test.c
	drivers/gpu/drm/tests/drm_mm_test.c
		change added under
		4cb818386e ("Merge DRM changes from upstream v6.0.8..v6.1")

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a251c17aa558d8e3128a528af5cf8b9d7caae4fd
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed Oct 5 17:43:22 2022 +0200

    treewide: use get_random_u32() when possible

    The prandom_u32() function has been a deprecated inline wrapper around
    get_random_u32() for several releases now, and compiles down to the
    exact same code. Replace the deprecated wrapper with a direct call to
    the real function. The same also applies to get_random_int(), which is
    just a wrapper around get_random_u32(). This was done as a basic find
    and replace.

    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Yury Norov <yury.norov@gmail.com>
    Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
    Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
    Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbol
t
    Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
    Acked-by: Helge Deller <deller@gmx.de> # for parisc
    Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:03 -04:00
Ivan Vecera 497f645693 net: move gso declarations and functions to their own files
JIRA: https://issues.redhat.com/browse/RHEL-12679

commit d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jun 8 19:17:37 2023 +0000

    net: move gso declarations and functions to their own files

    Move declarations into include/net/gso.h and code into net/core/gso.c

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Stanislav Fomichev <sdf@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-11 13:35:27 +02:00
Felix Maurer 2d92cf1f17 bpf, sockmap: Pass skb ownership through read_skb
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218483
Conflicts:
- net/ipv4/udp.c: Context difference due to missing ec095263a965 ("net:
  remove noblock parameter from recvmsg() entities") and db39dfdc1c3b
  ("udp: Use WARN_ON_ONCE() in udp_read_skb()"); 31f1fbcb346c ("udp:
  Refactor udp_read_skb()") was adapted to reflect this
- net/vmw_vsock/virtio_transport_common.c: Skipped, because the relevant
  code is not there, missing 634f1a7110b4 ("vsock: support sockmap")

commit 78fa0d61d97a728d306b0c23d353c0e340756437
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Mon May 22 19:56:05 2023 -0700

    bpf, sockmap: Pass skb ownership through read_skb

    The read_skb hook calls consume_skb() now, but this means that if the
    recv_actor program wants to use the skb it needs to inc the ref cnt
    so that the consume_skb() doesn't kfree the sk_buff.

    This is problematic because in some error cases under memory pressure
    we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
    Then we get this,

     skb_linearize()
       __pskb_pull_tail()
         pskb_expand_head()
           BUG_ON(skb_shared(skb))

    Because we incremented users refcnt from sk_psock_verdict_recv() we
    hit the bug on with refcnt > 1 and trip it.

    To fix lets simply pass ownership of the sk_buff through the skb_read
    call. Then we can drop the consume from read_skb handlers and assume
    the verdict recv does any required kfree.

    Bug found while testing in our CI which runs in VMs that hit memory
    constraints rather regularly. William tested TCP read_skb handlers.

    [  106.536188] ------------[ cut here ]------------
    [  106.536197] kernel BUG at net/core/skbuff.c:1693!
    [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
    [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
    [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
    [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
    [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
    [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
    [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
    [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
    [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
    [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
    [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
    [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
    [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [  106.542255] Call Trace:
    [  106.542383]  <IRQ>
    [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
    [  106.542681]  skb_ensure_writable+0x85/0xa0
    [  106.542882]  sk_skb_pull_data+0x18/0x20
    [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
    [  106.543536]  ? migrate_disable+0x66/0x80
    [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
    [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
    [  106.544561]  tcp_read_skb+0x7b/0x120
    [  106.544740]  tcp_data_queue+0x904/0xee0
    [  106.544931]  tcp_rcv_established+0x212/0x7c0
    [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
    [  106.545326]  tcp_v4_rcv+0xe70/0xf60
    [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
    [  106.545744]  ip_local_deliver_finish+0xa7/0x150

    Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
    Reported-by: William Findlay <will@isovalent.com>
    Signed-off-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: William Findlay <will@isovalent.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-29 15:45:40 +02:00
Felix Maurer 7f95976aed udp: Refactor udp_read_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218483
Conflicts:
- net/ipv4/udp.c: Code differece due to missing ec095263a965 ("net: remove
  noblock parameter from recvmsg() entities"): keep the existing parameters
  to skb_recv_udp(); and missing db39dfdc1c3b ("udp: Use WARN_ON_ONCE() in
  udp_read_skb()"): keep WARN_ON

commit 31f1fbcb346c9342f6860c322b3f33b2acbc640b
Author: Peilin Ye <peilin.ye@bytedance.com>
Date:   Thu Sep 22 21:59:13 2022 -0700

    udp: Refactor udp_read_skb()

    Delete the unnecessary while loop in udp_read_skb() for readability.
    Additionally, since recv_actor() cannot return a value greater than
    skb->len (see sk_psock_verdict_recv()), remove the redundant check.

    Suggested-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
    Link: https://lore.kernel.org/r/343b5d8090a3eb764068e9f1d392939e2b423747.1663909008.git.peilin.ye@bytedance.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-29 15:45:40 +02:00
Jan Stancek dea08a5636 Merge: net: mptcp: rebase to latest net-next
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2479

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193330
Upstream Status: All mainline in net.git.
Tested: boot+kselftest
Conflicts: see individual commits

Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-19 08:29:21 +02:00
Davide Caratti c6f30ffe1a net: cache align tcp_memory_allocated, tcp_sockets_allocated
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193330
Upstream Status: net.git commit 91b6d3256356

commit 91b6d325635617540b6a1646ddb138bb17cbd569
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:39 2021 -0800

    net: cache align tcp_memory_allocated, tcp_sockets_allocated

    tcp_memory_allocated and tcp_sockets_allocated often share
    a common cache line, source of false sharing.

    Also take care of udp_memory_allocated and mptcp_sockets_allocated.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2023-05-09 11:08:43 +02:00
Jeff Moyer 87aedebebc net: flag sockets supporting msghdr originated zerocopy
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Conflicts: include/linux/net.h - upstream there was a conflict between
  SOCK_CUSTOM_SOCKOPT and SOCK_SUPPORT_ZC.  There, it was resolved
  with the former getting defined as 6, and the latter as 5.  However,
  in the RHEL backport of a5ef058dc4d9 ("net: introduce and use custom
  sockopt socket flag"), 5 was chosen for SOCK_CUSTOM_SOCKOPT.  I
  could renumber it to 6 to match upstream, but that risks introducing
  unnecessary incompatibilities for 3rd party modules, so I opted to
  differ from upstream.  net/ipv4/udp.c - RHEL has a backport of
  commit 8a3854c7b8e4 ("udp: track the forward memory release
  threshold in an hot cacheline") out of order with this commit.  It's
  a simple fixup.

commit e993ffe3da4bcddea0536b03be1031bf35cd8d85
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Oct 21 11:16:39 2022 +0100

    net: flag sockets supporting msghdr originated zerocopy
    
    We need an efficient way in io_uring to check whether a socket supports
    zerocopy with msghdr provided ubuf_info. Add a new flag into the struct
    socket flags fields.
    
    Cc: <stable@vger.kernel.org> # 6.0
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/3dafafab822b1c66308bb58a0ac738b1e3f53f74.1666346426.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-05-05 15:24:12 -04:00
Paolo Abeni 8912bbc04a net: Return errno in sk->sk_prot->get_port().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2166482
Tested: vs bz reproducer
Conflicts: different context in inet_csk_get_port and udp_lib_get_port,\
  as rhel-9 lacks the upstream commit 08eaef904031 ("tcp: Clean up \
  some functions.") and upstream commit 919dfa0b20ae ("udp: Clean up \
  some functions.")

Upstream commit:
commit 7a7160edf1bfde25422262fb26851cef65f695d3
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Nov 18 10:25:06 2022 -0800

    net: Return errno in sk->sk_prot->get_port().

    We assume the correct errno is -EADDRINUSE when sk->sk_prot->get_port()
    fails, so some ->get_port() functions return just 1 on failure and the
    callers return -EADDRINUSE instead.

    However, mptcp_get_port() can return -EINVAL.  Let's not ignore the error.

    Note the only exception is inet_autobind(), all of whose callers return
    -EAGAIN instead.

    Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-02 00:03:37 +01:00
Herton R. Krzesinski ee17c5d305 Merge: bpf, xdp: update to 6.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1742

bpf, xdp: update to 6.0

Bugzilla: https://bugzilla.redhat.com/2137876

Signed-off-by: Artem Savkov <asavkov@redhat.com>

Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Jerome Marchand <jmarchan@redhat.com>
Approved-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Approved-by: Michael Petlan <mpetlan@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-01-12 16:01:19 +00:00
Felix Maurer 6aa8ccbbdc skmsg: Get rid of skb_clone()
Bugzilla: https://bugzilla.redhat.com/2137876

commit 57452d767feaeab405de3bff0d240c3ac84bfe0d
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Jun 15 09:20:13 2022 -0700

    skmsg: Get rid of skb_clone()
    
    With ->read_skb() now we have an entire skb dequeued from
    receive queue, now we just need to grab an addtional refcnt
    before passing its ownership to recv actors.
    
    And we should not touch them any more, particularly for
    skb->sk. Fortunately, skb->sk is already set for most of
    the protocols except UDP where skb->sk has been stolen,
    so we have to fix it up for UDP case.
    
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220615162014.89193-4-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Felix Maurer 09faf01cb9 net: Introduce a new proto_ops ->read_skb()
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: Context difference due to not yet applied 314001f0bf927
("af_unix: Add OOB support") and already applied 3f92a64e44e5 ("tcp:
allow tls to decrypt directly from the tcp rcv queue")

commit 965b57b469a589d64d81b1688b38dcb537011bb0
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Jun 15 09:20:12 2022 -0700

    net: Introduce a new proto_ops ->read_skb()

    Currently both splice() and sockmap use ->read_sock() to
    read skb from receive queue, but for sockmap we only read
    one entire skb at a time, so ->read_sock() is too conservative
    to use. Introduce a new proto_ops ->read_skb() which supports
    this sematic, with this we can finally pass the ownership of
    skb to recv actors.

    For non-TCP protocols, all ->read_sock() can be simply
    converted to ->read_skb().

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Guillaume Nault 996e10a048 inet: rename INET_MATCH()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit eda090c31fe923ab9463b884469744ec903ab0cc
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 13 11:55:50 2022 -0700

    inet: rename INET_MATCH()

    This is no longer a macro, but an inlined function.

    INET_MATCH() -> inet_match()

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net>
    Suggested-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:47 +01:00
Guillaume Nault 97f5ffd267 inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 4915d50e300e96929d2462041d6f6c6f061167fd
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu May 12 09:56:01 2022 -0700

    inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()

    INET_MATCH() runs without holding a lock on the socket.

    We probably need to annotate most reads.

    This patch makes INET_MATCH() an inline function
    to ease our changes.

    v2:

    We remove the 32bit version of it, as modern compilers
    should generate the same code really, no need to
    try to be smarter.

    Also make 'struct net *net' the first argument.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:45 +01:00
Herton R. Krzesinski 09736a3a30 Merge: udp: some performance optimizations
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1541

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1, tput test

This series improves UDP protocol RX tput, to keep it on equal footing with rhel-8 one.

Patches 1,3,4 are there just to reduces the conflicts, and patch 4 is a very partial
backport, to avoid pulling unrelated features.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-13 17:35:03 +00:00
Frantisek Hrbata f76fa64ae5 Merge: udp: backports from upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1501

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135958
Tested: compile only

Signed-off-by: Xin Long <lxin@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-29 07:44:19 -05:00
Davide Caratti 9aac6c4346 net: add per_cpu_fw_alloc field to struct proto
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 0defbb0af775
Conflicts:
 - net/core/sock.c: context mismatch because of missing backport of
   upstream commit f20cfd662a62 ("net: add sanity check in proto_register()")

commit 0defbb0af775ef037913786048d099bbe8b9a2c2
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 23:34:08 2022 -0700

    net: add per_cpu_fw_alloc field to struct proto

    Each protocol having a ->memory_allocated pointer gets a corresponding
    per-cpu reserve, that following patches will use.

    Instead of having reserved bytes per socket,
    we want to have per-cpu reserves.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:55 +01:00
Davide Caratti 543f426b27 net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 100fdd1faf50

commit 100fdd1faf50557558e2911af4be32e515cb8036
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 23:34:07 2022 -0700

    net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT

    Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE.

    This might change in the future, but it seems better to avoid the
    confusion.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:55 +01:00
Paolo Abeni 2657483f26 udp: track the forward memory release threshold in an hot cacheline
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1
Conflicts: use lock_sock()/release_sock() instead of \
 sockopt_lock_sock()/sockopt_release_sock as rhel-9 lacks the
 upstream commit 24426654ed3a ("bpf: net: Avoid sk_setsockopt() taking
 sk lock when called from bpf")\

Upstream commit:
commit 8a3854c7b8e4532063b14bed34115079b7d0cb36
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Thu Oct 20 19:48:52 2022 +0200

    udp: track the forward memory release threshold in an hot cacheline

    When the receiver process and the BH runs on different cores,
    udp_rmem_release() experience a cache miss while accessing sk_rcvbuf,
    as the latter shares the same cacheline with sk_forward_alloc, written
    by the BH.

    With this patch, UDP tracks the rcvbuf value and its update via custom
    SOL_SOCKET socket options, and copies the forward memory threshold value
    used by udp_rmem_release() in a different cacheline, already accessed by
    the above function and uncontended.

    Since the UDP socket init operation grown a bit, factor out the common
    code between v4 and v6 in a shared helper.

    Overall the above give a 10% peek throughput increase under UDP flood.

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-28 09:46:32 +02:00
Frantisek Hrbata 0c3a22328a Merge: IPv6: 9.2 P1 backport from upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1488

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135319

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-27 08:26:02 -04:00
Frantisek Hrbata fa843be1d1 Merge: net: add skb drop reasons
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161

Sync skb drop reasons with upstream to improve debuggability and visibility in
the net stack. This MR helps in understanding why a given packet is being
dropped.

One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint:

```
# perf record -e skb:kfree_skb -a sleep 10
# perf script
         swapper     0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED
         swapper     0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE
```

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-24 14:27:58 -04:00
Xin Long 41b7fb44a4 udp: Update reuse->has_conns under reuseport_lock.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135958
Tested: compile only

commit 69421bf98482d089e50799f45e48b25ce4a8d154
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Oct 14 11:26:25 2022 -0700

    udp: Update reuse->has_conns under reuseport_lock.

    When we call connect() for a UDP socket in a reuseport group, we have
    to update sk->sk_reuseport_cb->has_conns to 1.  Otherwise, the kernel
    could select a unconnected socket wrongly for packets sent to the
    connected socket.

    However, the current way to set has_conns is illegal and possible to
    trigger that problem.  reuseport_has_conns() changes has_conns under
    rcu_read_lock(), which upgrades the RCU reader to the updater.  Then,
    it must do the update under the updater's lock, reuseport_lock, but
    it doesn't for now.

    For this reason, there is a race below where we fail to set has_conns
    resulting in the wrong socket selection.  To avoid the race, let's split
    the reader and updater with proper locking.

     cpu1                               cpu2
    +----+                             +----+

    __ip[46]_datagram_connect()        reuseport_grow()
    .                                  .
    |- reuseport_has_conns(sk, true)   |- more_reuse = __reuseport_alloc(more_socks_size)
    |  .                               |
    |  |- rcu_read_lock()
    |  |- reuse = rcu_dereference(sk->sk_reuseport_cb)
    |  |
    |  |                               |  /* reuse->has_conns == 0 here */
    |  |                               |- more_reuse->has_conns = reuse->has_conns
    |  |- reuse->has_conns = 1         |  /* more_reuse->has_conns SHOULD BE 1 HERE */
    |  |                               |
    |  |                               |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
    |  |                               |                     more_reuse)
    |  `- rcu_read_unlock()            `- kfree_rcu(reuse, rcu)
    |
    |- sk->sk_state = TCP_ESTABLISHED

    Note the likely(reuse) in reuseport_has_conns_set() is always true,
    but we put the test there for ease of review.  [0]

    For the record, usually, sk_reuseport_cb is changed under lock_sock().
    The only exception is reuseport_grow() & TCP reqsk migration case.

      1) shutdown() TCP listener, which is moved into the latter part of
         reuse->socks[] to migrate reqsk.

      2) New listen() overflows reuse->socks[] and call reuseport_grow().

      3) reuse->max_socks overflows u16 with the new listener.

      4) reuseport_grow() pops the old shutdown()ed listener from the array
         and update its sk->sk_reuseport_cb as NULL without lock_sock().

    shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(),
    but, reuseport_has_conns_set() is called only for UDP under lock_sock(),
    so likely(reuse) never be false in reuseport_has_conns_set().

    [0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/

    Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Xin Long <lxin@redhat.com>
2022-10-18 19:20:11 -04:00
Xin Long bcde996ce0 udp: Remove redundant __udp_sysctl_init() call from udp_init().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135958
Tested: compile only

commit 02a7cb2866dd6e3ac7645b594289e1c308b68c4e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Thu Jul 28 20:21:37 2022 -0700

    udp: Remove redundant __udp_sysctl_init() call from udp_init().

    __udp_sysctl_init() is called for init_net via udp_sysctl_ops.

    While at it, we can rename __udp_sysctl_init() to udp_sysctl_init().

    Fixes: 1e80295158 ("udp: Move the udp sysctl to namespace.")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Xin Long <lxin@redhat.com>
2022-10-18 19:20:11 -04:00
Xin Long 83ed64626f net: udp: fix alignment problem in udp4_seq_show()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135958
Tested: compile only

commit 6c25449e1a32c594d743df8e8258e8ef870b6a77
Author: yangxingwu <xingwu.yang@gmail.com>
Date:   Mon Dec 27 16:29:51 2021 +0800

    net: udp: fix alignment problem in udp4_seq_show()

    $ cat /pro/net/udp

    before:

      sl  local_address rem_address   st tx_queue rx_queue tr tm->when
    26050: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000
    26320: 0100007F:0143 00000000:0000 07 00000000:00000000 00:00000000
    27135: 00000000:8472 00000000:0000 07 00000000:00000000 00:00000000

    after:

       sl  local_address rem_address   st tx_queue rx_queue tr tm->when
    26050: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000
    26320: 0100007F:0143 00000000:0000 07 00000000:00000000 00:00000000
    27135: 00000000:8472 00000000:0000 07 00000000:00000000 00:00000000

    Signed-off-by: yangxingwu <xingwu.yang@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Xin Long <lxin@redhat.com>
2022-10-18 19:20:10 -04:00
Hangbin Liu 4b10ce48ea tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135319
Upstream Status: net.git commit d38afeec26ed

commit d38afeec26ed4739c640bf286c270559aab2ba5f
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Thu Oct 6 11:53:47 2022 -0700

    tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().

    Originally, inet6_sk(sk)->XXX were changed under lock_sock(), so we were
    able to clean them up by calling inet6_destroy_sock() during the IPv6 ->
    IPv4 conversion by IPV6_ADDRFORM.  However, commit 03485f2adc ("udpv6:
    Add lockless sendmsg() support") added a lockless memory allocation path,
    which could cause a memory leak:

    setsockopt(IPV6_ADDRFORM)                 sendmsg()
    +-----------------------+                 +-------+
    - do_ipv6_setsockopt(sk, ...)             - udpv6_sendmsg(sk, ...)
      - sockopt_lock_sock(sk)                   ^._ called via udpv6_prot
        - lock_sock(sk)                             before WRITE_ONCE()
      - WRITE_ONCE(sk->sk_prot, &tcp_prot)
      - inet6_destroy_sock()                    - if (!corkreq)
      - sockopt_release_sock(sk)                  - ip6_make_skb(sk, ...)
        - release_sock(sk)                          ^._ lockless fast path for
                                                        the non-corking case

                                                    - __ip6_append_data(sk, ...)
                                                      - ipv6_local_rxpmtu(sk, ...)
                                                        - xchg(&np->rxpmtu, skb)
                                                          ^._ rxpmtu is never freed.

                                                    - goto out_no_dst;

                                                - lock_sock(sk)

    For now, rxpmtu is only the case, but not to miss the future change
    and a similar bug fixed in commit e27326009a3d ("net: ping6: Fix
    memleak in ipv6_renew_options()."), let's set a new function to IPv6
    sk->sk_destruct() and call inet6_cleanup_sock() there.  Since the
    conversion does not change sk->sk_destruct(), we can guarantee that
    we can clean up IPv6 resources finally.

    We can now remove all inet6_destroy_sock() calls from IPv6 protocol
    specific ->destroy() functions, but such changes are invasive to
    backport.  So they can be posted as a follow-up later for net-next.

    Fixes: 03485f2adc ("udpv6: Add lockless sendmsg() support")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-10-18 11:41:13 +08:00
Antoine Tenart 854953ce3e net: udp: use kfree_skb_reason() in __udp_queue_rcv_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 08d4c0370c400fa6ef2194f9ee2e8dccc4a7ab39
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sat Feb 5 15:47:39 2022 +0800

    net: udp: use kfree_skb_reason() in __udp_queue_rcv_skb()

    Replace kfree_skb() with kfree_skb_reason() in __udp_queue_rcv_skb().
    Following new drop reasons are introduced:

    SKB_DROP_REASON_SOCKET_RCVBUFF
    SKB_DROP_REASON_PROTO_MEM

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Antoine Tenart b9868e0eb2 net: udp: use kfree_skb_reason() in udp_queue_rcv_one_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 1379a92d38e31132e87d9b653e9343c7841a7348
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sat Feb 5 15:47:38 2022 +0800

    net: udp: use kfree_skb_reason() in udp_queue_rcv_one_skb()

    Replace kfree_skb() with kfree_skb_reason() in udp_queue_rcv_one_skb().

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Chris von Recklinghausen 59baaa91f0 include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h
Bugzilla: https://bugzilla.redhat.com/2120352

commit a1554c002699cbc9ced2e9f44f9c1357181bead3
Author: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Date:   Fri Nov 5 13:45:21 2021 -0700

    include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h

    nr_free_buffer_pages could be exposed through mm.h instead of swap.h.
    The advantage of this change is that it can reduce the obsolete
    includes.  For example, net/ipv4/tcp.c wouldn't need swap.h any more
    since it has already included mm.h.  Similarly, after checking all the
    other files, it comes that tcp.c, udp.c meter.c ,...  follow the same
    rule, so these files can have swap.h removed too.

    Moreover, after preprocessing all the files that use
    nr_free_buffer_pages, it turns out that those files have already
    included mm.h.Thus, we can move nr_free_buffer_pages from swap.h to mm.h
    safely.  This change will not affect the compilation of other files.

    Link: https://lkml.kernel.org/r/20210912133640.1624-1-liumh1@shanghaitech.edu.cn
    Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn>
    Cc: Jakub Kicinski <kuba@kernel.org>
    CC: Ulf Hansson <ulf.hansson@linaro.org>
    Cc: "David S . Miller" <davem@davemloft.net>
    Cc: Simon Horman <horms@verge.net.au>
    Cc: Pravin B Shelar <pshelar@ovn.org>
    Cc: Vlad Yasevich <vyasevich@gmail.com>
    Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:30 -04:00
Felix Maurer de20724127 net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620

commit 91a760b26926265a60c77ddf016529bcf3e17a04
Author: Menglong Dong <imagedong@tencent.com>
Date:   Thu Jan 6 21:20:20 2022 +0800

    net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()

    The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
    __inet_bind() is not handled properly. While the return value
    is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
    exit:

            err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
            if (err) {
                    inet->inet_saddr = inet->inet_rcv_saddr = 0;
                    goto out_release_sock;
            }

    Let's take UDP for example and see what will happen. For UDP
    socket, it will be added to 'udp_prot.h.udp_table->hash' and
    'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
    called success. If 'inet->inet_rcv_saddr' is specified here,
    then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
    to (because inet_saddr is changed to 0), and UDP packet received
    will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
    specified here, the sock will work fine, as it can receive packet
    properly, which is wired, as the 'bind()' is already failed.

    To undo the get_port() operation, introduce the 'put_port' field
    for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
    proto, it is udp_lib_unhash(); For icmp proto, it is
    ping_unhash().

    Therefore, after sys_bind() fail caused by
    BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
    means that it can try to be binded to another port.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-08-24 16:53:48 +02:00
Artem Savkov 75a645a56c add missing bpf-cgroup.h includes
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit aef2feda97b840ec38e9fa53d0065188453304e8
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Dec 15 18:55:37 2021 -0800

    add missing bpf-cgroup.h includes

    We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency,
    make sure those who actually need more than the definition of
    struct cgroup_bpf include bpf-cgroup.h explicitly.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/bpf/20211216025538.1649516-3-kuba@kernel.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:49 +02:00
Artem Savkov 070349c7ea bpf: Add ingress_ifindex to bpf_sk_lookup
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit f89315650ba34ec6c91a8bded72796980bee2a4d
Author: Mark Pashmfouroush <markpash@cloudflare.com>
Date:   Wed Nov 10 11:10:15 2021 +0000

    bpf: Add ingress_ifindex to bpf_sk_lookup

    It may be helpful to have access to the ifindex during bpf socket
    lookup. An example may be to scope certain socket lookup logic to
    specific interfaces, i.e. an interface may be made exempt from custom
    lookup code.

    Add the ifindex of the arriving connection to the bpf_sk_lookup API.

    Signed-off-by: Mark Pashmfouroush <markpash@cloudflare.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211110111016.5670-2-markpash@cloudflare.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:35 +02:00
Patrick Talbert 8c5b3f7fd9 Merge: XDP and networking eBPF rebase to v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/674

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Depends: !572

Tested: Using bpf selftests, everything passes.

This rebases XDP and networking eBPF to upstream kernel version 5.15.

Signed-off-by: Jiri Benc <jbenc@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-03 09:26:25 +02:00