Commit Graph

784 Commits

Author SHA1 Message Date
Benjamin Coddington 82eb8441b8 net: add a refcount tracker for kernel sockets
JIRA: https://issues.redhat.com/browse/RHEL-73723
Conflicts: the __netns_tracker_alloc interface has been updated upstream
b6d7c0eb2dcbd, but in RHEL the hunk for notrefcnt_tracker was not included
(See RHEL commit 3b0a87ad0e, RHEL-24101).  We merge it in here.  Also,
we've dropped the rds hunk, as that seems unmantained in RHEL and is missing
the path where that hunk should operate.

commit 0cafd77dcd032d1687efaba5598cf07bce85997f
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Oct 20 23:20:18 2022 +0000

    net: add a refcount tracker for kernel sockets

    Commit ffa84b5ffb37 ("net: add netns refcount tracker to struct sock")
    added a tracker to sockets, but did not track kernel sockets.

    We still have syzbot reports hinting about netns being destroyed
    while some kernel TCP sockets had not been dismantled.

    This patch tracks kernel sockets, and adds a ref_tracker_dir_print()
    call to net_free() right before the netns is freed.

    Normally, each layer is responsible for properly releasing its
    kernel sockets before last call to net_free().

    This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
2025-01-31 06:45:48 -05:00
Rado Vrbovsky 745fc07ced Merge: net-core: stable backport for 9.6 phase 2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6112

JIRA: https://issues.redhat.com/browse/RHEL-73121
CVE: CVE-2024-56658
Tested: LNST, Tier1

A couple of stable backport for core networking, addressing critical issues.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2025-01-14 14:18:30 +00:00
Paolo Abeni ab4de5449b net: restrict SO_REUSEPORT to inet sockets
JIRA: https://issues.redhat.com/browse/RHEL-73121
Tested: LNST, Tier1

Upstream commit:
commit 5b0af621c3f6ef9261cf6067812f2fd9943acb4b
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Dec 31 16:05:27 2024 +0000

    net: restrict SO_REUSEPORT to inet sockets

    After blamed commit, crypto sockets could accidentally be destroyed
    from RCU call back, as spotted by zyzbot [1].

    Trying to acquire a mutex in RCU callback is not allowed.

    Restrict SO_REUSEPORT socket option to inet sockets.

    v1 of this patch supported TCP, UDP and SCTP sockets,
    but fcnal-test.sh test needed RAW and ICMP support.

    [1]
    BUG: sleeping function called from invalid context at kernel/locking/mutex.c:562
    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 24, name: ksoftirqd/1
    preempt_count: 100, expected: 0
    RCU nest depth: 0, expected: 0
    1 lock held by ksoftirqd/1/24:
      #0: ffffffff8e937ba0 (rcu_callback){....}-{0:0}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
      #0: ffffffff8e937ba0 (rcu_callback){....}-{0:0}, at: rcu_do_batch kernel/rcu/tree.c:2561 [inline]
      #0: ffffffff8e937ba0 (rcu_callback){....}-{0:0}, at: rcu_core+0xa37/0x17a0 kernel/rcu/tree.c:2823
    Preemption disabled at:
     [<ffffffff8161c8c8>] softirq_handle_begin kernel/softirq.c:402 [inline]
     [<ffffffff8161c8c8>] handle_softirqs+0x128/0x9b0 kernel/softirq.c:537
    CPU: 1 UID: 0 PID: 24 Comm: ksoftirqd/1 Not tainted 6.13.0-rc3-syzkaller-00174-ga024e377efed #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
    Call Trace:
     <TASK>
      __dump_stack lib/dump_stack.c:94 [inline]
      dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
      __might_resched+0x5d4/0x780 kernel/sched/core.c:8758
      __mutex_lock_common kernel/locking/mutex.c:562 [inline]
      __mutex_lock+0x131/0xee0 kernel/locking/mutex.c:735
      crypto_put_default_null_skcipher+0x18/0x70 crypto/crypto_null.c:179
      aead_release+0x3d/0x50 crypto/algif_aead.c:489
      alg_do_release crypto/af_alg.c:118 [inline]
      alg_sock_destruct+0x86/0xc0 crypto/af_alg.c:502
      __sk_destruct+0x58/0x5f0 net/core/sock.c:2260
      rcu_do_batch kernel/rcu/tree.c:2567 [inline]
      rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
      handle_softirqs+0x2d4/0x9b0 kernel/softirq.c:561
      run_ksoftirqd+0xca/0x130 kernel/softirq.c:950
      smpboot_thread_fn+0x544/0xa30 kernel/smpboot.c:164
      kthread+0x2f0/0x390 kernel/kthread.c:389
      ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
      ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
     </TASK>

    Fixes: 8c7138b33e ("net: Unpublish sk from sk_reuseport_cb before call_rcu")
    Reported-by: syzbot+b3e02953598f447d4d2a@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/netdev/6772f2f4.050a0220.2f3838.04cb.GAE@google.com/T/#u
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Martin KaFai Lau <kafai@fb.com>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://patch.msgid.link/20241231160527.3994168-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-08 18:55:28 +01:00
Jeff Moyer c46aaba751 net: change proto and proto_ops accept type
JIRA: https://issues.redhat.com/browse/RHEL-64867
Conflicts: RHEL is missing commit 1ded5e5a5931 ("net: annotate
data-races around sock->ops"), which accounts for the differences in
ops structure dereferencing.

commit 92ef0fd55ac80dfc2e4654edfe5d1ddfa6e070fe
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu May 9 09:20:08 2024 -0600

    net: change proto and proto_ops accept type
    
    Rather than pass in flags, error pointer, and whether this is a kernel
    invocation or not, add a struct proto_accept_arg struct as the argument.
    This then holds all of these arguments, and prepares accept for being
    able to pass back more information.
    
    No functional changes in this patch.
    
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-12-02 11:12:33 -05:00
Lucas Zampieri b0d4d6bf2e Merge: io_uring: update to 6.8 + fixes
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4429

io_uring: update to 6.8 + fixes

JIRA: https://issues.redhat.com/browse/RHEL-27755  
JIRA: https://issues.redhat.com/browse/RHEL-36928  
JIRA: https://issues.redhat.com/browse/RHEL-36926  
JIRA: https://issues.redhat.com/browse/RHEL-37250  
JIRA: https://issues.redhat.com/browse/RHEL-37293  
CVE: CVE-2024-35831  
CVE: CVE-2024-35827  
CVE: CVE-2024-35880  
CVE: CVE-2024-35923  

This update pulls in 6.7 and 6.8 patches along with any missing fixes
and stable Cc:s.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-07 16:50:15 +00:00
Scott Weaver 19af9f11eb Merge: net: fix __dst_negative_advice() race
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4673

JIRA: https://issues.redhat.com/browse/RHEL-41185
CVE: CVE-2024-36971
Tested: compile only

Patch 1/2 is to solve a context difference in include/net/sock.h.

v1->v2:
  - add the missing CVE tag.

Signed-off-by: Xin Long <lxin@redhat.com>

Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Scott Weaver <scweaver@redhat.com>
2024-08-02 10:38:22 -04:00
Xin Long a6a99aa0fb net: annotate data-races around sk->sk_dst_pending_confirm
JIRA: https://issues.redhat.com/browse/RHEL-41185
Tested: compile only

commit eb44ad4e635132754bfbcb18103f1dcb7058aedd
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Sep 21 20:28:18 2023 +0000

    net: annotate data-races around sk->sk_dst_pending_confirm

    This field can be read or written without socket lock being held.

    Add annotations to avoid load-store tearing.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Xin Long <lxin@redhat.com>
2024-07-10 15:11:11 -04:00
Paolo Abeni f690f594cf net: do not leave a dangling sk pointer, when socket creation fails
JIRA: https://issues.redhat.com/browse/RHEL-46610
Tested: LNST

Upstream commit:
commit 6cd4a78d962bebbaf8beb7d2ead3f34120e3f7b2
Author: Ignat Korchagin <ignat@cloudflare.com>
Date:   Mon Jun 17 22:02:05 2024 +0100

    net: do not leave a dangling sk pointer, when socket creation fails

    It is possible to trigger a use-after-free by:
      * attaching an fentry probe to __sock_release() and the probe calling the
        bpf_get_socket_cookie() helper
      * running traceroute -I 1.1.1.1 on a freshly booted VM

    A KASAN enabled kernel will log something like below (decoded and stripped):
    ==================================================================
    BUG: KASAN: slab-use-after-free in __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29)
    Read of size 8 at addr ffff888007110dd8 by task traceroute/299

    CPU: 2 PID: 299 Comm: traceroute Tainted: G            E      6.10.0-rc2+ #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
    Call Trace:
     <TASK>
    dump_stack_lvl (lib/dump_stack.c:117 (discriminator 1))
    print_report (mm/kasan/report.c:378 mm/kasan/report.c:488)
    ? __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29)
    kasan_report (mm/kasan/report.c:603)
    ? __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29)
    kasan_check_range (mm/kasan/generic.c:183 mm/kasan/generic.c:189)
    __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29)
    bpf_get_socket_ptr_cookie (./arch/x86/include/asm/preempt.h:94 ./include/linux/sock_diag.h:42 net/core/filter.c:5094 net/core/filter.c:5092)
    bpf_prog_875642cf11f1d139___sock_release+0x6e/0x8e
    bpf_trampoline_6442506592+0x47/0xaf
    __sock_release (net/socket.c:652)
    __sock_create (net/socket.c:1601)
    ...
    Allocated by task 299 on cpu 2 at 78.328492s:
    kasan_save_stack (mm/kasan/common.c:48)
    kasan_save_track (mm/kasan/common.c:68)
    __kasan_slab_alloc (mm/kasan/common.c:312 mm/kasan/common.c:338)
    kmem_cache_alloc_noprof (mm/slub.c:3941 mm/slub.c:4000 mm/slub.c:4007)
    sk_prot_alloc (net/core/sock.c:2075)
    sk_alloc (net/core/sock.c:2134)
    inet_create (net/ipv4/af_inet.c:327 net/ipv4/af_inet.c:252)
    __sock_create (net/socket.c:1572)
    __sys_socket (net/socket.c:1660 net/socket.c:1644 net/socket.c:1706)
    __x64_sys_socket (net/socket.c:1718)
    do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

    Freed by task 299 on cpu 2 at 78.328502s:
    kasan_save_stack (mm/kasan/common.c:48)
    kasan_save_track (mm/kasan/common.c:68)
    kasan_save_free_info (mm/kasan/generic.c:582)
    poison_slab_object (mm/kasan/common.c:242)
    __kasan_slab_free (mm/kasan/common.c:256)
    kmem_cache_free (mm/slub.c:4437 mm/slub.c:4511)
    __sk_destruct (net/core/sock.c:2117 net/core/sock.c:2208)
    inet_create (net/ipv4/af_inet.c:397 net/ipv4/af_inet.c:252)
    __sock_create (net/socket.c:1572)
    __sys_socket (net/socket.c:1660 net/socket.c:1644 net/socket.c:1706)
    __x64_sys_socket (net/socket.c:1718)
    do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

    Fix this by clearing the struct socket reference in sk_common_release() to cover
    all protocol families create functions, which may already attached the
    reference to the sk object with sock_init_data().

    Fixes: c5dbb89fc2 ("bpf: Expose bpf_get_socket_cookie to tracing programs")
    Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/netdev/20240613194047.36478-1-kuniyu@amazon.com/T/
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
    Link: https://lore.kernel.org/r/20240617210205.67311-1-ignat@cloudflare.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-10 09:15:04 +02:00
Jeff Moyer 8d70cdbc58 net/socket: Break down __sys_getsockopt
JIRA: https://issues.redhat.com/browse/RHEL-27755
Conflicts: RHEL does not include commit 1ded5e5a5931 ("net: annotate
data-races around sock->ops"), which converts proto_ops to a const
accessed with READ_ONCE.  Fix up the patch to apply, but keep the
READ_ONCE from 1ded5e5a5931.

commit 0b05b0cd78c92371fdde6333d006f39eaf9e0860
Author: Breno Leitao <leitao@debian.org>
Date:   Mon Oct 16 06:47:42 2023 -0700

    net/socket: Break down __sys_getsockopt
    
    Split __sys_getsockopt() into two functions by removing the core
    logic into a sub-function (do_sock_getsockopt()). This will avoid
    code duplication when doing the same operation in other callers, for
    instance.
    
    do_sock_getsockopt() will be called by io_uring getsockopt() command
    operation in the following patch.
    
    The same was done for the setsockopt pair.
    
    Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20231016134750.1381153-5-leitao@debian.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-07-02 09:57:34 -04:00
Lucas Zampieri 35e08b2a2f Merge: net/other: phase-1 stable backports for RHEL-9.5
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4078

JIRA: https://issues.redhat.com/browse/RHEL-33410  
JIRA: https://issues.redhat.com/browse/RHEL-30875  
Upstream Status: All mainline in net.git.  
Depends: !3968  
Tested: boot-tested only  
Conflicts: see individual patches  
  
Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-27 14:03:58 +00:00
Lucas Zampieri b6c99fa14e Merge: tcp: fix accounted memory leak on ARM
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4195

JIRA: https://issues.redhat.com/browse/RHEL-34070
Tested: LNST, Tier1

The issue is addressed by the last commit in the merge. The previous 2 changes are pre-requisites to avoid any conflict in the last one.
Technically neither of the pre-req is strictly needed, we adapt the fix with moderate effort and limited risk, but both the pre-reqs are small enough and nice enough to be a valuable alternative. Additionally the first one is provides a new sysctl that will be very likely used/requested by the users affected by this issue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-19 18:00:02 +00:00
Davide Caratti 4f983c4399 af_unix: Fix data race around sk->sk_err.
JIRA: https://issues.redhat.com/browse/RHEL-33410
Upstream Status: net.git commit b192812905e4b134f7b7994b079eb647e9d2d37e

commit b192812905e4b134f7b7994b079eb647e9d2d37e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Sep 1 17:27:08 2023 -0700

    af_unix: Fix data race around sk->sk_err.

    As with sk->sk_shutdown shown in the previous patch, sk->sk_err can be
    read locklessly by unix_dgram_sendmsg().

    Let's use READ_ONCE() for sk_err as well.

    Note that the writer side is marked by commit cc04410af7de ("af_unix:
    annotate lockless accesses to sk->sk_err").

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-06-06 11:39:13 +02:00
Davide Caratti 4f44d35947 af_unix: Fix data-races around sk->sk_shutdown.
JIRA: https://issues.redhat.com/browse/RHEL-33410
Upstream Status: net.git commit afe8764f76346ba838d4f162883e23d2fcfaa90e

commit afe8764f76346ba838d4f162883e23d2fcfaa90e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Sep 1 17:27:07 2023 -0700

    af_unix: Fix data-races around sk->sk_shutdown.

    sk->sk_shutdown is changed under unix_state_lock(sk), but
    unix_dgram_sendmsg() calls two functions to read sk_shutdown locklessly.

      sock_alloc_send_pskb
      `- sock_wait_for_wmem

    Let's use READ_ONCE() there.

    Note that the writer side was marked by commit e1d09c2c2f57 ("af_unix:
    Fix data races around sk->sk_shutdown.").

    BUG: KCSAN: data-race in sock_alloc_send_pskb / unix_release_sock

    write (marked) to 0xffff8880069af12c of 1 bytes by task 1 on cpu 1:
     unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
     unix_release+0x59/0x80 net/unix/af_unix.c:1053
     __sock_release+0x7d/0x170 net/socket.c:654
     sock_close+0x19/0x30 net/socket.c:1386
     __fput+0x2a3/0x680 fs/file_table.c:384
     ____fput+0x15/0x20 fs/file_table.c:412
     task_work_run+0x116/0x1a0 kernel/task_work.c:179
     resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
     exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
     exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
     __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
     syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
     do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
     entry_SYSCALL_64_after_hwframe+0x6e/0xd8

    read to 0xffff8880069af12c of 1 bytes by task 28650 on cpu 0:
     sock_alloc_send_pskb+0xd2/0x620 net/core/sock.c:2767
     unix_dgram_sendmsg+0x2f8/0x14f0 net/unix/af_unix.c:1944
     unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline]
     unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292
     sock_sendmsg_nosec net/socket.c:725 [inline]
     sock_sendmsg+0x148/0x160 net/socket.c:748
     ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494
     ___sys_sendmsg+0xc6/0x140 net/socket.c:2548
     __sys_sendmsg+0x94/0x140 net/socket.c:2577
     __do_sys_sendmsg net/socket.c:2586 [inline]
     __se_sys_sendmsg net/socket.c:2584 [inline]
     __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x6e/0xd8

    value changed: 0x00 -> 0x03

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 28650 Comm: systemd-coredum Not tainted 6.4.0-11989-g6843306689af #6
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Reported-by: syzkaller <syzkaller@googlegroups.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-06-06 11:39:13 +02:00
Ivan Vecera 6d5dcfe050 net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP
JIRA: https://issues.redhat.com/browse/RHEL-36217

commit b534dc46c8ae0165b1b2509be24dbea4fa9c4011
Author: Willem de Bruijn <willemb@google.com>
Date:   Wed Dec 7 09:37:01 2022 -0500

    net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP

    Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
    write_seq sockets instead of snd_una.

    This should have been the behavior from the start. Because processes
    may now exist that rely on the established behavior, do not change
    behavior of the existing option, but add the right behavior with a new
    flag. It is encouraged to always set SOF_TIMESTAMPING_OPT_ID_TCP on
    stream sockets along with the existing SOF_TIMESTAMPING_OPT_ID.

    Intuitively the contract is that the counter is zero after the
    setsockopt, so that the next write N results in a notification for
    the last byte N - 1.

    On idle sockets snd_una == write_seq and this holds for both. But on
    sockets with data in transmission, snd_una records the unacked offset
    in the stream. This depends on the ACK response from the peer. A
    process cannot learn this in a race free manner (ioctl SIOCOUTQ is one
    racy approach).

    write_seq records the offset at the last byte written by the process.
    This is a better starting point. It matches the intuitive contract in
    all circumstances, unaffected by external behavior.

    The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
    Move the field in struct sock to avoid growing the socket (for some
    common CONFIG variants). The UAPI interface so_timestamping.flags is
    already int, so 32 bits wide.

    Reported-by: Sotirios Delimanolis <sotodel@meta.com>
    Signed-off-by: Willem de Bruijn <willemb@google.com>
    Link: https://lore.kernel.org/r/20221207143701.29861-1-willemdebruijn.kernel@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-16 18:34:21 +02:00
Paolo Abeni 3c85017f3c tcp: sk_forced_mem_schedule() optimization
JIRA: https://issues.redhat.com/browse/RHEL-34070
Tested: LNST, Tier1

Upstream commit:
commit 219160be496f7f9cd105c5708e37cf22ab4ce0c7
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Jun 10 20:30:16 2022 -0700

    tcp: sk_forced_mem_schedule() optimization

    sk_memory_allocated_add() has three callers, and returns
    to them @memory_allocated.

    sk_forced_mem_schedule() is one of them, and ignores
    the returned value.

    Change sk_memory_allocated_add() to return void.

    Change sock_reserve_memory() and __sk_mem_raise_allocated()
    to call sk_memory_allocated().

    This removes one cache line miss [1] for RPC workloads,
    as first skbs in TCP write queue and receive queue go through
    sk_forced_mem_schedule().

    [1] Cache line holding tcp_memory_allocated.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-08 11:52:36 +02:00
Paolo Abeni bd6b15c986 net: make SK_MEMORY_PCPU_RESERV tunable
JIRA: https://issues.redhat.com/browse/RHEL-34070
Tested: LNST, Tier1
Conflicts: different context in sysctl_net_core.c, as rhel-9 lacks \
  the upstream series cb636b3e372b ("Merge branch 'use-standard-sysctl-macro'")

Upstream commit:
commit 12a686c2e761f1f1f6e6e2117a9ab9c6de2ac8a7
Author: Adam Li <adamli@os.amperecomputing.com>
Date:   Mon Feb 26 02:24:52 2024 +0000

    net: make SK_MEMORY_PCPU_RESERV tunable

    This patch adds /proc/sys/net/core/mem_pcpu_rsv sysctl file,
    to make SK_MEMORY_PCPU_RESERV tunable.

    Commit 3cd3399dd7a8 ("net: implement per-cpu reserves for
    memory_allocated") introduced per-cpu forward alloc cache:

    "Implement a per-cpu cache of +1/-1 MB, to reduce number
    of changes to sk->sk_prot->memory_allocated, which
    would otherwise be cause of false sharing."

    sk_prot->memory_allocated points to global atomic variable:
    atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp;

    If increasing the per-cpu cache size from 1MB to e.g. 16MB,
    changes to sk->sk_prot->memory_allocated can be further reduced.
    Performance may be improved on system with many cores.

    Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
    Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-08 11:44:38 +02:00
Sabrina Dubroca 7bc5eeb384 net: skbuff: generalize the skb->decrypted bit
JIRA: https://issues.redhat.com/browse/RHEL-29306

commit 9f06f87fef689d28588cde8c7ebb00a67da34026
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 3 13:21:39 2024 -0700

    net: skbuff: generalize the skb->decrypted bit

    The ->decrypted bit can be reused for other crypto protocols.
    Remove the direct dependency on TLS, add helpers to clean up
    the ifdefs leaking out everywhere.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2024-05-01 17:48:16 +02:00
Lucas Zampieri f112e4de2c Merge: CNB95: netlink/devlink: update devlink & netlink to the v6.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3939

JIRA: https://issues.redhat.com/browse/RHEL-30656  
Tested: LNST  
Depends: !3918 

The series updates netlink and devlink core to upstream version v6.6. Both have to be updated at once due to circular dependencies.

Omitted-fix: 83f2df9d66bc
            The fix needs an additional devlink dependencies and it will be applied in next rebase series covered by RHEL-30145

Commits:
```
6978052448f9 ("netlink: remove unused 'compare' function")
74bf6477c18b ("netlink-specs: add partial specification for devlink")
82b3297009b6 ("netlink: specs: allow uapi-header in genetlink")
56c874f7dbca ("tools: ynl: skip the explicit op array size when not needed")
8da3a5598f75 ("ynl: allow to encode u8 attr")
bc77f7318da8 ("tools: ynl: add the Python requirements.txt file")
dd3a7d58dcc2 ("tools: ynl: Add missing types to encode/decode")
4c6170d1ae2c ("tools: ynl: default to treating enums as flags for mask generation")
bec0b7a2db35 ("tools: ynl: Add struct parsing to nlspec")
b423c3c86325 ("tools: ynl: Add C array attribute decoding to ynl")
2607191395bd ("tools: ynl: Add struct attr decoding to ynl")
f036d936ca57 ("tools: ynl: Add fixed-header support to ynl")
643ef4a676e3 ("netlink: specs: add partial specification for openvswitch")
88e288968412 ("docs: netlink: document struct support for genetlink-legacy")
04eac39361d3 ("docs: netlink: document the sub-type attribute property")
9f7cc57fe550 ("tools: ynl: support byte-order in cli")
a353318ebf24 ("tools: ynl: populate most of the ethtool spec")
48993e22d23a ("tools: ynl: replace print with NlError")
f3d07b02b2b8 ("tools: ynl: ethtool testing tool")
ebe3bdc4359e ("tools: ynl: throw a more meaningful exception if family not supported")
3ea31e66644b ("tools: ynl: Remove absolute paths to yaml files from ethtool testing tool")
85a4abed1554 ("tools: ynl: Rename ethtool to ethtool.py")
d913d32cc270 ("netlink: Use copy_to_user() for optval in netlink_getsockopt().")
a939d14919b7 ("netlink: annotate accesses to nlk->cb_running")
7c2435ef76e5 ("tools: ynl: Use dict of predefined Structs to decode scalar types")
bddd2e561b0a ("tools: ynl: Handle byte-order in struct members")
081e8df68199 ("tools: ynl: avoid dict errors on older Python versions")
9b66ee06e5ca ("net: ynl: prefix uAPI header include with uapi/")
0684f29a89e5 ("netlink: specs: correct types of legacy arrays")
6d6bae63053d ("doc: ynl: Add doc attr to struct members in genetlink-legacy spec")
5ac18889bde0 ("tools: ynl: Initialise fixed headers to 0 in genetlink-legacy")
313a7a808ca8 ("tools: ynl: Support enums in struct members in genetlink-legacy")
93b230b549bc ("netlink: specs: add ynl spec for ovs_flow")
f4e4534850a9 ("net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report")
91dfaef243cd ("tools: ynl-gen: add extra headers for user space")
6ad49839ba9b ("tools: ynl-gen: fix unused / pad attribute handling")
67c65ce762ad ("tools: ynl-gen: don't override pure nested struct")
5605f102378f ("tools: ynl-gen: loosen type consistency check for events")
eef9b794eac8 ("tools: ynl-gen: add error checking for nested structs")
21b6e302789c ("tools: ynl-gen: generate enum-to-string helpers")
dc0956c98f11 ("tools: ynl-gen: move the response reading logic into YNL")
5d58f911c755 ("tools: ynl-gen: generate alloc and free helpers for req")
8cb6afb33541 ("tools: ynl-gen: switch to family struct")
59d814f0f285 ("tools: ynl-gen: generate static descriptions of notifications")
a99bfdf64795 ("tools: ynl-gen: clean up stray new lines at the end of reply-less requests")
86878f14d71a ("tools: ynl: user space helpers")
d75fdfbc6f26 ("tools: ynl: support fou and netdev in C")
ee0202e2e731 ("tools: ynl: add sample for netdev")
f6ca5baf2a86 ("netlink: specs: ethtool: fix random typos")
2cc9671a82e3 ("tools: ynl-gen: fill in support for MultiAttr scalars")
58da455b31ba ("tools: ynl-gen: improve unwind on parsing errors")
7a11f70ce882 ("tools: ynl: generate code for the handshake family")
8947e5037371 ("netlink: specs: devlink: fill in some details important for C")
9858bfc271de ("tools: ynl-gen: use enum names in op strmap more carefully")
6f115d4575ab ("tools: ynl-gen: refactor strmap helper generation")
ff6db4b58c93 ("tools: ynl-gen: enable code gen for directional specs")
6afaa0ef9b0e ("tools: ynl-gen: try to sort the types more intelligently")
37487f93b125 ("tools: ynl-gen: inherit struct use info")
eae7af21bdb9 ("tools: ynl-gen: walk nested types in depth")
168dea20ecef ("tools: ynl-gen: don't generate forward declarations for policies")
0a9471219672 ("tools: ynl-gen: don't generate forward declarations for policies - regen")
5d1a30eb989a ("tools: ynl: generate code for the devlink family")
fff8660b5425 ("tools: ynl: add sample for devlink")
30b5c720e1a9 ("tools: ynl-gen: cleanup user space header includes")
9b52fd4b6305 ("tools: ynl: regen: cleanup user space header includes")
820343ccbb2e ("tools: ynl-gen: complete the C keyword list")
2c0f1466867c ("tools: ynl-gen: combine else with closing bracket")
e4ea3cc68472 ("tools: ynl-gen: get attr type outside of if()")
7234415b8f86 ("tools: ynl: regen: regenerate the if ladders")
f2ba1e5e2208 ("tools: ynl-gen: stop generating common notification handlers")
d0915d64c3a6 ("tools: ynl: regen: stop generating common notification handlers")
ced1568862bd ("tools: ynl-gen: sanitize notification tracking")
6da3424fd629 ("tools: ynl-gen: support code gen for events")
6f96ec73cb5a ("tools: ynl-gen: don't pass op_name to RenderInfo")
76abff37f0d7 ("tools: ynl-gen: support / skip pads on the way to kernel")
008bcd6835a2 ("tools: ynl-gen: support excluding tricky ops")
33eedb0071c8 ("tools: ynl-gen: record extra args for regen")
ed2042cc77f1 ("netlink: specs: support setting prefix-name per attribute")
d4813b11d679 ("netlink: specs: ethtool: add C render hints")
dddc9f53da3e ("tools: ynl-gen: don't generate enum types if unnamed")
2c9d47a095f7 ("tools: ynl-gen: resolve enum vs struct name conflicts")
180ad455273a ("netlink: specs: ethtool: add empty enum stringset")
37c852222712 ("netlink: specs: ethtool: untangle UDP tunnels and cable test a bit")
709d0c3b3d4c ("netlink: specs: ethtool: untangle stats-get")
68335713d2ea ("netlink: specs: ethtool: mark pads as pads")
2d7be507d65e ("tools: ynl: generate code for the ethtool family")
f561ff232a6b ("tools: ynl: add sample for ethtool")
10c4d2a7b88d ("tools: ynl-gen: correct enum policies")
be093a80dff0 ("tools: ynl-gen: inherit policy in multi-attr")
fa0e21fa4443 ("rtnetlink: extend RTEXT_FILTER_SKIP_STATS to IFLA_VF_INFO")
89da780aa4c7 ("rtnetlink: move validate_linkmsg out of do_setlink")
f0ec58d557d6 ("tools: ynl: work around stale system headers")
6907217a8054 ("netlink: specs: fixup openvswitch specs for code generation")
8d61f926d420 ("netlink: fix potential deadlock in netlink_set_err()")
0c3d6fd4b89c ("tools: ynl: improve the direct-include header guard logic")
737eab775d36 ("netlink: specs: add display-hint to schema definitions")
d8eea68d913c ("tools: ynl: add display-hint support to ynl")
334f39ce17ef ("netlink: specs: add display hints to ovs_flow")
25a9c8a4431c ("netlink: Add __sock_i_ino() for __netlink_diag_dump().")
b8e39b38487e ("netlink: Make use of __assign_bit() API")
633d76ad01ad ("devlink: remove reload failed checks in params get/set callbacks")
4a59cdfd6699 ("rtnetlink: Move nesting cancellation rollback to proper function")
5766946ea511 ("genetlink: add explicit ordering break check for split ops")
a3377386b564 ("netlink: Reverse the patch which removed filtering")
a4c9a56e6a2c ("netlink: Add new netlink_release function")
d7ddf5f4269f ("tools: ynl-gen: fix enum index in _decode_enum(..)")
df15c15e6c98 ("tools: ynl-gen: fix parse multi-attr enum attribute")
5fac9b7c16c5 ("netlink: allow be16 and be32 types in all uint policy checks")
e5c157f081ab ("ynl: expose xdp-zc-max-segs")
37844828d290 ("ynl: mark max/mask as private for kdoc")
25b5a2a1905f ("ynl: regenerate all headers")
26fdb67e8b4a ("ynl: print xdp-zc-max-segs in the sample")
759ab1edb56c ("net: store netdevs in an xarray")
84e00d9bd4e4 ("net: convert some netlink netdev iterators to depend on the xarray")
2628d40899d1 ("devlink: Remove unused extern declaration devlink_port_region_destroy()")
78c96d7b7c9a ("netlink: specs: add dump-strict flag for dont-validate property")
dc7b81a828db ("ynl-gen-c.py: filter rendering of validate field values for split ops")
eab7be688b44 ("ynl-gen-c.py: allow directional model for kernel mode")
fa8ba3502ade ("ynl-gen-c.py: render netlink policies static for split ops")
ba0f66c95fa6 ("devlink: rename devlink_nl_ops to devlink_nl_small_ops")
d61aedcf628e ("devlink: rename couple of doit netlink callbacks to match generated names")
491a24872a64 ("devlink: introduce couple of dumpit callbacks for split ops")
8300dce542e4 ("devlink: un-static devlink_nl_pre/post_doit()")
759f661012d1 ("netlink: specs: devlink: add info-get dump op")
6b7c486cae81 ("devlink: add split ops generated according to spec")
b2551b1517d8 ("devlink: include the generated netlink header")
6e067d0cab68 ("devlink: use generated split ops and remove duplicated commands from small ops")
b876b71a6ac2 ("devlink: Remove unused devlink_dpipe_table_resource_set() declaration")
2c0e9f3806c4 ("tools: ynl-gen: avoid rendering empty validate field")
832140804e3b ("devlink: clear flag on port register error path")
cd3112ebbaf4 ("tools: ynl-gen: add missing empty line between policies")
8fe08d70a2b6 ("netlink: convert nlk->flags to atomic flags")
63618463cb94 ("devlink: parse linecard attr in doit() callbacks")
41a1d4d1399a ("devlink: parse rate attrs in doit() callbacks")
ee6d78ac28c7 ("devlink: introduce devlink_nl_pre_doit_port*() helper functions")
8fa995ad1f7f ("devlink: rename doit callbacks for per-instance dump commands")
24c8e56d4f98 ("devlink: introduce dumpit callbacks for split ops")
7d3c6fec6135 ("devlink: pass flags as an arg of dump_one() callback")
7199c86247e9 ("netlink: specs: devlink: add commands that do per-instance dump")
ddff283280ba ("devlink: remove duplicate temporary netlink callback prototypes")
833e479d330c ("devlink: remove converted commands from small ops")
4a1b5aa8b5c7 ("devlink: allow user to narrow per-instance dumps by passing handle attrs")
34493336e7d3 ("netlink: specs: devlink: extend per-instance dump commands to accept instance attributes")
b03f13cb67a5 ("devlink: extend health reporter dump selector by port index")
0149bca17262 ("netlink: specs: devlink: extend health reporter dump attributes by port index")
84817d8c6042 ("genetlink: push conditional locking into dumpit/done")
fde9bd4a4d41 ("genetlink: make genl_info->nlhdr const")
bffcc6882a1b ("genetlink: remove userhdr from struct genl_info")
9272af109fe6 ("genetlink: add struct genl_info to struct genl_dumpit_info")
7288dd2fd488 ("genetlink: use attrs from struct genl_info")
5c670a010de4 ("genetlink: add a family pointer to struct genl_info")
5aa51d9f889c ("genetlink: add genlmsg_iput() API")
0e19d3108aea ("netdev-genl: use struct genl_info for reply construction")
ec0e5b09b834 ("ethtool: netlink: simplify arguments to ethnl_default_parse()")
f946270d05c2 ("ethtool: netlink: always pass genl_info to .prepare_data")
956db0a13b47 ("net: warn about attempts to register negative ifindex")
ded67d90815a ("netlink: specs: add ovs_vport new command")
7582113c6917 ("tools: ynl: add more info to KeyErrors on missing attrs")
d56b699d76d1 ("Documentation: Fix typos")
f65f305ae008 ("tools: ynl-gen: use temporary file for rendering")
f534f6581ec0 ("net: validate veth and vxcan peer ifindexes")
649bde9004ac ("tools: ynl: allow passing binary data")
a149a3a13bbc ("tools: ynl-gen: set length of binary fields")
dc2ef94d8926 ("tools: ynl-gen: fix collecting global policy attrs")
4c8c24e801e6 ("tools: ynl-gen: support empty attribute lists")
e83d4e9b2d0f ("netlink: specs: fix indent in fou")
a02430c06f56 ("tools: ynl-gen: fix uAPI generation after tempfile changes")
52d08fda3516 ("doc/netlink: Add delete operation to ovs_vport spec")
ed68c58c0eb4 ("doc/netlink: Add a schema for netlink-raw families")
294f37fc8772 ("doc/netlink: Update genetlink-legacy documentation")
2db8abf0b455 ("doc/netlink: Document the netlink-raw schema extensions")
88901b967958 ("tools/ynl: Add mcast-group schema parsing to ynl")
fb0a06d455d6 ("tools/net/ynl: Fix extack parsing with fixed header genlmsg")
e46dd903efe3 ("tools/net/ynl: Add support for netlink-raw families")
0493e56d021d ("tools/net/ynl: Implement nlattr array-nest decoding in ynl")
1768d8a767f8 ("tools/net/ynl: Add support for create flags")
dfb0f7d9d979 ("doc/netlink: Add spec for rt addr messages")
b2f63d904e72 ("doc/netlink: Add spec for rt link messages")
023289b4f582 ("doc/netlink: Add spec for rt route messages")
56e65312830e ("devlink: push object register/unregister notifications into separate helpers")
eec1e5ea1d71 ("devlink: push port related code into separate file")
2b4d8bb08889 ("devlink: push shared buffer related code into separate file")
2475ed158c47 ("devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper")
a9fd44b15fc5 ("devlink: push dpipe related code into separate file")
a9f960074ecd ("devlink: push resource related code into separate file")
830c41e1e987 ("devlink: push param related code into separate file")
1aa47ca1f52e ("devlink: push region related code into separate file")
85facf94fd80 ("devlink: use tracepoint_enabled() helper")
4bbdec80ff27 ("devlink: push trap related code into separate file")
7cc7194e85ca ("devlink: push rate related code into separate file")
9edbe6f36c5f ("devlink: push linecard related code into separate file")
890c55667437 ("devlink: move tracepoint definitions into core.c")
29a390d17748 ("devlink: move small_ops definition into netlink.c")
71179ac5c211 ("devlink: move devlink_notify_register/unregister() to dev.c")
ee940b57a929 ("doc/netlink: Fix missing classic_netlink doc reference")
d0f95894fda7 ("netlink: annotate data-races around sk->sk_err")
0f4d44f6ee04 ("netlink: specs: devlink: fix reply command values")
69844e335d8c ("selftests/bpf: Fix sockopt_sk selftest")
e4fe082c38cd ("tools: ynl: make sure we always pass yarg to mnl_cb_run")
5d78b73e8514 ("tools: ynl: don't leak mcast_groups on init error")
b6c65eb20ffa ("tools: ynl: fix handling of multiple mcast groups")
ceaac91dcd06 ("net: make sure we never create ifindex = 0")
0e0939c0adf9 ("net-procfs: use xarray iterator to implement /proc/net/dev")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-26 12:33:53 +00:00
Ivan Vecera 764a373f7a netlink: Add __sock_i_ino() for __netlink_diag_dump().
JIRA: https://issues.redhat.com/browse/RHEL-30656

commit 25a9c8a4431c364f97f75558cb346d2ad3f53fbb
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jun 26 09:43:13 2023 -0700

    netlink: Add __sock_i_ino() for __netlink_diag_dump().

    syzbot reported a warning in __local_bh_enable_ip(). [0]

    Commit 8d61f926d420 ("netlink: fix potential deadlock in
    netlink_set_err()") converted read_lock(&nl_table_lock) to
    read_lock_irqsave() in __netlink_diag_dump() to prevent a deadlock.

    However, __netlink_diag_dump() calls sock_i_ino() that uses
    read_lock_bh() and read_unlock_bh().  If CONFIG_TRACE_IRQFLAGS=y,
    read_unlock_bh() finally enables IRQ even though it should stay
    disabled until the following read_unlock_irqrestore().

    Using read_lock() in sock_i_ino() would trigger a lockdep splat
    in another place that was fixed in commit f064af1e50 ("net: fix
    a lockdep splat"), so let's add __sock_i_ino() that would be safe
    to use under BH disabled.

    [0]:
    WARNING: CPU: 0 PID: 5012 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
    Modules linked in:
    CPU: 0 PID: 5012 Comm: syz-executor487 Not tainted 6.4.0-rc7-syzkaller-00202-g6f68fc395f49 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
    RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
    Code: 45 bf 01 00 00 00 e8 91 5b 0a 00 e8 3c 15 3d 00 fb 65 8b 05 ec e9 b5 7e 85 c0 74 58 5b 5d c3 65 8b 05 b2 b6 b4 7e 85 c0 75 a2 <0f> 0b eb 9e e8 89 15 3d 00 eb 9f 48 89 ef e8 6f 49 18 00 eb a8 0f
    RSP: 0018:ffffc90003a1f3d0 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000201 RCX: 1ffffffff1cf5996
    RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffff8805c6f3
    RBP: ffffffff8805c6f3 R08: 0000000000000001 R09: ffff8880152b03a3
    R10: ffffed1002a56074 R11: 0000000000000005 R12: 00000000000073e4
    R13: dffffc0000000000 R14: 0000000000000002 R15: 0000000000000000
    FS:  0000555556726300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000045ad50 CR3: 000000007c646000 CR4: 00000000003506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <TASK>
     sock_i_ino+0x83/0xa0 net/core/sock.c:2559
     __netlink_diag_dump+0x45c/0x790 net/netlink/diag.c:171
     netlink_diag_dump+0xd6/0x230 net/netlink/diag.c:207
     netlink_dump+0x570/0xc50 net/netlink/af_netlink.c:2269
     __netlink_dump_start+0x64b/0x910 net/netlink/af_netlink.c:2374
     netlink_dump_start include/linux/netlink.h:329 [inline]
     netlink_diag_handler_dump+0x1ae/0x250 net/netlink/diag.c:238
     __sock_diag_cmd net/core/sock_diag.c:238 [inline]
     sock_diag_rcv_msg+0x31e/0x440 net/core/sock_diag.c:269
     netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2547
     sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:280
     netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
     netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
     netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1914
     sock_sendmsg_nosec net/socket.c:724 [inline]
     sock_sendmsg+0xde/0x190 net/socket.c:747
     ____sys_sendmsg+0x71c/0x900 net/socket.c:2503
     ___sys_sendmsg+0x110/0x1b0 net/socket.c:2557
     __sys_sendmsg+0xf7/0x1c0 net/socket.c:2586
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x7f5303aaabb9
    Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007ffc7506e548 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5303aaabb9
    RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
    RBP: 00007f5303a6ed60 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007f5303a6edf0
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
     </TASK>

    Fixes: 8d61f926d420 ("netlink: fix potential deadlock in netlink_set_err()")
    Reported-by: syzbot+5da61cf6a9bc1902d422@syzkaller.appspotmail.com
    Link: https://syzkaller.appspot.com/bug?extid=5da61cf6a9bc1902d422
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230626164313.52528-1-kuniyu@amazon.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-10 09:19:27 +02:00
Paolo Abeni 279c8c2ceb udp: fix busy polling
JIRA: https://issues.redhat.com/browse/RHEL-32270
Tested: LNST, Tier1

Upstream commit:
commit a54d51fb2dfb846aedf3751af501e9688db447f5
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jan 18 20:17:49 2024 +0000

    udp: fix busy polling

    Generic sk_busy_loop_end() only looks at sk->sk_receive_queue
    for presence of packets.

    Problem is that for UDP sockets after blamed commit, some packets
    could be present in another queue: udp_sk(sk)->reader_queue

    In some cases, a busy poller could spin until timeout expiration,
    even if some packets are available in udp_sk(sk)->reader_queue.

    v3: - make sk_busy_loop_end() nicer (Willem)

    v2: - add a READ_ONCE(sk->sk_family) in sk_is_inet() to avoid KCSAN splats.
        - add a sk_is_inet() check in sk_is_udp() (Willem feedback)
        - add a sk_is_inet() check in sk_is_tcp().

    Fixes: 2276f58ac5 ("udp: use a separate rx queue for packet reception")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-09 18:19:35 +02:00
Paolo Abeni 251dcf6d75 udp6: Fix race condition in udp6_sendmsg & connect
JIRA: https://issues.redhat.com/browse/RHEL-32270
Tested: LNST, Tier1

Upstream commit:
commit 448a5ce1120c5bdbce1f1ccdabcd31c7d029f328
Author: Vladislav Efanov <VEfanov@ispras.ru>
Date:   Tue May 30 14:39:41 2023 +0300

    udp6: Fix race condition in udp6_sendmsg & connect

    Syzkaller got the following report:
    BUG: KASAN: use-after-free in sk_setup_caps+0x621/0x690 net/core/sock.c:2018
    Read of size 8 at addr ffff888027f82780 by task syz-executor276/3255

    The function sk_setup_caps (called by ip6_sk_dst_store_flow->
    ip6_dst_store) referenced already freed memory as this memory was
    freed by parallel task in udpv6_sendmsg->ip6_sk_dst_lookup_flow->
    sk_dst_check.

              task1 (connect)              task2 (udp6_sendmsg)
            sk_setup_caps->sk_dst_set |
                                      |  sk_dst_check->
                                      |      sk_dst_set
                                      |      dst_release
            sk_setup_caps references  |
            to already freed dst_entry|

    The reason for this race condition is: sk_setup_caps() keeps using
    the dst after transferring the ownership to the dst cache.

    Found by Linux Verification Center (linuxtesting.org) with syzkaller.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Vladislav Efanov <VEfanov@ispras.ru>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-09 18:19:27 +02:00
Bastien Nocera 73ed5e4957 net: annotate data-races around sk->sk_lingertime
JIRA: https://issues.redhat.com/browse/RHEL-17138

commit bc1fb82ae11753c5dec53c667a055dc37796dbd2
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Aug 19 04:06:46 2023 +0000

    net: annotate data-races around sk->sk_lingertime

    sk_getsockopt() runs locklessly. This means sk->sk_lingertime
    can be read while other threads are changing its value.

    Other reads also happen without socket lock being held,
    and must be annotated.

    Remove preprocessor logic using BITS_PER_LONG, compilers
    are smart enough to figure this by themselves.

    v2: fixed a clang W=1 (-Wtautological-constant-out-of-range-compare) warning
        (Jakub)

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Bastien Nocera <bnocera@redhat.com>
2024-01-11 16:47:24 +01:00
Scott Weaver c6519990cd Merge: net: visibility patches
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3447

JIRA: https://issues.redhat.com/browse/RHEL-17413

A set of various visibility / debuggability improvements related to the net stack.

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Eric Chanudet <echanude@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2024-01-02 10:35:00 -05:00
Scott Weaver 8d95883db0 Merge: io_uring: update to upstream v6.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3318

Update io_uring and its dependencies to upstream kernel version 6.6.

JIRA: https://issues.redhat.com/browse/RHEL-12076
JIRA: https://issues.redhat.com/browse/RHEL-14998
JIRA: https://issues.redhat.com/browse/RHEL-4447
CVE: CVE-2023-46862

Omitted-Fix: ab69838e7c75 ("io_uring/kbuf: Fix check of BID wrapping in provided buffers")
Omitted-Fix: f74c746e476b ("io_uring/kbuf: Allow the full buffer id space for provided buffers")

This is the list of new features available (includes upstream kernel versions 6.3-6.6):

    User-specified ring buffer
    Provided Buffers allocated by the kernel
    Ability to register the ring fd
    Multi-shot timeouts
    ability to pass custom flags to the completion queue entry for ring messages

All of these features are covered by the liburing tests.

In my testing, no-mmap-inval.t failed because of a broken test.  socket-uring-cmd.t also failed because of a missing selinux policy rule.  Try running audit2allow if you see a failure in that test.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-12-16 14:38:47 -05:00
Antoine Tenart 5e0d04b8ef net/sock: Introduce trace_sk_data_ready()
JIRA: https://issues.redhat.com/browse/RHEL-17413
Upstream Status: linux.git
Conflicts:\
- drivers/infiniband/hw/erdma/erdma_cm.c chunk missing due to missing
  upstream commit 920d93eac8b9 ("RDMA/erdma: Add connection management
  (CM) support") in c9s.
- Context diff in fs/dlm/lowcomms.c due to missing upstream commit
  dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling") in c9s.
- Context diff in net/core/net-traces.c as 8139dccd464a ("udp6: add a
  missing call into udp_fail_queue_rcv_skb tracepoint") was backported
  earlier in c9s.
- Context diff in net/tls/tls_sw.c as 74836ec828fe ("tls: rx: strp:
  don't use GFP_KERNEL in softirq context") was backported earlier in
  c9s.
- Context diff in net/sunrpc/svcsock.c as upstream commit fc80fc2d4e39
  ("SUNRPC: Fix UAF in svc_tcp_listen_data_ready()") was backported
  before in c9s.

commit 40e0b09081420853542571c38875b48b60404ebb
Author: Peilin Ye <peilin.ye@bytedance.com>
Date:   Thu Jan 19 16:45:16 2023 -0800

    net/sock: Introduce trace_sk_data_ready()

    As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
    callback implementations.  For example:

    <...>
      iperf-609  [002] .....  70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
      iperf-609  [002] .....  70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
    <...>

    Suggested-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-12-11 11:15:00 +01:00
Jeff Moyer d2f4f99945 net: remove sk_is_ipmr() and sk_is_icmpv6() helpers
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 634236b34d7a8c9e11c12b0746b83b8942fc8f2e
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jun 19 12:43:35 2023 +0000

    net: remove sk_is_ipmr() and sk_is_icmpv6() helpers
    
    Blamed commit added these helpers for sake of detecting RAW
    sockets specific ioctl.
    
    syzbot complained about it [1].
    
    Issue here is that RAW sockets could pretend there was no need
    to call ipmr_sk_ioctl()
    
    Regardless of inet_sk(sk)->inet_num, we must be prepared
    for ipmr_ioctl() being called later. This must happen
    from ipmr_sk_ioctl() context only.
    
    We could add a safety check in ipmr_ioctl() at the risk of breaking
    applications.
    
    Instead, remove sk_is_ipmr() and sk_is_icmpv6() because their
    name would be misleading, once we change their implementation.
    
    [1]
    BUG: KASAN: stack-out-of-bounds in ipmr_ioctl+0xb12/0xbd0 net/ipv4/ipmr.c:1654
    Read of size 4 at addr ffffc90003aefae4 by task syz-executor105/5004
    
    CPU: 0 PID: 5004 Comm: syz-executor105 Not tainted 6.4.0-rc6-syzkaller-01304-gc08afcdcf952 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
    Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:88 [inline]
    dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
    print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:351
    print_report mm/kasan/report.c:462 [inline]
    kasan_report+0x11c/0x130 mm/kasan/report.c:572
    ipmr_ioctl+0xb12/0xbd0 net/ipv4/ipmr.c:1654
    raw_ioctl+0x4e/0x1e0 net/ipv4/raw.c:881
    sock_ioctl_out net/core/sock.c:4186 [inline]
    sk_ioctl+0x151/0x440 net/core/sock.c:4214
    inet_ioctl+0x18c/0x380 net/ipv4/af_inet.c:1001
    sock_do_ioctl+0xcc/0x230 net/socket.c:1189
    sock_ioctl+0x1f8/0x680 net/socket.c:1306
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:870 [inline]
    __se_sys_ioctl fs/ioctl.c:856 [inline]
    __x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x7f2944bf6ad9
    Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007ffd8897a028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2944bf6ad9
    RDX: 0000000000000000 RSI: 00000000000089e1 RDI: 0000000000000003
    RBP: 00007f2944bbac80 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2944bbad10
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    </TASK>
    
    The buggy address belongs to stack of task syz-executor105/5004
    and is located at offset 36 in frame:
    sk_ioctl+0x0/0x440 net/core/sock.c:4172
    
    This frame has 2 objects:
    [32, 36) 'karg'
    [48, 88) 'buffer'
    
    Fixes: e1d001fa5b47 ("net: ioctl: Use kernel memory on protocol ioctl callbacks")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Breno Leitao <leitao@debian.org>
    Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Link: https://lore.kernel.org/r/20230619124336.651528-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-03 15:39:16 -04:00
Jeff Moyer 092f5d645a net: ioctl: Use kernel memory on protocol ioctl callbacks
JIRA: https://issues.redhat.com/browse/RHEL-12076
Conflicts: There are contextual differences as we're missing commit
  559260fd9d9a ("ipmr: do not acquire mrt_lock in
  ioctl(SIOCGETVIFCNT)").  I also pulled in header changes from commit
  949d6b405e61 ("net: add missing includes and forward declarations
  under net/") to address a build failure with this patch applied.

commit e1d001fa5b477c4da46a29be1fcece91db7c7c6f
Author: Breno Leitao <leitao@debian.org>
Date:   Fri Jun 9 08:27:42 2023 -0700

    net: ioctl: Use kernel memory on protocol ioctl callbacks
    
    Most of the ioctls to net protocols operates directly on userspace
    argument (arg). Usually doing get_user()/put_user() directly in the
    ioctl callback.  This is not flexible, because it is hard to reuse these
    functions without passing userspace buffers.
    
    Change the "struct proto" ioctls to avoid touching userspace memory and
    operate on kernel buffers, i.e., all protocol's ioctl callbacks is
    adapted to operate on a kernel memory other than on userspace (so, no
    more {put,get}_user() and friends being called in the ioctl callback).
    
    This changes the "struct proto" ioctl format in the following way:
    
        int                     (*ioctl)(struct sock *sk, int cmd,
    -                                        unsigned long arg);
    +                                        int *karg);
    
    (Important to say that this patch does not touch the "struct proto_ops"
    protocols)
    
    So, the "karg" argument, which is passed to the ioctl callback, is a
    pointer allocated to kernel space memory (inside a function wrapper).
    This buffer (karg) may contain input argument (copied from userspace in
    a prep function) and it might return a value/buffer, which is copied
    back to userspace if necessary. There is not one-size-fits-all format
    (that is I am using 'may' above), but basically, there are three type of
    ioctls:
    
    1) Do not read from userspace, returns a result to userspace
    2) Read an input parameter from userspace, and does not return anything
      to userspace
    3) Read an input from userspace, and return a buffer to userspace.
    
    The default case (1) (where no input parameter is given, and an "int" is
    returned to userspace) encompasses more than 90% of the cases, but there
    are two other exceptions. Here is a list of exceptions:
    
    * Protocol RAW:
       * cmd = SIOCGETVIFCNT:
         * input and output = struct sioc_vif_req
       * cmd = SIOCGETSGCNT
         * input and output = struct sioc_sg_req
       * Explanation: for the SIOCGETVIFCNT case, userspace passes the input
         argument, which is struct sioc_vif_req. Then the callback populates
         the struct, which is copied back to userspace.
    
    * Protocol RAW6:
       * cmd = SIOCGETMIFCNT_IN6
         * input and output = struct sioc_mif_req6
       * cmd = SIOCGETSGCNT_IN6
         * input and output = struct sioc_sg_req6
    
    * Protocol PHONET:
      * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
         * input int (4 bytes)
      * Nothing is copied back to userspace.
    
    For the exception cases, functions sock_sk_ioctl_inout() will
    copy the userspace input, and copy it back to kernel space.
    
    The wrapper that prepare the buffer and put the buffer back to user is
    sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
    calls sk_ioctl(), which will handle all cases.
    
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:32:16 -04:00
Paolo Abeni 3317959251 net: use sk_forward_alloc_get() in sk_get_meminfo()
JIRA: https://issues.redhat.com/browse/RHEL-14364
Tested: LNST, Tier1

Upstream commit:
commit 66d58f046c9d3a8f996b7138d02e965fd0617de0
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Aug 31 13:52:08 2023 +0000

    net: use sk_forward_alloc_get() in sk_get_meminfo()

    inet_sk_diag_fill() has been changed to use sk_forward_alloc_get(),
    but sk_get_meminfo() was forgotten.

    Fixes: 292e6077b040 ("net: introduce sk_forward_alloc_get()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-20 13:48:54 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Paolo Abeni 5107c808b3 net: add sock_init_data_uid()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1

Upstream commit:
commit 584f3742890e966d2f0a1f3c418c9ead70b2d99e
Author: Pietro Borrello <borrello@diag.uniroma1.it>
Date:   Sat Feb 4 17:39:20 2023 +0000

    net: add sock_init_data_uid()

    Add sock_init_data_uid() to explicitly initialize the socket uid.
    To initialise the socket uid, sock_init_data() assumes a the struct
    socket* sock is always embedded in a struct socket_alloc, used to
    access the corresponding inode uid. This may not be true.
    Examples are sockets created in tun_chr_open() and tap_open().

    Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
    Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-02 19:07:41 +02:00
Paolo Abeni 916713b856 txhash: fix sk->sk_txrehash default
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1

Upstream commit:
commit c11204c78d6966c5bda6dd05c3ac5cbb193f93e3
Author: Kevin Yang <yyd@google.com>
Date:   Tue Feb 7 02:08:20 2023 +0000

    txhash: fix sk->sk_txrehash default

    This code fix a bug that sk->sk_txrehash gets its default enable
    value from sysctl_txrehash only when the socket is a TCP listener.

    We should have sysctl_txrehash to set the default sk->sk_txrehash,
    no matter TCP, nor listerner/connector.

    Tested by following packetdrill:
      0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 4
      // SO_TXREHASH == 74, default to sysctl_txrehash == 1
      +0 getsockopt(3, SOL_SOCKET, 74, [1], [4]) = 0
      +0 getsockopt(4, SOL_SOCKET, 74, [1], [4]) = 0

    Fixes: 26859240e4ee ("txhash: Add socket option to control TX hash rethink behavior")
    Signed-off-by: Kevin Yang <yyd@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-02 19:07:41 +02:00
Paolo Abeni cc88f2ea1c soreuseport: Fix socket selection for SO_INCOMING_CPU.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1

Upstream commit:
commit b261eda84ec136240a9ca753389853a3a1bccca2
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Oct 21 13:44:34 2022 -0700

    soreuseport: Fix socket selection for SO_INCOMING_CPU.

    Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
    with setsockopt(SO_REUSEPORT) since v4.6.

    With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
    build a highly efficient server application.

    setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
    or UDP socket, and then incoming packets processed on the CPU will
    likely be distributed to the socket.  Technically, a socket could
    even receive packets handled on another CPU if no sockets in the
    reuseport group have the same CPU receiving the flow.

    The logic exists in compute_score() so that a socket will get a higher
    score if it has the same CPU with the flow.  However, the score gets
    ignored after the blamed two commits, which introduced a faster socket
    selection algorithm for SO_REUSEPORT.

    This patch introduces a counter of sockets with SO_INCOMING_CPU in
    a reuseport group to check if we should iterate all sockets to find
    a proper one.  We increment the counter when

      * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT

      * enabling SO_INCOMING_CPU if the socket is in a reuseport group

    Also, we decrement it when

      * detaching a socket out of the group to apply SO_INCOMING_CPU to
        migrated TCP requests

      * disabling SO_INCOMING_CPU if the socket is in a reuseport group

    When the counter reaches 0, we can get back to the O(1) selection
    algorithm.

    The overall changes are negligible for the non-SO_INCOMING_CPU case,
    and the only notable thing is that we have to update sk_incomnig_cpu
    under reuseport_lock.  Otherwise, the race prevents transitioning to
    the O(n) algorithm and results in the wrong socket selection.

     cpu1 (setsockopt)               cpu2 (listen)
    +-----------------+             +-------------+

    lock_sock(sk1)                  lock_sock(sk2)

    reuseport_update_incoming_cpu(sk1, val)
    .
    |  /* set CPU as 0 */
    |- WRITE_ONCE(sk1->incoming_cpu, val)
    |
    |                               spin_lock_bh(&reuseport_lock)
    |                               reuseport_grow(sk2, reuse)
    |                               .
    |                               |- more_socks_size = reuse->max_socks * 2U;
    |                               |- if (more_socks_size > U16_MAX &&
    |                               |       reuse->num_closed_socks)
    |                               |  .
    |                               |  |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
    |                               |  `- __reuseport_detach_closed_sock(sk1, reuse)
    |                               |     .
    |                               |     `- reuseport_put_incoming_cpu(sk1, reuse)
    |                               |        .
    |                               |        |  /* Read shutdown()ed sk1's sk_incoming_cpu
    |                               |        |   * without lock_sock().
    |                               |        |   */
    |                               |        `- if (sk1->sk_incoming_cpu >= 0)
    |                               |           .
    |                               |           |  /* decrement not-yet-incremented
    |                               |           |   * count, which is never incremented.
    |                               |           |   */
    |                               |           `- __reuseport_put_incoming_cpu(reuse);
    |                               |
    |                               `- spin_lock_bh(&reuseport_lock)
    |
    |- spin_lock_bh(&reuseport_lock)
    |
    |- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
    |- if (!reuse)
    |  .
    |  |  /* Cannot increment reuse->incoming_cpu. */
    |  `- goto out;
    |
    `- spin_unlock_bh(&reuseport_lock)

    Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
    Fixes: c125e80b88 ("soreuseport: fast reuseport TCP socket selection")
    Reported-by: Kazuho Oku <kazuhooku@gmail.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-02 18:56:58 +02:00
Xin Long 5a01b46698 net: add support for ipv4 big tcp
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2185290
Tested: big tcp selftest

commit b1a78b9b98862cda167b643690e43662ea060625
Author: Xin Long <lucien.xin@gmail.com>
Date:   Sat Jan 28 10:58:39 2023 -0500

    net: add support for ipv4 big tcp

    Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.

    Firstly, allow sk->sk_gso_max_size to be set to a value greater than
    GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
    for IPv4 TCP sockets.

    Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
    in __ip_local_out() to allow to send BIG TCP packets, and this implies
    that skb->len is the length of a IPv4 packet; On RX path, use skb->len
    as the length of the IPv4 packet when the IP header tot_len is 0 and
    skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
    skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
    need to update these APIs.

    Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
    the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
    GRO complete, set IP header tot_len to 0 when the merged packet size
    greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
    on RX path.

    Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
    this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
    packets.

    Signed-off-by: Xin Long <lucien.xin@gmail.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-05-02 10:36:11 -04:00
Jeff Moyer 1eef062be8 net: inline sock_alloc_send_skb
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit de32bc6aad09131a30b4a9a738e2bf2ba5a9a5aa
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Apr 28 11:58:44 2022 +0100

    net: inline sock_alloc_send_skb
    
    sock_alloc_send_skb() is simple and just proxying to another function,
    so we can inline it and cut associated overhead.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 07:54:02 -04:00
Guillaume Nault e7de28d517 net: Introduce sk_use_task_frag in struct sock.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2183213
Upstream Status: linux.git

commit fb87bd47516d9a26b6d549231aa743b20fd4a569
Author: Guillaume Nault <gnault@redhat.com>
Date:   Fri Dec 16 07:45:26 2022 -0500

    net: Introduce sk_use_task_frag in struct sock.

    Sockets that can be used while recursing into memory reclaim, like
    those used by network block devices and file systems, mustn't use
    current->task_frag: if the current process is already using it, then
    the inner memory reclaim call would corrupt the task_frag structure.

    To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
    that mustn't use current->task_frag, assuming that those used during
    memory reclaim had their allocation constraints reflected in
    ->sk_allocation.

    This unfortunately doesn't cover all cases: in an attempt to remove all
    usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
    ->sk_allocation, and used memalloc_nofs critical sections instead.
    This breaks the sk_page_frag() heuristic since the allocation
    constraints are now stored in current->flags, which sk_page_frag()
    can't read without risking triggering a cache miss and slowing down
    TCP's fast path.

    This patch creates a new field in struct sock, named sk_use_task_frag,
    which sockets with memory reclaim constraints can set to false if they
    can't safely use current->task_frag. In such cases, sk_page_frag() now
    always returns the socket's page_frag (->sk_frag). The first user is
    sunrpc, which needs to avoid using current->task_frag but can keep
    ->sk_allocation set to GFP_KERNEL otherwise.

    Eventually, it might be possible to simplify sk_page_frag() by only
    testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
    heuristic entirely (assuming other sockets will set ->sk_use_task_frag
    according to their constraints in the future).

    The new ->sk_use_task_frag field is placed in a hole in struct sock and
    belongs to a cache line shared with ->sk_shutdown. Therefore it should
    be hot and shouldn't have negative performance impacts on TCP's fast
    path (sk_shutdown is tested just before the while() loop in
    tcp_sendmsg_locked()).

    Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
    Signed-off-by: Guillaume Nault <gnault@redhat.com>
    Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-04-03 11:41:09 +02:00
Jan Stancek c3f4ddd4dd Merge: Merge tag 'kernel-5.14.0-284.5.1.el9_2' from 9.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2250

Bring in changes from 9.2 tag kernel-5.14.0-284.5.1.el9_2.

The change to Makefile.rhelver was dropped since it is not applicable to
centos stream 9.

The change to block/blk-mq.h was re-done based on current
centos-stream-9 tree content. Since c9s tree does have this:
80bd4a7aab4c blk-mq: move the srcu_struct used for quiescing to the tagset
Then I just applied the original upstream change instead*,
not using the 9.2 specific version anymore.

*blk-mq: fix "bad unlock balance detected" on q->srcu in __blk_mq_run_dispatch_ops

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-03-29 12:11:12 +02:00
Artem Savkov e4a4c99666 lsm: make security_socket_getpeersec_stream() sockptr_t safe
Bugzilla: https://bugzilla.redhat.com/2166911

commit b10b9c342f7571f287fd422be5d5c0beb26ba974
Author: Paul Moore <paul@paul-moore.com>
Date:   Mon Oct 10 12:31:21 2022 -0400

    lsm: make security_socket_getpeersec_stream() sockptr_t safe
    
    Commit 4ff09db1b79b ("bpf: net: Change sk_getsockopt() to take the
    sockptr_t argument") made it possible to call sk_getsockopt()
    with both user and kernel address space buffers through the use of
    the sockptr_t type.  Unfortunately at the time of conversion the
    security_socket_getpeersec_stream() LSM hook was written to only
    accept userspace buffers, and in a desire to avoid having to change
    the LSM hook the commit author simply passed the sockptr_t's
    userspace buffer pointer.  Since the only sk_getsockopt() callers
    at the time of conversion which used kernel sockptr_t buffers did
    not allow SO_PEERSEC, and hence the
    security_socket_getpeersec_stream() hook, this was acceptable but
    also very fragile as future changes presented the possibility of
    silently passing kernel space pointers to the LSM hook.
    
    There are several ways to protect against this, including careful
    code review of future commits, but since relying on code review to
    catch bugs is a recipe for disaster and the upstream eBPF maintainer
    is "strongly against defensive programming", this patch updates the
    LSM hook, and all of the implementations to support sockptr_t and
    safely handle both user and kernel space buffers.
    
    Acked-by: Casey Schaufler <casey@schaufler-ca.com>
    Acked-by: John Johansen <john.johansen@canonical.com>
    Signed-off-by: Paul Moore <paul@paul-moore.com>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:40 +01:00
Felix Maurer 0f771c8999 bpf: Change bpf_getsockopt(SOL_SOCKET) to reuse sk_getsockopt()
Bugzilla: https://bugzilla.redhat.com/2166911

commit 65ddc82d3b96be5555a36de4e2b4547433a00532
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Thu Sep 1 17:29:12 2022 -0700

    bpf: Change bpf_getsockopt(SOL_SOCKET) to reuse sk_getsockopt()
    
    This patch changes bpf_getsockopt(SOL_SOCKET) to reuse
    sk_getsockopt().  It removes all duplicated code from
    bpf_getsockopt(SOL_SOCKET).
    
    Before this patch, there were some optnames available to
    bpf_setsockopt(SOL_SOCKET) but missing in bpf_getsockopt(SOL_SOCKET).
    It surprises users from time to time.  For example, SO_REUSEADDR,
    SO_KEEPALIVE, SO_RCVLOWAT, and SO_MAX_PACING_RATE.  This patch
    automatically closes this gap without duplicating more code.
    The only exception is SO_BINDTODEVICE because it needs to acquire a
    blocking lock.  Thus, SO_BINDTODEVICE is not supported.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20220902002912.2894040-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:32 +01:00
Felix Maurer cee9644119 bpf: net: Change sk_getsockopt() to take the sockptr_t argument
Bugzilla: https://bugzilla.redhat.com/2166911

commit 4ff09db1b79b98b4a2a7511571c640b76cab3beb
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Thu Sep 1 17:28:02 2022 -0700

    bpf: net: Change sk_getsockopt() to take the sockptr_t argument
    
    This patch changes sk_getsockopt() to take the sockptr_t argument
    such that it can be used by bpf_getsockopt(SOL_SOCKET) in a
    latter patch.
    
    security_socket_getpeersec_stream() is not changed.  It stays
    with the __user ptr (optval.user and optlen.user) to avoid changes
    to other security hooks.  bpf_getsockopt(SOL_SOCKET) also does not
    support SO_PEERSEC.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20220902002802.2888419-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:31 +01:00
Felix Maurer cfd2975ca7 net: Change sock_getsockopt() to take the sk ptr instead of the sock ptr
Bugzilla: https://bugzilla.redhat.com/2166911

commit ba74a7608dc12fbbd8ea36e660087f08a81ef26a
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Thu Sep 1 17:27:56 2022 -0700

    net: Change sock_getsockopt() to take the sk ptr instead of the sock ptr
    
    A latter patch refactors bpf_getsockopt(SOL_SOCKET) with the
    sock_getsockopt() to avoid code duplication and code
    drift between the two duplicates.
    
    The current sock_getsockopt() takes sock ptr as the argument.
    The very first thing of this function is to get back the sk ptr
    by 'sk = sock->sk'.
    
    bpf_getsockopt() could be called when the sk does not have
    the sock ptr created.  Meaning sk->sk_socket is NULL.  For example,
    when a passive tcp connection has just been established but has yet
    been accept()-ed.  Thus, it cannot use the sock_getsockopt(sk->sk_socket)
    or else it will pass a NULL ptr.
    
    This patch moves all sock_getsockopt implementation to the newly
    added sk_getsockopt().  The new sk_getsockopt() takes a sk ptr
    and immediately gets the sock ptr by 'sock = sk->sk_socket'
    
    The existing sock_getsockopt(sock) is changed to call
    sk_getsockopt(sock->sk).  All existing callers have both sock->sk
    and sk->sk_socket pointer.
    
    The latter patch will make bpf_getsockopt(SOL_SOCKET) call
    sk_getsockopt(sk) directly.  The bpf_getsockopt(SOL_SOCKET) does
    not use the optnames that require sk->sk_socket, so it will
    be safe.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20220902002756.2887884-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:31 +01:00
Felix Maurer 88d5be6b23 bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()
Bugzilla: https://bugzilla.redhat.com/2166911

Conflicts:
- Has been partially backported in df06f0d362 ("bpf: Change
  bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()"): export of
  sk_setsockopt was done there.
- net/core/filter.c: difference in removed code due to already applied
  38e724189c ("net: Fix data-races around
  sysctl_[rw]mem_(max|default)."). sk_setsockopt in net/core/sock.c does
  already have the READ_ONCEs.

commit 29003875bd5bab262a29d1c6e76a2124bd07e4c2
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Tue Aug 16 23:18:04 2022 -0700

    bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()

    After the prep work in the previous patches,
    this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET)
    and reuses them from sk_setsockopt().

    The sock ptr test is added to the SO_RCVLOWAT because
    the sk->sk_socket could be NULL in some of the bpf hooks.

    The existing optname white-list is refactored into a new
    function sol_socket_setsockopt().

    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/r/20220817061804.4178920-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:30 +01:00
Felix Maurer 99a9a726f4 bpf: net: Consider has_current_bpf_ctx() when testing capable() in sk_setsockopt()
Bugzilla: https://bugzilla.redhat.com/2166911

Conflicts:
- net/core/sock.c: Code difference because SO_PRIORITY and SO_MARK were
  not allowed with CAP_NET_RAW, but only with CAP_NET_ADMIN. So change
  only this check to use sockopt_capable(). Missing commits are
  a1b519b74548 ("net: allow CAP_NET_RAW to setsockopt SO_PRIORITY") and
  079925cce1d0 ("net: allow SO_MARK with CAP_NET_RAW").
- net/core/sock.c: We do not have SO_RCVMARK, so leave out this hunk.

commit e42c7beee71d0d84a6193357e3525d0cf2a3e168
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Tue Aug 16 23:17:23 2022 -0700

    bpf: net: Consider has_current_bpf_ctx() when testing capable() in sk_setsockopt()

    When bpf program calling bpf_setsockopt(SOL_SOCKET),
    it could be run in softirq and doesn't make sense to do the capable
    check.  There was a similar situation in bpf_setsockopt(TCP_CONGESTION).
    In commit 8d650cdeda ("tcp: fix tcp_set_congestion_control() use from bpf hook"),
    tcp_set_congestion_control(..., cap_net_admin) was added to skip
    the cap check for bpf prog.

    This patch adds sockopt_ns_capable() and sockopt_capable() for
    the sk_setsockopt() to use.  They will consider the
    has_current_bpf_ctx() before doing the ns_capable() and capable() test.
    They are in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.

    Suggested-by: Stanislav Fomichev <sdf@google.com>
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/r/20220817061723.4175820-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:29 +01:00
Felix Maurer 3562cfdaca bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
Bugzilla: https://bugzilla.redhat.com/2166911

commit 24426654ed3ae83d1127511891fb782c54f49203
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Tue Aug 16 23:17:17 2022 -0700

    bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
    
    Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
    the sk_setsockopt().  The number of supported optnames are
    increasing ever and so as the duplicated code.
    
    One issue in reusing sk_setsockopt() is that the bpf prog
    has already acquired the sk lock.  This patch adds a
    has_current_bpf_ctx() to tell if the sk_setsockopt() is called from
    a bpf prog.  The bpf prog calling bpf_setsockopt() is either running
    in_task() or in_serving_softirq().  Both cases have the current->bpf_ctx
    initialized.  Thus, the has_current_bpf_ctx() only needs to
    test !!current->bpf_ctx.
    
    This patch also adds sockopt_{lock,release}_sock() helpers
    for sk_setsockopt() to use.  These helpers will test
    has_current_bpf_ctx() before acquiring/releasing the lock.  They are
    in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
    
    Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
    is done in sock_setbindtodevice() instead of doing the lock_sock
    in sock_bindtoindex(..., lock_sk = true).
    
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/r/20220817061717.4175589-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:29 +01:00
Florian Westphal 802a0caabe net: use indirect calls helpers for sk_exit_memory_pressure()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2155182
Upstream Status: net commit 5c1ebbfabcd61

commit 5c1ebbfabcd61142a4551bfc0e51840f9bdae7af
Author: Brian Vazquez <brianvv@google.com>
Date:   Wed Mar 1 13:32:47 2023 +0000

    net: use indirect calls helpers for sk_exit_memory_pressure()

    Florian reported a regression and sent a patch with the following
    changelog:

    <quote>
     There is a noticeable tcp performance regression (loopback or cross-netns),
     seen with iperf3 -Z (sendfile mode) when generic retpolines are needed.

     With SK_RECLAIM_THRESHOLD checks gone number of calls to enter/leave
     memory pressure happen much more often. For TCP indirect calls are
     used.

     We can't remove the if-set-return short-circuit check in
     tcp_enter_memory_pressure because there are callers other than
     sk_enter_memory_pressure.  Doing a check in the sk wrapper too
     reduces the indirect calls enough to recover some performance.

     Before,
     0.00-60.00  sec   322 GBytes  46.1 Gbits/sec                  receiver

     After:
     0.00-60.04  sec   359 GBytes  51.4 Gbits/sec                  receiver

     "iperf3 -c $peer -t 60 -Z -f g", connected via veth in another netns.
    </quote>

    It seems we forgot to upstream this indirect call mitigation we
    had for years, lets do this instead.

    [edumazet] - It seems we forgot to upstream this indirect call
                 mitigation we had for years, let's do this instead.
               - Changed to INDIRECT_CALL_INET_1() to avoid bots reports.

    Fixes: 4890b686f408 ("net: keep sk->sk_forward_alloc as small as possible")
    Reported-by: Florian Westphal <fw@strlen.de>
    Link: https://lore.kernel.org/netdev/20230227152741.4a53634b@kernel.org/T/
    Signed-off-by: Brian Vazquez <brianvv@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230301133247.2346111-1-edumazet@google.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2023-03-03 14:17:45 +01:00
Herton R. Krzesinski 621a3b0cfb Merge: net: Backport data race annotations in the networking stack (part 1).
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1722

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git
Conflicts: Few minor conflicts, see description in affected commits.

Properly mark concurent reads and writes with READ_ONCE() and
WRITE_ONCE() in various parts of the networking stack. This is a
backport of the following upstream patch series:
  
  * Patch set A: merge commit e97e68b56e78 ("Merge branch 'sk_bound_dev_if-annotations'")
  * Patch set B: merge commit 32b3ad1418ea ("Merge branch 'sysctl-data-races'")
  * Patch set C: merge commit 7d5424b26f17 ("Merge branch 'net-sysctl-races'")
  * Patch set D: merge commit 782d86fe44e3 ("Merge branch 'net-sysctl-races-round2'")
  * Patch set E: merge commit c9f21106d97b ("Merge branch 'net-ipv4-sysctl-races-part-3'")

Patch 1 is a standalone READ_ONCE() annotation for sk->sk_bound_dev_if.
It's a prerequisite for correctly backporting patch set A.

Patches 2-9 are backports of patch set A. The following upstream
patches have been omitted since they're already in Centos Stream:
  
  * Upstream commit a20ea298071f ("sctp: read sk->sk_bound_dev_if once                                                                                                                                                                                                                                                         
    in sctp_rcv()"), backported by Centos Stream commit 5d539b8523.

  * Upstream commit 70f87de9fa0d ("net_sched: em_meta: add READ_ONCE()                                                                                                                                                                                                                                                         
    in var_sk_bound_if()"), backported by Centos Stream commit
    866ca288f3.

Patch 10 was in the original upstream series of patch set B, but was
resubmitted independently as that series was reworked before being
applied. Therefore, it doesn't strictly belong to patch set B, but is
closely related to it and is thus backported here.

Patches 11-21 are backports of patch set B. The following upstream
patch has been omitted since it's already in Centos Stream:
  
  * Upstream commit 310731e2f161 ("net: Fix data-races around                                                                                                                                                                                                                                                                  
    sysctl_mem.", backported by Centos Stream commit a99b2cb4eb.

Patches 22-36 are backports corresponding to patch set C.

Patches 37-51 are backports corresponding to patch set D.

Patches 52-66 are backports corresponding to patch set E.

Signed-off-by: Guillaume Nault <gnault@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-01-09 15:37:27 +00:00
Herton R. Krzesinski c2ca5b8087 Merge: net: tls: rebase to 6.0+
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1724

Tested: selftests
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143700

Rebase kTLS to upstream, to get the recent fixes and performance
improvements. This will also help with some driver rebases.

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Kamal Heib <kheib@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-22 21:08:50 +00:00
Guillaume Nault 25fe18780a net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_if
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit e5fccaa1eb7f6116deab0f708a787e2de915869f
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 13 11:55:44 2022 -0700

    net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_if

    sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val),
    because other cpus/threads might locklessly read this field.

    sock_getbindtodevice(), sock_getsockopt() need READ_ONCE()
    because they run without socket lock held.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:46 +01:00
Herton R. Krzesinski 09736a3a30 Merge: udp: some performance optimizations
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1541

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1, tput test

This series improves UDP protocol RX tput, to keep it on equal footing with rhel-8 one.

Patches 1,3,4 are there just to reduces the conflicts, and patch 4 is a very partial
backport, to avoid pulling unrelated features.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-13 17:35:03 +00:00
Sabrina Dubroca c92a443dc7 tls: rx: periodically flush socket backlog
Tested: selftests
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143700

commit c46b01839f7aad5889e23505bbfbeb5f4d7fde8e
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Jul 5 16:59:26 2022 -0700

    tls: rx: periodically flush socket backlog

    We continuously hold the socket lock during large reads and writes.
    This may inflate RTT and negatively impact TCP performance.
    Flush the backlog periodically. I tried to pick a flush period (128kB)
    which gives significant benefit but the max Bps rate is not yet visibly
    impacted.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2022-11-30 23:43:01 +01:00
Frantisek Hrbata 1269719102 Merge: BPF and XDP rebase to v5.18
Merge conflicts:
-----------------
arch/x86/net/bpf_jit_comp.c
        - bpf_arch_text_poke()
          HEAD(!1464) contains b73b002f7f ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline")
          Resolved in favour of !1464, but keep the return statement from !1477

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477

Bugzilla: https://bugzilla.redhat.com/2120966

Rebase BPF and XDP to the upstream kernel version 5.18

Patch applied, then reverted:
```
544356 selftests/bpf: switch to new libbpf XDP APIs
0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs
```
Taken in the perf rebase:
```
23fcfc perf: use generic bpf_program__set_type() to set BPF prog type
```
Unsuported arches:
```
5c1011 libbpf: Fix riscv register names
cf0b5b libbpf: Fix accessing syscall arguments on riscv
```
Depends on changes of other subsystems:
```
7fc8c3 s390/bpf: encode register within extable entry
aebfd1 x86/ibt,ftrace: Search for __fentry__ location
589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
```
Broken selftest:
```
edae34 selftests net: add UDP GRO fraglist + bpf self-tests
cf6783 selftests net: fix bpf build error
7b92aa selftests net: fix kselftest net fatal error
```
Out of scope:
```
baebdf net: dev: Makes sure netif_rx() can be invoked in any context.
5c8166 kbuild: replace $(if A,A,B) with $(or A,B)
1a97ce perf maps: Use a pointer for kmaps
967747 uaccess: remove CONFIG_SET_FS
42b01a s390: always use the packed stack layout
bf0882 flow_dissector: Add support for HSR
d09a30 s390/extable: move EX_TABLE define to asm-extable.h
3d6671 s390/extable: convert to relative table with data
4efd41 s390: raise minimum supported machine generation to z10
f65e58 flow_dissector: Add support for HSRv0
1a6d7a netdevsim: Introduce support for L3 offload xstats
9b1894 selftests: netdevsim: hw_stats_l3: Add a new test
84005b perf ftrace latency: Add -n/--use-nsec option
36c4a7 kasan, arm64: don't tag executable vmalloc allocations
8df013 docs: netdev: move the netdev-FAQ to the process pages
4d4d00 perf tools: Update copy of libbpf's hashmap.c
0df6ad perf evlist: Rename cpus to user_requested_cpus
1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning
0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf
8994e9 perf test bpf: Skip test if clang is not present
735346 perf build: Fix btf__load_from_kernel_by_id() feature check
f037ac s390/stack: merge empty stack frame slots
335220 docs: netdev: update maintainer-netdev.rst reference
a0b098 s390/nospec: remove unneeded header includes
34513a netdevsim: Fix hwstats debugfs file permissions
```

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Viktor Malik <vmalik@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-21 05:30:47 -05:00