Commit Graph

961 Commits

Author SHA1 Message Date
Jeff Moyer 092f5d645a net: ioctl: Use kernel memory on protocol ioctl callbacks
JIRA: https://issues.redhat.com/browse/RHEL-12076
Conflicts: There are contextual differences as we're missing commit
  559260fd9d9a ("ipmr: do not acquire mrt_lock in
  ioctl(SIOCGETVIFCNT)").  I also pulled in header changes from commit
  949d6b405e61 ("net: add missing includes and forward declarations
  under net/") to address a build failure with this patch applied.

commit e1d001fa5b477c4da46a29be1fcece91db7c7c6f
Author: Breno Leitao <leitao@debian.org>
Date:   Fri Jun 9 08:27:42 2023 -0700

    net: ioctl: Use kernel memory on protocol ioctl callbacks
    
    Most of the ioctls to net protocols operates directly on userspace
    argument (arg). Usually doing get_user()/put_user() directly in the
    ioctl callback.  This is not flexible, because it is hard to reuse these
    functions without passing userspace buffers.
    
    Change the "struct proto" ioctls to avoid touching userspace memory and
    operate on kernel buffers, i.e., all protocol's ioctl callbacks is
    adapted to operate on a kernel memory other than on userspace (so, no
    more {put,get}_user() and friends being called in the ioctl callback).
    
    This changes the "struct proto" ioctl format in the following way:
    
        int                     (*ioctl)(struct sock *sk, int cmd,
    -                                        unsigned long arg);
    +                                        int *karg);
    
    (Important to say that this patch does not touch the "struct proto_ops"
    protocols)
    
    So, the "karg" argument, which is passed to the ioctl callback, is a
    pointer allocated to kernel space memory (inside a function wrapper).
    This buffer (karg) may contain input argument (copied from userspace in
    a prep function) and it might return a value/buffer, which is copied
    back to userspace if necessary. There is not one-size-fits-all format
    (that is I am using 'may' above), but basically, there are three type of
    ioctls:
    
    1) Do not read from userspace, returns a result to userspace
    2) Read an input parameter from userspace, and does not return anything
      to userspace
    3) Read an input from userspace, and return a buffer to userspace.
    
    The default case (1) (where no input parameter is given, and an "int" is
    returned to userspace) encompasses more than 90% of the cases, but there
    are two other exceptions. Here is a list of exceptions:
    
    * Protocol RAW:
       * cmd = SIOCGETVIFCNT:
         * input and output = struct sioc_vif_req
       * cmd = SIOCGETSGCNT
         * input and output = struct sioc_sg_req
       * Explanation: for the SIOCGETVIFCNT case, userspace passes the input
         argument, which is struct sioc_vif_req. Then the callback populates
         the struct, which is copied back to userspace.
    
    * Protocol RAW6:
       * cmd = SIOCGETMIFCNT_IN6
         * input and output = struct sioc_mif_req6
       * cmd = SIOCGETSGCNT_IN6
         * input and output = struct sioc_sg_req6
    
    * Protocol PHONET:
      * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
         * input int (4 bytes)
      * Nothing is copied back to userspace.
    
    For the exception cases, functions sock_sk_ioctl_inout() will
    copy the userspace input, and copy it back to kernel space.
    
    The wrapper that prepare the buffer and put the buffer back to user is
    sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
    calls sk_ioctl(), which will handle all cases.
    
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:32:16 -04:00
Paolo Abeni 8ce7b9e432 tcp: get rid of sysctl_tcp_adv_win_scale
JIRA: https://issues.redhat.com/browse/RHEL-15036
Tested: LNST, Tier1

Upstream commit:
commit dfa2f0483360d4d6f2324405464c9f281156bd87
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jul 17 15:29:17 2023 +0000

    tcp: get rid of sysctl_tcp_adv_win_scale

    With modern NIC drivers shifting to full page allocations per
    received frame, we face the following issue:

    TCP has one per-netns sysctl used to tweak how to translate
    a memory use into an expected payload (RWIN), in RX path.

    tcp_win_from_space() implementation is limited to few cases.

    For hosts dealing with various MSS, we either under estimate
    or over estimate the RWIN we send to the remote peers.

    For instance with the default sysctl_tcp_adv_win_scale value,
    we expect to store 50% of payload per allocated chunk of memory.

    For the typical use of MTU=1500 traffic, and order-0 pages allocations
    by NIC drivers, we are sending too big RWIN, leading to potential
    tcp collapse operations, which are extremely expensive and source
    of latency spikes.

    This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
    uses a per socket scaling factor, so that we can precisely
    adjust the RWIN based on effective skb->len/skb->truesize ratio.

    This patch alone can double TCP receive performance when receivers
    are too slow to drain their receive queue, or by allowing
    a bigger RWIN when MSS is close to PAGE_SIZE.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-31 21:50:01 +01:00
Paolo Abeni 0240ed7c51 tcp: allow again tcp_disconnect() when threads are waiting
JIRA: https://issues.redhat.com/browse/RHEL-12593
Tested: vs bz reproducer
Conflicts: the tls_sw chunk is mangling and applied in \
  tls_rx_reader_acquire(), as rhel lacks the upstream commit \
  f9ae3204fb45 ("net/tls: split  tls_rx_reader_lock"). \
  the wait_on_pending_writer() chunk did not contain the ONCE \
  annotation, as rhel lacks the upstream commit d0ac89f6f987 ("net: \
  deal with most data-races in sk_wait_event()"). The same for \
  sk_stream_wait_memory() chunk.

Upstream commit:
commit 419ce133ab928ab5efd7b50b2ef36ddfd4eadbd2
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Oct 11 09:20:55 2023 +0200

    tcp: allow again tcp_disconnect() when threads are waiting

    As reported by Tom, .NET and applications build on top of it rely
    on connect(AF_UNSPEC) to async cancel pending I/O operations on TCP
    socket.

    The blamed commit below caused a regression, as such cancellation
    can now fail.

    As suggested by Eric, this change addresses the problem explicitly
    causing blocking I/O operation to terminate immediately (with an error)
    when a concurrent disconnect() is executed.

    Instead of tracking the number of threads blocked on a given socket,
    track the number of disconnect() issued on such socket. If such counter
    changes after a blocking operation releasing and re-acquiring the socket
    lock, error out the current operation.

    Fixes: 4faeee0cf8a5 ("tcp: deny tcp_disconnect() when threads are waiting")
    Reported-by: Tom Deseyn <tdeseyn@redhat.com>
    Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886305
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/f3b95e47e3dbed840960548aebaa8d954372db41.1697008693.git.pabeni@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-23 16:47:41 +02:00
Jan Stancek 2747ad2ea7 Merge: bpf, xdp: backports from upstream (phase 2)
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2748

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218483

Backporting safe upstream fixes to bpf that are missing so far.

Backporting 78fa0d61d97a ("bpf, sockmap: Pass skb ownership through
read_skb") required dependencies to be backported as well:
 - 31f1fbcb346c ("udp: Refactor udp_read_skb()")
 - d6e3b27cbd2d ("af_unix: Refactor unix_read_skb()")

Signed-off-by: Felix Maurer <fmaurer@redhat.com>

Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-08-04 10:01:55 +02:00
Felix Maurer 2d92cf1f17 bpf, sockmap: Pass skb ownership through read_skb
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218483
Conflicts:
- net/ipv4/udp.c: Context difference due to missing ec095263a965 ("net:
  remove noblock parameter from recvmsg() entities") and db39dfdc1c3b
  ("udp: Use WARN_ON_ONCE() in udp_read_skb()"); 31f1fbcb346c ("udp:
  Refactor udp_read_skb()") was adapted to reflect this
- net/vmw_vsock/virtio_transport_common.c: Skipped, because the relevant
  code is not there, missing 634f1a7110b4 ("vsock: support sockmap")

commit 78fa0d61d97a728d306b0c23d353c0e340756437
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Mon May 22 19:56:05 2023 -0700

    bpf, sockmap: Pass skb ownership through read_skb

    The read_skb hook calls consume_skb() now, but this means that if the
    recv_actor program wants to use the skb it needs to inc the ref cnt
    so that the consume_skb() doesn't kfree the sk_buff.

    This is problematic because in some error cases under memory pressure
    we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
    Then we get this,

     skb_linearize()
       __pskb_pull_tail()
         pskb_expand_head()
           BUG_ON(skb_shared(skb))

    Because we incremented users refcnt from sk_psock_verdict_recv() we
    hit the bug on with refcnt > 1 and trip it.

    To fix lets simply pass ownership of the sk_buff through the skb_read
    call. Then we can drop the consume from read_skb handlers and assume
    the verdict recv does any required kfree.

    Bug found while testing in our CI which runs in VMs that hit memory
    constraints rather regularly. William tested TCP read_skb handlers.

    [  106.536188] ------------[ cut here ]------------
    [  106.536197] kernel BUG at net/core/skbuff.c:1693!
    [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
    [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
    [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
    [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
    [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
    [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
    [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
    [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
    [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
    [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
    [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
    [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
    [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [  106.542255] Call Trace:
    [  106.542383]  <IRQ>
    [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
    [  106.542681]  skb_ensure_writable+0x85/0xa0
    [  106.542882]  sk_skb_pull_data+0x18/0x20
    [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
    [  106.543536]  ? migrate_disable+0x66/0x80
    [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
    [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
    [  106.544561]  tcp_read_skb+0x7b/0x120
    [  106.544740]  tcp_data_queue+0x904/0xee0
    [  106.544931]  tcp_rcv_established+0x212/0x7c0
    [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
    [  106.545326]  tcp_v4_rcv+0xe70/0xf60
    [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
    [  106.545744]  ip_local_deliver_finish+0xa7/0x150

    Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
    Reported-by: William Findlay <will@isovalent.com>
    Signed-off-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: William Findlay <will@isovalent.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-29 15:45:40 +02:00
Paolo Abeni 560a4662e8 tcp: deny tcp_disconnect() when threads are waiting
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217511
Tested: LNST, Tier1

Upstream commit:
commit 4faeee0cf8a5d88d63cdbc3bab124fb0e6aed08c
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 26 16:34:58 2023 +0000

    tcp: deny tcp_disconnect() when threads are waiting

    Historically connect(AF_UNSPEC) has been abused by syzkaller
    and other fuzzers to trigger various bugs.

    A recent one triggers a divide-by-zero [1], and Paolo Abeni
    was able to diagnose the issue.

    tcp_recvmsg_locked() has tests about sk_state being not TCP_LISTEN
    and TCP REPAIR mode being not used.

    Then later if socket lock is released in sk_wait_data(),
    another thread can call connect(AF_UNSPEC), then make this
    socket a TCP listener.

    When recvmsg() is resumed, it can eventually call tcp_cleanup_rbuf()
    and attempt a divide by 0 in tcp_rcv_space_adjust() [1]

    This patch adds a new socket field, counting number of threads
    blocked in sk_wait_event() and inet_wait_for_connect().

    If this counter is not zero, tcp_disconnect() returns an error.

    This patch adds code in blocking socket system calls, thus should
    not hurt performance of non blocking ones.

    Note that we probably could revert commit 499350a5a6 ("tcp:
    initialize rcv_mss to TCP_MIN_MSS instead of 0") to restore
    original tcpi_rcv_mss meaning (was 0 if no payload was ever
    received on a socket)

    [1]
    divide error: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 13832 Comm: syz-executor.5 Not tainted 6.3.0-rc4-syzkaller-00224-g00c7b5f4ddc5 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
    RIP: 0010:tcp_rcv_space_adjust+0x36e/0x9d0 net/ipv4/tcp_input.c:740
    Code: 00 00 00 00 fc ff df 4c 89 64 24 48 8b 44 24 04 44 89 f9 41 81 c7 80 03 00 00 c1 e1 04 44 29 f0 48 63 c9 48 01 e9 48 0f af c1 <49> f7 f6 48 8d 04 41 48 89 44 24 40 48 8b 44 24 30 48 c1 e8 03 48
    RSP: 0018:ffffc900033af660 EFLAGS: 00010206
    RAX: 4a66b76cbade2c48 RBX: ffff888076640cc0 RCX: 00000000c334e4ac
    RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000001
    RBP: 00000000c324e86c R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8880766417f8
    R13: ffff888028fbb980 R14: 0000000000000000 R15: 0000000000010344
    FS: 00007f5bffbfe700(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000001b32f25000 CR3: 000000007ced0000 CR4: 00000000003506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    <TASK>
    tcp_recvmsg_locked+0x100e/0x22e0 net/ipv4/tcp.c:2616
    tcp_recvmsg+0x117/0x620 net/ipv4/tcp.c:2681
    inet6_recvmsg+0x114/0x640 net/ipv6/af_inet6.c:670
    sock_recvmsg_nosec net/socket.c:1017 [inline]
    sock_recvmsg+0xe2/0x160 net/socket.c:1038
    ____sys_recvmsg+0x210/0x5a0 net/socket.c:2720
    ___sys_recvmsg+0xf2/0x180 net/socket.c:2762
    do_recvmmsg+0x25e/0x6e0 net/socket.c:2856
    __sys_recvmmsg net/socket.c:2935 [inline]
    __do_sys_recvmmsg net/socket.c:2958 [inline]
    __se_sys_recvmmsg net/socket.c:2951 [inline]
    __x64_sys_recvmmsg+0x20f/0x260 net/socket.c:2951
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x7f5c0108c0f9
    Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007f5bffbfe168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
    RAX: ffffffffffffffda RBX: 00007f5c011ac050 RCX: 00007f5c0108c0f9
    RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000003
    RBP: 00007f5c010e7b39 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000
    R13: 00007f5c012cfb1f R14: 00007f5bffbfe300 R15: 0000000000022000
    </TASK>

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Reported-by: Paolo Abeni <pabeni@redhat.com>
    Diagnosed-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Tested-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/20230526163458.2880232-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-26 15:43:53 +02:00
Jan Stancek dea08a5636 Merge: net: mptcp: rebase to latest net-next
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2479

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193330
Upstream Status: All mainline in net.git.
Tested: boot+kselftest
Conflicts: see individual commits

Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-19 08:29:21 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Davide Caratti c6f30ffe1a net: cache align tcp_memory_allocated, tcp_sockets_allocated
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193330
Upstream Status: net.git commit 91b6d3256356

commit 91b6d325635617540b6a1646ddb138bb17cbd569
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:39 2021 -0800

    net: cache align tcp_memory_allocated, tcp_sockets_allocated

    tcp_memory_allocated and tcp_sockets_allocated often share
    a common cache line, source of false sharing.

    Also take care of udp_memory_allocated and mptcp_sockets_allocated.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2023-05-09 11:08:43 +02:00
Jeff Moyer 5fbf8901c6 net: shrink struct ubuf_info
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit e7d2b510165fff6bedc9cca88c071ad846850c74
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Sep 23 17:39:04 2022 +0100

    net: shrink struct ubuf_info
    
    We can benefit from a smaller struct ubuf_info, so leave only mandatory
    fields and let users to decide how they want to extend it. Convert
    MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields.
    This reduces the size from 48 bytes to just 16.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-05-05 15:25:02 -04:00
Jeff Moyer 87aedebebc net: flag sockets supporting msghdr originated zerocopy
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Conflicts: include/linux/net.h - upstream there was a conflict between
  SOCK_CUSTOM_SOCKOPT and SOCK_SUPPORT_ZC.  There, it was resolved
  with the former getting defined as 6, and the latter as 5.  However,
  in the RHEL backport of a5ef058dc4d9 ("net: introduce and use custom
  sockopt socket flag"), 5 was chosen for SOCK_CUSTOM_SOCKOPT.  I
  could renumber it to 6 to match upstream, but that risks introducing
  unnecessary incompatibilities for 3rd party modules, so I opted to
  differ from upstream.  net/ipv4/udp.c - RHEL has a backport of
  commit 8a3854c7b8e4 ("udp: track the forward memory release
  threshold in an hot cacheline") out of order with this commit.  It's
  a simple fixup.

commit e993ffe3da4bcddea0536b03be1031bf35cd8d85
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Oct 21 11:16:39 2022 +0100

    net: flag sockets supporting msghdr originated zerocopy
    
    We need an efficient way in io_uring to check whether a socket supports
    zerocopy with msghdr provided ubuf_info. Add a new flag into the struct
    socket flags fields.
    
    Cc: <stable@vger.kernel.org> # 6.0
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/3dafafab822b1c66308bb58a0ac738b1e3f53f74.1666346426.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-05-05 15:24:12 -04:00
Jeff Moyer abfc92436c tcp: support externally provided ubufs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit eb315a7d1396b1139fc7daea55f2d3191e8e7092
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Tue Jul 12 21:52:35 2022 +0100

    tcp: support externally provided ubufs
    
    Teach tcp how to use external ubuf_info provided in msghdr and
    also prepare it for managed frags by sprinkling
    skb_zcopy_downgrade_managed() when it could mix managed and not managed
    frags.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:12:02 -04:00
Jeff Moyer ff18f5e1d8 tcp: take care of mixed splice()/sendmsg(MSG_ZEROCOPY) case
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit f8d9d938514f46c4892aff6bfe32f425e84d81cc
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 3 14:55:47 2022 -0800

    tcp: take care of mixed splice()/sendmsg(MSG_ZEROCOPY) case
    
    syzbot found that mixing sendpage() and sendmsg(MSG_ZEROCOPY)
    calls over the same TCP socket would again trigger the
    infamous warning in inet_sock_destruct()
    
            WARN_ON(sk_forward_alloc_get(sk));
    
    While Talal took into account a mix of regular copied data
    and MSG_ZEROCOPY one in the same skb, the sendpage() path
    has been forgotten.
    
    We want the charging to happen for sendpage(), because
    pages could be coming from a pipe. What is missing is the
    downgrading of pure zerocopy status to make sure
    sk_forward_alloc will stay synced.
    
    Add tcp_downgrade_zcopy_pure() helper so that we can
    use it from the two callers.
    
    Fixes: 9b65b17db723 ("net: avoid double accounting for pure zerocopy skbs")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Cc: Talal Ahmad <talalahmad@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Willem de Bruijn <willemb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Link: https://lore.kernel.org/r/20220203225547.665114-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:11:02 -04:00
Jeff Moyer 6af0ee8427 tcp: fix mem under-charging with zerocopy sendmsg()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 479f5547239d970d3833f15f54a6481fffdb91ec
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jan 31 22:52:54 2022 -0800

    tcp: fix mem under-charging with zerocopy sendmsg()
    
    We got reports of following warning in inet_sock_destruct()
    
            WARN_ON(sk_forward_alloc_get(sk));
    
    Whenever we add a non zero-copy fragment to a pure zerocopy skb,
    we have to anticipate that whole skb->truesize will be uncharged
    when skb is finally freed.
    
    skb->data_len is the payload length. But the memory truesize
    estimated by __zerocopy_sg_from_iter() is page aligned.
    
    Fixes: 9b65b17db723 ("net: avoid double accounting for pure zerocopy skbs")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Talal Ahmad <talalahmad@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Willem de Bruijn <willemb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Link: https://lore.kernel.org/r/20220201065254.680532-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:10:02 -04:00
Jeff Moyer d19688b83d net: avoid double accounting for pure zerocopy skbs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 9b65b17db72313b7a4fe9bc9502928c88be57986
Author: Talal Ahmad <talalahmad@google.com>
Date:   Tue Nov 2 22:58:44 2021 -0400

    net: avoid double accounting for pure zerocopy skbs
    
    Track skbs containing only zerocopy data and avoid charging them to
    kernel memory to correctly account the memory utilization for
    msg_zerocopy. All of the data in such skbs is held in user pages which
    are already accounted to user. Before this change, they are charged
    again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
    excessive because data is not being copied into skb frags. This
    excessive charging can lead to kernel going into memory pressure
    state which impacts all sockets in the system adversely. Mark pure
    zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
    charge/uncharge for data in such skbs.
    
    Initially, an skb is marked pure zerocopy when it is empty and in
    zerocopy path. skb can then change from a pure zerocopy skb to mixed
    data skb (zerocopy and copy data) if it is at tail of write queue and
    there is room available in it and non-zerocopy data is being sent in
    the next sendmsg call. At this time sk_mem_charge is done for the pure
    zerocopied data and the pure zerocopy flag is unmarked. We found that
    this happens very rarely on workloads that pass MSG_ZEROCOPY.
    
    A pure zerocopy skb can later be coalesced into normal skb if they are
    next to each other in queue but this patch prevents coalescing from
    happening. This avoids complexity of charging when skb downgrades from
    pure zerocopy to mixed. This is also rare.
    
    In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
    for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
    tcp_skb_entail for an skb without data.
    
    Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
    with zerocopy showed that before this patch the 'sock' variable in
    memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
    sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
    change it is 0. This is due to no charge to sk_forward_alloc for
    zerocopy data and shows memory utilization for kernel is lowered.
    
    With this commit we don't see the warning we saw in previous commit
    which resulted in commit 84882cf72cd774cf16fd338bdbf00f69ac9f9194.
    
    Signed-off-by: Talal Ahmad <talalahmad@google.com>
    Acked-by: Arjun Roy <arjunroy@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:03:02 -04:00
Jeff Moyer 474b0b4e6c tcp: rename sk_wmem_free_skb
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 03271f3a3594c0e88f68d8cfbec0ba250b2c538a
Author: Talal Ahmad <talalahmad@google.com>
Date:   Fri Oct 29 22:05:41 2021 -0400

    tcp: rename sk_wmem_free_skb
    
    sk_wmem_free_skb() is only used by TCP.
    
    Rename it to make this clear, and move its declaration to
    include/net/tcp.h
    
    Signed-off-by: Talal Ahmad <talalahmad@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Acked-by: Arjun Roy <arjunroy@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:02:02 -04:00
Paolo Abeni 79d712b0c0 tcp: fix rate_app_limited to default to 1
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561
Tested: LNST, Tier1

Upstream commit:
commit 300b655db1b5152d6101bcb6801d50899b20c2d6
Author: David Morley <morleyd@google.com>
Date:   Thu Jan 19 19:00:28 2023 +0000

    tcp: fix rate_app_limited to default to 1

    The initial default value of 0 for tp->rate_app_limited was incorrect,
    since a flow is indeed application-limited until it first sends
    data. Fixing the default to be 1 is generally correct but also
    specifically will help user-space applications avoid using the initial
    tcpi_delivery_rate value of 0 that persists until the connection has
    some non-zero bandwidth sample.

    Fixes: eb8329e0a0 ("tcp: export data delivery rate")
    Suggested-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: David Morley <morleyd@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Tested-by: David Morley <morleyd@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-21 10:00:53 +02:00
Paolo Abeni b2f3cdda1c tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561
Tested: LNST, Tier1

Upstream commit:
commit 0c175da7b0378445f5ef53904247cfbfb87e0b78
Author: Lu Wei <luwei32@huawei.com>
Date:   Fri Nov 4 10:27:23 2022 +0800

    tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent

    If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
    of TCPOPT_SACK_PERM is called to enable sack after data is sent
    and dupacks are received , it will trigger a warning in function
    tcp_verify_left_out() as follows:

    ============================================
    WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
    tcp_timeout_mark_lost+0x154/0x160
    tcp_enter_loss+0x2b/0x290
    tcp_retransmit_timer+0x50b/0x640
    tcp_write_timer_handler+0x1c8/0x340
    tcp_write_timer+0xe5/0x140
    call_timer_fn+0x3a/0x1b0
    __run_timers.part.0+0x1bf/0x2d0
    run_timer_softirq+0x43/0xb0
    __do_softirq+0xfd/0x373
    __irq_exit_rcu+0xf6/0x140

    The warning is caused in the following steps:
    1. a socket named socketA is created
    2. socketA enters repair mode without build a connection
    3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
       directly
    4. socketA leaves repair mode
    5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
       ack receives) increase
    6. socketA enters repair mode again
    7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
    8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
       increases
    9. sack_outs + lost_out > packets_out triggers since lost_out and
       sack_outs increase repeatly

    In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
    Step7 not happen and the warning will not be triggered. As suggested by
    Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
    already sent.

    socket-tcp tests in CRIU has been tested as follows:
    $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp*  --keep-going \
           --ignore-taint

    socket-tcp* represent all socket-tcp tests in test/zdtm/static/.

    Fixes: b139ba4e90 ("tcp: Repair connection-time negotiated parameters")
    Signed-off-by: Lu Wei <luwei32@huawei.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-21 09:56:59 +02:00
Jan Stancek f302196b1b Merge: io_uring: update to v5.19
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2190

Sync up our io_uring code with upstream v5.19, but do not enable it.  The goal is to be bug-for-bug compatible with this version of the code.  I'll post further MRs that will sync to later releases, and then a final MR with remaining fixes.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2123490

Omitted-fix: df6d3422d3ee ("io_uring/kbuf: fix not advancing READV kbuf ring")
	Fixes will be pulled in by later merge requests
Omitted-fix: 9d94c04c0db0 ("io_uring/filetable: fix file reference underflow")
	Fixes will be pulled in by later merge requests
Omitted-fix: 48ba08374e77 ("io_uring: fix size calculation when registering buf ring")
	Fixes will be pulled in by later merge requests
Omitted-fix: 36632d062975 ("io_uring: Replace 0-length array with flexible array")
	Fixes will be pulled in by later merge requests
Omitted-fix: 336d28a8f380 ("io_uring: recycle kbuf recycle on tw requeue")
	Fixes will be pulled in by later merge requests
Omitted-fix: 91482864768a ("io_uring: fix multishot accept request leaks")
	Fixes will be pulled in by later merge requests
Omitted-fix: dd9373402280 ("Smack: Provide read control for io_uring_cmd")
	Fixes will be pulled in by later merge requests
Omitted-fix: f4d653dcaa4e ("selinux: implement the security_uring_cmd() LSM hook")
	Fixes will be pulled in by later merge requests
Omitted-fix: 2a5840124009 ("lsm,io_uring: add LSM hooks for the new uring_cmd file op")
	Fixes will be pulled in by later merge requests
Omitted-fix: 3b8fdd1dc35e ("io_uring/fdinfo: fix sqe dumping for IORING_SETUP_SQE128")
	Fixes will be pulled in by later merge requests
Omitted-fix: 00927931cb63 ("io_uring: fix fdinfo sqe offsets calculation")
	Fixes will be pulled in by later merge requests
Omitted-fix: 9d2789ac9d60 ("block/io_uring: pass in issue_flags for uring_cmd task_work handling")
	Fixes will be pulled in by later merge requests
Omitted-fix: 02a4d923e440 ("io_uring/rsrc: fix null-ptr-deref in io_file_bitmap_get()")
	Fixes will be pulled in by later merge requests

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-04-13 07:46:41 +02:00
Jeff Moyer 1693856c78 tcp: pass back data left in socket after receive
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2123490

commit f94fd25cb0aaf77fd7453f31c5d394a1a68ecf60
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu Apr 28 18:45:06 2022 -0600

    tcp: pass back data left in socket after receive
    
    This is currently done for CMSG_INQ, add an ability to do so via struct
    msghdr as well and have CMSG_INQ use that too. If the caller sets
    msghdr->msg_get_inq, then we'll pass back the hint in msghdr->msg_inq.
    
    Rearrange struct msghdr a bit so we can add this member while shrinking
    it at the same time. On a 64-bit build, it was 96 bytes before this
    change and 88 bytes afterwards.
    
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Link: https://lore.kernel.org/r/650c22ca-cffc-0255-9a05-2413a1e20826@kernel.dk
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-03-16 08:37:43 -04:00
Felix Maurer ee5a18be02 bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
Bugzilla: https://bugzilla.redhat.com/2166911

commit 273b7f0fb44847c41814a59901977be284daa447
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Thu Sep 1 17:29:18 2022 -0700

    bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
    
    This patch changes bpf_getsockopt(SOL_TCP) to reuse
    do_tcp_getsockopt().  It removes the duplicated code from
    bpf_getsockopt(SOL_TCP).
    
    Before this patch, there were some optnames available to
    bpf_setsockopt(SOL_TCP) but missing in bpf_getsockopt(SOL_TCP).
    For example, TCP_NODELAY, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL,
    and a few more.  It surprises users from time to time.  This patch
    automatically closes this gap without duplicating more code.
    
    bpf_getsockopt(TCP_SAVED_SYN) does not free the saved_syn,
    so it stays in sol_tcp_sockopt().
    
    For string name value like TCP_CONGESTION, bpf expects it
    is always null terminated, so sol_tcp_sockopt() decrements
    optlen by one before calling do_tcp_getsockopt() and
    the 'if (optlen < saved_optlen) memset(..,0,..);'
    in __bpf_getsockopt() will always do a null termination.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20220902002918.2894511-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:33 +01:00
Felix Maurer 0d5e52df56 bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf
Bugzilla: https://bugzilla.redhat.com/2166911

commit d51bbff2aba7880c669e3ed1b4a5a64fed684bf0
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Thu Sep 1 17:28:21 2022 -0700

    bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf
    
    Similar to the earlier commit that changed sk_setsockopt() to
    use sockopt_{lock,release}_sock() such that it can avoid taking
    lock when called from bpf.  This patch also changes do_tcp_getsockopt()
    to use sockopt_{lock,release}_sock() such that a latter patch can
    make bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt().
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20220902002821.2889765-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:31 +01:00
Felix Maurer e5a88caccb bpf: net: Change do_tcp_getsockopt() to take the sockptr_t argument
Bugzilla: https://bugzilla.redhat.com/2166911

Conflicts:
- net/ipv4/tcp.c: context difference due to already applied d109429414
  ("tcp: Fix data races around icsk->icsk_af_ops.")

commit 34704ef024ae6167c7ae9e67f671eb6bc1962c90
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Thu Sep 1 17:28:15 2022 -0700

    bpf: net: Change do_tcp_getsockopt() to take the sockptr_t argument

    Similar to the earlier patch that changes sk_getsockopt() to
    take the sockptr_t argument .  This patch also changes
    do_tcp_getsockopt() to take the sockptr_t argument such that
    a latter patch can make bpf_getsockopt(SOL_TCP) to reuse
    do_tcp_getsockopt().

    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20220902002815.2889332-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:31 +01:00
Felix Maurer 72574a4403 bpf, net: Avoid loading module when calling bpf_setsockopt(TCP_CONGESTION)
Bugzilla: https://bugzilla.redhat.com/2166911

commit 84e5a0f208ca341ec1ea88a97c40849a2d541faa
Author: Martin KaFai Lau <martin.lau@linux.dev>
Date:   Tue Aug 30 16:19:46 2022 -0700

    bpf, net: Avoid loading module when calling bpf_setsockopt(TCP_CONGESTION)
    
    When bpf prog changes tcp-cc by calling bpf_setsockopt(TCP_CONGESTION),
    it should not try to load module which may be a blocking operation.
    This details was correct in the v1 [0] but missed by mistake in the
    later revision in commit cb388e7ee3a8 ("bpf: net: Change do_tcp_setsockopt()
    to use the sockopt's lock_sock() and capable()"). This patch fixes it by
    checking the has_current_bpf_ctx().
    
      [0] https://lore.kernel.org/bpf/20220727060921.2373314-1-kafai@fb.com/
    
    Fixes: cb388e7ee3a8 ("bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()")
    Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220830231946.791504-1-martin.lau@linux.dev

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:31 +01:00
Felix Maurer 4eef215a5c bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()
Bugzilla: https://bugzilla.redhat.com/2166911

commit 0c751f7071ef98d334ed06ca3f8f4cc1f7458cf5
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Tue Aug 16 23:18:19 2022 -0700

    bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()
    
    After the prep work in the previous patches,
    this patch removes all the dup code from bpf_setsockopt(SOL_TCP)
    and reuses the do_tcp_setsockopt().
    
    The existing optname white-list is refactored into a new
    function sol_tcp_setsockopt().  The sol_tcp_setsockopt()
    also calls the bpf_sol_tcp_setsockopt() to handle
    the TCP_BPF_XXX specific optnames.
    
    bpf_setsockopt(TCP_SAVE_SYN) now also allows a value 2 to
    save the eth header also and it comes for free from
    do_tcp_setsockopt().
    
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/r/20220817061819.4180146-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:30 +01:00
Felix Maurer 4f156d8614 bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()
Bugzilla: https://bugzilla.redhat.com/2166911

commit cb388e7ee3a824250a66b854adae9f03b70956f1
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Tue Aug 16 23:17:30 2022 -0700

    bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()
    
    Similar to the earlier patch that avoids sk_setsockopt() from
    taking sk lock and doing capable test when called by bpf.  This patch
    changes do_tcp_setsockopt() to use the sockopt_{lock,release}_sock()
    and sockopt_[ns_]capable().
    
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/r/20220817061730.4176021-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:29 +01:00
Guillaume Nault 8eaa312967 tcp: Fix a data-race around sysctl_tcp_autocorking.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 85225e6f0a76e6745bc841c9f25169c509b573d8
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:25 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_autocorking.

    While reading sysctl_tcp_autocorking, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: f54b311142 ("tcp: auto corking")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:15 +01:00
Herton R. Krzesinski ee17c5d305 Merge: bpf, xdp: update to 6.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1742

bpf, xdp: update to 6.0

Bugzilla: https://bugzilla.redhat.com/2137876

Signed-off-by: Artem Savkov <asavkov@redhat.com>

Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Jerome Marchand <jmarchan@redhat.com>
Approved-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Approved-by: Michael Petlan <mpetlan@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-01-12 16:01:19 +00:00
Herton R. Krzesinski 621a3b0cfb Merge: net: Backport data race annotations in the networking stack (part 1).
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1722

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git
Conflicts: Few minor conflicts, see description in affected commits.

Properly mark concurent reads and writes with READ_ONCE() and
WRITE_ONCE() in various parts of the networking stack. This is a
backport of the following upstream patch series:
  
  * Patch set A: merge commit e97e68b56e78 ("Merge branch 'sk_bound_dev_if-annotations'")
  * Patch set B: merge commit 32b3ad1418ea ("Merge branch 'sysctl-data-races'")
  * Patch set C: merge commit 7d5424b26f17 ("Merge branch 'net-sysctl-races'")
  * Patch set D: merge commit 782d86fe44e3 ("Merge branch 'net-sysctl-races-round2'")
  * Patch set E: merge commit c9f21106d97b ("Merge branch 'net-ipv4-sysctl-races-part-3'")

Patch 1 is a standalone READ_ONCE() annotation for sk->sk_bound_dev_if.
It's a prerequisite for correctly backporting patch set A.

Patches 2-9 are backports of patch set A. The following upstream
patches have been omitted since they're already in Centos Stream:
  
  * Upstream commit a20ea298071f ("sctp: read sk->sk_bound_dev_if once                                                                                                                                                                                                                                                         
    in sctp_rcv()"), backported by Centos Stream commit 5d539b8523.

  * Upstream commit 70f87de9fa0d ("net_sched: em_meta: add READ_ONCE()                                                                                                                                                                                                                                                         
    in var_sk_bound_if()"), backported by Centos Stream commit
    866ca288f3.

Patch 10 was in the original upstream series of patch set B, but was
resubmitted independently as that series was reworked before being
applied. Therefore, it doesn't strictly belong to patch set B, but is
closely related to it and is thus backported here.

Patches 11-21 are backports of patch set B. The following upstream
patch has been omitted since it's already in Centos Stream:
  
  * Upstream commit 310731e2f161 ("net: Fix data-races around                                                                                                                                                                                                                                                                  
    sysctl_mem.", backported by Centos Stream commit a99b2cb4eb.

Patches 22-36 are backports corresponding to patch set C.

Patches 37-51 are backports corresponding to patch set D.

Patches 52-66 are backports corresponding to patch set E.

Signed-off-by: Guillaume Nault <gnault@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-01-09 15:37:27 +00:00
Felix Maurer 619ca0e7c3 tcp: read multiple skbs in tcp_read_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137876

commit db4192a754ebd52300a28abe1a50dd18eae0eb12
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Mon Sep 12 10:35:53 2022 -0700

    tcp: read multiple skbs in tcp_read_skb()

    Before we switched to ->read_skb(), ->read_sock() was passed with
    desc.count=1, which technically indicates we only read one skb per
    ->sk_data_ready() call. However, for TCP, this is not true.

    TCP at least has sk_rcvlowat which intentionally holds skb's in
    receive queue until this watermark is reached. This means when
    ->sk_data_ready() is invoked there could be multiple skb's in the
    queue, therefore we have to read multiple skbs in tcp_read_skb()
    instead of one.

    Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
    Reported-by: Peilin Ye <peilin.ye@bytedance.com>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: Jakub Sitnicki <jakub@cloudflare.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Link: https://lore.kernel.org/r/20220912173553.235838-1-xiyou.wangcong@gmail.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:50:16 +01:00
Felix Maurer 3325c6902b tcp: Use WARN_ON_ONCE() in tcp_read_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137876

commit 96628951869c0dedf0377adca01c8675172d8639
Author: Peilin Ye <peilin.ye@bytedance.com>
Date:   Thu Sep 8 16:15:23 2022 -0700

    tcp: Use WARN_ON_ONCE() in tcp_read_skb()

    Prevent tcp_read_skb() from flooding the syslog.

    Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
    Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:50:16 +01:00
Felix Maurer fdab1f5740 tcp: handle pure FIN case correctly
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137876

commit 2e23acd99efacfd2a63cb9725afbc65e4e964fb7
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Aug 17 12:54:45 2022 -0700

    tcp: handle pure FIN case correctly

    When skb->len==0, the recv_actor() returns 0 too, but we also use 0
    for error conditions. This patch amends this by propagating the errors
    to tcp_read_skb() so that we can distinguish skb->len==0 case from
    error cases.

    Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()")
    Reported-by: Eric Dumazet <edumazet@google.com>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: Jakub Sitnicki <jakub@cloudflare.com>
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:50:16 +01:00
Felix Maurer ce685612f8 tcp: refactor tcp_read_skb() a bit
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137876

commit a8688821f3854f37fe0198b8945f9cfc051ab2cf
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Aug 17 12:54:44 2022 -0700

    tcp: refactor tcp_read_skb() a bit

    As tcp_read_skb() only reads one skb at a time, the while loop is
    unnecessary, we can turn it into an if. This also simplifies the
    code logic.

    Cc: Eric Dumazet <edumazet@google.com>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: Jakub Sitnicki <jakub@cloudflare.com>
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:50:16 +01:00
Felix Maurer 67b688aecd tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137876
Conflicts: Context difference due to not yet applied 3df684c1a3d08 ("tcp:
avoid indirect calls to sock_rfree") and already applied 3f92a64e44e5
("tcp: allow tls to decrypt directly from the tcp rcv queue")

commit c457985aaa92e1fda2ce837cabf90bf687b92dcb
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Aug 17 12:54:43 2022 -0700

    tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()

    tcp_cleanup_rbuf() retrieves the skb from sk_receive_queue, it
    assumes the skb is not yet dequeued. This is no longer true for
    tcp_read_skb() case where we dequeue the skb first.

    Fix this by introducing a helper __tcp_cleanup_rbuf() which does
    not require any skb and calling it in tcp_read_skb().

    Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()")
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: Jakub Sitnicki <jakub@cloudflare.com>
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:49:55 +01:00
Felix Maurer 72df553bf8 tcp: fix sock skb accounting in tcp_read_skb()
Bugzilla: https://bugzilla.redhat.com/2137876

commit e9c6e79760265f019cde39d3f2c443dfbc1395b0
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Aug 17 12:54:42 2022 -0700

    tcp: fix sock skb accounting in tcp_read_skb()
    
    Before commit 965b57b469a5 ("net: Introduce a new proto_ops
    ->read_skb()"), skb was not dequeued from receive queue hence
    when we close TCP socket skb can be just flushed synchronously.
    
    After this commit, we have to uncharge skb immediately after being
    dequeued, otherwise it is still charged in the original sock. And we
    still need to retain skb->sk, as eBPF programs may extract sock
    information from skb->sk. Therefore, we have to call
    skb_set_owner_sk_safe() here.
    
    Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
    Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com
    Tested-by: Stanislav Fomichev <sdf@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: Jakub Sitnicki <jakub@cloudflare.com>
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Felix Maurer 09faf01cb9 net: Introduce a new proto_ops ->read_skb()
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: Context difference due to not yet applied 314001f0bf927
("af_unix: Add OOB support") and already applied 3f92a64e44e5 ("tcp:
allow tls to decrypt directly from the tcp rcv queue")

commit 965b57b469a589d64d81b1688b38dcb537011bb0
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Jun 15 09:20:12 2022 -0700

    net: Introduce a new proto_ops ->read_skb()

    Currently both splice() and sockmap use ->read_sock() to
    read skb from receive queue, but for sockmap we only read
    one entire skb at a time, so ->read_sock() is too conservative
    to use. Introduce a new proto_ops ->read_skb() which supports
    this sematic, with this we can finally pass the ownership of
    skb to recv actors.

    For non-TCP protocols, all ->read_sock() can be simply
    converted to ->read_skb().

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Felix Maurer a4ae9a073c tcp: Introduce tcp_read_skb()
Bugzilla: https://bugzilla.redhat.com/2137876
Conflicts: 3f92a64e44e5 "tcp: allow tls to decrypt directly from the
           tcp rcv queue" already backported.

commit 04919bed948dc22a0032a9da867b7dcb8aece4ca
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Jun 15 09:20:11 2022 -0700

    tcp: Introduce tcp_read_skb()

    This patch inroduces tcp_read_skb() based on tcp_read_sock(),
    a preparation for the next patch which actually introduces
    a new sock ops.

    TCP is special here, because it has tcp_read_sock() which is
    mainly used by splice(). tcp_read_sock() supports partial read
    and arbitrary offset, neither of them is needed for sockmap.

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220615162014.89193-2-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Guillaume Nault 284b4d5d77 tcp: Fix data-races around sysctl_tcp_fastopen.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 5a54213318c43f4009ae158347aa6016e3b9b55a
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 15 10:17:54 2022 -0700

    tcp: Fix data-races around sysctl_tcp_fastopen.

    While reading sysctl_tcp_fastopen, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 2100c8d2d9 ("net-tcp: Fast Open base")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Acked-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:58 +01:00
Guillaume Nault 3b5433cc57 tcp: Fix data-races around some timeout sysctl knobs.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 39e24435a776e9de5c6dd188836cf2523547804b
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 15 10:17:50 2022 -0700

    tcp: Fix data-races around some timeout sysctl knobs.

    While reading these sysctl knobs, they can be changed concurrently.
    Thus, we need to add READ_ONCE() to their readers.

      - tcp_retries1
      - tcp_retries2
      - tcp_orphan_retries
      - tcp_fin_timeout

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:58 +01:00
Guillaume Nault 8e09537936 tcp: Fix data-races around sysctl_tcp_reordering.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 46778cd16e6a5ad1b2e3a91f6c057c907379418e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 15 10:17:49 2022 -0700

    tcp: Fix data-races around sysctl_tcp_reordering.

    While reading sysctl_tcp_reordering, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:57 +01:00
Guillaume Nault fe19c398fb tcp: Fix data-races around sysctl_tcp_syn(ack)?_retries.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 20a3b1c0f603e8c55c3396abd12dfcfb523e4d3c
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 15 10:17:46 2022 -0700

    tcp: Fix data-races around sysctl_tcp_syn(ack)?_retries.

    While reading sysctl_tcp_syn(ack)?_retries, they can be changed
    concurrently.  Thus, we need to add READ_ONCE() to their readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:57 +01:00
Guillaume Nault 6a8bdf1542 tcp: Fix a data-race around sysctl_tcp_max_orphans.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 47e6ab24e8c6e3ca10ceb5835413f401f90de4bf
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 6 16:39:58 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_max_orphans.

    While reading sysctl_tcp_max_orphans, it can be changed concurrently.
    So, we need to add READ_ONCE() to avoid a data-race.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:49 +01:00
Sabrina Dubroca adef18fac2 tcp: allow tls to decrypt directly from the tcp rcv queue
Tested: selftests
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143700

Conflicts: missing tcp_read_skb from commit 965b57b469a5 ("net:
    Introduce a new proto_ops ->read_skb()")

commit 3f92a64e44e5823a975cbf2c9f05ab1893fd4cb7
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Fri Jul 22 16:50:31 2022 -0700

    tcp: allow tls to decrypt directly from the tcp rcv queue

    Expose TCP rx queue accessor and cleanup, so that TLS can
    decrypt directly from the TCP queue. The expectation
    is that the caller can access the skb returned from
    tcp_recv_skb() and up to inq bytes worth of data (some
    of which may be in ->next skbs) and then call
    tcp_read_done() when data has been consumed.
    The socket lock must be held continuously across
    those two operations.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2022-12-02 08:54:44 +01:00
Sabrina Dubroca 64f998e588 tcp: avoid indirect calls to sock_rfree
Tested: selftests
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143700

commit 3df684c1a3d08a4f649689053a3d527b3b5fda9e
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:45 2021 -0800

    tcp: avoid indirect calls to sock_rfree

    TCP uses sk_eat_skb() when skbs can be removed from receive queue.
    However, the call to skb_orphan() from __kfree_skb() incurs
    an indirect call so sock_rfee(), which is more expensive than
    a direct call, especially for CONFIG_RETPOLINE=y.

    Add tcp_eat_recv_skb() function to make the call before
    __kfree_skb().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2022-12-02 08:54:44 +01:00
Frantisek Hrbata 27a89b8946 Merge: tcp: BIG TCP implementation
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1560

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Tested: Using netperf and veth driver. Results meet the assumptions. See https://bugzilla.redhat.com/show_bug.cgi?id=2139501#c1

The series introduces support for BIG TCP.

- Patch 1-2: Preliminary dependencies
- Patch 3-14: Commits from upstream series 7fa2e481ff2f ("Merge branch 'big-tcp'", 2022-05-16)
- Patch 15-19: Follow-ups

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-15 07:30:55 -05:00
Frantisek Hrbata e265d68e77 Merge: tcp: phase-1 backports for RHEL-9.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1504

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: All mainline in net-next.git.
Tested: boot-tested only
Conflicts: see individual patches

Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-14 02:40:21 -05:00
Davide Caratti 421bf30029 tcp: export tcp_sendmsg_fastopen
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 3242abeb8da7

commit 3242abeb8da7071fa40d32346ed36343bee33b80
Author: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Date:   Mon Sep 26 16:27:37 2022 -0700

    tcp: export tcp_sendmsg_fastopen

    It will be used to support TCP FastOpen with MPTCP in the following
    commit.

    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
    Co-developed-by: Dmytro Shytyi <dmytro@shytyi.net>
    Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
    Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
    Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:59 +01:00
Davide Caratti 7f603092eb net: Fix data-races around sysctl_max_skb_frags.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 657b991afb89

commit 657b991afb89d25fe6c4783b1b75a8ad4563670d
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:54 2022 -0700

    net: Fix data-races around sysctl_max_skb_frags.

    While reading sysctl_max_skb_frags, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 5f74f82ea3 ("net:Add sysctl_max_skb_frags")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:58 +01:00
Davide Caratti bfb8959ee3 net: keep sk->sk_forward_alloc as small as possible
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 4890b686f408
Conflicts:
 - net/ipv4/tcp.c: context mismatch because we don't have upstream
   commit 8a794df69300 ("tcp: use MAX_TCP_HEADER in tcp_stream_alloc_skb")
   and commit c4322884ed21 ("tcp: remove unneeded code from tcp_stream_alloc_skb()")

commit 4890b686f4088c90432149bd6de567e621266fa2
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 23:34:11 2022 -0700

    net: keep sk->sk_forward_alloc as small as possible

    Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast.

    Each TCP socket can forward allocate up to 2 MB of memory, even after
    flow became less active.

    10,000 sockets can have reserved 20 GB of memory,
    and we have no shrinker in place to reclaim that.

    Instead of trying to reclaim the extra allocations in some places,
    just keep sk->sk_forward_alloc values as small as possible.

    This should not impact performance too much now we have per-cpu
    reserves: Changes to tcp_memory_allocated should not be too frequent.

    For sockets not using SO_RESERVE_MEM:
     - idle sockets (no packets in tx/rx queues) have zero forward alloc.
     - non idle sockets have a forward alloc smaller than one page.

    Note:

     - Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD
       is left to MPTCP maintainers as a follow up.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:55 +01:00
Davide Caratti 9aac6c4346 net: add per_cpu_fw_alloc field to struct proto
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 0defbb0af775
Conflicts:
 - net/core/sock.c: context mismatch because of missing backport of
   upstream commit f20cfd662a62 ("net: add sanity check in proto_register()")

commit 0defbb0af775ef037913786048d099bbe8b9a2c2
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 23:34:08 2022 -0700

    net: add per_cpu_fw_alloc field to struct proto

    Each protocol having a ->memory_allocated pointer gets a corresponding
    per-cpu reserve, that following patches will use.

    Instead of having reserved bytes per socket,
    we want to have per-cpu reserves.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:55 +01:00
Davide Caratti 543f426b27 net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 100fdd1faf50

commit 100fdd1faf50557558e2911af4be32e515cb8036
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 23:34:07 2022 -0700

    net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT

    Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE.

    This might change in the future, but it seems better to avoid the
    confusion.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:55 +01:00
Davide Caratti a3894ee946 net: inet: Retire port only listening_hash
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit cae3873c5b3a

commit cae3873c5b3a4fcd9706fb461ff4e91bdf1f0120
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed May 11 17:06:05 2022 -0700

    net: inet: Retire port only listening_hash

    The listen sk is currently stored in two hash tables,
    listening_hash (hashed by port) and lhash2 (hashed by port and address).

    After commit 0ee58dad5b ("net: tcp6: prefer listeners bound to an address")
    and commit d9fbc7f643 ("net: tcp: prefer listeners bound to an address"),
    the TCP-SYN lookup fast path does not use listening_hash.

    The commit 05c0b35709c5 ("tcp: seq_file: Replace listening_hash with lhash2")
    also moved the seq_file (/proc/net/tcp) iteration usage from
    listening_hash to lhash2.

    There are still a few listening_hash usages left.
    One of them is inet_reuseport_add_sock() which uses the listening_hash
    to search a listen sk during the listen() system call.  This turns
    out to be very slow on use cases that listen on many different
    VIPs at a popular port (e.g. 443).  [ On top of the slowness in
    adding to the tail in the IPv6 case ].  The latter patch has a
    selftest to demonstrate this case.

    This patch takes this chance to move all remaining listening_hash
    usages to lhash2 and then retire listening_hash.

    Since most changes need to be done together, it is hard to cut
    the listening_hash to lhash2 switch into small patches.  The
    changes in this patch is highlighted here for the review
    purpose.

    1. Because of the listening_hash removal, lhash2 can use the
       sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
       This will also keep the sk_unhashed() check to work as is
       after stop adding sk to listening_hash.

       The union is removed from inet_listen_hashbucket because
       only nulls_head is needed.

    2. icsk->icsk_listen_portaddr_node and its helpers are removed.

    3. The current lhash2 users needs to iterate with sk_nulls_node
       instead of icsk_listen_portaddr_node.

       One case is in the inet[6]_lhash2_lookup().

       Another case is the seq_file iterator in tcp_ipv4.c.
       One thing to note is sk_nulls_next() is needed
       because the old inet_lhash2_for_each_icsk_continue()
       does a "next" first before iterating.

    4. Move the remaining listening_hash usage to lhash2

       inet_reuseport_add_sock() which this series is
       trying to improve.

       inet_diag.c and mptcp_diag.c are the final two
       remaining use cases and is moved to lhash2 now also.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:54 +01:00
Davide Caratti d2950bc221 tcp: switch orphan_count to bare per-cpu counters
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 19757cebf0c5

commit 19757cebf0c5016a1f36f7fe9810a9f0b33c0832
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Oct 14 06:41:26 2021 -0700

    tcp: switch orphan_count to bare per-cpu counters

    Use of percpu_counter structure to track count of orphaned
    sockets is causing problems on modern hosts with 256 cpus
    or more.

    Stefan Bach reported a serious spinlock contention in real workloads,
    that I was able to reproduce with a netfilter rule dropping
    incoming FIN packets.

        53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
                |
                ---queued_spin_lock_slowpath
                   |
                    --53.51%--_raw_spin_lock_irqsave
                              |
                               --53.51%--__percpu_counter_sum
                                         tcp_check_oom
                                         |
                                         |--39.03%--__tcp_close
                                         |          tcp_close
                                         |          inet_release
                                         |          inet6_release
                                         |          sock_close
                                         |          __fput
                                         |          ____fput
                                         |          task_work_run
                                         |          exit_to_usermode_loop
                                         |          do_syscall_64
                                         |          entry_SYSCALL_64_after_hwframe
                                         |          __GI___libc_close
                                         |
                                          --14.48%--tcp_out_of_resources
                                                    tcp_write_timeout
                                                    tcp_retransmit_timer
                                                    tcp_write_timer_handler
                                                    tcp_write_timer
                                                    call_timer_fn
                                                    expire_timers
                                                    __run_timers
                                                    run_timer_softirq
                                                    __softirqentry_text_start

    As explained in commit cf86a086a1 ("net/dst: use a smaller percpu_counter
    batch for dst entries accounting"), default batch size is too big
    for the default value of tcp_max_orphans (262144).

    But even if we reduce batch sizes, there would still be cases
    where the estimated count of orphans is beyond the limit,
    and where tcp_too_many_orphans() has to call the expensive
    percpu_counter_sum_positive().

    One solution is to use plain per-cpu counters, and have
    a timer to periodically refresh this cache.

    Updating this cache every 100ms seems about right, tcp pressure
    state is not radically changing over shorter periods.

    percpu_counter was nice 15 years ago while hosts had less
    than 16 cpus, not anymore by current standards.

    v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
        reported by kernel test robot <lkp@intel.com>
        Remove unused socket argument from tcp_too_many_orphans()

    Fixes: dd24c00191 ("net: Use a percpu_counter for orphan_count")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Stefan Bach <sfb@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:54 +01:00
Davide Caratti 7faf5c7e58 tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: net-next.git commit f4ce91ce12a7

commit f4ce91ce12a7c6ead19b128ffa8cff6e3ded2a14
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed Sep 28 16:03:31 2022 -0400

    tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited

    This commit fixes a bug in the tracking of max_packets_out and
    is_cwnd_limited. This bug can cause the connection to fail to remember
    that is_cwnd_limited is true, causing the connection to fail to grow
    cwnd when it should, causing throughput to be lower than it should be.

    The following event sequence is an example that triggers the bug:

     (a) The connection is cwnd_limited, but packets_out is not at its
         peak due to TSO deferral deciding not to send another skb yet.
         In such cases the connection can advance max_packets_seq and set
         tp->is_cwnd_limited to true and max_packets_out to a small
         number.

    (b) Then later in the round trip the connection is pacing-limited (not
         cwnd-limited), and packets_out is larger. In such cases the
         connection would raise max_packets_out to a bigger number but
         (unexpectedly) flip tp->is_cwnd_limited from true to false.

    This commit fixes that bug.

    One straightforward fix would be to separately track (a) the next
    window after max_packets_out reaches a maximum, and (b) the next
    window after tp->is_cwnd_limited is set to true. But this would
    require consuming an extra u32 sequence number.

    Instead, to save space we track only the most important
    information. Specifically, we track the strongest available signal of
    the degree to which the cwnd is fully utilized:

    (1) If the connection is cwnd-limited then we remember that fact for
    the current window.

    (2) If the connection not cwnd-limited then we track the maximum
    number of outstanding packets in the current window.

    In particular, note that the new logic cannot trigger the buggy
    (a)/(b) sequence above because with the new logic a condition where
    tp->packets_out > tp->max_packets_out can only trigger an update of
    tp->is_cwnd_limited if tp->is_cwnd_limited is false.

    This first showed up in a testing of a BBRv2 dev branch, but this
    buggy behavior highlighted a general issue with the
    tcp_cwnd_validate() logic that can cause cwnd to fail to increase at
    the proper rate for any TCP congestion control, including Reno or
    CUBIC.

    Fixes: ca8a226343 ("tcp: make cwnd-limited checks measurement-based, and gentler")
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-07 10:19:56 +01:00
Davide Caratti 7c7c9c8466 tcp: TX zerocopy should not sense pfmemalloc status
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: net-next.git commit 326140063946
Omitted-fix: ee15e1f38dc2 ("kcm: do not sense pfmemalloc status in kcm_sendpage()")
 - not backported as we don't have have kcm in rhel-9

commit 3261400639463a853ba2b3be8bd009c2a8089775
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Aug 31 23:38:09 2022 +0000

    tcp: TX zerocopy should not sense pfmemalloc status

    We got a recent syzbot report [1] showing a possible misuse
    of pfmemalloc page status in TCP zerocopy paths.

    Indeed, for pages coming from user space or other layers,
    using page_is_pfmemalloc() is moot, and possibly could give
    false positives.

    There has been attempts to make page_is_pfmemalloc() more robust,
    but not using it in the first place in this context is probably better,
    removing cpu cycles.

    Note to stable teams :

    You need to backport 84ce071e38a6 ("net: introduce
    __skb_fill_page_desc_noacc") as a prereq.

    Race is more probable after commit c07aea3ef4
    ("mm: add a signature in struct page") because page_is_pfmemalloc()
    is now using low order bit from page->lru.next, which can change
    more often than page->index.

    Low order bit should never be set for lru.next (when used as an anchor
    in LRU list), so KCSAN report is mostly a false positive.

    Backporting to older kernel versions seems not necessary.

    [1]
    BUG: KCSAN: data-race in lru_add_fn / tcp_build_frag

    write to 0xffffea0004a1d2c8 of 8 bytes by task 18600 on cpu 0:
    __list_add include/linux/list.h:73 [inline]
    list_add include/linux/list.h:88 [inline]
    lruvec_add_folio include/linux/mm_inline.h:105 [inline]
    lru_add_fn+0x440/0x520 mm/swap.c:228
    folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246
    folio_batch_add_and_move mm/swap.c:263 [inline]
    folio_add_lru+0xf1/0x140 mm/swap.c:490
    filemap_add_folio+0xf8/0x150 mm/filemap.c:948
    __filemap_get_folio+0x510/0x6d0 mm/filemap.c:1981
    pagecache_get_page+0x26/0x190 mm/folio-compat.c:104
    grab_cache_page_write_begin+0x2a/0x30 mm/folio-compat.c:116
    ext4_da_write_begin+0x2dd/0x5f0 fs/ext4/inode.c:2988
    generic_perform_write+0x1d4/0x3f0 mm/filemap.c:3738
    ext4_buffered_write_iter+0x235/0x3e0 fs/ext4/file.c:270
    ext4_file_write_iter+0x2e3/0x1210
    call_write_iter include/linux/fs.h:2187 [inline]
    new_sync_write fs/read_write.c:491 [inline]
    vfs_write+0x468/0x760 fs/read_write.c:578
    ksys_write+0xe8/0x1a0 fs/read_write.c:631
    __do_sys_write fs/read_write.c:643 [inline]
    __se_sys_write fs/read_write.c:640 [inline]
    __x64_sys_write+0x3e/0x50 fs/read_write.c:640
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    read to 0xffffea0004a1d2c8 of 8 bytes by task 18611 on cpu 1:
    page_is_pfmemalloc include/linux/mm.h:1740 [inline]
    __skb_fill_page_desc include/linux/skbuff.h:2422 [inline]
    skb_fill_page_desc include/linux/skbuff.h:2443 [inline]
    tcp_build_frag+0x613/0xb20 net/ipv4/tcp.c:1018
    do_tcp_sendpages+0x3e8/0xaf0 net/ipv4/tcp.c:1075
    tcp_sendpage_locked net/ipv4/tcp.c:1140 [inline]
    tcp_sendpage+0x89/0xb0 net/ipv4/tcp.c:1150
    inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833
    kernel_sendpage+0x184/0x300 net/socket.c:3561
    sock_sendpage+0x5a/0x70 net/socket.c:1054
    pipe_to_sendpage+0x128/0x160 fs/splice.c:361
    splice_from_pipe_feed fs/splice.c:415 [inline]
    __splice_from_pipe+0x222/0x4d0 fs/splice.c:559
    splice_from_pipe fs/splice.c:594 [inline]
    generic_splice_sendpage+0x89/0xc0 fs/splice.c:743
    do_splice_from fs/splice.c:764 [inline]
    direct_splice_actor+0x80/0xa0 fs/splice.c:931
    splice_direct_to_actor+0x305/0x620 fs/splice.c:886
    do_splice_direct+0xfb/0x180 fs/splice.c:974
    do_sendfile+0x3bf/0x910 fs/read_write.c:1249
    __do_sys_sendfile64 fs/read_write.c:1317 [inline]
    __se_sys_sendfile64 fs/read_write.c:1303 [inline]
    __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1303
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    value changed: 0x0000000000000000 -> 0xffffea0004a1d288

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 18611 Comm: syz-executor.4 Not tainted 6.0.0-rc2-syzkaller-00248-ge022620b5d05-dirty #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022

    Fixes: c07aea3ef4 ("mm: add a signature in struct page")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-07 10:17:43 +01:00
Ivan Vecera 9a3d61a7ce net: Adjust sk_gso_max_size once when set
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

commit ab14f1802cfb2d7ca120bbf48e3ba6712314ffc3
Author: David Ahern <dsahern@kernel.org>
Date:   Mon Jan 24 19:45:11 2022 -0700

    net: Adjust sk_gso_max_size once when set

    sk_gso_max_size is set based on the dst dev. Both users of it
    adjust the value by the same offset - (MAX_TCP_HEADER + 1). Rather
    than compute the same adjusted value on each call do the adjustment
    once when set.

    Signed-off-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220125024511.27480-1-dsahern@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:55:50 +01:00
Frantisek Hrbata 0c3a22328a Merge: IPv6: 9.2 P1 backport from upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1488

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135319

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-27 08:26:02 -04:00
Frantisek Hrbata fa843be1d1 Merge: net: add skb drop reasons
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161

Sync skb drop reasons with upstream to improve debuggability and visibility in
the net stack. This MR helps in understanding why a given packet is being
dropped.

One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint:

```
# perf record -e skb:kfree_skb -a sleep 10
# perf script
         swapper     0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED
         swapper     0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE
```

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-24 14:27:58 -04:00
Frantisek Hrbata e9e9bc8da2 Merge: mm changes through v5.18 for 9.2
Merge conflicts:
-----------------
Conflicts with !1142(merged) "io_uring: update to v5.15"

fs/io-wq.c
        - static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
          !1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
          along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142)
        - static int io_wqe_worker(void *data)
          !1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers")
          Resolved in favor of HEAD(!1142)
        - static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
          HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread")
          Resolved in favor of !1370
        - static void create_worker_cont(struct callback_head *cb)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static void io_workqueue_create(struct work_struct *work)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
          !1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          Resolved in favor of HEAD(!1142)
        - static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
          !1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          removed wrongly merged run_cancel label
          Resolved in favor of HEAD(!1142)
        - static bool io_task_work_match(struct callback_head *cb, void *data)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - static void io_wq_exit_workers(struct io_wq *wq)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - int io_wq_max_workers(struct io_wq *wq, int *new_count)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
fs/io_uring.c
        - static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
          !1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
          Resolved in favor of HEAD(!1142)
include/uapi/linux/io_uring.h
        - !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items")
          just a comment conflict
          Resolved in favor of HEAD(!1142)
kernel/exit.c
        - void __noreturn do_exit(long code)
        - !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions")
          Resolved in favor of !1370

Conflicts with !1357(merged) "NFS refresh for RHEL-9.2"

fs/nfs/callback.c
        - nfs4_callback_svc(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed
          Resolved in favor of HEAD(!1357)
fs/nfs/file.c
          !1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
          Resolved in favor of HEAD(!1370)
fs/nfsd/nfssvc.c
        - nfsd(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module")
          Resolved in favor of HEAD(!1357)
-----------------

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370

Bugzilla: https://bugzilla.redhat.com/2120352

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

Patches 1-9 are changes to selftests
Patches 10-31 are reverts of RHEL-only patches to address COR CVE
Patches 32-320 are the machine dependent mm changes ported by Rafael
Patch 321 reverts the backport of 6692c98c7df5. See below.
Patches 322-981 are the machine independent mm changes
Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE

RHEL commit b23c298982 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5
is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added
after 40966e316f86.

Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bc369921d670 io-wq: max_worker fixes
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774

Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die()
	unsupported arch

Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die()
	unsupported arch

Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c
        unsupported arch

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter
        reverted later

Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter"
        revert of above omitted fix

Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak
	unsupported fs

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-23 19:49:41 +02:00
Hangbin Liu d109429414 tcp: Fix data races around icsk->icsk_af_ops.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135319
Upstream Status: net.git commit f49cd2f4d617

Conflicts: context conflicts due to missing upstream commit
34704ef024ae ("bpf: net: Change do_tcp_getsockopt() to take the sockptr_t
argument").

commit f49cd2f4d6170d27a2c61f1fecb03d8a70c91f57
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Thu Oct 6 11:53:49 2022 -0700

    tcp: Fix data races around icsk->icsk_af_ops.

    setsockopt(IPV6_ADDRFORM) and tcp_v6_connect() change icsk->icsk_af_ops
    under lock_sock(), but tcp_(get|set)sockopt() read it locklessly.  To
    avoid load/store tearing, we need to add READ_ONCE() and WRITE_ONCE()
    for the reads and writes.

    Thanks to Eric Dumazet for providing the syzbot report:

    BUG: KCSAN: data-race in tcp_setsockopt / tcp_v6_connect

    write to 0xffff88813c624518 of 8 bytes by task 23936 on cpu 0:
    tcp_v6_connect+0x5b3/0xce0 net/ipv6/tcp_ipv6.c:240
    __inet_stream_connect+0x159/0x6d0 net/ipv4/af_inet.c:660
    inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:724
    __sys_connect_file net/socket.c:1976 [inline]
    __sys_connect+0x197/0x1b0 net/socket.c:1993
    __do_sys_connect net/socket.c:2003 [inline]
    __se_sys_connect net/socket.c:2000 [inline]
    __x64_sys_connect+0x3d/0x50 net/socket.c:2000
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    read to 0xffff88813c624518 of 8 bytes by task 23937 on cpu 1:
    tcp_setsockopt+0x147/0x1c80 net/ipv4/tcp.c:3789
    sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3585
    __sys_setsockopt+0x212/0x2b0 net/socket.c:2252
    __do_sys_setsockopt net/socket.c:2263 [inline]
    __se_sys_setsockopt net/socket.c:2260 [inline]
    __x64_sys_setsockopt+0x62/0x70 net/socket.c:2260
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    value changed: 0xffffffff8539af68 -> 0xffffffff8539aff8

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 23937 Comm: syz-executor.5 Not tainted
    6.0.0-rc4-syzkaller-00331-g4ed9c1e971b1-dirty #0

    Hardware name: Google Google Compute Engine/Google Compute Engine,
    BIOS Google 08/26/2022

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Reported-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-10-18 11:41:13 +08:00
Antoine Tenart 4e30665664 tcp: md5: fix IPv4-mapped support
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit e62d2e110356093c034998e093675df83057e511
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jul 26 11:57:43 2022 +0000

    tcp: md5: fix IPv4-mapped support

    After the blamed commit, IPv4 SYN packets handled
    by a dual stack IPv6 socket are dropped, even if
    perfectly valid.

    $ nstat | grep MD5
    TcpExtTCPMD5Failure             5                  0.0

    For a dual stack listener, an incoming IPv4 SYN packet
    would call tcp_inbound_md5_hash() with @family == AF_INET,
    while tp->af_specific is pointing to tcp_sock_ipv6_specific.

    Only later when an IPv4-mapped child is created, tp->af_specific
    is changed to tcp_sock_ipv6_mapped_specific.

    Fixes: 7bbb765b7349 ("net/tcp: Merge TCP-MD5 inbound callbacks")
    Reported-by: Brian Vazquez <brianvv@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Dmitry Safonov <dima@arista.com>
    Tested-by: Leonard Crestez <cdleonard@gmail.com>
    Link: https://lore.kernel.org/r/20220726115743.2759832-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-14 17:40:26 +02:00
Antoine Tenart 04f4917aca skb: make drop reason booleanable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 1330b6ef3313fcec577d2b020c290dc8b9f11f1a
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Mar 7 16:44:21 2022 -0800

    skb: make drop reason booleanable

    We have a number of cases where function returns drop/no drop
    decision as a boolean. Now that we want to report the reason
    code as well we have to pass extra output arguments.

    We can make the reason code evaluate correctly as bool.

    I believe we're good to reorder the reasons as they are
    reported to user space as strings.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:24 +02:00
Antoine Tenart 997d93a49f net/tcp: Merge TCP-MD5 inbound callbacks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 7bbb765b73496699a165d505ecdce962f903b422
Author: Dmitry Safonov <0x7f454c46@gmail.com>
Date:   Wed Feb 23 17:57:40 2022 +0000

    net/tcp: Merge TCP-MD5 inbound callbacks

    The functions do essentially the same work to verify TCP-MD5 sign.
    Code can be merged into one family-independent function in order to
    reduce copy'n'paste and generated code.
    Later with TCP-AO option added, this will allow to create one function
    that's responsible for segment verification, that will have all the
    different checks for MD5/AO/non-signed packets, which in turn will help
    to see checks for all corner-cases in one function, rather than spread
    around different families and functions.

    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
    Signed-off-by: Dmitry Safonov <dima@arista.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20220223175740.452397-1-dima@arista.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:24 +02:00
Paolo Abeni 519b3282c5 net: Fix data-races around sysctl_[rw]mem(_offset)?.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Conflicts: different context in __tcp_grow_window() as rhel-9 \
 lacks upstream commit 240bfd134c592 ("tcp: tweak len/truesize \
 ratio for coalesce candidates")

Upstream commit:
commit 02739545951ad4c1215160db7fbf9b7a918d3c0b
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:00 2022 -0700

    net: Fix data-races around sysctl_[rw]mem(_offset)?.

    While reading these sysctl variables, they can be changed concurrently.
    Thus, we need to add READ_ONCE() to their readers.

      - .sysctl_rmem
      - .sysctl_rwmem
      - .sysctl_rmem_offset
      - .sysctl_wmem_offset
      - sysctl_tcp_rmem[1, 2]
      - sysctl_tcp_wmem[1, 2]
      - sysctl_decnet_rmem[1]
      - sysctl_decnet_wmem[1]
      - sysctl_tipc_rmem[1]

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 12:59:36 +02:00
Chris von Recklinghausen 59baaa91f0 include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h
Bugzilla: https://bugzilla.redhat.com/2120352

commit a1554c002699cbc9ced2e9f44f9c1357181bead3
Author: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Date:   Fri Nov 5 13:45:21 2021 -0700

    include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h

    nr_free_buffer_pages could be exposed through mm.h instead of swap.h.
    The advantage of this change is that it can reduce the obsolete
    includes.  For example, net/ipv4/tcp.c wouldn't need swap.h any more
    since it has already included mm.h.  Similarly, after checking all the
    other files, it comes that tcp.c, udp.c meter.c ,...  follow the same
    rule, so these files can have swap.h removed too.

    Moreover, after preprocessing all the files that use
    nr_free_buffer_pages, it turns out that those files have already
    included mm.h.Thus, we can move nr_free_buffer_pages from swap.h to mm.h
    safely.  This change will not affect the compilation of other files.

    Link: https://lkml.kernel.org/r/20210912133640.1624-1-liumh1@shanghaitech.edu.cn
    Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn>
    Cc: Jakub Kicinski <kuba@kernel.org>
    CC: Ulf Hansson <ulf.hansson@linaro.org>
    Cc: "David S . Miller" <davem@davemloft.net>
    Cc: Simon Horman <horms@verge.net.au>
    Cc: Pravin B Shelar <pshelar@ovn.org>
    Cc: Vlad Yasevich <vyasevich@gmail.com>
    Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:30 -04:00
Paolo Abeni 036c0e121e tcp: add accessors to read/set tp->snd_cwnd
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101465
Tested: LNST, Tier1

Upstream commit:
commit 40570375356c874b1578e05c1dcc3ff7c1322dbe
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Apr 5 16:35:38 2022 -0700

    tcp: add accessors to read/set tp->snd_cwnd

    We had various bugs over the years with code
    breaking the assumption that tp->snd_cwnd is greater
    than zero.

    Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added
    in commit 8b8a321ff7 ("tcp: fix zero cwnd in tcp_cwnd_reduction")
    can trigger, and without a repro we would have to spend
    considerable time finding the bug.

    Instead of complaining too late, we want to catch where
    and when tp->snd_cwnd is set to an illegal value.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Suggested-by: Yuchung Cheng <ycheng@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Acked-by: Yuchung Cheng <ycheng@google.com>
    Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-27 16:43:55 +02:00
Patrick Talbert 8c5b3f7fd9 Merge: XDP and networking eBPF rebase to v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/674

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Depends: !572

Tested: Using bpf selftests, everything passes.

This rebases XDP and networking eBPF to upstream kernel version 5.15.

Signed-off-by: Jiri Benc <jbenc@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-03 09:26:25 +02:00
Jiri Benc df10d51307 net: Rename ->stream_memory_read to ->sock_is_readable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Conflicts:
- [minor] Context difference in struct proto due to missing 6c302e799a0d
  "net: forward_alloc_get depends on CONFIG_MPTCP".
- [minor] Context difference in sock.h due to out of order backport of
  4c1e34c0dbff "vsock: Enable y2038 safe timeval for timeout".

commit 7b50ecfcc6cdfe87488576bc3ed443dc8d083b90
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Fri Oct 8 13:33:03 2021 -0700

    net: Rename ->stream_memory_read to ->sock_is_readable

    The proto ops ->stream_memory_read() is currently only used
    by TCP to check whether psock queue is empty or not. We need
    to rename it before reusing it for non-TCP protocols, and
    adjust the exsiting users accordingly.

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211008203306.37525-2-xiyou.wangcong@gmail.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:53 +02:00
Paolo Abeni bae902a610 inet: fully convert sk->sk_rx_dst to RCU rules
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2079411
Tested: LNST, Tieri1
Conflicts: \
  - sk_rx_dst location inside struct sock is slightly different
  from upstream as rhel-9 already has commit 43f51df41729 ("net:
   move early demux fields close to sk_refcnt")

Upstream commit:
commit 8f905c0e7354ef261360fb7535ea079b1082c105
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Dec 20 06:33:30 2021 -0800

    inet: fully convert sk->sk_rx_dst to RCU rules

    syzbot reported various issues around early demux,
    one being included in this changelog [1]

    sk->sk_rx_dst is using RCU protection without clearly
    documenting it.

    And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
    are not following standard RCU rules.

    [a]    dst_release(dst);
    [b]    sk->sk_rx_dst = NULL;

    They look wrong because a delete operation of RCU protected
    pointer is supposed to clear the pointer before
    the call_rcu()/synchronize_rcu() guarding actual memory freeing.

    In some cases indeed, dst could be freed before [b] is done.

    We could cheat by clearing sk_rx_dst before calling
    dst_release(), but this seems the right time to stick
    to standard RCU annotations and debugging facilities.

    [1]
    BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
    BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
    Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204

    CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
     <TASK>
     __dump_stack lib/dump_stack.c:88 [inline]
     dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
     print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
     __kasan_report mm/kasan/report.c:433 [inline]
     kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
     dst_check include/net/dst.h:470 [inline]
     tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
     ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
     ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
     ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
     ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
     __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
     __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
     __netif_receive_skb_list net/core/dev.c:5608 [inline]
     netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
     gro_normal_list net/core/dev.c:5853 [inline]
     gro_normal_list net/core/dev.c:5849 [inline]
     napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
     virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
     virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
     __napi_poll+0xaf/0x440 net/core/dev.c:7023
     napi_poll net/core/dev.c:7090 [inline]
     net_rx_action+0x801/0xb40 net/core/dev.c:7177
     __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
     invoke_softirq kernel/softirq.c:432 [inline]
     __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
     irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
     common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
     asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
    RIP: 0033:0x7f5e972bfd57
    Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
    RSP: 002b:00007fff8a413210 EFLAGS: 00000283
    RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
    RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
    RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
    R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
    R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
     </TASK>

    Allocated by task 13:
     kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
     kasan_set_track mm/kasan/common.c:46 [inline]
     set_alloc_info mm/kasan/common.c:434 [inline]
     __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
     kasan_slab_alloc include/linux/kasan.h:259 [inline]
     slab_post_alloc_hook mm/slab.h:519 [inline]
     slab_alloc_node mm/slub.c:3234 [inline]
     slab_alloc mm/slub.c:3242 [inline]
     kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
     dst_alloc+0x146/0x1f0 net/core/dst.c:92
     rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
     ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
     ip_route_input_rcu net/ipv4/route.c:2470 [inline]
     ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
     ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
     ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
     ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
     ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
     __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
     __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
     __netif_receive_skb_list net/core/dev.c:5608 [inline]
     netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
     gro_normal_list net/core/dev.c:5853 [inline]
     gro_normal_list net/core/dev.c:5849 [inline]
     napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
     virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
     virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
     __napi_poll+0xaf/0x440 net/core/dev.c:7023
     napi_poll net/core/dev.c:7090 [inline]
     net_rx_action+0x801/0xb40 net/core/dev.c:7177
     __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

    Freed by task 13:
     kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
     kasan_set_track+0x21/0x30 mm/kasan/common.c:46
     kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
     ____kasan_slab_free mm/kasan/common.c:366 [inline]
     ____kasan_slab_free mm/kasan/common.c:328 [inline]
     __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
     kasan_slab_free include/linux/kasan.h:235 [inline]
     slab_free_hook mm/slub.c:1723 [inline]
     slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
     slab_free mm/slub.c:3513 [inline]
     kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
     dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
     rcu_do_batch kernel/rcu/tree.c:2506 [inline]
     rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
     __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

    Last potentially related work creation:
     kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
     __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
     __call_rcu kernel/rcu/tree.c:2985 [inline]
     call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
     dst_release net/core/dst.c:177 [inline]
     dst_release+0x79/0xe0 net/core/dst.c:167
     tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
     sk_backlog_rcv include/net/sock.h:1030 [inline]
     __release_sock+0x134/0x3b0 net/core/sock.c:2768
     release_sock+0x54/0x1b0 net/core/sock.c:3300
     tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
     inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
     sock_sendmsg_nosec net/socket.c:704 [inline]
     sock_sendmsg+0xcf/0x120 net/socket.c:724
     sock_write_iter+0x289/0x3c0 net/socket.c:1057
     call_write_iter include/linux/fs.h:2162 [inline]
     new_sync_write+0x429/0x660 fs/read_write.c:503
     vfs_write+0x7cd/0xae0 fs/read_write.c:590
     ksys_write+0x1ee/0x250 fs/read_write.c:643
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    The buggy address belongs to the object at ffff88807f1cb700
     which belongs to the cache ip_dst_cache of size 176
    The buggy address is located 58 bytes inside of
     176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
    The buggy address belongs to the page:
    page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
    flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
    raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
    raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected
    page_owner tracks the page as allocated
    page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
     prep_new_page mm/page_alloc.c:2418 [inline]
     get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
     __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
     alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
     alloc_slab_page mm/slub.c:1793 [inline]
     allocate_slab mm/slub.c:1930 [inline]
     new_slab+0x32d/0x4a0 mm/slub.c:1993
     ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
     __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
     slab_alloc_node mm/slub.c:3200 [inline]
     slab_alloc mm/slub.c:3242 [inline]
     kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
     dst_alloc+0x146/0x1f0 net/core/dst.c:92
     rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
     __mkroute_output net/ipv4/route.c:2564 [inline]
     ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
     ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
     __ip_route_output_key include/net/route.h:126 [inline]
     ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
     ip_route_output_key include/net/route.h:142 [inline]
     geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
     geneve_xmit_skb drivers/net/geneve.c:899 [inline]
     geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
     __netdev_start_xmit include/linux/netdevice.h:4994 [inline]
     netdev_start_xmit include/linux/netdevice.h:5008 [inline]
     xmit_one net/core/dev.c:3590 [inline]
     dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
     __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
    page last free stack trace:
     reset_page_owner include/linux/page_owner.h:24 [inline]
     free_pages_prepare mm/page_alloc.c:1338 [inline]
     free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
     free_unref_page_prepare mm/page_alloc.c:3309 [inline]
     free_unref_page+0x19/0x690 mm/page_alloc.c:3388
     qlink_free mm/kasan/quarantine.c:146 [inline]
     qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
     kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
     __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
     kasan_slab_alloc include/linux/kasan.h:259 [inline]
     slab_post_alloc_hook mm/slab.h:519 [inline]
     slab_alloc_node mm/slub.c:3234 [inline]
     kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
     __alloc_skb+0x215/0x340 net/core/skbuff.c:414
     alloc_skb include/linux/skbuff.h:1126 [inline]
     alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
     sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
     mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
     add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
     add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
     mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
     mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
     mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
     process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
     worker_thread+0x658/0x11f0 kernel/workqueue.c:2445

    Memory state around the buggy address:
     ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
     ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
    >ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                            ^
     ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
     ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

    Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-12 16:55:33 +02:00
Antoine Tenart e19e48206f net: remove sk_route_forced_caps
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041382
Upstream Status: linux.git
Tested: ENRT

commit d0d598ca86bd9e595f16a39097707c90841afe80
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:34 2021 -0800

    net: remove sk_route_forced_caps

    We were only using one bit, and we can replace it by sk_is_tcp()

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-01-21 16:26:18 +01:00
Paolo Abeni 8262fea6a4 tcp: expose __tcp_sock_set_cork and __tcp_sock_set_nodelay
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028420
Tested: LNST, Tier1

Upstream commit:
commit 6fadaa565882cd7afc501de5921db6f5e45c784b
Author: Maxim Galaganov <max@internet.ru>
Date:   Fri Dec 3 14:35:39 2021 -0800

    tcp: expose __tcp_sock_set_cork and __tcp_sock_set_nodelay

    Expose __tcp_sock_set_cork() and __tcp_sock_set_nodelay() for use in
    MPTCP setsockopt code -- namely for syncing MPTCP socket options with
    subflows inside sync_socket_options() while already holding the subflow
    socket lock.

    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
    Signed-off-by: Maxim Galaganov <max@internet.ru>
    Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-01-12 10:49:56 +01:00
Paolo Abeni 61472a2f82 tcp: remove sk_{tr}x_skb_cache
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028420
Tested: LNST, Tier1

Upstream commit:
commit d8b81175e412c7abebdb5b37d8a84d5fd19b1aad
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Sep 22 19:26:43 2021 +0200

    tcp: remove sk_{tr}x_skb_cache

    This reverts the following patches :

    - commit 2e05fcae83 ("tcp: fix compile error if !CONFIG_SYSCTL")
    - commit 4f661542a4 ("tcp: fix zerocopy and notsent_lowat issues")
    - commit 472c2e07ee ("tcp: add one skb cache for tx")
    - commit 8b27dae5a2 ("tcp: add one skb cache for rx")

    Having a cache of one skb (in each direction) per TCP socket is fragile,
    since it can cause a significant increase of memory needs,
    and not good enough for high speed flows anyway where more than one skb
    is needed.

    We want instead to add a generic infrastructure, with more flexible
    per-cpu caches, for alien NUMA nodes.

    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-01-12 10:49:53 +01:00
Paolo Abeni 44f22770d6 tcp: make tcp_build_frag() static
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028420
Tested: LNST, Tier1

Upstream commit:
commit ff6fb083a07f1b71fc6a9438f27113d73cf23381
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Sep 22 19:26:42 2021 +0200

    tcp: make tcp_build_frag() static

    After the previous patch the mentioned helper is
    used only inside its compilation unit: let's make
    it static.

    RFC -> v1:
     - preserve the tcp_build_frag() helper (Eric)

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-01-12 10:49:53 +01:00
Paolo Abeni 0208c78680 tcp: expose the tcp_mark_push() and tcp_skb_entail() helpers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028420
Tested: LNST, Tier1

Upstream commit:
commit 04d8825c30b718781197c8f07b1915a11bfb8685
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Sep 22 19:26:40 2021 +0200

    tcp: expose the tcp_mark_push() and tcp_skb_entail() helpers

    the tcp_skb_entail() helper is actually skb_entail(), renamed
    to provide proper scope.

        The two helper will be used by the next patch.

    RFC -> v1:
     - rename skb_entail to tcp_skb_entail (Eric)

    Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-01-12 10:49:53 +01:00
Paolo Abeni 0b0733b86d tcp: don't free a FIN sk_buff in tcp_remove_empty_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028279
Tested: LNST, Tier1

Upstream commit:
commit cf12e6f9124629b18a6182deefc0315f0a73a199
Author: Jon Maxwell <jmaxwell37@gmail.com>
Date:   Mon Oct 25 10:59:03 2021 +1100

    tcp: don't free a FIN sk_buff in tcp_remove_empty_skb()

    v1: Implement a more general statement as recommended by Eric Dumazet. The
    sequence number will be advanced, so this check will fix the FIN case and
    other cases.

    A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that
    the write_queue was not empty as determined by tcp_write_queue_empty() but the
    sk_buff containing the FIN flag had been freed and the socket was zombied in
    that state. Corresponding pcaps show no FIN from the Linux kernel on the wire.

    Some instrumentation was added to the kernel and it was found that there is a
    timing window where tcp_sendmsg() can run after tcp_send_fin().

    tcp_sendmsg() will hit an error, for example:

    1269 ▹       if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))↩
    1270 ▹       ▹       goto do_error;↩

    tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The
    TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent.

    If the other side sends a FIN packet the socket will transition to CLOSING and
    remain that way until the system is rebooted.

    Fix this by checking for the FIN flag in the sk_buff and don't free it if that
    is the case. Testing confirmed that fixed the issue.

    Fixes: fdfc5c8594 ("tcp: remove empty skb from write queue in error cases")
    Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
    Reported-by: Monir Zouaoui <Monir.Zouaoui@mail.schwarz>
    Reported-by: Simon Stier <simon.stier@mail.schwarz>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-02 12:04:01 +01:00
Paolo Abeni 7b94dec6c5 tcp: Fix uninitialized access in skb frags array for Rx 0cp.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028279
Tested: LNST, Tier1

Upstream commit:
commit 70701b83e208767f2720d8cd3e6a62cddafb3a30
Author: Arjun Roy <arjunroy@google.com>
Date:   Thu Nov 11 15:52:15 2021 -0800

    tcp: Fix uninitialized access in skb frags array for Rx 0cp.

    TCP Receive zerocopy iterates through the SKB queue via
    tcp_recv_skb(), acquiring a pointer to an SKB and an offset within
    that SKB to read from. From there, it iterates the SKB frags array to
    determine which offset to start remapping pages from.

    However, this is built on the assumption that the offset read so far
    within the SKB is smaller than the SKB length. If this assumption is
    violated, we can attempt to read an invalid frags array element, which
    would cause a fault.

    tcp_recv_skb() can cause such an SKB to be returned when the TCP FIN
    flag is set. Therefore, we must guard against this occurrence inside
    skb_advance_frag().

    One way that we can reproduce this error follows:
    1) In a receiver program, call getsockopt(TCP_ZEROCOPY_RECEIVE) with:
    char some_array[32 * 1024];
    struct tcp_zerocopy_receive zc = {
      .copybuf_address  = (__u64) &some_array[0],
      .copybuf_len = 32 * 1024,
    };

    2) In a sender program, after a TCP handshake, send the following
    sequence of packets:
      i) Seq = [X, X+4000]
      ii) Seq = [X+4000, X+5000]
      iii) Seq = [X+4000, X+5000], Flags = FIN | URG, urgptr=1000

    (This can happen without URG, if we have a signal pending, but URG is
    a convenient way to reproduce the behaviour).

    In this case, the following event sequence will occur on the receiver:

    tcp_zerocopy_receive():
    -> receive_fallback_to_copy() // copybuf_len >= inq
    -> tcp_recvmsg_locked() // reads 5000 bytes, then breaks due to URG
    -> tcp_recv_skb() // yields skb with skb->len == offset
    -> tcp_zerocopy_set_hint_for_skb()
    -> skb_advance_to_frag() // will returns a frags ptr. >= nr_frags
    -> find_next_mappable_frag() // will dereference this bad frags ptr.

    With this patch, skb_advance_to_frag() will no longer return an
    invalid frags pointer, and will return NULL instead, fixing the issue.

    Signed-off-by: Arjun Roy <arjunroy@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Fixes: 05255b823a ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
    Link: https://lore.kernel.org/r/20211111235215.2605384-1-arjunroy.kdev@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-02 12:03:54 +01:00
Talal Ahmad 358ed62420 tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
sk_wmem_schedule makes sure that sk_forward_alloc has enough
bytes for charging that is going to be done by sk_mem_charge.

In the transmit zerocopy path, there is sk_mem_charge but there was
no call to sk_wmem_schedule. This change adds that call.

Without this call to sk_wmem_schedule, sk_forward_alloc can go
negetive which is a bug because sk_forward_alloc is a per-socket
space that has been forward charged so this can't be negative.

Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Talal Ahmad <talalahmad@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09 11:25:24 -07:00
Linus Torvalds dbe69e4337 Networking changes for 5.14.
Core:
 
  - BPF:
    - add syscall program type and libbpf support for generating
      instructions and bindings for in-kernel BPF loaders (BPF loaders
      for BPF), this is a stepping stone for signed BPF programs
    - infrastructure to migrate TCP child sockets from one listener
      to another in the same reuseport group/map to improve flexibility
      of service hand-off/restart
    - add broadcast support to XDP redirect
 
  - allow bypass of the lockless qdisc to improving performance
    (for pktgen: +23% with one thread, +44% with 2 threads)
 
  - add a simpler version of "DO_ONCE()" which does not require
    jump labels, intended for slow-path usage
 
  - virtio/vsock: introduce SOCK_SEQPACKET support
 
  - add getsocketopt to retrieve netns cookie
 
  - ip: treat lowest address of a IPv4 subnet as ordinary unicast address
        allowing reclaiming of precious IPv4 addresses
 
  - ipv6: use prandom_u32() for ID generation
 
  - ip: add support for more flexible field selection for hashing
        across multi-path routes (w/ offload to mlxsw)
 
  - icmp: add support for extended RFC 8335 PROBE (ping)
 
  - seg6: add support for SRv6 End.DT46 behavior
 
  - mptcp:
     - DSS checksum support (RFC 8684) to detect middlebox meddling
     - support Connection-time 'C' flag
     - time stamping support
 
  - sctp: packetization Layer Path MTU Discovery (RFC 8899)
 
  - xfrm: speed up state addition with seq set
 
  - WiFi:
     - hidden AP discovery on 6 GHz and other HE 6 GHz improvements
     - aggregation handling improvements for some drivers
     - minstrel improvements for no-ack frames
     - deferred rate control for TXQs to improve reaction times
     - switch from round robin to virtual time-based airtime scheduler
 
  - add trace points:
     - tcp checksum errors
     - openvswitch - action execution, upcalls
     - socket errors via sk_error_report
 
 Device APIs:
 
  - devlink: add rate API for hierarchical control of max egress rate
             of virtual devices (VFs, SFs etc.)
 
  - don't require RCU read lock to be held around BPF hooks
    in NAPI context
 
  - page_pool: generic buffer recycling
 
 New hardware/drivers:
 
  - mobile:
     - iosm: PCIe Driver for Intel M.2 Modem
     - support for Qualcomm MSM8998 (ipa)
 
  - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices
 
  - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches
 
  - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)
 
  - NXP SJA1110 Automotive Ethernet 10-port switch
 
  - Qualcomm QCA8327 switch support (qca8k)
 
  - Mikrotik 10/25G NIC (atl1c)
 
 Driver changes:
 
  - ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP
    (our first foray into MAC/PHY description via ACPI)
 
  - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx
 
  - Mellanox/Nvidia NIC (mlx5)
    - NIC VF offload of L2 bridging
    - support IRQ distribution to Sub-functions
 
  - Marvell (prestera):
     - add flower and match all
     - devlink trap
     - link aggregation
 
  - Netronome (nfp): connection tracking offload
 
  - Intel 1GE (igc): add AF_XDP support
 
  - Marvell DPU (octeontx2): ingress ratelimit offload
 
  - Google vNIC (gve): new ring/descriptor format support
 
  - Qualcomm mobile (rmnet & ipa): inline checksum offload support
 
  - MediaTek WiFi (mt76)
     - mt7915 MSI support
     - mt7915 Tx status reporting
     - mt7915 thermal sensors support
     - mt7921 decapsulation offload
     - mt7921 enable runtime pm and deep sleep
 
  - Realtek WiFi (rtw88)
     - beacon filter support
     - Tx antenna path diversity support
     - firmware crash information via devcoredump
 
  - Qualcomm 60GHz WiFi (wcn36xx)
     - Wake-on-WLAN support with magic packets and GTK rekeying
 
  - Micrel PHY (ksz886x/ksz8081): add cable test support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDb+fUACgkQMUZtbf5S
 Irs2Jg//aqN0Q8CgIvYCVhPxQw1tY7pTAbgyqgBZ01vwjyvtIOgJiWzSfFEU84mX
 M8fcpFX5eTKrOyJ9S6UFfQ/JG114n3hjAxFFT4Hxk2gC1Tg0vHuFQTDHcUl28bUE
 mTm61e1YpdorILnv2k5JVQ/wu0vs5QKDrjcYcrcPnh+j93wvnPOgAfDBV95nZzjS
 OTt4q2fR8GzLcSYWWsclMbDNkzyTG50RW/0Yd6aGjr5QGvXfrMeXfUJNz533PMf/
 w5lNyjRKv+x9mdTZJzU0+msNUrZgUdRz7W8Ey8lD3hJZRE+D6/uU7FtsE8Mi3+uc
 HWxeZUyzA3YF1MfVl/eesbxyPT7S/OkLzk4O5B35FbqP0YltaP+bOjq1/nM3ce1/
 io9Dx9pIl/2JANUgRCAtLi8Z2dkvRoqTaBxZ/nPudCCljFwDwl6joTMJ7Ow22i5Y
 5aIkcXFmZq4LbJDiHvbTlqT7yiuaEvu2UK/23bSIg/K3nF4eAmkY9Y1EgiMf60OF
 78Ttw0wk2tUegwaS5MZnCniKBKDyl9gM2F6rbZ/IxQRR2LTXFc1B6gC+ynUxgXfh
 Ub8O++6qGYGYZ0XvQH4pzco79p3qQWBTK5beIp2eu6BOAjBVIXq4AibUfoQLACsu
 hX7jMPYd0kc3WFgUnKgQP8EnjFSwbf4XiaE7fIXvWBY8hzCw2h4=
 =LvtX
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - BPF:
      - add syscall program type and libbpf support for generating
        instructions and bindings for in-kernel BPF loaders (BPF loaders
        for BPF), this is a stepping stone for signed BPF programs
      - infrastructure to migrate TCP child sockets from one listener to
        another in the same reuseport group/map to improve flexibility
        of service hand-off/restart
      - add broadcast support to XDP redirect

   - allow bypass of the lockless qdisc to improving performance (for
     pktgen: +23% with one thread, +44% with 2 threads)

   - add a simpler version of "DO_ONCE()" which does not require jump
     labels, intended for slow-path usage

   - virtio/vsock: introduce SOCK_SEQPACKET support

   - add getsocketopt to retrieve netns cookie

   - ip: treat lowest address of a IPv4 subnet as ordinary unicast
     address allowing reclaiming of precious IPv4 addresses

   - ipv6: use prandom_u32() for ID generation

   - ip: add support for more flexible field selection for hashing
     across multi-path routes (w/ offload to mlxsw)

   - icmp: add support for extended RFC 8335 PROBE (ping)

   - seg6: add support for SRv6 End.DT46 behavior

   - mptcp:
      - DSS checksum support (RFC 8684) to detect middlebox meddling
      - support Connection-time 'C' flag
      - time stamping support

   - sctp: packetization Layer Path MTU Discovery (RFC 8899)

   - xfrm: speed up state addition with seq set

   - WiFi:
      - hidden AP discovery on 6 GHz and other HE 6 GHz improvements
      - aggregation handling improvements for some drivers
      - minstrel improvements for no-ack frames
      - deferred rate control for TXQs to improve reaction times
      - switch from round robin to virtual time-based airtime scheduler

   - add trace points:
      - tcp checksum errors
      - openvswitch - action execution, upcalls
      - socket errors via sk_error_report

  Device APIs:

   - devlink: add rate API for hierarchical control of max egress rate
     of virtual devices (VFs, SFs etc.)

   - don't require RCU read lock to be held around BPF hooks in NAPI
     context

   - page_pool: generic buffer recycling

  New hardware/drivers:

   - mobile:
      - iosm: PCIe Driver for Intel M.2 Modem
      - support for Qualcomm MSM8998 (ipa)

   - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices

   - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches

   - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)

   - NXP SJA1110 Automotive Ethernet 10-port switch

   - Qualcomm QCA8327 switch support (qca8k)

   - Mikrotik 10/25G NIC (atl1c)

  Driver changes:

   - ACPI support for some MDIO, MAC and PHY devices from Marvell and
     NXP (our first foray into MAC/PHY description via ACPI)

   - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx

   - Mellanox/Nvidia NIC (mlx5)
      - NIC VF offload of L2 bridging
      - support IRQ distribution to Sub-functions

   - Marvell (prestera):
      - add flower and match all
      - devlink trap
      - link aggregation

   - Netronome (nfp): connection tracking offload

   - Intel 1GE (igc): add AF_XDP support

   - Marvell DPU (octeontx2): ingress ratelimit offload

   - Google vNIC (gve): new ring/descriptor format support

   - Qualcomm mobile (rmnet & ipa): inline checksum offload support

   - MediaTek WiFi (mt76)
      - mt7915 MSI support
      - mt7915 Tx status reporting
      - mt7915 thermal sensors support
      - mt7921 decapsulation offload
      - mt7921 enable runtime pm and deep sleep

   - Realtek WiFi (rtw88)
      - beacon filter support
      - Tx antenna path diversity support
      - firmware crash information via devcoredump

   - Qualcomm WiFi (wcn36xx)
      - Wake-on-WLAN support with magic packets and GTK rekeying

   - Micrel PHY (ksz886x/ksz8081): add cable test support"

* tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits)
  tcp: change ICSK_CA_PRIV_SIZE definition
  tcp_yeah: check struct yeah size at compile time
  gve: DQO: Fix off by one in gve_rx_dqo()
  stmmac: intel: set PCI_D3hot in suspend
  stmmac: intel: Enable PHY WOL option in EHL
  net: stmmac: option to enable PHY WOL with PMT enabled
  net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del}
  net: use netdev_info in ndo_dflt_fdb_{add,del}
  ptp: Set lookup cookie when creating a PTP PPS source.
  net: sock: add trace for socket errors
  net: sock: introduce sk_error_report
  net: dsa: replay the local bridge FDB entries pointing to the bridge dev too
  net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev
  net: dsa: include fdb entries pointing to bridge in the host fdb list
  net: dsa: include bridge addresses which are local in the host fdb list
  net: dsa: sync static FDB entries on foreign interfaces to hardware
  net: dsa: install the host MDB and FDB entries in the master's RX filter
  net: dsa: reference count the FDB addresses at the cross-chip notifier level
  net: dsa: introduce a separate cross-chip notifier type for host FDBs
  net: dsa: reference count the MDB entries at the cross-chip notifier level
  ...
2021-06-30 15:51:09 -07:00
Alexander Aring e3ae2365ef net: sock: introduce sk_error_report
This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 11:28:21 -07:00
Liam Howlett 47bdd1db16 net/ipv5/tcp: use vma_lookup() in tcp_zerocopy_receive()
Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Link: https://lkml.kernel.org/r/20210521174745.2219620-13-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:51 -07:00
Florian Westphal 892bfd3ded tcp: export timestamp helpers for mptcp
MPTCP is builtin, so no need to add EXPORT_SYMBOL()s.

It will be used to support SO_TIMESTAMP(NS) ancillary
messages in the mptcp receive path.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-04 14:08:09 -07:00
Arjun Roy a6f8ee58a8 tcp: Specify cmsgbuf is user pointer for receive zerocopy.
A prior change (1f466e1f15) introduces separate handling for
->msg_control depending on whether the pointer is a kernel or user
pointer. However, while tcp receive zerocopy is using this field, it
is not properly annotating that the buffer in this case is a user
pointer. This can cause faults when the improper mechanism is used
within put_cmsg().

This patch simply annotates tcp receive zerocopy's use as explicitly
being a user pointer.

Fixes: 7eeba1706e ("tcp: Add receive timestamp support for receive zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210506223530.2266456-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-06 18:05:35 -07:00
Eric Dumazet a7150e3822 Revert "tcp: Reset tcp connections in SYN-SENT state"
This reverts commit e880f8b3a2.

1) Patch has not been properly tested, and is wrong [1]
2) Patch submission did not include TCP maintainer (this is me)

[1]
divide error: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 8426 Comm: syz-executor478 Not tainted 5.12.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__tcp_select_window+0x56d/0xad0 net/ipv4/tcp_output.c:3015
Code: 44 89 ff e8 d5 cd f0 f9 45 39 e7 0f 8d 20 ff ff ff e8 f7 c7 f0 f9 44 89 e3 e9 13 ff ff ff e8 ea c7 f0 f9 44 89 e0 44 89 e3 99 <f7> 7c 24 04 29 d3 e9 fc fe ff ff e8 d3 c7 f0 f9 41 f7 dc bf 1f 00
RSP: 0018:ffffc9000184fac0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff87832e76 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff87832e14 R11: 0000000000000000 R12: 0000000000000000
R13: 1ffff92000309f5c R14: 0000000000000000 R15: 0000000000000000
FS:  00000000023eb300(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc2b5f426c0 CR3: 000000001c5cf000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 tcp_select_window net/ipv4/tcp_output.c:264 [inline]
 __tcp_transmit_skb+0xa82/0x38f0 net/ipv4/tcp_output.c:1351
 tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
 tcp_send_active_reset+0x475/0x8e0 net/ipv4/tcp_output.c:3449
 tcp_disconnect+0x15a9/0x1e60 net/ipv4/tcp.c:2955
 inet_shutdown+0x260/0x430 net/ipv4/af_inet.c:905
 __sys_shutdown_sock net/socket.c:2189 [inline]
 __sys_shutdown_sock net/socket.c:2183 [inline]
 __sys_shutdown+0xf1/0x1b0 net/socket.c:2201
 __do_sys_shutdown net/socket.c:2209 [inline]
 __se_sys_shutdown net/socket.c:2207 [inline]
 __x64_sys_shutdown+0x50/0x70 net/socket.c:2207
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: e880f8b3a2 ("tcp: Reset tcp connections in SYN-SENT state")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Manoj Basapathi <manojbm@codeaurora.org>
Cc: Sauvik Saha <ssaha@codeaurora.org>
Link: https://lore.kernel.org/r/20210409170237.274904-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 16:35:31 -07:00
Manoj Basapathi e880f8b3a2 tcp: Reset tcp connections in SYN-SENT state
Userspace sends tcp connection (sock) destroy on network switch
i.e switching the default network of the device between multiple
networks(Cellular/Wifi/Ethernet).

Kernel though doesn't send reset for the connections in SYN-SENT state
and these connections continue to remain.
Even as per RFC 793, there is no hard rule to not send RST on ABORT in
this state.

Modify tcp_abort and tcp_disconnect behavior to send RST for connections
in syn-sent state to avoid lingering connections on network switch.

Signed-off-by: Manoj Basapathi <manojbm@codeaurora.org>
Signed-off-by: Sauvik Saha <ssaha@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-06 16:17:20 -07:00
Yonghong Song 97a19caf1b bpf: net: Emit anonymous enum with BPF_TCP_CLOSE value explicitly
The selftest failed to compile with clang-built bpf-next.
Adding LLVM=1 to your vmlinux and selftest build will use clang.
The error message is:
  progs/test_sk_storage_tracing.c:38:18: error: use of undeclared identifier 'BPF_TCP_CLOSE'
          if (newstate == BPF_TCP_CLOSE)
                          ^
  1 error generated.
  make: *** [Makefile:423: /bpf-next/tools/testing/selftests/bpf/test_sk_storage_tracing.o] Error 1

The reason for the failure is that BPF_TCP_CLOSE, a value of
an anonymous enum defined in uapi bpf.h, is not defined in
vmlinux.h. gcc does not have this problem. Since vmlinux.h
is derived from BTF which is derived from vmlinux DWARF,
that means gcc-produced vmlinux DWARF has BPF_TCP_CLOSE
while llvm-produced vmlinux DWARF does not have.

BPF_TCP_CLOSE is referenced in net/ipv4/tcp.c as
  BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
The following test mimics the above BUILD_BUG_ON, preprocessed
with clang compiler, and shows gcc DWARF contains BPF_TCP_CLOSE while
llvm DWARF does not.

  $ cat t.c
  enum {
    BPF_TCP_ESTABLISHED = 1,
    BPF_TCP_CLOSE = 7,
  };
  enum {
    TCP_ESTABLISHED = 1,
    TCP_CLOSE = 7,
  };

  int test() {
    do {
      extern void __compiletime_assert_767(void) ;
      if ((int)BPF_TCP_CLOSE != (int)TCP_CLOSE) __compiletime_assert_767();
    } while (0);
    return 0;
  }
  $ clang t.c -O2 -c -g && llvm-dwarfdump t.o | grep BPF_TCP_CLOSE
  $ gcc t.c -O2 -c -g && llvm-dwarfdump t.o | grep BPF_TCP_CLOSE
                    DW_AT_name    ("BPF_TCP_CLOSE")

Further checking clang code find clang actually tried to
evaluate condition at compile time. If it is definitely
true/false, it will perform optimization and the whole if condition
will be removed before generating IR/debuginfo.

This patch explicited add an expression after the
above mentioned BUILD_BUG_ON in net/ipv4/tcp.c like
  (void)BPF_TCP_ESTABLISHED
to enable generation of debuginfo for the anonymous
enum which also includes BPF_TCP_CLOSE.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210317174132.589276-1-yhs@fb.com
2021-03-17 18:45:40 -07:00
Eric Dumazet 8811f4a983 tcp: add sanity tests to TCP_QUEUE_SEQ
Qingyu Li reported a syzkaller bug where the repro
changes RCV SEQ _after_ restoring data in the receive queue.

mprotect(0x4aa000, 12288, PROT_READ)    = 0
mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0
setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0
recvfrom(3, NULL, 20, 0, NULL, NULL)    = -1 ECONNRESET (Connection reset by peer)

syslog shows:
[  111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0
[  111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0

This should not be allowed. TCP_QUEUE_SEQ should only be used
when queues are empty.

This patch fixes this case, and the tx path as well.

Fixes: ee9952831c ("tcp: Initial repair mode")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005
Reported-by: Qingyu Li <ieatmuttonchuan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 15:32:05 -08:00
Arjun Roy 2107d45f17 tcp: Fix sign comparison bug in getsockopt(TCP_ZEROCOPY_RECEIVE)
getsockopt(TCP_ZEROCOPY_RECEIVE) has a bug where we read a
user-provided "len" field of type signed int, and then compare the
value to the result of an "offsetofend" operation, which is unsigned.

Negative values provided by the user will be promoted to large
positive numbers; thus checking that len < offsetofend() will return
false when the intention was that it return true.

Note that while len is originally checked for negative values earlier
on in do_tcp_getsockopt(), subsequent calls to get_user() re-read the
value from userspace which may have changed in the meantime.

Therefore, re-add the check for negative values after the call to
get_user in the handler code for TCP_ZEROCOPY_RECEIVE.

Fixes: c8856c0514 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Arjun Roy <arjunroy@google.com>
Link: https://lore.kernel.org/r/20210225232628.4033281-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26 15:47:15 -08:00
David S. Miller b8af417e4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-02-16

The following pull-request contains BPF updates for your *net-next* tree.

There's a small merge conflict between 7eeba1706e ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f81 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:

  [...]
                lock_sock(sk);
                err = tcp_zerocopy_receive(sk, &zc, &tss);
                err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
                                                          &zc, &len, err);
                release_sock(sk);
  [...]

We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).

The main changes are:

1) Adds support of pointers to types with known size among global function
   args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.

2) Add bpf_iter for task_vma which can be used to generate information similar
   to /proc/pid/maps, from Song Liu.

3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
   rewriting bind user ports from BPF side below the ip_unprivileged_port_start
   range, both from Stanislav Fomichev.

4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
   as well as per-cpu maps for the latter, from Alexei Starovoitov.

5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
   for sleepable programs, both from KP Singh.

6) Extend verifier to enable variable offset read/write access to the BPF
   program stack, from Andrei Matei.

7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
   query device MTU from programs, from Jesper Dangaard Brouer.

8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
   tracing programs, from Florent Revest.

9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
   otherwise too many passes are required, from Gary Lin.

10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
    verification both related to zero-extension handling, from Ilya Leoshkevich.

11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.

12) Batch of AF_XDP selftest cleanups and small performance improvement
    for libbpf's xsk map redirect for newer kernels, from Björn Töpel.

13) Follow-up BPF doc and verifier improvements around atomics with
    BPF_FETCH, from Brendan Jackman.

14) Permit zero-sized data sections e.g. if ELF .rodata section contains
    read-only data from local variables, from Yonghong Song.

15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-16 13:14:06 -08:00
Eric Dumazet 05dc72aba3 tcp: factorize logic into tcp_epollin_ready()
Both tcp_data_ready() and tcp_stream_is_readable() share the same logic.

Add tcp_epollin_ready() helper to avoid duplication.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12 17:28:26 -08:00
Arjun Roy 3c5a2fd042 tcp: Sanitize CMSG flags and reserved args in tcp_zerocopy_receive.
Explicitly define reserved field and require it and any subsequent
fields to be zero-valued for now. Additionally, limit the valid CMSG
flags that tcp_zerocopy_receive accepts.

Fixes: 7eeba1706e ("tcp: Add receive timestamp support for receive zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: David Ahern <dsahern@gmail.com>
Suggested-by: Leon Romanovsky <leon@kernel.org>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-11 18:25:05 -08:00
Arjun Roy 7eeba1706e tcp: Add receive timestamp support for receive zerocopy.
tcp_recvmsg() uses the CMSG mechanism to receive control information
like packet receive timestamps. This patch adds CMSG fields to
struct tcp_zerocopy_receive, and provides receive timestamps
if available to the user.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 20:05:56 -08:00
Arjun Roy 925bba24e6 tcp: Remove CMSG magic numbers for tcp_recvmsg().
At present, tcp_recvmsg() uses flags to track if any CMSGs are pending
and what those CMSGs are. These flags are currently magic numbers,
used only within tcp_recvmsg().

To prepare for receive timestamp support in tcp receive zerocopy,
gently refactor these magic numbers into enums.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 20:05:56 -08:00
Yousuk Seung e7ed11ee94 tcp: add TTL to SCM_TIMESTAMPING_OPT_STATS
This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports
the time-to-live or hop limit of the latest incoming packet with
SCM_TSTAMP_ACK. The value exported may not be from the packet that acks
the sequence when incoming packets are aggregated. Exporting the
time-to-live or hop limit value of incoming packets helps to estimate
the hop count of the path of the flow that may change over time.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 18:20:52 -08:00
Stanislav Fomichev 9cacf81f81 bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.

Without this patch:
     3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
            |
             --3.30%--__cgroup_bpf_run_filter_getsockopt
                       |
                        --0.81%--__kmalloc

With the patch applied:
     0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern

Note, exporting uapi/tcp.h requires removing netinet/tcp.h
from test_progs.h because those headers have confliciting
definitions.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
2021-01-20 14:23:00 -08:00
Jakub Kicinski 0fe2f273ab Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/can/dev.c
  commit 03f16c5075 ("can: dev: can_restart: fix use after free bug")
  commit 3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")

  Code move.

drivers/net/dsa/b53/b53_common.c
 commit 8e4052c32d ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
 commit b7a9e0da2d ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")

 Field rename.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 12:16:11 -08:00
Enke Chen 9d9b1ee0b2 tcp: fix TCP_USER_TIMEOUT with zero window
The TCP session does not terminate with TCP_USER_TIMEOUT when data
remain untransmitted due to zero window.

The number of unanswered zero-window probes (tcp_probes_out) is
reset to zero with incoming acks irrespective of the window size,
as described in tcp_probe_timer():

    RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
    as long as the receiver continues to respond probes. We support
    this by default and reset icsk_probes_out with incoming ACKs.

This counter, however, is the wrong one to be used in calculating the
duration that the window remains closed and data remain untransmitted.
Thanks to Jonathan Maxwell <jmaxwell37@gmail.com> for diagnosing the
actual issue.

In this patch a new timestamp is introduced for the socket in order to
track the elapsed time for the zero-window probes that have not been
answered with any non-zero window ack.

Fixes: 9721e709fa ("tcp: simplify window probe aborting on USER_TIMEOUT")
Reported-by: William McCall <william.mccall@gmail.com>
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210115223058.GA39267@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 19:59:17 -08:00
Jonathan Lemon 8e04491724 skbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put}
Unlike the rest of the skb_zcopy_ functions, these routines
operate on a 'struct ubuf', not a skb.  Remove the 'skb_'
prefix from the naming to make things clearer.

Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:37 -08:00
Jonathan Lemon 06b4feb37e net: group skb_shinfo zerocopy related bits together.
In preparation for expanded zerocopy (TX and RX), move
the zerocopy related bits out of tx_flags into their own
flag word.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:37 -08:00
Jonathan Lemon 8c793822c5 skbuff: rename sock_zerocopy_* to msg_zerocopy_*
At Willem's suggestion, rename the sock_zerocopy_* functions
so that they match the MSG_ZEROCOPY flag, which makes it clear
they are specific to this zerocopy implementation.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:35 -08:00
Jonathan Lemon 236a6b1cd5 skbuff: Call sock_zerocopy_put_abort from skb_zcopy_put_abort
The sock_zerocopy_put_abort function contains logic which is
specific to the current zerocopy implementation.  Add a wrapper
which checks the callback and dispatches apppropriately.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00