Commit Graph

805 Commits

Author SHA1 Message Date
CKI Backport Bot e057f6c3ff tcp: fix excessive TLP and RACK timeouts from HZ rounding
JIRA: https://issues.redhat.com/browse/RHEL-83546

commit 1c2709cfff1dedbb9591e989e2f001484208d914
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sun Oct 15 13:47:00 2023 -0400

    tcp: fix excessive TLP and RACK timeouts from HZ rounding

    We discovered from packet traces of slow loss recovery on kernels with
    the default HZ=250 setting (and min_rtt < 1ms) that after reordering,
    when receiving a SACKed sequence range, the RACK reordering timer was
    firing after about 16ms rather than the desired value of roughly
    min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer
    calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels
    with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the
    exact same issue.

    This commit fixes the TLP transmit timer and RACK reordering timer
    floor calculation to more closely match the intended 2ms floor even on
    kernels with HZ=250. It does this by adding in a new
    TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies,
    instead of the current approach of converting to jiffies and then
    adding th TCP_TIMEOUT_MIN value of 2 jiffies.

    Our testing has verified that on kernels with HZ=1000, as expected,
    this does not produce significant changes in behavior, but on kernels
    with the default HZ=250 the latency improvement can be large. For
    example, our tests show that for HZ=250 kernels at low RTTs this fix
    roughly halves the latency for the RACK reorder timer: instead of
    mostly firing at 16ms it mostly fires at 8ms.

    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Fixes: bb4d991a28 ("tcp: adjust tail loss probe timeout")
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20231015174700.2206872-1-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2025-03-14 13:21:38 +00:00
Paolo Abeni e1733b3630 tcp: fix mptcp DSS corruption due to large pmtu xmit
JIRA: https://issues.redhat.com/browse/RHEL-55470
Tested: vs self-tests
Conflicts: tcp_can_coalesce_send_queue_head() lacks the \
	skb_frags_readable(skb) != skb_frags_readable(next)) \
	check, as RHEL-9 lacks the upstream DEVMEM series. \
	Replace the current condition we the upstream code is safe:\
	DEVMEM should not be backported to RHEL-9 and if that should\
	happen, the tcp_skb_can_collapse() will be updated accordingly.

Upstream commit:
commit 4dabcdf581217e60690467a37c956a5b8dbc6bd9
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Tue Oct 8 13:04:53 2024 +0200

    tcp: fix mptcp DSS corruption due to large pmtu xmit

    Syzkaller was able to trigger a DSS corruption:

      TCP: request_sock_subflow_v4: Possible SYN flooding on port [::]:20002. Sending cookies.
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 5227 at net/mptcp/protocol.c:695 __mptcp_move_skbs_from_subflow+0x20a9/0x21f0 net/mptcp/protocol.c:695
      Modules linked in:
      CPU: 0 UID: 0 PID: 5227 Comm: syz-executor350 Not tainted 6.11.0-syzkaller-08829-gaf9c191ac2a0 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024
      RIP: 0010:__mptcp_move_skbs_from_subflow+0x20a9/0x21f0 net/mptcp/protocol.c:695
      Code: 0f b6 dc 31 ff 89 de e8 b5 dd ea f5 89 d8 48 81 c4 50 01 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 98 da ea f5 90 <0f> 0b 90 e9 47 ff ff ff e8 8a da ea f5 90 0f 0b 90 e9 99 e0 ff ff
      RSP: 0018:ffffc90000006db8 EFLAGS: 00010246
      RAX: ffffffff8ba9df18 RBX: 00000000000055f0 RCX: ffff888030023c00
      RDX: 0000000000000100 RSI: 00000000000081e5 RDI: 00000000000055f0
      RBP: 1ffff110062bf1ae R08: ffffffff8ba9cf12 R09: 1ffff110062bf1b8
      R10: dffffc0000000000 R11: ffffed10062bf1b9 R12: 0000000000000000
      R13: dffffc0000000000 R14: 00000000700cec61 R15: 00000000000081e5
      FS:  000055556679c380(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020287000 CR3: 0000000077892000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       move_skbs_to_msk net/mptcp/protocol.c:811 [inline]
       mptcp_data_ready+0x29c/0xa90 net/mptcp/protocol.c:854
       subflow_data_ready+0x34a/0x920 net/mptcp/subflow.c:1490
       tcp_data_queue+0x20fd/0x76c0 net/ipv4/tcp_input.c:5283
       tcp_rcv_established+0xfba/0x2020 net/ipv4/tcp_input.c:6237
       tcp_v4_do_rcv+0x96d/0xc70 net/ipv4/tcp_ipv4.c:1915
       tcp_v4_rcv+0x2dc0/0x37f0 net/ipv4/tcp_ipv4.c:2350
       ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205
       ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233
       NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
       NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
       __netif_receive_skb_one_core net/core/dev.c:5662 [inline]
       __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775
       process_backlog+0x662/0x15b0 net/core/dev.c:6107
       __napi_poll+0xcb/0x490 net/core/dev.c:6771
       napi_poll net/core/dev.c:6840 [inline]
       net_rx_action+0x89b/0x1240 net/core/dev.c:6962
       handle_softirqs+0x2c5/0x980 kernel/softirq.c:554
       do_softirq+0x11b/0x1e0 kernel/softirq.c:455
       </IRQ>
       <TASK>
       __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382
       local_bh_enable include/linux/bottom_half.h:33 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:919 [inline]
       __dev_queue_xmit+0x1764/0x3e80 net/core/dev.c:4451
       dev_queue_xmit include/linux/netdevice.h:3094 [inline]
       neigh_hh_output include/net/neighbour.h:526 [inline]
       neigh_output include/net/neighbour.h:540 [inline]
       ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:236
       ip_local_out net/ipv4/ip_output.c:130 [inline]
       __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:536
       __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466
       tcp_transmit_skb net/ipv4/tcp_output.c:1484 [inline]
       tcp_mtu_probe net/ipv4/tcp_output.c:2547 [inline]
       tcp_write_xmit+0x641d/0x6bf0 net/ipv4/tcp_output.c:2752
       __tcp_push_pending_frames+0x9b/0x360 net/ipv4/tcp_output.c:3015
       tcp_push_pending_frames include/net/tcp.h:2107 [inline]
       tcp_data_snd_check net/ipv4/tcp_input.c:5714 [inline]
       tcp_rcv_established+0x1026/0x2020 net/ipv4/tcp_input.c:6239
       tcp_v4_do_rcv+0x96d/0xc70 net/ipv4/tcp_ipv4.c:1915
       sk_backlog_rcv include/net/sock.h:1113 [inline]
       __release_sock+0x214/0x350 net/core/sock.c:3072
       release_sock+0x61/0x1f0 net/core/sock.c:3626
       mptcp_push_release net/mptcp/protocol.c:1486 [inline]
       __mptcp_push_pending+0x6b5/0x9f0 net/mptcp/protocol.c:1625
       mptcp_sendmsg+0x10bb/0x1b10 net/mptcp/protocol.c:1903
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg+0x1a6/0x270 net/socket.c:745
       ____sys_sendmsg+0x52a/0x7e0 net/socket.c:2603
       ___sys_sendmsg net/socket.c:2657 [inline]
       __sys_sendmsg+0x2aa/0x390 net/socket.c:2686
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7fb06e9317f9
      Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffe2cfd4f98 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007fb06e97f468 RCX: 00007fb06e9317f9
      RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000005
      RBP: 00007fb06e97f446 R08: 0000555500000000 R09: 0000555500000000
      R10: 0000555500000000 R11: 0000000000000246 R12: 00007fb06e97f406
      R13: 0000000000000001 R14: 00007ffe2cfd4fe0 R15: 0000000000000003
       </TASK>

    Additionally syzkaller provided a nice reproducer. The repro enables
    pmtu on the loopback device, leading to tcp_mtu_probe() generating
    very large probe packets.

    tcp_can_coalesce_send_queue_head() currently does not check for
    mptcp-level invariants, and allowed the creation of cross-DSS probes,
    leading to the mentioned corruption.

    Address the issue teaching tcp_can_coalesce_send_queue_head() about
    mptcp using the tcp_skb_can_collapse(), also reducing the code
    duplication.

    Fixes: 8571248411 ("tcp: coalesce/collapse must respect MPTCP extensions")
    Cc: stable@vger.kernel.org
    Reported-by: syzbot+d1bff73460e33101f0e7@syzkaller.appspotmail.com
    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/513
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
    Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
    Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-2-c6fb8e93e551@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-11 17:10:12 +02:00
Lucas Zampieri 55f96777fb Merge: net: backport visibility improvements
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4765

JIRA: https://issues.redhat.com/browse/RHEL-48648  
  
Various visibility improvements; mainly around drop reasons, reset reason and improved tracepoints this time.  
  
Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-12 16:18:50 +00:00
Antoine Tenart 51c78f9a4a rstreason: make it work in trace world
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit b533fb9cf4f7c6ca2aa255a5a1fdcde49fff2b24
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:40 2024 +0800

    rstreason: make it work in trace world

    At last, we should let it work by introducing this reset reason in
    trace world.

    One of the possible expected outputs is:
    ... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
    state=TCP_ESTABLISHED reason=NOT_SPECIFIED

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart aef83a52dd rstreason: prepare for active reset
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Context difference due to missing upstream commit e13ec3da05d1 ("tcp:
  annotate lockless access to sk->sk_err") in c9s.

commit 5691276b39daf90294c6a81fb6d62d667f634c92
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:36 2024 +0800

    rstreason: prepare for active reset

    Like what we did to passive reset:
    only passing possible reset reason in each active reset path.

    No functional changes.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Xin Long a6a99aa0fb net: annotate data-races around sk->sk_dst_pending_confirm
JIRA: https://issues.redhat.com/browse/RHEL-41185
Tested: compile only

commit eb44ad4e635132754bfbcb18103f1dcb7058aedd
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Sep 21 20:28:18 2023 +0000

    net: annotate data-races around sk->sk_dst_pending_confirm

    This field can be read or written without socket lock being held.

    Add annotations to avoid load-store tearing.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Xin Long <lxin@redhat.com>
2024-07-10 15:11:11 -04:00
Florian Westphal bd2a0fb2c5 tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets
JIRA: https://issues.redhat.com/browse/RHEL-39833
Upstream Status: commit 94062790aedb
CVE: CVE-2024-36905

commit 94062790aedb505bdda209b10bea47b294d6394f
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed May 1 12:54:48 2024 +0000

    tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets

    TCP_SYN_RECV state is really special, it is only used by
    cross-syn connections, mostly used by fuzzers.

    In the following crash [1], syzbot managed to trigger a divide
    by zero in tcp_rcv_space_adjust()

    A socket makes the following state transitions,
    without ever calling tcp_init_transfer(),
    meaning tcp_init_buffer_space() is also not called.

             TCP_CLOSE
    connect()
             TCP_SYN_SENT
             TCP_SYN_RECV
    shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN)
             TCP_FIN_WAIT1

    To fix this issue, change tcp_shutdown() to not
    perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition,
    which makes no sense anyway.

    When tcp_rcv_state_process() later changes socket state
    from TCP_SYN_RECV to TCP_ESTABLISH, then look at
    sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state,
    and send a FIN packet from a sane socket state.

    This means tcp_send_fin() can now be called from BH
    context, and must use GFP_ATOMIC allocations.

    [1]
    divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
    CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
     RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767
    Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48
    RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246
    RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7
    R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30
    R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da
    FS:  00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0
    Call Trace:
     <TASK>
      tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513
      tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578
      inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680
      sock_recvmsg_nosec net/socket.c:1046 [inline]
      sock_recvmsg+0x109/0x280 net/socket.c:1068
      ____sys_recvmsg+0x1db/0x470 net/socket.c:2803
      ___sys_recvmsg net/socket.c:2845 [inline]
      do_recvmmsg+0x474/0xae0 net/socket.c:2939
      __sys_recvmmsg net/socket.c:3018 [inline]
      __do_sys_recvmmsg net/socket.c:3041 [inline]
      __se_sys_recvmmsg net/socket.c:3034 [inline]
      __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
      do_syscall_x64 arch/x86/entry/common.c:52 [inline]
      do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x77/0x7f
    RIP: 0033:0x7faeb6363db9
    Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9
    RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005
    RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c
    R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Link: https://lore.kernel.org/r/20240501125448.896529-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2024-06-07 15:21:41 +02:00
Scott Weaver 9ec057b2d1 Merge: tcp: Add memory barrier to tcp_push()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3921

JIRA: https://issues.redhat.com/browse/RHEL-22708
Upstream Status: linux.git

commit 7267e8dcad6b2f9fce05a6a06335d7040acbc2b6
Author: Salvatore Dipietro <dipiets@amazon.com>
Date:   Fri Jan 19 11:01:33 2024 -0800

    tcp: Add memory barrier to tcp_push()

    On CPUs with weak memory models, reads and updates performed by tcp_push
    to the sk variables can get reordered leaving the socket throttled when
    it should not. The tasklet running tcp_wfree() may also not observe the
    memory updates in time and will skip flushing any packets throttled by
    tcp_push(), delaying the sending. This can pathologically cause 40ms
    extra latency due to bad interactions with delayed acks.

    Adding a memory barrier in tcp_push removes the bug, similarly to the
    previous commit bf06200e73 ("tcp: tsq: fix nonagle handling").
    smp_mb__after_atomic() is used to not incur in unnecessary overhead
    on x86 since not affected.

    Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu
    22.04 and Apache Tomcat 9.0.83 running the basic servlet below:

    import java.io.IOException;
    import java.io.OutputStreamWriter;
    import java.io.PrintWriter;
    import javax.servlet.ServletException;
    import javax.servlet.http.HttpServlet;
    import javax.servlet.http.HttpServletRequest;
    import javax.servlet.http.HttpServletResponse;

    public class HelloWorldServlet extends HttpServlet {
        @Override
        protected void doGet(HttpServletRequest request, HttpServletResponse response)
          throws ServletException, IOException {
            response.setContentType("text/html;charset=utf-8");
            OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
            String s = "a".repeat(3096);
            osw.write(s,0,s.length());
            osw.flush();
        }
    }

    Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
    c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
    values is observed while, with the patch, the extra latency disappears.

    No patch and tcp_autocorking=1
    ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
      ...
     50.000%    0.91ms
     75.000%    1.13ms
     90.000%    1.46ms
     99.000%    1.74ms
     99.900%    1.89ms
     99.990%   41.95ms  <<< 40+ ms extra latency
     99.999%   48.32ms
    100.000%   48.96ms

    With patch and tcp_autocorking=1
    ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
      ...
     50.000%    0.90ms
     75.000%    1.13ms
     90.000%    1.45ms
     99.000%    1.72ms
     99.900%    1.83ms
     99.990%    2.11ms  <<< no 40+ ms extra latency
     99.999%    2.53ms
    100.000%    2.62ms

    Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
    affected by this issue and the patch doesn't introduce any additional
    delay.

    Fixes: 7aa5470c2c ("tcp: tsq: move tsq_flags close to sk_wmem_alloc")
    Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240119190133.43698-1-dipiets@amazon.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Scott Weaver <scweaver@redhat.com>
2024-05-30 09:32:41 -04:00
Antoine Tenart 4aa736bc93 tcp: tcp_wfree() refactoring
JIRA: https://issues.redhat.com/browse/RHEL-22708
Upstream Status: linux.git

commit b548b17a93fd18357a5a6f535c10c1e68719ad32
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Nov 10 19:02:39 2022 +0000

    tcp: tcp_wfree() refactoring

    Use try_cmpxchg() (instead of cmpxchg()) in a more readable way.

    oval = smp_load_acquire(&sk->sk_tsq_flags);
    do {
            ...
    } while (!try_cmpxchg(&sk->sk_tsq_flags, &oval, nval));

    Reduce indentation level.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20221110190239.3531280-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-04-30 10:30:10 +02:00
Paolo Abeni b8b1f3ff10 tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skb
JIRA: https://issues.redhat.com/browse/RHEL-32164
Tested: LNST, Tier1

Upstream commit:
commit f921a4a5bffa8a0005b190fb9421a7fc1fd716b6
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Oct 17 12:45:26 2023 +0000

    tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skb

    In commit 75eefc6c59 ("tcp: tsq: add a shortcut in tcp_small_queue_check()")
    we allowed to send an skb regardless of TSQ limits being hit if rtx queue
    was empty or had a single skb, in order to better fill the pipe
    when/if TX completions were slow.

    Then later, commit 75c119afe1 ("tcp: implement rb-tree based
    retransmit queue") accidentally removed the special case for
    one skb in rtx queue.

    Stefan Wahren reported a regression in single TCP flow throughput
    using a 100Mbit fec link, starting from commit 65466904b015 ("tcp: adjust
    TSO packet sizes based on min_rtt"). This last commit only made the
    regression more visible, because it locked the TCP flow on a particular
    behavior where TSQ prevented two skbs being pushed downstream,
    adding silences on the wire between each TSO packet.

    Many thanks to Stefan for his invaluable help !

    Fixes: 75c119afe1 ("tcp: implement rb-tree based retransmit queue")
    Link: https://lore.kernel.org/netdev/7f31ddc8-9971-495e-a1f6-819df542e0af@gmx.net/
    Reported-by: Stefan Wahren <wahrenst@gmx.net>
    Tested-by: Stefan Wahren <wahrenst@gmx.net>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Link: https://lore.kernel.org/r/20231017124526.4060202-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-08 18:28:15 +02:00
Paolo Abeni f1dc7fefff net: Remove acked SYN flag from packet in the transmit queue correctly
JIRA: https://issues.redhat.com/browse/RHEL-21432
Tested: LNST, Tier1

Upstream commit:
commit f99cd56230f56c8b6b33713c5be4da5d6766be1f
Author: Dong Chenchen <dongchenchen2@huawei.com>
Date:   Sun Dec 10 10:02:00 2023 +0800

    net: Remove acked SYN flag from packet in the transmit queue correctly

    syzkaller report:

     kernel BUG at net/core/skbuff.c:3452!
     invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
     CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc4-00009-gbee0e7762ad2-dirty #135
     RIP: 0010:skb_copy_and_csum_bits (net/core/skbuff.c:3452)
     Call Trace:
     icmp_glue_bits (net/ipv4/icmp.c:357)
     __ip_append_data.isra.0 (net/ipv4/ip_output.c:1165)
     ip_append_data (net/ipv4/ip_output.c:1362 net/ipv4/ip_output.c:1341)
     icmp_push_reply (net/ipv4/icmp.c:370)
     __icmp_send (./include/net/route.h:252 net/ipv4/icmp.c:772)
     ip_fragment.constprop.0 (./include/linux/skbuff.h:1234 net/ipv4/ip_output.c:592 net/ipv4/ip_output.c:577)
     __ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:295)
     ip_output (net/ipv4/ip_output.c:427)
     __ip_queue_xmit (net/ipv4/ip_output.c:535)
     __tcp_transmit_skb (net/ipv4/tcp_output.c:1462)
     __tcp_retransmit_skb (net/ipv4/tcp_output.c:3387)
     tcp_retransmit_skb (net/ipv4/tcp_output.c:3404)
     tcp_retransmit_timer (net/ipv4/tcp_timer.c:604)
     tcp_write_timer (./include/linux/spinlock.h:391 net/ipv4/tcp_timer.c:716)

    The panic issue was trigered by tcp simultaneous initiation.
    The initiation process is as follows:

          TCP A                                            TCP B

      1.  CLOSED                                           CLOSED

      2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...

      3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT

      4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED

      5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...

      // TCP B: not send challenge ack for ack limit or packet loss
      // TCP A: close
            tcp_close
               tcp_send_fin
                  if (!tskb && tcp_under_memory_pressure(sk))
                      tskb = skb_rb_last(&sk->tcp_rtx_queue); //pick SYN_ACK packet
               TCP_SKB_CB(tskb)->tcp_flags |= TCPHDR_FIN;  // set FIN flag

      6.  FIN_WAIT_1  --> <SEQ=100><ACK=301><END_SEQ=102><CTL=SYN,FIN,ACK> ...

      // TCP B: send challenge ack to SYN_FIN_ACK

      7.               ... <SEQ=301><ACK=101><CTL=ACK>   <-- SYN-RECEIVED //challenge ack

      // TCP A:  <SND.UNA=101>

      8.  FIN_WAIT_1 --> <SEQ=101><ACK=301><END_SEQ=102><CTL=SYN,FIN,ACK> ... // retransmit panic

            __tcp_retransmit_skb  //skb->len=0
                tcp_trim_head
                    len = tp->snd_una - TCP_SKB_CB(skb)->seq // len=101-100
                        __pskb_trim_head
                            skb->data_len -= len // skb->len=-1, wrap around
                ... ...
                ip_fragment
                    icmp_glue_bits //BUG_ON

    If we use tcp_trim_head() to remove acked SYN from packet that contains data
    or other flags, skb->len will be incorrectly decremented. We can remove SYN
    flag that has been acked from rtx_queue earlier than tcp_trim_head(), which
    can fix the problem mentioned above.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Co-developed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
    Link: https://lore.kernel.org/r/20231210020200.1539875-1-dongchenchen2@huawei.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-12 16:31:32 +01:00
Jan Stancek 3c8d3e2d4a Merge: tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3301

JIRA: https://issues.redhat.com/browse/RHEL-11592

commit b650d953cd391595e536153ce30b4aab385643ac
Author: mfreemon@cloudflare.com <mfreemon@cloudflare.com>
Date:   Sun Jun 11 22:05:24 2023 -0500

    tcp: enforce receive buffer memory limits by allowing the tcp window to shrink

    Under certain circumstances, the tcp receive buffer memory limit
    set by autotuning (sk_rcvbuf) is increased due to incoming data
    packets as a result of the window not closing when it should be.
    This can result in the receive buffer growing all the way up to
    tcp_rmem[2], even for tcp sessions with a low BDP.

    To reproduce:  Connect a TCP session with the receiver doing
    nothing and the sender sending small packets (an infinite loop
    of socket send() with 4 bytes of payload with a sleep of 1 ms
    in between each send()).  This will cause the tcp receive buffer
    to grow all the way up to tcp_rmem[2].

    As a result, a host can have individual tcp sessions with receive
    buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
    limits, causing the host to go into tcp memory pressure mode.

    The fundamental issue is the relationship between the granularity
    of the window scaling factor and the number of byte ACKed back
    to the sender.  This problem has previously been identified in
    RFC 7323, appendix F [1].

    The Linux kernel currently adheres to never shrinking the window.

    In addition to the overallocation of memory mentioned above, the
    current behavior is functionally incorrect, because once tcp_rmem[2]
    is reached when no remediations remain (i.e. tcp collapse fails to
    free up any more memory and there are no packets to prune from the
    out-of-order queue), the receiver will drop in-window packets
    resulting in retransmissions and an eventual timeout of the tcp
    session.  A receive buffer full condition should instead result
    in a zero window and an indefinite wait.

    In practice, this problem is largely hidden for most flows.  It
    is not applicable to mice flows.  Elephant flows can send data
    fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
    triggering a zero window.

    But this problem does show up for other types of flows.  Examples
    are websockets and other type of flows that send small amounts of
    data spaced apart slightly in time.  In these cases, we directly
    encounter the problem described in [1].

    RFC 7323, section 2.4 [2], says there are instances when a retracted
    window can be offered, and that TCP implementations MUST ensure
    that they handle a shrinking window, as specified in RFC 1122,
    section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
    management have made clear that sender must accept a shrunk window
    from the receiver, including RFC 793 [4] and RFC 1323 [5].

    This patch implements the functionality to shrink the tcp window
    when necessary to keep the right edge within the memory limit by
    autotuning (sk_rcvbuf).  This new functionality is enabled with
    the new sysctl: net.ipv4.tcp_shrink_window

    Additional information can be found at:
    https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

    [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
    [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
    [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
    [4] https://www.rfc-editor.org/rfc/rfc793
    [5] https://www.rfc-editor.org/rfc/rfc1323

    Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-20 21:50:41 +01:00
Felix Maurer 130ad87ddc tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
JIRA: https://issues.redhat.com/browse/RHEL-11592
Conflicts:
- net/ipv4/sysctl_net_ipv4.c: context difference due to missing new sysctls
- net/ipv4/tcp_ipv4.c: context difference due to missing ccce324dabfe
  ("tcp: make the first N SYN RTO backoffs linear") and 37ba017dcc3b
  ("ipv4/tcp: do not use per netns ctl sockets")

commit b650d953cd391595e536153ce30b4aab385643ac
Author: mfreemon@cloudflare.com <mfreemon@cloudflare.com>
Date:   Sun Jun 11 22:05:24 2023 -0500

    tcp: enforce receive buffer memory limits by allowing the tcp window to shrink

    Under certain circumstances, the tcp receive buffer memory limit
    set by autotuning (sk_rcvbuf) is increased due to incoming data
    packets as a result of the window not closing when it should be.
    This can result in the receive buffer growing all the way up to
    tcp_rmem[2], even for tcp sessions with a low BDP.

    To reproduce:  Connect a TCP session with the receiver doing
    nothing and the sender sending small packets (an infinite loop
    of socket send() with 4 bytes of payload with a sleep of 1 ms
    in between each send()).  This will cause the tcp receive buffer
    to grow all the way up to tcp_rmem[2].

    As a result, a host can have individual tcp sessions with receive
    buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
    limits, causing the host to go into tcp memory pressure mode.

    The fundamental issue is the relationship between the granularity
    of the window scaling factor and the number of byte ACKed back
    to the sender.  This problem has previously been identified in
    RFC 7323, appendix F [1].

    The Linux kernel currently adheres to never shrinking the window.

    In addition to the overallocation of memory mentioned above, the
    current behavior is functionally incorrect, because once tcp_rmem[2]
    is reached when no remediations remain (i.e. tcp collapse fails to
    free up any more memory and there are no packets to prune from the
    out-of-order queue), the receiver will drop in-window packets
    resulting in retransmissions and an eventual timeout of the tcp
    session.  A receive buffer full condition should instead result
    in a zero window and an indefinite wait.

    In practice, this problem is largely hidden for most flows.  It
    is not applicable to mice flows.  Elephant flows can send data
    fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
    triggering a zero window.

    But this problem does show up for other types of flows.  Examples
    are websockets and other type of flows that send small amounts of
    data spaced apart slightly in time.  In these cases, we directly
    encounter the problem described in [1].

    RFC 7323, section 2.4 [2], says there are instances when a retracted
    window can be offered, and that TCP implementations MUST ensure
    that they handle a shrinking window, as specified in RFC 1122,
    section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
    management have made clear that sender must accept a shrunk window
    from the receiver, including RFC 793 [4] and RFC 1323 [5].

    This patch implements the functionality to shrink the tcp window
    when necessary to keep the right edge within the memory limit by
    autotuning (sk_rcvbuf).  This new functionality is enabled with
    the new sysctl: net.ipv4.tcp_shrink_window

    Additional information can be found at:
    https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

    [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
    [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
    [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
    [4] https://www.rfc-editor.org/rfc/rfc793
    [5] https://www.rfc-editor.org/rfc/rfc1323

    Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-10-31 16:20:06 +01:00
Felix Maurer 2261d33599 tcp: adjust rcv_ssthresh according to sk_reserved_mem
JIRA: https://issues.redhat.com/browse/RHEL-11592
Conflicts:
- net/ipv4/tcp_input.c: context difference due to missing 240bfd134c59
  ("tcp: tweak len/truesize ratio for coalesce candidates")

commit 053f368412c9a7bfce2befec8c795113c8cfb0b1
Author: Wei Wang <weiwan@google.com>
Date:   Wed Sep 29 10:25:13 2021 -0700

    tcp: adjust rcv_ssthresh according to sk_reserved_mem

    When user sets SO_RESERVE_MEM socket option, in order to utilize the
    reserved memory when in memory pressure state, we adjust rcv_ssthresh
    according to the available reserved memory for the socket, instead of
    using 4 * advmss always.

    Signed-off-by: Wei Wang <weiwan@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-10-31 16:20:06 +01:00
Paolo Abeni bdc0298e9a tcp: fix quick-ack counting to count actual ACKs of new data
JIRA: https://issues.redhat.com/browse/RHEL-14348
Tested: LNST, Tier1
Conflicts: different context, as rhel lacks the upstream commit \
  03b123debcbc ("tcp: tcp_enter_quickack_mode() should be static")

Upstream commit:
commit 059217c18be6757b95bfd77ba53fb50b48b8a816
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sun Oct 1 11:12:38 2023 -0400

    tcp: fix quick-ack counting to count actual ACKs of new data

    This commit fixes quick-ack counting so that it only considers that a
    quick-ack has been provided if we are sending an ACK that newly
    acknowledges data.

    The code was erroneously using the number of data segments in outgoing
    skbs when deciding how many quick-ack credits to remove. This logic
    does not make sense, and could cause poor performance in
    request-response workloads, like RPC traffic, where requests or
    responses can be multi-segment skbs.

    When a TCP connection decides to send N quick-acks, that is to
    accelerate the cwnd growth of the congestion control module
    controlling the remote endpoint of the TCP connection. That quick-ack
    decision is purely about the incoming data and outgoing ACKs. It has
    nothing to do with the outgoing data or the size of outgoing data.

    And in particular, an ACK only serves the intended purpose of allowing
    the remote congestion control to grow the congestion window quickly if
    the ACK is ACKing or SACKing new data.

    The fix is simple: only count packets as serving the goal of the
    quickack mechanism if they are ACKing/SACKing new data. We can tell
    whether this is the case by checking inet_csk_ack_scheduled(), since
    we schedule an ACK exactly when we are ACKing/SACKing new data.

    Fixes: fc6415bcb0 ("[TCP]: Fix quick-ack decrementing with TSO.")
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Reviewed-by: Yuchung Cheng <ycheng@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-20 11:53:04 +02:00
Paolo Abeni 8d38177438 tcp: tcp_make_synack() can be called from process context
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217511
Tested: LNST, Tier1

Upstream commit:
commit bced3f7db95ff2e6ca29dc4d1c9751ab5e736a09
Author: Breno Leitao <leitao@debian.org>
Date:   Wed Mar 8 11:07:45 2023 -0800

    tcp: tcp_make_synack() can be called from process context

    tcp_rtx_synack() now could be called in process context as explained in
    0a375c822497 ("tcp: tcp_rtx_synack() can be called from process
    context").

    tcp_rtx_synack() might call tcp_make_synack(), which will touch per-CPU
    variables with preemption enabled. This causes the following BUG:

        BUG: using __this_cpu_add() in preemptible [00000000] code: ThriftIO1/5464
        caller is tcp_make_synack+0x841/0xac0
        Call Trace:
         <TASK>
         dump_stack_lvl+0x10d/0x1a0
         check_preemption_disabled+0x104/0x110
         tcp_make_synack+0x841/0xac0
         tcp_v6_send_synack+0x5c/0x450
         tcp_rtx_synack+0xeb/0x1f0
         inet_rtx_syn_ack+0x34/0x60
         tcp_check_req+0x3af/0x9e0
         tcp_rcv_state_process+0x59b/0x2030
         tcp_v6_do_rcv+0x5f5/0x700
         release_sock+0x3a/0xf0
         tcp_sendmsg+0x33/0x40
         ____sys_sendmsg+0x2f2/0x490
         __sys_sendmsg+0x184/0x230
         do_syscall_64+0x3d/0x90

    Avoid calling __TCP_INC_STATS() with will touch per-cpu variables. Use
    TCP_INC_STATS() which is safe to be called from context switch.

    Fixes: 8336886f78 ("tcp: TCP Fast Open Server - support TFO listeners")
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230308190745.780221-1-leitao@debian.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-26 15:43:41 +02:00
Jeff Moyer d19688b83d net: avoid double accounting for pure zerocopy skbs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 9b65b17db72313b7a4fe9bc9502928c88be57986
Author: Talal Ahmad <talalahmad@google.com>
Date:   Tue Nov 2 22:58:44 2021 -0400

    net: avoid double accounting for pure zerocopy skbs
    
    Track skbs containing only zerocopy data and avoid charging them to
    kernel memory to correctly account the memory utilization for
    msg_zerocopy. All of the data in such skbs is held in user pages which
    are already accounted to user. Before this change, they are charged
    again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
    excessive because data is not being copied into skb frags. This
    excessive charging can lead to kernel going into memory pressure
    state which impacts all sockets in the system adversely. Mark pure
    zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
    charge/uncharge for data in such skbs.
    
    Initially, an skb is marked pure zerocopy when it is empty and in
    zerocopy path. skb can then change from a pure zerocopy skb to mixed
    data skb (zerocopy and copy data) if it is at tail of write queue and
    there is room available in it and non-zerocopy data is being sent in
    the next sendmsg call. At this time sk_mem_charge is done for the pure
    zerocopied data and the pure zerocopy flag is unmarked. We found that
    this happens very rarely on workloads that pass MSG_ZEROCOPY.
    
    A pure zerocopy skb can later be coalesced into normal skb if they are
    next to each other in queue but this patch prevents coalescing from
    happening. This avoids complexity of charging when skb downgrades from
    pure zerocopy to mixed. This is also rare.
    
    In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
    for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
    tcp_skb_entail for an skb without data.
    
    Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
    with zerocopy showed that before this patch the 'sock' variable in
    memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
    sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
    change it is 0. This is due to no charge to sk_forward_alloc for
    zerocopy data and shows memory utilization for kernel is lowered.
    
    With this commit we don't see the warning we saw in previous commit
    which resulted in commit 84882cf72cd774cf16fd338bdbf00f69ac9f9194.
    
    Signed-off-by: Talal Ahmad <talalahmad@google.com>
    Acked-by: Arjun Roy <arjunroy@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:03:02 -04:00
Jeff Moyer 474b0b4e6c tcp: rename sk_wmem_free_skb
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 03271f3a3594c0e88f68d8cfbec0ba250b2c538a
Author: Talal Ahmad <talalahmad@google.com>
Date:   Fri Oct 29 22:05:41 2021 -0400

    tcp: rename sk_wmem_free_skb
    
    sk_wmem_free_skb() is only used by TCP.
    
    Rename it to make this clear, and move its declaration to
    include/net/tcp.h
    
    Signed-off-by: Talal Ahmad <talalahmad@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Acked-by: Arjun Roy <arjunroy@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:02:02 -04:00
Guillaume Nault 7def07a992 tcp: Fix a data-race around sysctl_tcp_min_tso_segs.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit e0bb4ab9dfddd872622239f49fb2bd403b70853b
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:22 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_min_tso_segs.

    While reading sysctl_tcp_min_tso_segs, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 95bd09eb27 ("tcp: TSO packets automatic sizing")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault a87822cdb7 tcp: Fix a data-race around sysctl_tcp_limit_output_bytes.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 9fb90193fbd66b4c5409ef729fd081861f8b6351
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:20 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_limit_output_bytes.

    While reading sysctl_tcp_limit_output_bytes, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.

    Fixes: 46d3ceabd8 ("tcp: TCP Small Queues")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault b453a54187 tcp: Fix a data-race around sysctl_tcp_retrans_collapse.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 1a63cb91f0c2fcdeced6d6edee8d1d886583d139
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:49 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_retrans_collapse.

    While reading sysctl_tcp_retrans_collapse, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault d895c95ec3 tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 4845b5713ab18a1bb6e31d1fbb4d600240b8b691
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:48 2022 -0700

    tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.

    While reading sysctl_tcp_slow_start_after_idle, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.

    Fixes: 35089bb203 ("[TCP]: Add tcp_slow_start_after_idle sysctl.")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault 346b7c48cc tcp: Fix a data-race around sysctl_tcp_early_retrans.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 52e65865deb6a36718a463030500f16530eaab74
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:45 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_early_retrans.

    While reading sysctl_tcp_early_retrans, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: eed530b6c6 ("tcp: early retransmit")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:13 +01:00
Guillaume Nault afe98b9c8e tcp: Fix data-races around sysctl knobs related to SYN option.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 3666f666e99600518ab20982af04a078bbdad277
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:44 2022 -0700

    tcp: Fix data-races around sysctl knobs related to SYN option.

    While reading these knobs, they can be changed concurrently.
    Thus, we need to add READ_ONCE() to their readers.

      - tcp_sack
      - tcp_window_scaling
      - tcp_timestamps

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:13 +01:00
Guillaume Nault 3b5433cc57 tcp: Fix data-races around some timeout sysctl knobs.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 39e24435a776e9de5c6dd188836cf2523547804b
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 15 10:17:50 2022 -0700

    tcp: Fix data-races around some timeout sysctl knobs.

    While reading these sysctl knobs, they can be changed concurrently.
    Thus, we need to add READ_ONCE() to their readers.

      - tcp_retries1
      - tcp_retries2
      - tcp_orphan_retries
      - tcp_fin_timeout

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:58 +01:00
Guillaume Nault 6f5b891c69 tcp: Fix a data-race around sysctl_tcp_probe_interval.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 2a85388f1d94a9f8b5a529118a2c5eaa0520d85c
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 13 13:52:05 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_probe_interval.

    While reading sysctl_tcp_probe_interval, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 05cbc0db03 ("ipv4: Create probe timer for tcp PMTU as per RFC4821")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:56 +01:00
Guillaume Nault 79ef98893f tcp: Fix a data-race around sysctl_tcp_probe_threshold.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 92c0aa4175474483d6cf373314343d4e624e882a
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 13 13:52:04 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_probe_threshold.

    While reading sysctl_tcp_probe_threshold, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 6b58e0a5f3 ("ipv4: Use binary search to choose tcp PMTU probe_size")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:56 +01:00
Guillaume Nault 335253285f tcp: Fix data-races around sysctl_tcp_min_snd_mss.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 78eb166cdefcc3221c8c7c1e2d514e91a2eb5014
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 13 13:52:02 2022 -0700

    tcp: Fix data-races around sysctl_tcp_min_snd_mss.

    While reading sysctl_tcp_min_snd_mss, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 5f3e2bf008 ("tcp: add tcp_min_snd_mss sysctl")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:55 +01:00
Guillaume Nault f2b719c89b tcp: Fix data-races around sysctl_tcp_base_mss.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 88d78bc097cd8ebc6541e93316c9d9bf651b13e8
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 13 13:52:01 2022 -0700

    tcp: Fix data-races around sysctl_tcp_base_mss.

    While reading sysctl_tcp_base_mss, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 5d424d5a67 ("[TCP]: MTU probing")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:55 +01:00
Guillaume Nault 98bf3bfc65 tcp: Fix data-races around sysctl_tcp_mtu_probing.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit f47d00e077e7d61baf69e46dde3210c886360207
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 13 13:52:00 2022 -0700

    tcp: Fix data-races around sysctl_tcp_mtu_probing.

    While reading sysctl_tcp_mtu_probing, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 5d424d5a67 ("[TCP]: MTU probing")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:55 +01:00
Guillaume Nault 2d93db2fd4 tcp: Fix a data-race around sysctl_tcp_ecn_fallback.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 12b8d9ca7e678abc48195294494f1815b555d658
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 11 17:15:31 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_ecn_fallback.

    While reading sysctl_tcp_ecn_fallback, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 492135557d ("tcp: add rfc3168, section 6.1.1.1. fallback")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:52 +01:00
Guillaume Nault c8af159735 tcp: Fix data-races around sysctl_tcp_ecn.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 4785a66702f086cf2ea84bdbe6ec921f274bd9f2
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 11 17:15:30 2022 -0700

    tcp: Fix data-races around sysctl_tcp_ecn.

    While reading sysctl_tcp_ecn, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:52 +01:00
Frantisek Hrbata 1269719102 Merge: BPF and XDP rebase to v5.18
Merge conflicts:
-----------------
arch/x86/net/bpf_jit_comp.c
        - bpf_arch_text_poke()
          HEAD(!1464) contains b73b002f7f ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline")
          Resolved in favour of !1464, but keep the return statement from !1477

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477

Bugzilla: https://bugzilla.redhat.com/2120966

Rebase BPF and XDP to the upstream kernel version 5.18

Patch applied, then reverted:
```
544356 selftests/bpf: switch to new libbpf XDP APIs
0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs
```
Taken in the perf rebase:
```
23fcfc perf: use generic bpf_program__set_type() to set BPF prog type
```
Unsuported arches:
```
5c1011 libbpf: Fix riscv register names
cf0b5b libbpf: Fix accessing syscall arguments on riscv
```
Depends on changes of other subsystems:
```
7fc8c3 s390/bpf: encode register within extable entry
aebfd1 x86/ibt,ftrace: Search for __fentry__ location
589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
```
Broken selftest:
```
edae34 selftests net: add UDP GRO fraglist + bpf self-tests
cf6783 selftests net: fix bpf build error
7b92aa selftests net: fix kselftest net fatal error
```
Out of scope:
```
baebdf net: dev: Makes sure netif_rx() can be invoked in any context.
5c8166 kbuild: replace $(if A,A,B) with $(or A,B)
1a97ce perf maps: Use a pointer for kmaps
967747 uaccess: remove CONFIG_SET_FS
42b01a s390: always use the packed stack layout
bf0882 flow_dissector: Add support for HSR
d09a30 s390/extable: move EX_TABLE define to asm-extable.h
3d6671 s390/extable: convert to relative table with data
4efd41 s390: raise minimum supported machine generation to z10
f65e58 flow_dissector: Add support for HSRv0
1a6d7a netdevsim: Introduce support for L3 offload xstats
9b1894 selftests: netdevsim: hw_stats_l3: Add a new test
84005b perf ftrace latency: Add -n/--use-nsec option
36c4a7 kasan, arm64: don't tag executable vmalloc allocations
8df013 docs: netdev: move the netdev-FAQ to the process pages
4d4d00 perf tools: Update copy of libbpf's hashmap.c
0df6ad perf evlist: Rename cpus to user_requested_cpus
1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning
0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf
8994e9 perf test bpf: Skip test if clang is not present
735346 perf build: Fix btf__load_from_kernel_by_id() feature check
f037ac s390/stack: merge empty stack frame slots
335220 docs: netdev: update maintainer-netdev.rst reference
a0b098 s390/nospec: remove unneeded header includes
34513a netdevsim: Fix hwstats debugfs file permissions
```

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Viktor Malik <vmalik@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-21 05:30:47 -05:00
Frantisek Hrbata 27a89b8946 Merge: tcp: BIG TCP implementation
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1560

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Tested: Using netperf and veth driver. Results meet the assumptions. See https://bugzilla.redhat.com/show_bug.cgi?id=2139501#c1

The series introduces support for BIG TCP.

- Patch 1-2: Preliminary dependencies
- Patch 3-14: Commits from upstream series 7fa2e481ff2f ("Merge branch 'big-tcp'", 2022-05-16)
- Patch 15-19: Follow-ups

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-15 07:30:55 -05:00
Frantisek Hrbata e265d68e77 Merge: tcp: phase-1 backports for RHEL-9.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1504

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: All mainline in net-next.git.
Tested: boot-tested only
Conflicts: see individual patches

Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-14 02:40:21 -05:00
Davide Caratti 3b7dfba048 tcp: fix over estimation in sk_forced_mem_schedule()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit c4ee118561a0

commit c4ee118561a0f74442439b7b5b486db1ac1ddfeb
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jun 14 10:17:33 2022 -0700

    tcp: fix over estimation in sk_forced_mem_schedule()

    sk_forced_mem_schedule() has a bug similar to ones fixed
    in commit 7c80b038d23e ("net: fix sk_wmem_schedule() and
    sk_rmem_schedule() errors")

    While this bug has little chance to trigger in old kernels,
    we need to fix it before the following patch.

    Fixes: d83769a580 ("tcp: fix possible deadlock in tcp_send_fin()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Reviewed-by: Wei Wang <weiwan@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:11:00 +01:00
Davide Caratti a15f6e1ab8 tcp: Fix data-races around sysctl_tcp_workaround_signed_windows.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 0f1e4d06591d

commit 0f1e4d06591d0a7907c71f7b6d1c79f8a4de8098
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:19 2022 -0700

    tcp: Fix data-races around sysctl_tcp_workaround_signed_windows.

    While reading sysctl_tcp_workaround_signed_windows, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.

    Fixes: 15d99e02ba ("[TCP]: sysctl to allow TCP window > 32767 sans wscale")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:58 +01:00
Davide Caratti 543f426b27 net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 100fdd1faf50

commit 100fdd1faf50557558e2911af4be32e515cb8036
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 23:34:07 2022 -0700

    net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT

    Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE.

    This might change in the future, but it seems better to avoid the
    confusion.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:55 +01:00
Davide Caratti e451125385 Revert "tcp: change pingpong threshold to 3"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: net-next.git commit 4d8f24eeedc5

commit 4d8f24eeedc58d5f87b650ddda73c16e8ba56559
Author: Wei Wang <weiwan@google.com>
Date:   Thu Jul 21 20:44:04 2022 +0000

    Revert "tcp: change pingpong threshold to 3"

    This reverts commit 4a41f453be.

    This to-be-reverted commit was meant to apply a stricter rule for the
    stack to enter pingpong mode. However, the condition used to check for
    interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
    jiffy based and might be too coarse, which delays the stack entering
    pingpong mode.
    We revert this patch so that we no longer use the above condition to
    determine interactive session, and also reduce pingpong threshold to 1.

    Fixes: 4a41f453be ("tcp: change pingpong threshold to 3")
    Reported-by: LemmyHuang <hlm3280@163.com>
    Suggested-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Wei Wang <weiwan@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220721204404.388396-1-weiwan@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-07 10:19:56 +01:00
Davide Caratti 7faf5c7e58 tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: net-next.git commit f4ce91ce12a7

commit f4ce91ce12a7c6ead19b128ffa8cff6e3ded2a14
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed Sep 28 16:03:31 2022 -0400

    tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited

    This commit fixes a bug in the tracking of max_packets_out and
    is_cwnd_limited. This bug can cause the connection to fail to remember
    that is_cwnd_limited is true, causing the connection to fail to grow
    cwnd when it should, causing throughput to be lower than it should be.

    The following event sequence is an example that triggers the bug:

     (a) The connection is cwnd_limited, but packets_out is not at its
         peak due to TSO deferral deciding not to send another skb yet.
         In such cases the connection can advance max_packets_seq and set
         tp->is_cwnd_limited to true and max_packets_out to a small
         number.

    (b) Then later in the round trip the connection is pacing-limited (not
         cwnd-limited), and packets_out is larger. In such cases the
         connection would raise max_packets_out to a bigger number but
         (unexpectedly) flip tp->is_cwnd_limited from true to false.

    This commit fixes that bug.

    One straightforward fix would be to separately track (a) the next
    window after max_packets_out reaches a maximum, and (b) the next
    window after tp->is_cwnd_limited is set to true. But this would
    require consuming an extra u32 sequence number.

    Instead, to save space we track only the most important
    information. Specifically, we track the strongest available signal of
    the degree to which the cwnd is fully utilized:

    (1) If the connection is cwnd-limited then we remember that fact for
    the current window.

    (2) If the connection not cwnd-limited then we track the maximum
    number of outstanding packets in the current window.

    In particular, note that the new logic cannot trigger the buggy
    (a)/(b) sequence above because with the new logic a condition where
    tp->packets_out > tp->max_packets_out can only trigger an update of
    tp->is_cwnd_limited if tp->is_cwnd_limited is false.

    This first showed up in a testing of a BBRv2 dev branch, but this
    buggy behavior highlighted a general issue with the
    tcp_cwnd_validate() logic that can cause cwnd to fail to increase at
    the proper rate for any TCP congestion control, including Reno or
    CUBIC.

    Fixes: ca8a226343 ("tcp: make cwnd-limited checks measurement-based, and gentler")
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-07 10:19:56 +01:00
Davide Caratti 36fb5aea7d tcp: make retransmitted SKB fit into the send window
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: net-next.git commit 536a6c8e05f9

commit 536a6c8e05f95e3d1118c40ae8b3022ee2d05d52
Author: Yonglong Li <liyonglong@chinatelecom.cn>
Date:   Mon Jul 11 17:47:18 2022 +0800

    tcp: make retransmitted SKB fit into the send window

    current code of __tcp_retransmit_skb only check TCP_SKB_CB(skb)->seq
    in send window, and TCP_SKB_CB(skb)->seq_end maybe out of send window.
    If receiver has shrunk his window, and skb is out of new window,  it
    should retransmit a smaller portion of the payload.

    test packetdrill script:
        0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
       +0 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
       +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0

       +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
       +0 > S 0:0(0)  win 65535 <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8>
     +.05 < S. 0:0(0) ack 1 win 6000 <mss 1000,nop,nop,sackOK>
       +0 > . 1:1(0) ack 1

       +0 write(3, ..., 10000) = 10000

       +0 > . 1:2001(2000) ack 1 win 65535
       +0 > . 2001:4001(2000) ack 1 win 65535
       +0 > . 4001:6001(2000) ack 1 win 65535

     +.05 < . 1:1(0) ack 4001 win 1001

    and tcpdump show:
    192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 1:2001, ack 1, win 65535, length 2000
    192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 2001:4001, ack 1, win 65535, length 2000
    192.168.226.67.55 > 192.0.2.1.8080: Flags [P.], seq 4001:5001, ack 1, win 65535, length 1000
    192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 5001:6001, ack 1, win 65535, length 1000
    192.0.2.1.8080 > 192.168.226.67.55: Flags [.], ack 4001, win 1001, length 0
    192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 5001:6001, ack 1, win 65535, length 1000
    192.168.226.67.55 > 192.0.2.1.8080: Flags [P.], seq 4001:5001, ack 1, win 65535, length 1000

    when cient retract window to 1001, send window is [4001,5002],
    but TLP send 5001-6001 packet which is out of send window.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/1657532838-20200-1-git-send-email-liyonglong@chinatelecom.cn
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-07 10:10:25 +01:00
Davide Caratti accc31efc1 tcp: tcp_rtx_synack() can be called from process context
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: net-next.git commit 0a375c822497

commit 0a375c822497ed6ad6b5da0792a12a6f1af10c0b
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon May 30 14:37:13 2022 -0700

    tcp: tcp_rtx_synack() can be called from process context

    Laurent reported the enclosed report [1]

    This bug triggers with following coditions:

    0) Kernel built with CONFIG_DEBUG_PREEMPT=y

    1) A new passive FastOpen TCP socket is created.
       This FO socket waits for an ACK coming from client to be a complete
       ESTABLISHED one.
    2) A socket operation on this socket goes through lock_sock()
       release_sock() dance.
    3) While the socket is owned by the user in step 2),
       a retransmit of the SYN is received and stored in socket backlog.
    4) At release_sock() time, the socket backlog is processed while
       in process context.
    5) A SYNACK packet is cooked in response of the SYN retransmit.
    6) -> tcp_rtx_synack() is called in process context.

    Before blamed commit, tcp_rtx_synack() was always called from BH handler,
    from a timer handler.

    Fix this by using TCP_INC_STATS() & NET_INC_STATS()
    which do not assume caller is in non preemptible context.

    [1]
    BUG: using __this_cpu_add() in preemptible [00000000] code: epollpep/2180
    caller is tcp_rtx_synack.part.0+0x36/0xc0
    CPU: 10 PID: 2180 Comm: epollpep Tainted: G           OE     5.16.0-0.bpo.4-amd64 #1  Debian 5.16.12-1~bpo11+1
    Hardware name: Supermicro SYS-5039MC-H8TRF/X11SCD-F, BIOS 1.7 11/23/2021
    Call Trace:
     <TASK>
     dump_stack_lvl+0x48/0x5e
     check_preemption_disabled+0xde/0xe0
     tcp_rtx_synack.part.0+0x36/0xc0
     tcp_rtx_synack+0x8d/0xa0
     ? kmem_cache_alloc+0x2e0/0x3e0
     ? apparmor_file_alloc_security+0x3b/0x1f0
     inet_rtx_syn_ack+0x16/0x30
     tcp_check_req+0x367/0x610
     tcp_rcv_state_process+0x91/0xf60
     ? get_nohz_timer_target+0x18/0x1a0
     ? lock_timer_base+0x61/0x80
     ? preempt_count_add+0x68/0xa0
     tcp_v4_do_rcv+0xbd/0x270
     __release_sock+0x6d/0xb0
     release_sock+0x2b/0x90
     sock_setsockopt+0x138/0x1140
     ? __sys_getsockname+0x7e/0xc0
     ? aa_sk_perm+0x3e/0x1a0
     __sys_setsockopt+0x198/0x1e0
     __x64_sys_setsockopt+0x21/0x30
     do_syscall_64+0x38/0xc0
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Laurent Fasnacht <laurent.fasnacht@proton.ch>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Link: https://lore.kernel.org/r/20220530213713.601888-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-07 10:10:24 +01:00
Ivan Vecera d513603ec1 net: allow gso_max_size to exceed 65536
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

commit 7c4e983c4f3cf94fcd879730c6caa877e0768a4d
Author: Alexander Duyck <alexanderduyck@fb.com>
Date:   Fri May 13 11:33:57 2022 -0700

    net: allow gso_max_size to exceed 65536

    The code for gso_max_size was added originally to allow for debugging and
    workaround of buggy devices that couldn't support TSO with blocks 64K in
    size. The original reason for limiting it to 64K was because that was the
    existing limits of IPv4 and non-jumbogram IPv6 length fields.

    With the addition of Big TCP we can remove this limit and allow the value
    to potentially go up to UINT_MAX and instead be limited by the tso_max_size
    value.

    So in order to support this we need to go through and clean up the
    remaining users of the gso_max_size value so that the values will cap at
    64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
    so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
    limit for GSO_MAX_SIZE.

    v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
                   in a new sk_trim_gso_size() helper.
                   netif_set_tso_max_size() caps the requested TSO size
                   with GSO_MAX_SIZE.

    Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:55:52 +01:00
Ivan Vecera 9a3d61a7ce net: Adjust sk_gso_max_size once when set
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

commit ab14f1802cfb2d7ca120bbf48e3ba6712314ffc3
Author: David Ahern <dsahern@kernel.org>
Date:   Mon Jan 24 19:45:11 2022 -0700

    net: Adjust sk_gso_max_size once when set

    sk_gso_max_size is set based on the dst dev. Both users of it
    adjust the value by the same offset - (MAX_TCP_HEADER + 1). Rather
    than compute the same adjusted value on each call do the adjustment
    once when set.

    Signed-off-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220125024511.27480-1-dsahern@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:55:50 +01:00
Jiri Benc 6619cf0a37 net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
- [minor] different context in tcp_fragment() due to missing
  a52fe46ef160 ("tcp: factorize ip_summed setting")

commit a1ac9c8acec1605c6b43af418f79facafdced680
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:25 2022 -0800

    net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp

    skb->tstamp was first used as the (rcv) timestamp.
    The major usage is to report it to the user (e.g. SO_TIMESTAMP).

    Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
    during egress and used by the qdisc (e.g. sch_fq) to make decision on when
    the skb can be passed to the dev.

    Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
    or the delivery_time, so it is always reset to 0 whenever forwarded
    between egress and ingress.

    While it makes sense to always clear the (rcv) timestamp in skb->tstamp
    to avoid confusing sch_fq that expects the delivery_time, it is a
    performance issue [0] to clear the delivery_time if the skb finally
    egress to a fq@phy-dev.  For example, when forwarding from egress to
    ingress and then finally back to egress:

                tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                         ^              ^
                                         reset          rest

    This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
    is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.

    The current use case is to keep the TCP mono delivery_time (EDT) and
    to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
    to read and change the mono delivery_time.

    In the future, another bit (e.g. skb->user_delivery_time) can be added
    for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.

    [ This patch is a prep work.  The following patches will
      get the other parts of the stack ready first.  Then another patch
      after that will finally set the skb->mono_delivery_time. ]

    skb_set_delivery_time() function is added.  It is used by the tcp_output.c
    and during ip[6] fragmentation to assign the delivery_time to
    the skb->tstamp and also set the skb->mono_delivery_time.

    A note on the change in ip_send_unicast_reply() in ip_output.c.
    It is only used by TCP to send reset/ack out of a ctl_sk.
    Like the new skb_set_delivery_time(), this patch sets
    the skb->mono_delivery_time to 0 for now as a place
    holder.  It will be enabled in a latter patch.
    A similar case in tcp_ipv6 can be done with
    skb_set_delivery_time() in tcp_v6_send_response().

    [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:59 +02:00
Jiri Benc f8ca6f0764 tcp: Change SYN ACK retransmit behaviour to account for rehash
Bugzilla: https://bugzilla.redhat.com/2120966

commit cb6cd2cec799356e5e2f75a8591894599a6ad49d
Author: Akhmat Karakotov <hmukos@yandex-team.ru>
Date:   Mon Jan 31 16:31:25 2022 +0300

    tcp: Change SYN ACK retransmit behaviour to account for rehash

    Disabling rehash behavior did not affect SYN ACK retransmits because hash
    was forcefully changed bypassing the sk_rethink_hash function. This patch
    adds a condition which checks for rehash mode before resetting hash.

    Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:54 +02:00
Paolo Abeni 38e724189c net: Fix data-races around sysctl_[rw]mem_(max|default).
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 1227c1771dd2ad44318aa3ab9e3a293b3f34ff2a
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:44 2022 -0700

    net: Fix data-races around sysctl_[rw]mem_(max|default).

    While reading sysctl_[rw]mem_(max|default), they can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Paolo Abeni 519b3282c5 net: Fix data-races around sysctl_[rw]mem(_offset)?.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Conflicts: different context in __tcp_grow_window() as rhel-9 \
 lacks upstream commit 240bfd134c592 ("tcp: tweak len/truesize \
 ratio for coalesce candidates")

Upstream commit:
commit 02739545951ad4c1215160db7fbf9b7a918d3c0b
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:00 2022 -0700

    net: Fix data-races around sysctl_[rw]mem(_offset)?.

    While reading these sysctl variables, they can be changed concurrently.
    Thus, we need to add READ_ONCE() to their readers.

      - .sysctl_rmem
      - .sysctl_rwmem
      - .sysctl_rmem_offset
      - .sysctl_wmem_offset
      - sysctl_tcp_rmem[1, 2]
      - sysctl_tcp_wmem[1, 2]
      - sysctl_decnet_rmem[1]
      - sysctl_decnet_wmem[1]
      - sysctl_tipc_rmem[1]

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 12:59:36 +02:00
Paolo Abeni 036c0e121e tcp: add accessors to read/set tp->snd_cwnd
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101465
Tested: LNST, Tier1

Upstream commit:
commit 40570375356c874b1578e05c1dcc3ff7c1322dbe
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Apr 5 16:35:38 2022 -0700

    tcp: add accessors to read/set tp->snd_cwnd

    We had various bugs over the years with code
    breaking the assumption that tp->snd_cwnd is greater
    than zero.

    Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added
    in commit 8b8a321ff7 ("tcp: fix zero cwnd in tcp_cwnd_reduction")
    can trigger, and without a repro we would have to spend
    considerable time finding the bug.

    Instead of complaining too late, we want to catch where
    and when tp->snd_cwnd is set to an illegal value.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Suggested-by: Yuchung Cheng <ycheng@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Acked-by: Yuchung Cheng <ycheng@google.com>
    Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-27 16:43:55 +02:00
Patrick Talbert 55b0bd82ad Merge: mptcp: better window sharing
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/888

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089885
Tested: LNST, Tier1 and vs bz reproducer

when the MPTCP tput is CPU bound, and the used links are much faster then the CPU limits, the MPTCP tput is unstable as the MPTCP-level congestion window sharing has currently some glitches: patch 1/5 ensures the sharing affects even the announced window, patch 3/5 and 4/5 takes care of concurrent announced window updates. The remaining patches add more MIBs counter for introspection's sake

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-10 09:44:37 +02:00