Commit Graph

1175 Commits

Author SHA1 Message Date
Guillaume Nault 1e5a2e4daa tcp: Fix data-races around sysctl_tcp_max_reordering.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit a11e5b3e7a59fde1a90b0eaeaa82320495cf8cae
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:53 2022 -0700

    tcp: Fix data-races around sysctl_tcp_max_reordering.

    While reading sysctl_tcp_max_reordering, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.

    Fixes: dca145ffaa ("tcp: allow for bigger reordering level")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault 63dfd43837 tcp: Fix a data-race around sysctl_tcp_stdurg.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 4e08ed41cb1194009fc1a916a59ce3ed4afd77cd
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:50 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_stdurg.

    While reading sysctl_tcp_stdurg, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault b39dc1a4f7 tcp: Fix data-races around sysctl_tcp_recovery.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit e7d2ef837e14a971a05f60ea08c47f3fed1a36e4
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:46 2022 -0700

    tcp: Fix data-races around sysctl_tcp_recovery.

    While reading sysctl_tcp_recovery, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 4f41b1c58a ("tcp: use RACK to detect losses")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault afe98b9c8e tcp: Fix data-races around sysctl knobs related to SYN option.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 3666f666e99600518ab20982af04a078bbdad277
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:44 2022 -0700

    tcp: Fix data-races around sysctl knobs related to SYN option.

    While reading these knobs, they can be changed concurrently.
    Thus, we need to add READ_ONCE() to their readers.

      - tcp_sack
      - tcp_window_scaling
      - tcp_timestamps

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:13 +01:00
Herton R. Krzesinski ee17c5d305 Merge: bpf, xdp: update to 6.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1742

bpf, xdp: update to 6.0

Bugzilla: https://bugzilla.redhat.com/2137876

Signed-off-by: Artem Savkov <asavkov@redhat.com>

Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Jerome Marchand <jmarchan@redhat.com>
Approved-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Approved-by: Michael Petlan <mpetlan@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-01-12 16:01:19 +00:00
Artem Savkov 56229aa3cb bpf: Add helpers to issue and check SYN cookies in XDP
Bugzilla: https://bugzilla.redhat.com/2137876

commit 33bf9885040c399cf6a95bd33216644126728e14
Author: Maxim Mikityanskiy <maximmi@nvidia.com>
Date:   Wed Jun 15 16:48:44 2022 +0300

    bpf: Add helpers to issue and check SYN cookies in XDP
    
    The new helpers bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} allow an XDP
    program to generate SYN cookies in response to TCP SYN packets and to
    check those cookies upon receiving the first ACK packet (the final
    packet of the TCP handshake).
    
    Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a
    listening socket on the local machine, which allows to use them together
    with synproxy to accelerate SYN cookie generation.
    
    Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
    Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
    Link: https://lore.kernel.org/r/20220615134847.3753567-4-maximmi@nvidia.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:36 +01:00
Guillaume Nault 0bdff2f331 tcp: Fix data-races around sysctl_max_syn_backlog.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 79539f34743d3e14cc1fa6577d326a82cc64d62f
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 15 10:17:53 2022 -0700

    tcp: Fix data-races around sysctl_max_syn_backlog.

    While reading sysctl_max_syn_backlog, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:58 +01:00
Guillaume Nault 8e09537936 tcp: Fix data-races around sysctl_tcp_reordering.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 46778cd16e6a5ad1b2e3a91f6c057c907379418e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 15 10:17:49 2022 -0700

    tcp: Fix data-races around sysctl_tcp_reordering.

    While reading sysctl_tcp_reordering, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:57 +01:00
Guillaume Nault c8af159735 tcp: Fix data-races around sysctl_tcp_ecn.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit 4785a66702f086cf2ea84bdbe6ec921f274bd9f2
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 11 17:15:30 2022 -0700

    tcp: Fix data-races around sysctl_tcp_ecn.

    While reading sysctl_tcp_ecn, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:52 +01:00
Jamie Bainbridge ff967a9a4c tcp: Fix build break when CONFIG_IPV6=n
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143850
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git

commit c90b6b1005ec7423c7d0063eb27a9728498f6ec8
Author: Saeed Mahameed <saeedm@nvidia.com>
Date:   Tue Nov 22 10:41:58 2022 -0800

    tcp: Fix build break when CONFIG_IPV6=n

    The cited commit caused the following build break when CONFIG_IPV6 was
    disabled

    net/ipv4/tcp_input.c: In function ‘tcp_syn_flood_action’:
    include/net/sock.h:387:37: error: ‘const struct sock_common’ has no member named ‘skc_v6_rcv_saddr’; did you mean ‘skc_rcv_saddr’?

    Fix by using inet6_rcv_saddr() macro which handles this situation
    nicely.

    Fixes: d9282e48c608 ("tcp: Add listening address to SYN flood message")
    Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
    Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
    CC: Matthieu Baerts <matthieu.baerts@tessares.net>
    CC: Jamie Bainbridge <jamie.bainbridge@gmail.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20221122184158.170798-1-saeed@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>
2022-11-25 15:18:41 +10:00
Jamie Bainbridge 2c9cf28cda tcp: annotate data-race around queue->synflood_warned
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143850
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git

commit bf36267e3ad3df80a3a18eb0422723069a434934
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Nov 15 09:18:51 2022 +0000

    tcp: annotate data-race around queue->synflood_warned

    Annotate the lockless read of queue->synflood_warned.

    Following xchg() has the needed data-race resolution.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>
2022-11-25 15:18:41 +10:00
Jamie Bainbridge b455dcf2c6 tcp: Add listening address to SYN flood message
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143850
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git

commit d9282e48c6088105a98b98153a707fdbcdbf75b1
Author: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Date:   Mon Nov 14 12:00:08 2022 +1100

    tcp: Add listening address to SYN flood message

    The SYN flood message prints the listening port number, but with many
    processes bound to the same port on different IPs, it's impossible to
    tell which socket is the problem.

    Add the listen IP address to the SYN flood message.

    For IPv6 use "[IP]:port" as per RFC-5952 and to provide ease of
    copy-paste to "ss" filters. For IPv4 use "IP:port" to match.

    Each protcol's "any" address and a host address now look like:

     Possible SYN flooding on port 0.0.0.0:9001.
     Possible SYN flooding on port 127.0.0.1:9001.
     Possible SYN flooding on port [::]:9001.
     Possible SYN flooding on port [fc00::1]:9001.

    Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
    Link: https://lore.kernel.org/r/4fedab7ce54a389aeadbdc639f6b4f4988e9d2d7.1668386107.git.jamie.bainbridge@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>
2022-11-25 15:18:41 +10:00
Jamie Bainbridge cedd7bec50 tcp: Fix data-races around sysctl_tcp_syncookies.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143850

commit f2e383b5bb6bbc60a0b94b87b3e49a2b1aefd11e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 15 10:17:47 2022 -0700

    tcp: Fix data-races around sysctl_tcp_syncookies.

    While reading sysctl_tcp_syncookies, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>
2022-11-25 15:18:34 +10:00
Frantisek Hrbata e265d68e77 Merge: tcp: phase-1 backports for RHEL-9.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1504

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: All mainline in net-next.git.
Tested: boot-tested only
Conflicts: see individual patches

Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-14 02:40:21 -05:00
Davide Caratti e37ba8af61 tcp: Fix data-races around sysctl_tcp_moderate_rcvbuf.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 780476488844

commit 780476488844e070580bfc9e3bc7832ec1cea883
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:18 2022 -0700

    tcp: Fix data-races around sysctl_tcp_moderate_rcvbuf.

    While reading sysctl_tcp_moderate_rcvbuf, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:58 +01:00
Davide Caratti bfb8959ee3 net: keep sk->sk_forward_alloc as small as possible
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 4890b686f408
Conflicts:
 - net/ipv4/tcp.c: context mismatch because we don't have upstream
   commit 8a794df69300 ("tcp: use MAX_TCP_HEADER in tcp_stream_alloc_skb")
   and commit c4322884ed21 ("tcp: remove unneeded code from tcp_stream_alloc_skb()")

commit 4890b686f4088c90432149bd6de567e621266fa2
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 23:34:11 2022 -0700

    net: keep sk->sk_forward_alloc as small as possible

    Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast.

    Each TCP socket can forward allocate up to 2 MB of memory, even after
    flow became less active.

    10,000 sockets can have reserved 20 GB of memory,
    and we have no shrinker in place to reclaim that.

    Instead of trying to reclaim the extra allocations in some places,
    just keep sk->sk_forward_alloc values as small as possible.

    This should not impact performance too much now we have per-cpu
    reserves: Changes to tcp_memory_allocated should not be too frequent.

    For sockets not using SO_RESERVE_MEM:
     - idle sockets (no packets in tx/rx queues) have zero forward alloc.
     - non idle sockets have a forward alloc smaller than one page.

    Note:

     - Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD
       is left to MPTCP maintainers as a follow up.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:55 +01:00
Davide Caratti 543f426b27 net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 100fdd1faf50

commit 100fdd1faf50557558e2911af4be32e515cb8036
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 23:34:07 2022 -0700

    net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT

    Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE.

    This might change in the future, but it seems better to avoid the
    confusion.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:55 +01:00
Davide Caratti ca76233e5b tcp: fix early ETIMEDOUT after spurious non-SACK RTO
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: net-next.git commit 686dc2db2a0f

commit 686dc2db2a0fdc1d34b424ec2c0a735becd8d62b
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Sep 3 08:10:23 2022 -0400

    tcp: fix early ETIMEDOUT after spurious non-SACK RTO

    Fix a bug reported and analyzed by Nagaraj Arankal, where the handling
    of a spurious non-SACK RTO could cause a connection to fail to clear
    retrans_stamp, causing a later RTO to very prematurely time out the
    connection with ETIMEDOUT.

    Here is the buggy scenario, expanding upon Nagaraj Arankal's excellent
    report:

    (*1) Send one data packet on a non-SACK connection

    (*2) Because no ACK packet is received, the packet is retransmitted
         and we enter CA_Loss; but this retransmission is spurious.

    (*3) The ACK for the original data is received. The transmitted packet
         is acknowledged.  The TCP timestamp is before the retrans_stamp,
         so tcp_may_undo() returns true, and tcp_try_undo_loss() returns
         true without changing state to Open (because tcp_is_sack() is
         false), and tcp_process_loss() returns without calling
         tcp_try_undo_recovery().  Normally after undoing a CA_Loss
         episode, tcp_fastretrans_alert() would see that the connection
         has returned to CA_Open and fall through and call
         tcp_try_to_open(), which would set retrans_stamp to 0.  However,
         for non-SACK connections we hold the connection in CA_Loss, so do
         not fall through to call tcp_try_to_open() and do not set
         retrans_stamp to 0. So retrans_stamp is (erroneously) still
         non-zero.

         At this point the first "retransmission event" has passed and
         been recovered from. Any future retransmission is a completely
         new "event". However, retrans_stamp is erroneously still
         set. (And we are still in CA_Loss, which is correct.)

    (*4) After 16 minutes (to correspond with tcp_retries2=15), a new data
         packet is sent. Note: No data is transmitted between (*3) and
         (*4) and we disabled keep alives.

         The socket's timeout SHOULD be calculated from this point in
         time, but instead it's calculated from the prior "event" 16
         minutes ago (step (*2)).

    (*5) Because no ACK packet is received, the packet is retransmitted.

    (*6) At the time of the 2nd retransmission, the socket returns
         ETIMEDOUT, prematurely, because retrans_stamp is (erroneously)
         too far in the past (set at the time of (*2)).

    This commit fixes this bug by ensuring that we reuse in
    tcp_try_undo_loss() the same careful logic for non-SACK connections
    that we have in tcp_try_undo_recovery(). To avoid duplicating logic,
    we factor out that logic into a new
    tcp_is_non_sack_preventing_reopen() helper and call that helper from
    both undo functions.

    Fixes: da34ac7626 ("tcp: only undo on partial ACKs in CA_Loss")
    Reported-by: Nagaraj Arankal <nagaraj.p.arankal@hpe.com>
    Link: https://lore.kernel.org/all/SJ0PR84MB1847BE6C24D274C46A1B9B0EB27A9@SJ0PR84MB1847.NAMPRD84.PROD.OUTLOOK.COM/
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220903121023.866900-1-ncardwell.kernel@gmail.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-07 10:19:56 +01:00
Davide Caratti 8c1f553c5d tcp: fix F-RTO may not work correctly when receiving DSACK
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: net-next.git commit d9157f6806d1

commit d9157f6806d1499e173770df1f1b234763de5c79
Author: Pengcheng Yang <yangpc@wangsu.com>
Date:   Tue Apr 26 18:03:39 2022 +0800

    tcp: fix F-RTO may not work correctly when receiving DSACK

    Currently DSACK is regarded as a dupack, which may cause
    F-RTO to incorrectly enter "loss was real" when receiving
    DSACK.

    Packetdrill to demonstrate:

    // Enable F-RTO and TLP
        0 `sysctl -q net.ipv4.tcp_frto=2`
        0 `sysctl -q net.ipv4.tcp_early_retrans=3`
        0 `sysctl -q net.ipv4.tcp_congestion_control=cubic`

    // Establish a connection
       +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
       +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
       +0 bind(3, ..., ...) = 0
       +0 listen(3, 1) = 0

    // RTT 10ms, RTO 210ms
      +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
       +0 > S. 0:0(0) ack 1 <...>
     +.01 < . 1:1(0) ack 1 win 257
       +0 accept(3, ..., ...) = 4

    // Send 2 data segments
       +0 write(4, ..., 2000) = 2000
       +0 > P. 1:2001(2000) ack 1

    // TLP
    +.022 > P. 1001:2001(1000) ack 1

    // Continue to send 8 data segments
       +0 write(4, ..., 10000) = 10000
       +0 > P. 2001:10001(8000) ack 1

    // RTO
    +.188 > . 1:1001(1000) ack 1

    // The original data is acked and new data is sent(F-RTO step 2.b)
       +0 < . 1:1(0) ack 2001 win 257
       +0 > P. 10001:12001(2000) ack 1

    // D-SACK caused by TLP is regarded as a dupack, this results in
    // the incorrect judgment of "loss was real"(F-RTO step 3.a)
    +.022 < . 1:1(0) ack 2001 win 257 <sack 1001:2001,nop,nop>

    // Never-retransmitted data(3001:4001) are acked and
    // expect to switch to open state(F-RTO step 3.b)
       +0 < . 1:1(0) ack 4001 win 257
    +0 %{ assert tcpi_ca_state == 0, tcpi_ca_state }%

    Fixes: e33099f96d ("tcp: implement RFC5682 F-RTO")
    Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Tested-by: Neal Cardwell <ncardwell@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/1650967419-2150-1-git-send-email-yangpc@wangsu.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-07 10:19:56 +01:00
Frantisek Hrbata fa843be1d1 Merge: net: add skb drop reasons
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161

Sync skb drop reasons with upstream to improve debuggability and visibility in
the net stack. This MR helps in understanding why a given packet is being
dropped.

One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint:

```
# perf record -e skb:kfree_skb -a sleep 10
# perf script
         swapper     0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED
         swapper     0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE
```

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-24 14:27:58 -04:00
Antoine Tenart 06179dc98c tcp: fix signed/unsigned comparison
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 843f77407eebee07c2a3300df0c4b33f64322e29
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 17 11:34:32 2022 -0700

    tcp: fix signed/unsigned comparison

    Kernel test robot reported:

    smatch warnings:
    net/ipv4/tcp_input.c:5966 tcp_rcv_established() warn: unsigned 'reason' is never less than zero.

    I actually had one packetdrill failing because of this bug,
    and was about to send the fix :)

    v2: Andreas Schwab also pointed out that @reason needs to be negated
        before we reach tcp_drop_reason()

    Fixes: 4b506af9c5b8 ("tcp: add two drop reasons for tcp_ack()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Reported-by: Andreas Schwab <schwab@linux-m68k.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:26 +02:00
Antoine Tenart 1fc586215d tcp: add drop reason support to tcp_ofo_queue()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 8fbf195798b56e1e87f62d01be636a6425c304c2
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 15 17:10:48 2022 -0700

    tcp: add drop reason support to tcp_ofo_queue()

    packets in OFO queue might be redundant, and dropped.

    tcp_drop() is no longer needed.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart fbaadfecee tcp: add drop reasons to tcp_rcv_synsent_state_process()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 659affdb5140599f25418807c3354b060d4b1b88
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 15 17:10:47 2022 -0700

    tcp: add drop reasons to tcp_rcv_synsent_state_process()

    Re-use existing reasons.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart 39986af1ef tcp: make tcp_rcv_synsent_state_process() drop monitor friend
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit c337578a6592d671c5e78accc55f00cc594fe2da
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 15 17:10:46 2022 -0700

    tcp: make tcp_rcv_synsent_state_process() drop monitor friend

    1) A valid RST packet should be consumed, to not confuse drop monitor.

    2) Same remark for packet validating cross syn setup,
       even if we might ignore part of it.

    3) When third packet of 3WHS is delayed, do not pretend
       the SYNACK was dropped.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart cfe125d26c tcp: add drop reason support to tcp_prune_ofo_queue()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit e7c89ae4078eab24af71ba26b91642e819a4bd7f
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 15 17:10:45 2022 -0700

    tcp: add drop reason support to tcp_prune_ofo_queue()

    Add one reason for packets dropped from OFO queue because
    of memory pressure.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart 7a7559570c tcp: add two drop reasons for tcp_ack()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 4b506af9c5b8de0da34097d50d9448dfb33d70c3
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 15 17:10:44 2022 -0700

    tcp: add two drop reasons for tcp_ack()

    Add TCP_TOO_OLD_ACK and TCP_ACK_UNSENT_DATA drop
    reasons so that tcp_rcv_established() can report
    them.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart bfa77d2590 tcp: add drop reasons to tcp_rcv_state_process()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 669da7a71890b2b2a31a7e9571c0fdf1123e26ef
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 15 17:10:43 2022 -0700

    tcp: add drop reasons to tcp_rcv_state_process()

    Add basic support for drop reasons in tcp_rcv_state_process()

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart 1b85a235f3 tcp: make tcp_rcv_state_process() drop monitor friendly
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 37fd4e842391a1b947789969ae8454f1596735c8
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 15 17:10:42 2022 -0700

    tcp: make tcp_rcv_state_process() drop monitor friendly

    tcp_rcv_state_process() incorrectly drops packets
    instead of consuming it, making drop monitor very noisy,
    if not unusable.

    Calling tcp_time_wait() or tcp_done() is part
    of standard behavior, packets triggering these actions
    were not dropped.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart 0d71a5d640 tcp: add drop reason support to tcp_validate_incoming()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit da40b613f89c43c58986e6f30560ad6573a4d569
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 15 17:10:41 2022 -0700

    tcp: add drop reason support to tcp_validate_incoming()

    Creates four new drop reasons for the following cases:

    1) packet being rejected by RFC 7323 PAWS check
    2) packet being rejected by SEQUENCE check
    3) Invalid RST packet
    4) Invalid SYN packet

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart f53e3938c1 tcp: get rid of rst_seq_match
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit b5ec1e6205a1cb719ab188472f00ae81b0800f2e
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 15 17:10:40 2022 -0700

    tcp: get rid of rst_seq_match

    Small cleanup in tcp_validate_incoming(), no need for rst_seq_match
    setting and testing.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart 7732a85c9c tcp: consume incoming skb leading to a reset
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit d9d024f96609016628d750ebc8ee4a6f0d80e6e1
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 15 17:10:39 2022 -0700

    tcp: consume incoming skb leading to a reset

    Whenever tcp_validate_incoming() handles a valid RST packet,
    we should not pretend the packet was dropped.

    Create a special section at the end of tcp_validate_incoming()
    to handle this case.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart bf43facef5 tcp: tcp_send_challenge_ack delete useless param `skb`
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 208dd45d8d050360b46ded439a057bcc7cbf3b09
Author: Benjamin Yim <yan2228598786@gmail.com>
Date:   Sun Jan 9 21:08:24 2022 +0800

    tcp: tcp_send_challenge_ack delete useless param `skb`

    After this parameter is passed in, there is no usage, and deleting it will
     not bring any impact.

    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Benjamin Yim <yan2228598786@gmail.com>
    Link: https://lore.kernel.org/r/20220109130824.2776-1-yan2228598786@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:25 +02:00
Antoine Tenart 57be9c0cbb net: tcp: use tcp_drop_reason() for tcp_data_queue_ofo()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit d25e481be0c519d1a458b14191dc8c2a8bb3e24a
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:37 2022 +0800

    net: tcp: use tcp_drop_reason() for tcp_data_queue_ofo()

    Replace tcp_drop() used in tcp_data_queue_ofo with tcp_drop_reason().
    Following drop reasons are introduced:

    SKB_DROP_REASON_TCP_OFOMERGE

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 6c862ec2a7 net: tcp: use tcp_drop_reason() for tcp_data_queue()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit a7ec381049c0d1f03e342063d75f5c3b314d0ec2
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:36 2022 +0800

    net: tcp: use tcp_drop_reason() for tcp_data_queue()

    Replace tcp_drop() used in tcp_data_queue() with tcp_drop_reason().
    Following drop reasons are introduced:

    SKB_DROP_REASON_TCP_ZEROWINDOW
    SKB_DROP_REASON_TCP_OLD_DATA
    SKB_DROP_REASON_TCP_OVERWINDOW

    SKB_DROP_REASON_TCP_OLD_DATA is used for the case that end_seq of skb
    less than the left edges of receive window. (Maybe there is a better
    name?)

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 3a5a0ad32c net: tcp: use tcp_drop_reason() for tcp_rcv_established()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 2a968ef60e1fac4e694d9f60ce19a3b66b40e8c3
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:35 2022 +0800

    net: tcp: use tcp_drop_reason() for tcp_rcv_established()

    Replace tcp_drop() used in tcp_rcv_established() with tcp_drop_reason().
    Following drop reasons are added:

    SKB_DROP_REASON_TCP_FLAGS

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Antoine Tenart 849a53b816 net: tcp: introduce tcp_drop_reason()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 082116ffcb7457ae50e3ddb0213d66ab29408f30
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:29 2022 +0800

    net: tcp: introduce tcp_drop_reason()

    For TCP protocol, tcp_drop() is used to free the skb when it needs
    to be dropped. To make use of kfree_skb_reason() and pass the drop
    reason to it, introduce the function tcp_drop_reason(). Meanwhile,
    make tcp_drop() an inline call to tcp_drop_reason().

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Paolo Abeni 519b3282c5 net: Fix data-races around sysctl_[rw]mem(_offset)?.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Conflicts: different context in __tcp_grow_window() as rhel-9 \
 lacks upstream commit 240bfd134c592 ("tcp: tweak len/truesize \
 ratio for coalesce candidates")

Upstream commit:
commit 02739545951ad4c1215160db7fbf9b7a918d3c0b
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:00 2022 -0700

    net: Fix data-races around sysctl_[rw]mem(_offset)?.

    While reading these sysctl variables, they can be changed concurrently.
    Thus, we need to add READ_ONCE() to their readers.

      - .sysctl_rmem
      - .sysctl_rwmem
      - .sysctl_rmem_offset
      - .sysctl_wmem_offset
      - sysctl_tcp_rmem[1, 2]
      - sysctl_tcp_wmem[1, 2]
      - sysctl_decnet_rmem[1]
      - sysctl_decnet_wmem[1]
      - sysctl_tipc_rmem[1]

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 12:59:36 +02:00
Paolo Abeni 533df7034b tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101465
Tested: LNST, Tier1
Conflicts: use WARN_ON_ONCE instead of NET_DEBUG_WARN_ON_ONCE, as \
  rhel-9 lacks the network debug infra.

Upstream commit:
commit 11825765291a93d8e7f44230da67b9f607c777bf
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 27 14:28:29 2022 -0700

    tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd

    syzbot got a new report [1] finally pointing to a very old bug,
    added in initial support for MTU probing.

    tcp_mtu_probe() has checks about starting an MTU probe if
    tcp_snd_cwnd(tp) >= 11.

    But nothing prevents tcp_snd_cwnd(tp) to be reduced later
    and before the MTU probe succeeds.

    This bug would lead to potential zero-divides.

    Debugging added in commit 40570375356c ("tcp: add accessors
    to read/set tp->snd_cwnd") has paid off :)

    While we are at it, address potential overflows in this code.

    [1]
    WARNING: CPU: 1 PID: 14132 at include/net/tcp.h:1219 tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
    Modules linked in:
    CPU: 1 PID: 14132 Comm: syz-executor.2 Not tainted 5.18.0-syzkaller-07857-gbabf0bb978e3 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:tcp_snd_cwnd_set include/net/tcp.h:1219 [inline]
    RIP: 0010:tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
    Code: 74 08 48 89 ef e8 da 80 17 f9 48 8b 45 00 65 48 ff 80 80 03 00 00 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 aa b0 c5 f8 <0f> 0b e9 16 fe ff ff 48 8b 4c 24 08 80 e1 07 38 c1 0f 8c c7 fc ff
    RSP: 0018:ffffc900079e70f8 EFLAGS: 00010287
    RAX: ffffffff88c0f7f6 RBX: ffff8880756e7a80 RCX: 0000000000040000
    RDX: ffffc9000c6c4000 RSI: 0000000000031f9e RDI: 0000000000031f9f
    RBP: 0000000000000000 R08: ffffffff88c0f606 R09: ffffc900079e7520
    R10: ffffed101011226d R11: 1ffff1101011226c R12: 1ffff1100eadcf50
    R13: ffff8880756e72c0 R14: 1ffff1100eadcf89 R15: dffffc0000000000
    FS:  00007f643236e700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1ab3f1e2a0 CR3: 0000000064fe7000 CR4: 00000000003506e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <TASK>
     tcp_clean_rtx_queue+0x223a/0x2da0 net/ipv4/tcp_input.c:3356
     tcp_ack+0x1962/0x3c90 net/ipv4/tcp_input.c:3861
     tcp_rcv_established+0x7c8/0x1ac0 net/ipv4/tcp_input.c:5973
     tcp_v6_do_rcv+0x57b/0x1210 net/ipv6/tcp_ipv6.c:1476
     sk_backlog_rcv include/net/sock.h:1061 [inline]
     __release_sock+0x1d8/0x4c0 net/core/sock.c:2849
     release_sock+0x5d/0x1c0 net/core/sock.c:3404
     sk_stream_wait_memory+0x700/0xdc0 net/core/stream.c:145
     tcp_sendmsg_locked+0x111d/0x3fc0 net/ipv4/tcp.c:1410
     tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1448
     sock_sendmsg_nosec net/socket.c:714 [inline]
     sock_sendmsg net/socket.c:734 [inline]
     __sys_sendto+0x439/0x5c0 net/socket.c:2119
     __do_sys_sendto net/socket.c:2131 [inline]
     __se_sys_sendto net/socket.c:2127 [inline]
     __x64_sys_sendto+0xda/0xf0 net/socket.c:2127
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x46/0xb0
    RIP: 0033:0x7f6431289109
    Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007f643236e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 00007f643139c100 RCX: 00007f6431289109
    RDX: 00000000d0d0c2ac RSI: 0000000020000080 RDI: 000000000000000a
    RBP: 00007f64312e308d R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
    R13: 00007fff372533af R14: 00007f643236e300 R15: 0000000000022000

    Fixes: 5d424d5a67 ("[TCP]: MTU probing")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Acked-by: Yuchung Cheng <ycheng@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-07-01 09:53:41 +02:00
Paolo Abeni 036c0e121e tcp: add accessors to read/set tp->snd_cwnd
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101465
Tested: LNST, Tier1

Upstream commit:
commit 40570375356c874b1578e05c1dcc3ff7c1322dbe
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Apr 5 16:35:38 2022 -0700

    tcp: add accessors to read/set tp->snd_cwnd

    We had various bugs over the years with code
    breaking the assumption that tp->snd_cwnd is greater
    than zero.

    Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added
    in commit 8b8a321ff7 ("tcp: fix zero cwnd in tcp_cwnd_reduction")
    can trigger, and without a repro we would have to spend
    considerable time finding the bug.

    Instead of complaining too late, we want to catch where
    and when tp->snd_cwnd is set to an illegal value.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Suggested-by: Yuchung Cheng <ycheng@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Acked-by: Yuchung Cheng <ycheng@google.com>
    Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-27 16:43:55 +02:00
Tobias Huschle b1e18c6885 [s390] net/smc: Limit SMC visits when handshake workqueue congested
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None

commit 48b6190a00425a1bebac9f7ae4b338a1e20f50f3
Author: D. Wythe <alibuda@linux.alibaba.com>
Date:   Thu Feb 10 17:11:36 2022 +0800

    net/smc: Limit SMC visits when handshake workqueue congested

    This patch intends to provide a mechanism to put constraint on SMC
    connections visit according to the pressure of SMC handshake process.
    At present, frequent visits will cause the incoming connections to be
    backlogged in SMC handshake queue, raise the connections established
    time. Which is quite unacceptable for those applications who base on
    short lived connections.

    There are two ways to implement this mechanism:

    1. Put limitation after TCP established.
    2. Put limitation before TCP established.

    In the first way, we need to wait and receive CLC messages that the
    client will potentially send, and then actively reply with a decline
    message, in a sense, which is also a sort of SMC handshake, affect the
    connections established time on its way.

    In the second way, the only problem is that we need to inject SMC logic
    into TCP when it is about to reply the incoming SYN, since we already do
    that, it's seems not a problem anymore. And advantage is obvious, few
    additional processes are required to complete the constraint.

    This patch use the second way. After this patch, connections who beyond
    constraint will not informed any SMC indication, and SMC will not be
    involved in any of its subsequent processes.

    Link: https://lore.kernel.org/all/1641301961-59331-1-git-send-email-alibuda@linux.alibaba.com/
    Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Tobias Huschle <thuschle@redhat.com>
2022-06-15 06:47:39 +02:00
Paolo Abeni 7f5af56f76 tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2079411
Tested: LNST, Tier1

Upstream commit:
commit 4bfe744ff1644fbc0a991a2677dc874475dd6776
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 24 17:34:07 2022 -0700

    tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT

    I had this bug sitting for too long in my pile, it is time to fix it.

    Thanks to Doug Porter for reminding me of it!

    We had various attempts in the past, including commit
    0cbe6a8f08 ("tcp: remove SOCK_QUEUE_SHRUNK"),
    but the issue is that TCP stack currently only generates
    EPOLLOUT from input path, when tp->snd_una has advanced
    and skb(s) cleaned from rtx queue.

    If a flow has a big RTT, and/or receives SACKs, it is possible
    that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
    and no more data can be sent until tp->snd_una finally advances.

    What is needed is to also check if POLLOUT needs to be generated
    whenever tp->snd_nxt is advanced, from output path.

    This bug triggers more often after an idle period, as
    we do not receive ACK for at least one RTT. tcp_notsent_lowat
    could be a fraction of what CWND and pacing rate would allow to
    send during this RTT.

    In a followup patch, I will remove the bogus call
    to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
    from tcp_check_space(). Fact that we have decided to generate
    an EPOLLOUT does not mean the application has immediately
    refilled the transmit queue. This optimistic call
    might have been the reason the bug seemed not too serious.

    Tested:

    200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]

    $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
    $ cat bench_rr.sh
    SUM=0
    for i in {1..10}
    do
     V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
     echo $V
     SUM=$(($SUM + $V))
    done
    echo SUM=$SUM

    Before patch:
    $ bench_rr.sh
    130000000
    80000000
    140000000
    140000000
    140000000
    140000000
    130000000
    40000000
    90000000
    110000000
    SUM=1140000000

    After patch:
    $ bench_rr.sh
    430000000
    590000000
    530000000
    450000000
    450000000
    350000000
    450000000
    490000000
    480000000
    460000000
    SUM=4680000000  # This is 410 % of the value before patch.

    Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Doug Porter <dsp@fb.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-12 16:55:44 +02:00
Paolo Abeni bae902a610 inet: fully convert sk->sk_rx_dst to RCU rules
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2079411
Tested: LNST, Tieri1
Conflicts: \
  - sk_rx_dst location inside struct sock is slightly different
  from upstream as rhel-9 already has commit 43f51df41729 ("net:
   move early demux fields close to sk_refcnt")

Upstream commit:
commit 8f905c0e7354ef261360fb7535ea079b1082c105
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Dec 20 06:33:30 2021 -0800

    inet: fully convert sk->sk_rx_dst to RCU rules

    syzbot reported various issues around early demux,
    one being included in this changelog [1]

    sk->sk_rx_dst is using RCU protection without clearly
    documenting it.

    And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
    are not following standard RCU rules.

    [a]    dst_release(dst);
    [b]    sk->sk_rx_dst = NULL;

    They look wrong because a delete operation of RCU protected
    pointer is supposed to clear the pointer before
    the call_rcu()/synchronize_rcu() guarding actual memory freeing.

    In some cases indeed, dst could be freed before [b] is done.

    We could cheat by clearing sk_rx_dst before calling
    dst_release(), but this seems the right time to stick
    to standard RCU annotations and debugging facilities.

    [1]
    BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
    BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
    Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204

    CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
     <TASK>
     __dump_stack lib/dump_stack.c:88 [inline]
     dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
     print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
     __kasan_report mm/kasan/report.c:433 [inline]
     kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
     dst_check include/net/dst.h:470 [inline]
     tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
     ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
     ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
     ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
     ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
     __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
     __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
     __netif_receive_skb_list net/core/dev.c:5608 [inline]
     netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
     gro_normal_list net/core/dev.c:5853 [inline]
     gro_normal_list net/core/dev.c:5849 [inline]
     napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
     virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
     virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
     __napi_poll+0xaf/0x440 net/core/dev.c:7023
     napi_poll net/core/dev.c:7090 [inline]
     net_rx_action+0x801/0xb40 net/core/dev.c:7177
     __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
     invoke_softirq kernel/softirq.c:432 [inline]
     __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
     irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
     common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
     asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
    RIP: 0033:0x7f5e972bfd57
    Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
    RSP: 002b:00007fff8a413210 EFLAGS: 00000283
    RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
    RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
    RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
    R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
    R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
     </TASK>

    Allocated by task 13:
     kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
     kasan_set_track mm/kasan/common.c:46 [inline]
     set_alloc_info mm/kasan/common.c:434 [inline]
     __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
     kasan_slab_alloc include/linux/kasan.h:259 [inline]
     slab_post_alloc_hook mm/slab.h:519 [inline]
     slab_alloc_node mm/slub.c:3234 [inline]
     slab_alloc mm/slub.c:3242 [inline]
     kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
     dst_alloc+0x146/0x1f0 net/core/dst.c:92
     rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
     ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
     ip_route_input_rcu net/ipv4/route.c:2470 [inline]
     ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
     ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
     ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
     ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
     ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
     __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
     __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
     __netif_receive_skb_list net/core/dev.c:5608 [inline]
     netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
     gro_normal_list net/core/dev.c:5853 [inline]
     gro_normal_list net/core/dev.c:5849 [inline]
     napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
     virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
     virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
     __napi_poll+0xaf/0x440 net/core/dev.c:7023
     napi_poll net/core/dev.c:7090 [inline]
     net_rx_action+0x801/0xb40 net/core/dev.c:7177
     __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

    Freed by task 13:
     kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
     kasan_set_track+0x21/0x30 mm/kasan/common.c:46
     kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
     ____kasan_slab_free mm/kasan/common.c:366 [inline]
     ____kasan_slab_free mm/kasan/common.c:328 [inline]
     __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
     kasan_slab_free include/linux/kasan.h:235 [inline]
     slab_free_hook mm/slub.c:1723 [inline]
     slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
     slab_free mm/slub.c:3513 [inline]
     kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
     dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
     rcu_do_batch kernel/rcu/tree.c:2506 [inline]
     rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
     __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

    Last potentially related work creation:
     kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
     __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
     __call_rcu kernel/rcu/tree.c:2985 [inline]
     call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
     dst_release net/core/dst.c:177 [inline]
     dst_release+0x79/0xe0 net/core/dst.c:167
     tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
     sk_backlog_rcv include/net/sock.h:1030 [inline]
     __release_sock+0x134/0x3b0 net/core/sock.c:2768
     release_sock+0x54/0x1b0 net/core/sock.c:3300
     tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
     inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
     sock_sendmsg_nosec net/socket.c:704 [inline]
     sock_sendmsg+0xcf/0x120 net/socket.c:724
     sock_write_iter+0x289/0x3c0 net/socket.c:1057
     call_write_iter include/linux/fs.h:2162 [inline]
     new_sync_write+0x429/0x660 fs/read_write.c:503
     vfs_write+0x7cd/0xae0 fs/read_write.c:590
     ksys_write+0x1ee/0x250 fs/read_write.c:643
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    The buggy address belongs to the object at ffff88807f1cb700
     which belongs to the cache ip_dst_cache of size 176
    The buggy address is located 58 bytes inside of
     176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
    The buggy address belongs to the page:
    page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
    flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
    raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
    raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected
    page_owner tracks the page as allocated
    page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
     prep_new_page mm/page_alloc.c:2418 [inline]
     get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
     __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
     alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
     alloc_slab_page mm/slub.c:1793 [inline]
     allocate_slab mm/slub.c:1930 [inline]
     new_slab+0x32d/0x4a0 mm/slub.c:1993
     ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
     __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
     slab_alloc_node mm/slub.c:3200 [inline]
     slab_alloc mm/slub.c:3242 [inline]
     kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
     dst_alloc+0x146/0x1f0 net/core/dst.c:92
     rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
     __mkroute_output net/ipv4/route.c:2564 [inline]
     ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
     ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
     __ip_route_output_key include/net/route.h:126 [inline]
     ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
     ip_route_output_key include/net/route.h:142 [inline]
     geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
     geneve_xmit_skb drivers/net/geneve.c:899 [inline]
     geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
     __netdev_start_xmit include/linux/netdevice.h:4994 [inline]
     netdev_start_xmit include/linux/netdevice.h:5008 [inline]
     xmit_one net/core/dev.c:3590 [inline]
     dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
     __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
    page last free stack trace:
     reset_page_owner include/linux/page_owner.h:24 [inline]
     free_pages_prepare mm/page_alloc.c:1338 [inline]
     free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
     free_unref_page_prepare mm/page_alloc.c:3309 [inline]
     free_unref_page+0x19/0x690 mm/page_alloc.c:3388
     qlink_free mm/kasan/quarantine.c:146 [inline]
     qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
     kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
     __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
     kasan_slab_alloc include/linux/kasan.h:259 [inline]
     slab_post_alloc_hook mm/slab.h:519 [inline]
     slab_alloc_node mm/slub.c:3234 [inline]
     kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
     __alloc_skb+0x215/0x340 net/core/skbuff.c:414
     alloc_skb include/linux/skbuff.h:1126 [inline]
     alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
     sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
     mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
     add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
     add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
     mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
     mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
     mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
     process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
     worker_thread+0x658/0x11f0 kernel/workqueue.c:2445

    Memory state around the buggy address:
     ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
     ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
    >ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                            ^
     ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
     ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

    Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-12 16:55:33 +02:00
Paolo Abeni fa461af9aa tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028279
Tested: LNST, Tier1

Upstream commit:
commit 4f884f3962767877d7aabbc1ec124d2c307a4257
Author: zhenggy <zhenggy@chinatelecom.cn>
Date:   Tue Sep 14 09:51:15 2021 +0800

    tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()

    Commit 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit
    time") may directly retrans a multiple segments TSO/GSO packet without
    split, Since this commit, we can no longer assume that a retransmitted
    packet is a single segment.

    This patch fixes the tp->undo_retrans accounting in tcp_sacktag_one()
    that use the actual segments(pcount) of the retransmitted packet.

    Before that commit (10d3be5692), the assumption underlying the
    tp->undo_retrans-- seems correct.

    Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
    Signed-off-by: zhenggy <zhenggy@chinatelecom.cn>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Yuchung Cheng <ycheng@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-02 12:04:16 +01:00
Jianguo Wu 6787b7e350 mptcp: avoid processing packet if a subflow reset
If check_fully_established() causes a subflow reset, it should not
continue to process the packet in tcp_data_queue().
Add a return value to mptcp_incoming_options(), and return false if a
subflow has been reset, else return true. Then drop the packet in
tcp_data_queue()/tcp_rcv_state_process() if mptcp_incoming_options()
return false.

Fixes: d582484726 ("mptcp: fix fallback for MP_JOIN subflows")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09 18:38:53 -07:00
Nguyen Dinh Phi be5d1b61a2 tcp: fix tcp_init_transfer() to not reset icsk_ca_initialized
This commit fixes a bug (found by syzkaller) that could cause spurious
double-initializations for congestion control modules, which could cause
memory leaks or other problems for congestion control modules (like CDG)
that allocate memory in their init functions.

The buggy scenario constructed by syzkaller was something like:

(1) create a TCP socket
(2) initiate a TFO connect via sendto()
(3) while socket is in TCP_SYN_SENT, call setsockopt(TCP_CONGESTION),
    which calls:
       tcp_set_congestion_control() ->
         tcp_reinit_congestion_control() ->
           tcp_init_congestion_control()
(4) receive ACK, connection is established, call tcp_init_transfer(),
    set icsk_ca_initialized=0 (without first calling cc->release()),
    call tcp_init_congestion_control() again.

Note that in this sequence tcp_init_congestion_control() is called
twice without a cc->release() call in between. Thus, for CC modules
that allocate memory in their init() function, e.g, CDG, a memory leak
may occur. The syzkaller tool managed to find a reproducer that
triggered such a leak in CDG.

The bug was introduced when that commit 8919a9b31e ("tcp: Only init
congestion control if not initialized already")
introduced icsk_ca_initialized and set icsk_ca_initialized to 0 in
tcp_init_transfer(), missing the possibility for a sequence like the
one above, where a process could call setsockopt(TCP_CONGESTION) in
state TCP_SYN_SENT (i.e. after the connect() or TFO open sendmsg()),
which would call tcp_init_congestion_control(). It did not intend to
reset any initialization that the user had already explicitly made;
it just missed the possibility of that particular sequence (which
syzkaller managed to find).

Fixes: 8919a9b31e ("tcp: Only init congestion control if not initialized already")
Reported-by: syzbot+f1e24a0594d4e3a895d3@syzkaller.appspotmail.com
Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-06 10:32:37 -07:00
Alexander Aring e3ae2365ef net: sock: introduce sk_error_report
This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 11:28:21 -07:00
Yuchung Cheng a29cb69146 net: tcp better handling of reordering then loss cases
This patch aims to improve the situation when reordering and loss are
ocurring in the same flight of packets.

Previously the reordering would first induce a spurious recovery, then
the subsequent ACK may undo the cwnd (based on the timestamps e.g.).
However the current loss recovery does not proceed to invoke
RACK to install a reordering timer. If some packets are also lost, this
may lead to a long RTO-based recovery. An example is
https://groups.google.com/g/bbr-dev/c/OFHADvJbTEI

The solution is to after reverting the recovery, always invoke RACK
to either mount the RACK timer to fast retransmit after the reordering
window, or restarts the recovery if new loss is identified. Hence
it is possible the sender may go from Recovery to Disorder/Open to
Recovery again in one ACK.

Reported-by: mingkun bian <bianmingkun@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:20:44 -07:00
Jakub Kicinski 709c031423 tcp: add tracepoint for checksum errors
Add a tracepoint for capturing TCP segments with
a bad checksum. This makes it easy to identify
sources of bad frames in the fleet (e.g. machines
with faulty NICs).

It should also help tools like IOvisor's tcpdrop.py
which are used today to get detailed information
about such packets.

We don't have a socket in many cases so we must
open code the address extraction based just on
the skb.

v2: add missing export for ipv6=m

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14 15:26:03 -07:00
Eric Dumazet a7abf3cd76 tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()
Jakub reported Data included in a Fastopen SYN that had to be
retransmit would have to wait for an RTO if TX completions are slow,
even with prior fix.

This is because tcp_rcv_fastopen_synack() does not use standard
rtx logic, meaning TSQ handler exits early in tcp_tsq_write()
because tp->lost_out == tp->retrans_out

Lets make tcp_rcv_fastopen_synack() use standard rtx logic,
by using tcp_mark_skb_lost() on the skb thats needs to be
sent again.

Not this raised a warning in tcp_fastretrans_alert() during my tests
since we consider the data not being aknowledged
by the receiver does not mean packet was lost on the network.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11 18:35:31 -08:00
Eric Dumazet 39354eb29f tcp: tcp_data_ready() must look at SOCK_DONE
My prior cleanup missed that tcp_data_ready() has to look at SOCK_DONE.
Otherwise, an application using SO_RCVLOWAT will not get EPOLLIN event
if a FIN is received in the middle of expected payload.

The reason SOCK_DONE is not examined in tcp_epollin_ready()
is that tcp_poll() catches the FIN because tcp_fin()
is also setting RCV_SHUTDOWN into sk->sk_shutdown

Fixes: 05dc72aba3 ("tcp: factorize logic into tcp_epollin_ready()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-15 13:20:36 -08:00
Eric Dumazet 05dc72aba3 tcp: factorize logic into tcp_epollin_ready()
Both tcp_data_ready() and tcp_stream_is_readable() share the same logic.

Add tcp_epollin_ready() helper to avoid duplication.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12 17:28:26 -08:00
Jakub Kicinski c358f95205 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/can/dev.c
  b552766c87 ("can: dev: prevent potential information leak in can_fill_info()")
  3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")
  0a042c6ec9 ("can: dev: move netlink related code into seperate file")

  Code move.

drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
  57ac4a31c4 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down")
  214baf2287 ("net/mlx5e: Support HTB offload")

  Adjacent code changes

net/switchdev/switchdev.c
  20776b465c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP")
  ffb68fc58e ("net: switchdev: remove the transaction structure from port object notifiers")
  bae33f2b5a ("net: switchdev: remove the transaction structure from port attributes")

  Transaction parameter gets dropped otherwise keep the fix.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 17:09:31 -08:00
Pengcheng Yang 62d9f1a694 tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN
Upon receiving a cumulative ACK that changes the congestion state from
Disorder to Open, the TLP timer is not set. If the sender is app-limited,
it can only wait for the RTO timer to expire and retransmit.

The reason for this is that the TLP timer is set before the congestion
state changes in tcp_ack(), so we delay the time point of calling
tcp_set_xmit_timer() until after tcp_fastretrans_alert() returns and
remove the FLAG_SET_XMIT_TIMER from ack_flag when the RACK reorder timer
is set.

This commit has two additional benefits:
1) Make sure to reset RTO according to RFC6298 when receiving ACK, to
avoid spurious RTO caused by RTO timer early expires.
2) Reduce the xmit timer reschedule once per ACK when the RACK reorder
timer is set.

Fixes: df92c8394e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Link: https://lore.kernel.org/netdev/1611311242-6675-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611464834-23030-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23 21:33:01 -08:00
Enke Chen 344db93ae3 tcp: make TCP_USER_TIMEOUT accurate for zero window probes
The TCP_USER_TIMEOUT is checked by the 0-window probe timer. As the
timer has backoff with a max interval of about two minutes, the
actual timeout for TCP_USER_TIMEOUT can be off by up to two minutes.

In this patch the TCP_USER_TIMEOUT is made more accurate by taking it
into account when computing the timer value for the 0-window probes.

This patch is similar to and builds on top of the one that made
TCP_USER_TIMEOUT accurate for RTOs in commit b701a99e43 ("tcp: Add
tcp_clamp_rto_to_user_timeout() helper to improve accuracy").

Fixes: 9721e709fa ("tcp: simplify window probe aborting on USER_TIMEOUT")
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210122191306.GA99540@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23 19:32:51 -08:00
Yousuk Seung e7ed11ee94 tcp: add TTL to SCM_TIMESTAMPING_OPT_STATS
This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports
the time-to-live or hop limit of the latest incoming packet with
SCM_TSTAMP_ACK. The value exported may not be from the packet that acks
the sequence when incoming packets are aggregated. Exporting the
time-to-live or hop limit value of incoming packets helps to estimate
the hop count of the path of the flow that may change over time.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 18:20:52 -08:00
Yuchung Cheng 9c30ae8398 tcp: fix TCP socket rehash stats mis-accounting
The previous commit 32efcc06d2 ("tcp: export count for rehash attempts")
would mis-account rehashing SNMP and socket stats:

  a. During handshake of an active open, only counts the first
     SYN timeout

  b. After handshake of passive and active open, stop updating
     after (roughly) TCP_RETRIES1 recurring RTOs

  c. After the socket aborts, over count timeout_rehash by 1

This patch fixes this by checking the rehash result from sk_rethink_txhash.

Fixes: 32efcc06d2 ("tcp: export count for rehash attempts")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 19:47:20 -08:00
Enke Chen 9d9b1ee0b2 tcp: fix TCP_USER_TIMEOUT with zero window
The TCP session does not terminate with TCP_USER_TIMEOUT when data
remain untransmitted due to zero window.

The number of unanswered zero-window probes (tcp_probes_out) is
reset to zero with incoming acks irrespective of the window size,
as described in tcp_probe_timer():

    RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
    as long as the receiver continues to respond probes. We support
    this by default and reset icsk_probes_out with incoming ACKs.

This counter, however, is the wrong one to be used in calculating the
duration that the window remains closed and data remain untransmitted.
Thanks to Jonathan Maxwell <jmaxwell37@gmail.com> for diagnosing the
actual issue.

In this patch a new timestamp is introduced for the socket in order to
track the elapsed time for the zero-window probes that have not been
answered with any non-zero window ack.

Fixes: 9721e709fa ("tcp: simplify window probe aborting on USER_TIMEOUT")
Reported-by: William McCall <william.mccall@gmail.com>
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210115223058.GA39267@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 19:59:17 -08:00
Alexander Duyck c31b70c996 tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
message in the case of IPv6 or a fragmentation request in the case of
IPv4. This results in the socket stalling for a second or more as it does
not respond to the message by retransmitting the SYN frame.

Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
makes use of the entire MSS. In the case of fastopen it does, and an
additional complication is that the retransmit queue doesn't contain the
original frames. As a result when tcp_simple_retransmit is called and
walks the list of frames in the queue it may not mark the frames as lost
because both the SYN and the data packet each individually are smaller than
the MSS size after the adjustment. This results in the socket being stalled
until the retransmit timer kicks in and forces the SYN frame out again
without the data attached.

In order to resolve this we can reduce the MSS the packets are compared
to in tcp_simple_retransmit to -1 for cases where we are still in the
TCP_SYN_SENT state for a fastopen socket. Doing this we will mark all of
the packets related to the fastopen SYN as lost.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/160780498125.3272.15437756269539236825.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 19:29:55 -08:00
Florian Westphal 049fe386d3 tcp: parse mptcp options contained in reset packets
Because TCP-level resets only affect the subflow, there is a MPTCP
option to indicate that the MPTCP-level connection should be closed
immediately without a mptcp-level fin exchange.

This is the 'MPTCP fast close option'.  It can be carried on ack
segments or TCP resets.  In the latter case, its needed to parse mptcp
options also for reset packets so that MPTCP can act accordingly.

Next patch will add receive side fastclose support in MPTCP.

Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:30:06 -08:00
Jakub Kicinski 46d5e62dd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().

strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.

Conflicts:
	net/netfilter/nf_tables_api.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11 22:29:38 -08:00
Eric Dumazet 72d05c00d7 tcp: select sane initial rcvq_space.space for big MSS
Before commit a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.

This is no longer the case, and Hazem Mohamed Abuelfotoh reported
that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.

Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit
of rcvq_space.space computation, while it can select later a smaller
value for tp->rcv_ssthresh and tp->window_clamp.

ss -temoi on receiver would show :

skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596

This means that TCP can not increase its window in tcp_grow_window(),
and that DRS can never kick.

Fix this by making sure that rcvq_space.space is not bigger than number of bytes
that can be held in TCP receive queue.

People unable/unwilling to change their kernel can work around this issue by
selecting a bigger tcp_rmem[1] value as in :

echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem

Based on an initial report and patch from Hazem Mohamed Abuelfotoh
 https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/

Fixes: a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Fixes: 041a14d267 ("tcp: start receiver buffer autotuning sooner")
Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 16:27:48 -08:00
Florian Westphal 7ea851d19b tcp: merge 'init_req' and 'route_req' functions
The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.

At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards.  There are two ways to allow MPTCP to send the
reset.

1. override 'send_synack' callback and emit the rst from there.
   The drawback is that the request socket gets inserted into the
   listeners queue just to get removed again right away.

2. Send the reset from the 'route_req' function instead.
   This avoids the 'add&remove request socket', but route_req lacks the
   skb that is required to send the TCP reset.

Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.

This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.

'send reset on unknown mptcp join token' is added in next patch.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 12:56:03 -08:00
Yuchung Cheng 7e901ee7b6 tcp: avoid slow start during fast recovery on new losses
During TCP fast recovery, the congestion control in charge is by
default the Proportional Rate Reduction (PRR) unless the congestion
control module specified otherwise (e.g. BBR).

Previously when tcp_packets_in_flight() is below snd_ssthresh PRR
would slow start upon receiving an ACK that
   1) cumulatively acknowledges retransmitted data
   and
   2) does not detect further lost retransmission

Such conditions indicate the repair is in good steady progress
after the first round trip of recovery. Otherwise PRR adopts the
packet conservation principle to send only the amount that was
newly delivered (indicated by this ACK).

This patch generalizes the previous design principle to include
also the newly sent data beside retransmission: as long as
the delivery is making good progress, both retransmission and
new data should be accounted to make PRR more cautious in slow
starting.

Suggested-by: Matt Mathis <mattmathis@google.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201031013412.1973112-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-02 17:17:40 -08:00
Arjun Roy 435ccfa894 tcp: Prevent low rmem stalls with SO_RCVLOWAT.
With SO_RCVLOWAT, under memory pressure,
it is possible to enter a state where:

1. We have not received enough bytes to satisfy SO_RCVLOWAT.
2. We have not entered buffer pressure (see tcp_rmem_pressure()).
3. But, we do not have enough buffer space to accept more packets.

In this case, we advertise 0 rwnd (due to #3) but the application does
not drain the receive queue (no wakeup because of #1 and #2) so the
flow stalls.

Modify the heuristic for SO_RCVLOWAT so that, if we are advertising
rwnd<=rcv_mss, force a wakeup to prevent a stall.

Without this patch, setting tcp_rmem to 6143 and disabling TCP
autotune causes a stalled flow. With this patch, no stall occurs. This
is with RPC-style traffic with large messages.

Fixes: 03f45c883c ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-23 19:11:20 -07:00
Neal Cardwell 18ded910b5 tcp: fix to update snd_wl1 in bulk receiver fast path
In the header prediction fast path for a bulk data receiver, if no
data is newly acknowledged then we do not call tcp_ack() and do not
call tcp_ack_update_window(). This means that a bulk receiver that
receives large amounts of data can have the incoming sequence numbers
wrap, so that the check in tcp_may_update_window fails:
   after(ack_seq, tp->snd_wl1)

If the incoming receive windows are zero in this state, and then the
connection that was a bulk data receiver later wants to send data,
that connection can find itself persistently rejecting the window
updates in incoming ACKs. This means the connection can persistently
fail to discover that the receive window has opened, which in turn
means that the connection is unable to send anything, and the
connection's sending process can get permanently "stuck".

The fix is to update snd_wl1 in the header prediction fast path for a
bulk data receiver, so that it keeps up and does not see wrapping
problems.

This fix is based on a very nice and thorough analysis and diagnosis
by Apollon Oikonomopoulos (see link below).

This is a stable candidate but there is no Fixes tag here since the
bug predates current git history. Just for fun: looks like the bug
dates back to when header prediction was added in Linux v2.1.8 in Nov
1996. In that version tcp_rcv_established() was added, and the code
only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer:
receiver" code path it does not call tcp_ack(). This fix seems to
apply cleanly at least as far back as v3.2.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Apollon Oikonomopoulos <apoikos@dmesg.gr>
Tested-by: Apollon Oikonomopoulos <apoikos@dmesg.gr>
Link: https://www.spinics.net/lists/netdev/msg692430.html
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201022143331.1887495-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22 12:26:57 -07:00
Julia Lawall 44797589c2 tcp: use semicolons rather than commas to separate statements
Replace commas with semicolons.  Commas introduce unnecessary
variability in the code structure and are hard to see.  What is done
is essentially described by the following Coccinelle semantic patch
(http://coccinelle.lip6.fr/):

// <smpl>
@@ expression e1,e2; @@
e1
-,
+;
e2
... when any
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/1602412498-32025-4-git-send-email-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:11:52 -07:00
David S. Miller 8b0308fe31 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Rejecting non-native endian BTF overlapped with the addition
of support for it.

The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-05 18:40:01 -07:00
Yuchung Cheng 9cd8b6c905 tcp: account total lost packets properly
The retransmission refactoring patch
686989700c ("tcp: simplify tcp_mark_skb_lost")
does not properly update the total lost packet counter which may
break the policer mode in BBR. This patch fixes it.

Fixes: 686989700c ("tcp: simplify tcp_mark_skb_lost")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-03 16:57:18 -07:00
Yuchung Cheng 534a2109fb tcp: consolidate tcp_mark_skb_lost and tcp_skb_mark_lost
tcp_skb_mark_lost is used by RFC6675-SACK and can easily be replaced
with the new tcp_mark_skb_lost handler.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-25 17:17:14 -07:00
Yuchung Cheng 686989700c tcp: simplify tcp_mark_skb_lost
This patch consolidates and simplifes the loss marking logic used
by a few loss detections (RACK, RFC6675, NewReno). Previously
each detection uses a subset of several intertwined subroutines.
This unncessary complexity has led to bugs (and fixes of bug fixes).

tcp_mark_skb_lost now is the single one routine to mark a packet loss
when a loss detection caller deems an skb ist lost:

   1. rewind tp->retransmit_hint_skb if skb has lower sequence or
      all lost ones have been retransmitted.

   2. book-keeping: adjust flags and counts depending on if skb was
      retransmitted or not.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-25 17:17:14 -07:00
Yuchung Cheng fd2146741c tcp: move tcp_mark_skb_lost
A pure refactor to move tcp_mark_skb_lost to tcp_input.c to prepare
for the later loss marking consolidation.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-25 17:17:14 -07:00
Yuchung Cheng 179ac35f2f tcp: consistently check retransmit hint
tcp_simple_retransmit() used for path MTU discovery may not adjust
the retransmit hint properly by deducting retrans_out before checking
it to adjust the hint. This patch fixes this by a correct routine
tcp_mark_skb_lost() already used by the RACK loss detection.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-25 17:17:14 -07:00
Florian Westphal 77d0cab939 net: tcp: drop unused function argument from mptcp_incoming_options
Since commit cfde141ea3 ("mptcp: move option parsing into
mptcp_incoming_options()"), the 3rd function argument is no longer used.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 20:17:01 -07:00
Priyaranjan Jha ad2b9b0f8d tcp: skip DSACKs with dubious sequence ranges
Currently, we use length of DSACKed range to compute number of
delivered packets. And if sequence range in DSACK is corrupted,
we can get bogus dsacked/acked count, and bogus cwnd.

This patch put bounds on DSACKed range to skip update of data
delivery and spurious retransmission information, if the DSACK
is unlikely caused by sender's action:
- DSACKed range shouldn't be greater than maximum advertised rwnd.
- Total no. of DSACKed segments shouldn't be greater than total
  no. of retransmitted segs. Unlike spurious retransmits, network
  duplicates or corrupted DSACKs shouldn't be counted as delivery.

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 20:15:45 -07:00
David S. Miller 6d772f328d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-09-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 95 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).

The main changes are:

1) Full multi function support in libbpf, from Andrii.

2) Refactoring of function argument checks, from Lorenz.

3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.

4) Program metadata support, from YiFei.

5) bpf iterator optimizations, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-23 13:11:11 -07:00
Eric Dumazet 0cbe6a8f08 tcp: remove SOCK_QUEUE_SHRUNK
SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state
that remembers if some room has been made in the rtx queue
by an incoming ACK packet.

This is later used from tcp_check_space() before
considering to send EPOLLOUT.

Problem is: If we receive SACK packets, and no packet
is removed from RTX queue, we can send fresh packets, thus
moving them from write queue to rtx queue and eventually
empty the write queue.

This stall can happen if TCP_NOTSENT_LOWAT is used.

With this fix, we no longer risk stalling sends while holes
are repaired, and we can fully use socket sndbuf.

This also removes a cache line dirtying for typical RPC
workloads.

Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:36:00 -07:00
Neal Cardwell 8919a9b31e tcp: Only init congestion control if not initialized already
Change tcp_init_transfer() to only initialize congestion control if it
has not been initialized already.

With this new approach, we can arrange things so that if the EBPF code
sets the congestion control by calling setsockopt(TCP_CONGESTION) then
tcp_init_transfer() will not re-initialize the CC module.

This is an approach that has the following beneficial properties:

(1) This allows CC module customizations made by the EBPF called in
    tcp_init_transfer() to persist, and not be wiped out by a later
    call to tcp_init_congestion_control() in tcp_init_transfer().

(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
    for existing code upstream that depends on the current order.

(3) Does not cause 2 initializations for for CC in the case where the
    EBPF called in tcp_init_transfer() wants to set the CC to a new CC
    algorithm.

(4) Allows follow-on simplifications to the code in net/core/filter.c
    and net/ipv4/tcp_cong.c, which currently both have some complexity
    to special-case CC initialization to avoid double CC
    initialization if EBPF sets the CC.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10 20:53:01 -07:00
Wei Wang e9b12edc13 tcp: record received TOS value in the request socket
A new field is added to the request sock to record the TOS value
received on the listening socket during 3WHS:
When not under syn flood, it is recording the TOS value sent in SYN.
When under syn flood, it is recording the TOS value sent in the ACK.
This is a preparation patch in order to do TOS reflection in the later
commit.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:15:40 -07:00
Martin KaFai Lau 267cf9fa43 tcp: bpf: Optionally store mac header in TCP_SAVE_SYN
This patch is adapted from Eric's patch in an earlier discussion [1].

The TCP_SAVE_SYN currently only stores the network header and
tcp header.  This patch allows it to optionally store
the mac header also if the setsockopt's optval is 2.

It requires one more bit for the "save_syn" bit field in tcp_sock.
This patch achieves this by moving the syn_smc bit next to the is_mptcp.
The syn_smc is currently used with the TCP experimental option.  Since
syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
with "IS_ENABLED(CONFIG_MPTCP)".

The mac_hdrlen is also stored in the "struct saved_syn"
to allow a quick offset from the bpf prog if it chooses to start
getting from the network header or the tcp header.

[1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau 0813a84156 bpf: tcp: Allow bpf prog to write and parse TCP header option
[ Note: The TCP changes here is mainly to implement the bpf
  pieces into the bpf_skops_*() functions introduced
  in the earlier patches. ]

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options.  It currently supports most of
the TCP packet except RST.

Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt().  The helper will ensure there is no duplicated
option in the header.

By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog.  Different bpf-prog can write its
own option kind.  It could also allow the bpf-prog to support a
recently standardized option on an older kernel.

Sockops Callback Flags:
──────────────────────
The bpf program will only be called to parse/write tcp header option
if the following newly added callback flags are enabled
in tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG

A few words on the PARSE CB flags.  When the above PARSE CB flags are
turned on, the bpf-prog will be called on packets received
at a sk that has at least reached the ESTABLISHED state.
The parsing of the SYN-SYNACK-ACK will be discussed in the
"3 Way HandShake" section.

The default is off for all of the above new CB flags, i.e. the bpf prog
will not be called to parse or write bpf hdr option.  There are
details comment on these new cb flags in the UAPI bpf.h.

sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options.  They are read only.

The new bpf_load_hdr_opt() helps to read a particular option "kind"
from the skb_data.

Please refer to the comment in UAPI bpf.h.  It has details
on what skb_data contains under different sock_ops->op.

3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.

* Passive side

When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog.  The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb.  The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
More on this later.  Also, the bpf prog can learn if it is in syncookie
mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).

The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN).  The example in a later patch does it.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
  is not very useful here since the listen sk will be shared
  by many concurrent connection requests.

  Extending bpf_sk_storage support to request_sock will add weight
  to the minisock and it is not necessary better than storing the
  whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
header and set the RTO of this newly established connection
as an example.

The received ACK (that concludes the 3WHS) will also be available to
the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
It could be useful in syncookie scenario.  More on this later.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.

The new getsockopt(TCP_BPF_SYN*) will also know where it can get
the SYN's packet from:
  - (a) the just received syn (available when the bpf prog is writing SYNACK)
        and it is the only way to get SYN during syncookie mode.
  or
  - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
        existing CB).

The bpf prog does not need to know where the SYN pkt is coming from.
The getsockopt(TCP_BPF_SYN*) will hide this details.

Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to read a particular header option from the SYN packet.

* Fastopen

Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.

* Syncookie

For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK.  The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().

* Turn off header CB flags after 3WHS

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.

[1]: draft-wang-tcpm-low-latency-opt-00
     https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau 331fca4315 bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt()
The bpf prog needs to parse the SYN header to learn what options have
been sent by the peer's bpf-prog before writing its options into SYNACK.
This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
This syn_skb will eventually be made available (as read-only) to the
bpf prog.  This will be the only SYN packet available to the bpf
prog during syncookie.  For other regular cases, the bpf prog can
also use the saved_syn.

When writing options, the bpf prog will first be called to tell the
kernel its required number of bytes.  It is done by the new
bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
When the bpf prog returns, the kernel will know how many bytes are needed
and then update the "*remaining" arg accordingly.  4 byte alignment will
be included in the "*remaining" before this function returns.  The 4 byte
aligned number of bytes will also be stored into the opts->bpf_opt_len.
"bpf_opt_len" is a newly added member to the struct tcp_out_options.

Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
header options.  The bpf prog is only called if it has reserved spaces
before (opts->bpf_opt_len > 0).

The bpf prog is the last one getting a chance to reserve header space
and writing the header option.

These two functions are half implemented to highlight the changes in
TCP stack.  The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau 00d211a4ea bpf: tcp: Add bpf_skops_parse_hdr()
The patch adds a function bpf_skops_parse_hdr().
It will call the bpf prog to parse the TCP header received at
a tcp_sock that has at least reached the ESTABLISHED state.

For the packets received during the 3WHS (SYN, SYNACK and ACK),
the received skb will be available to the bpf prog during the callback
in bpf_skops_established() introduced in the previous patch and
in the bpf_skops_write_hdr_opt() that will be added in the
next patch.

Calling bpf prog to parse header is controlled by two new flags in
tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG and
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG.

When BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG is set,
the bpf prog will only be called when there is unknown
option in the TCP header.

When BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG is set,
the bpf prog will be called on all received TCP header.

This function is half implemented to highlight the changes in
TCP stack.  The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190046.2885054-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau 72be0fe6ba bpf: tcp: Add bpf_skops_established()
In tcp_init_transfer(), it currently calls the bpf prog to give it a
chance to handle the just "ESTABLISHED" event (e.g. do setsockopt
on the newly established sk).  Right now, it is done by calling the
general purpose tcp_call_bpf().

In the later patch, it also needs to pass the just-received skb which
concludes the 3 way handshake. E.g. the SYNACK received at the active side.
The bpf prog can then learn some specific header options written by the
peer's bpf-prog and potentially do setsockopt on the newly established sk.
Thus, instead of reusing the general purpose tcp_call_bpf(), a new function
bpf_skops_established() is added to allow passing the "skb" to the bpf
prog.  The actual skb passing from bpf_skops_established() to the bpf prog
will happen together in a later patch which has the necessary bpf pieces.

A "skb" arg is also added to tcp_init_transfer() such that
it can then be passed to bpf_skops_established().

Calling the new bpf_skops_established() instead of tcp_call_bpf()
should be a noop in this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190039.2884750-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau 7656d68455 tcp: Add saw_unknown to struct tcp_options_received
In a later patch, the bpf prog only wants to be called to handle
a header option if that particular header option cannot be handled by
the kernel.  This unknown option could be written by the peer's bpf-prog.
It could also be a new standard option that the running kernel does not
support it while a bpf-prog can handle it.

This patch adds a "saw_unknown" bit to "struct tcp_options_received"
and it uses an existing one byte hole to do that.  "saw_unknown" will
be set in tcp_parse_options() if it sees an option that the kernel
cannot handle.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190033.2884430-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau 70a217f197 tcp: Use a struct to represent a saved_syn
The TCP_SAVE_SYN has both the network header and tcp header.
The total length of the saved syn packet is currently stored in
the first 4 bytes (u32) of an array and the actual packet data is
stored after that.

A later patch will add a bpf helper that allows to get the tcp header
alone from the saved syn without the network header.  It will be more
convenient to have a direct offset to a specific header instead of
re-parsing it.  This requires to separately store the network hdrlen.
The total header length (i.e. network + tcp) is still needed for the
current usage in getsockopt.  Although this total length can be obtained
by looking into the tcphdr and then get the (th->doff << 2), this patch
chooses to directly store the tcp hdrlen in the second four bytes of
this newly created "struct saved_syn".  By using a new struct, it can
give a readable name to each individual header length.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com
2020-08-24 14:34:59 -07:00
Jianfeng Wang 730e700e2c tcp: apply a floor of 1 for RTT samples from TCP timestamps
For retransmitted packets, TCP needs to resort to using TCP timestamps
for computing RTT samples. In the common case where the data and ACK
fall in the same 1-millisecond interval, TCP senders with millisecond-
granularity TCP timestamps compute a ca_rtt_us of 0. This ca_rtt_us
of 0 propagates to rs->rtt_us.

This value of 0 can cause performance problems for congestion control
modules. For example, in BBR, the zero min_rtt sample can bring the
min_rtt and BDP estimate down to 0, reduce snd_cwnd and result in a
low throughput. It would be hard to mitigate this with filtering in
the congestion control module, because the proper floor to apply would
depend on the method of RTT sampling (using timestamp options or
internally-saved transmission timestamps).

This fix applies a floor of 1 for the RTT sample delta from TCP
timestamps, so that seq_rtt_us, ca_rtt_us, and rs->rtt_us will be at
least 1 * (USEC_PER_SEC / TCP_TS_HZ).

Note that the receiver RTT computation in tcp_rcv_rtt_measure() and
min_rtt computation in tcp_update_rtt_min() both already apply a floor
of 1 timestamp tick, so this commit makes the code more consistent in
avoiding this edge case of a value of 0.

Signed-off-by: Jianfeng Wang <jfwang@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 17:54:03 -07:00
Florian Westphal 6fc8c827dd tcp: syncookies: create mptcp request socket for ACK cookies with MPTCP option
If SYN packet contains MP_CAPABLE option, keep it enabled.
Syncokie validation and cookie-based socket creation is changed to
instantiate an mptcp request sockets if the ACK contains an MPTCP
connection request.

Rather than extend both cookie_v4/6_check, add a common helper to create
the (mp)tcp request socket.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 16:55:32 -07:00
Florian Westphal f8ace8d915 tcp: rename request_sock cookie_ts bit to syncookie
Nowadays output function has a 'synack_type' argument that tells us when
the syn/ack is emitted via syncookies.

The request already tells us when timestamps are supported, so check
both to detect special timestamp for tcp option encoding is needed.

We could remove cookie_ts altogether, but a followup patch would
otherwise need to adjust function signatures to pass 'want_cookie' to
mptcp core.

This way, the 'existing' bit can be used.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 16:55:32 -07:00
David S. Miller a57066b1a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The UDP reuseport conflict was a little bit tricky.

The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.

At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.

This requires moving the reuseport_has_conns() logic into the callers.

While we are here, get rid of inline directives as they do not belong
in foo.c files.

The other changes were cases of more straightforward overlapping
modifications.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-25 17:49:04 -07:00
Yuchung Cheng 76be93fc07 tcp: allow at most one TLP probe per flight
Previously TLP may send multiple probes of new data in one
flight. This happens when the sender is cwnd limited. After the
initial TLP containing new data is sent, the sender receives another
ACK that acks partial inflight.  It may re-arm another TLP timer
to send more, if no further ACK returns before the next TLP timeout
(PTO) expires. The sender may send in theory a large amount of TLP
until send queue is depleted. This only happens if the sender sees
such irregular uncommon ACK pattern. But it is generally undesirable
behavior during congestion especially.

The original TLP design restrict only one TLP probe per inflight as
published in "Reducing Web Latency: the Virtue of Gentle Aggression",
SIGCOMM 2013. This patch changes TLP to send at most one probe
per inflight.

Note that if the sender is app-limited, TLP retransmits old data
and did not have this issue.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23 12:23:32 -07:00
Priyaranjan Jha e3a5a1e8b6 tcp: add SNMP counter for no. of duplicate segments reported by DSACK
There are two existing SNMP counters, TCPDSACKRecv and TCPDSACKOfoRecv,
which are incremented depending on whether the DSACKed range is below
the cumulative ACK sequence number or not. Unfortunately, these both
implicitly assume each DSACK covers only one segment. This makes these
counters unusable for estimating spurious retransmit rates,
or real/non-spurious loss rate.

This patch introduces a new SNMP counter, TCPDSACKRecvSegs, which tracks
the estimated number of duplicate segments based on:
(DSACKed sequence range) / MSS. This counter is usable for estimating
spurious retransmit rates, or real/non-spurious loss rate.

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-17 12:54:30 -07:00
Priyaranjan Jha a71d77e6be tcp: fix segment accounting when DSACK range covers multiple segments
Currently, while processing DSACK, we assume DSACK covers only one
segment. This leads to significant underestimation of DSACKs with
LRO/GRO. This patch fixes segment accounting with DSACK by estimating
segment count from DSACK sequence range / MSS.

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-17 12:54:30 -07:00
Andrew Lunn 3628e3cbf9 net: ipv4: kerneldoc fixes
Simple fixes which require no deep knowledge of the code.

Cc: Paul Moore <paul@paul-moore.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-13 17:20:39 -07:00
David S. Miller 71930d6102 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
All conflicts seemed rather trivial, with some guidance from
Saeed Mameed on the tc_ct.c one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-11 00:46:00 -07:00
Alexander A. Klimov 7a6498ebcd Replace HTTP links with HTTPS ones: IPv*
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

Deterministic algorithm:
For each file:
  If not .svg:
    For each line:
      If doesn't contain `\bxmlns\b`:
        For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
          If both the HTTP and HTTPS versions
          return 200 OK and serve the same content:
            Replace HTTP with HTTPS.

Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-06 13:23:03 -07:00
Eric Dumazet ba3bb0e76c tcp: fix SO_RCVLOWAT possible hangs under high mem pressure
Whenever tcp_try_rmem_schedule() returns an error, we are under
trouble and should make sure to wakeup readers so that they
can drain socket queues and eventually make room.

Fixes: 03f45c883c ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 17:46:04 -07:00
Yousuk Seung ff91e9292f tcp: call tcp_ack_tstamp() when not fully acked
When skb is coalesced tcp_ack_tstamp() still needs to be called when not
fully acked in tcp_clean_rtx_queue(), otherwise SCM_TSTAMP_ACK
timestamps may never be fired. Since the original patch series had
dependent commits, this patch fixes the issue instead of reverting by
restoring calls to tcp_ack_tstamp() when skb is not fully acked.

Fixes: fdb7eb21dd ("tcp: stamp SCM_TSTAMP_ACK later in tcp_clean_rtx_queue()")
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 13:40:33 -07:00
Yousuk Seung 082d4fa980 tcp: update delivered_ce with delivered
Currently tp->delivered is updated in various places in tcp_ack() but
tp->delivered_ce is updated once at the end. As a result two counts in
OPT_STATS of SCM_TSTAMP_ACK timestamps generated in tcp_ack() may not be
in sync. This patch updates both counts at the same in tcp_ack().

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-27 17:41:27 -07:00
Yousuk Seung f00394ce60 tcp: count sacked packets in tcp_sacktag_state
Add sack_delivered to tcp_sacktag_state and count the number of sacked
and dsacked packets. This is pure refactor for future patches to improve
tracking delivered counts.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-27 17:41:27 -07:00
Yousuk Seung c634e34f6e tcp: add ece_ack flag to reno sack functions
Pass a boolean flag that tells the ECE state of the current ack to reno
sack functions. This is pure refactor for future patches to improve
tracking delivered counts.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-27 17:41:27 -07:00