Commit Graph

265 Commits

Author SHA1 Message Date
Rado Vrbovsky 81ce48e690 Merge: mptcp: phase-1 backports for RHEL-9.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5449

JIRA: https://issues.redhat.com/browse/RHEL-62871  
JIRA: https://issues.redhat.com/browse/RHEL-58839  
JIRA: https://issues.redhat.com/browse/RHEL-66083  
JIRA: https://issues.redhat.com/browse/RHEL-66074  
CVE: CVE-2024-46711  
CVE: CVE-2024-45009  
CVE: CVE-2024-45010  
Upstream Status: All mainline in net.git  
Tested: kselftest  
Conflicts: see individual patches  
  
Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-22 09:18:31 +00:00
Davide Caratti 6758e2bf77 tcp: set TCP_DEFER_ACCEPT locklessly
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 6e97ba552b8d3dd074a28b8600740b8bed42267b

commit 6e97ba552b8d3dd074a28b8600740b8bed42267b
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 4 14:46:16 2023 +0000

    tcp: set TCP_DEFER_ACCEPT locklessly

    rskq_defer_accept field can be read/written without
    the need of holding the socket lock.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Paolo Abeni fdad6e7a51 tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1
Conflicts: different context in tcp_conn_request(), as rhel-9 \
  lacks the TCP AO support.

Upstream commit:
commit 41eecbd712b73f0d5dcf1152b9a1c27b1f238028
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 7 09:33:22 2024 +0000

    tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field

    TCP can transform a TIMEWAIT socket into a SYN_RECV one from
    a SYN packet, and the ISN of the SYNACK packet is normally
    generated using TIMEWAIT tw_snd_nxt :

    tcp_timewait_state_process()
    ...
        u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
        if (isn == 0)
            isn++;
        TCP_SKB_CB(skb)->tcp_tw_isn = isn;
        return TCP_TW_SYN;

    This SYN packet also bypasses normal checks against listen queue
    being full or not.

    tcp_conn_request()
    ...
           __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
    ...
            /* TW buckets are converted to open requests without
             * limitations, they conserve resources and peer is
             * evidently real one.
             */
            if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
                    want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
                    if (!want_cookie)
                            goto drop;
            }

    This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.

    Unfortunately this field has been accidentally cleared
    after the call to tcp_timewait_state_process() returning
    TCP_TW_SYN.

    Using a field in TCP_SKB_CB(skb) for a temporary state
    is overkill.

    Switch instead to a per-cpu variable.

    As a bonus, we do not have to clear tcp_tw_isn in TCP receive
    fast path.
    It is temporarily set then cleared only in the TCP_TW_SYN dance.

    Fixes: 4ad19de877 ("net: tcp6: fix double call of tcp_v6_fill_cb()")
    Fixes: eeea10b83a ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:08:41 +02:00
Florian Westphal d3c5054dd7 net: tcp: un-pin the tw_timer
JIRA: https://issues.redhat.com/browse/RHEL-9279
Upstream Status: commit c75ad7c759cc

Context: net/ipv4/inet_timewait_sock.c net/ipv4/tcp_minisocks.c
We lack "bash2" table.

commit c75ad7c759ccead1be462d4cec42bd4ad3578989
Author: Florian Westphal <fw@strlen.de>
Date:   Thu Jun 6 17:11:38 2024 +0200

    net: tcp: un-pin the tw_timer

    After previous patch, even if timer fires immediately on another CPU,
    context that schedules the timer now holds the ehash spinlock, so timer
    cannot reap tw socket until ehash lock is released.

    BH disable is moved into hashdance_schedule.

    Signed-off-by: Florian Westphal <fw@strlen.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2024-08-21 16:57:38 +02:00
Florian Westphal e68e0d9b40 net: tcp/dccp: prepare for tw_timer un-pinning
JIRA: https://issues.redhat.com/browse/RHEL-9279
Upstream Status: commit b334b924c9b7

Conflicts: net/ipv4/tcp_minisocks.c

We lack a "struct net *net" in this function, earlier
conflict fixup used sock_net().  Resolve this by keeping
sock_net() usage as-is.

commit b334b924c9b709bc969644fb5c406f5c9d01dceb
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu Jun 6 17:11:37 2024 +0200

    net: tcp/dccp: prepare for tw_timer un-pinning

    The TCP timewait timer is proving to be problematic for setups where
    scheduler CPU isolation is achieved at runtime via cpusets (as opposed to
    statically via isolcpus=domains).

    What happens there is a CPU goes through tcp_time_wait(), arming the
    time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer
    fires, causing interference for the now-isolated CPU. This is conceptually
    similar to the issue described in commit e02b93124855 ("workqueue: Unbind
    kworkers before sending them to exit()")

    Move inet_twsk_schedule() to within inet_twsk_hashdance(), with the ehash
    lock held. Expand the lock's critical section from inet_twsk_kill() to
    inet_twsk_deschedule_put(), serializing the scheduling vs descheduling of
    the timer. IOW, this prevents the following race:

                                 tcp_time_wait()
                                   inet_twsk_hashdance()
      inet_twsk_deschedule_put()
        del_timer_sync()
                                   inet_twsk_schedule()

    Thanks to Paolo Abeni for suggesting to leverage the ehash lock.

    This also restores a comment from commit ec94c2696f ("tcp/dccp: avoid
    one atomic operation for timewait hashdance") as inet_twsk_hashdance() had
    a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing.

    inet_twsk_deschedule_put() now acquires the ehash spinlock to synchronize
    with inet_twsk_hashdance_schedule().

    To ease possible regression search, actual un-pin is done in next patch.

    Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Co-developed-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2024-08-21 16:56:29 +02:00
Florian Westphal 23f780623d tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stamp
JIRA: https://issues.redhat.com/browse/RHEL-9279
Upstream Status: commit 69e0b33a7fce

CS9 lacks both support for TCP Authentication option and usec
resolution for TCP timestamps.
Both features are out of scope, so do needed context fixups.
This change was added to reduce conflicts in the followup patch.

commit 69e0b33a7fce4d96649b9fa32e56b696921aa48e
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jun 3 15:51:06 2024 +0000

    tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stamp

    These fields can be read and written locklessly, add annotations
    around these minor races.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Conflicts: net/ipv4/tcp_ipv4.c net/ipv6/tcp_ipv6.c

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2024-08-21 16:55:25 +02:00
Antoine Tenart f676841039 tcp: rstreason: fully support in tcp_check_req()
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit 11f46ea9814d2f0a3c7a5bc749d7619e47251f75
Author: Jason Xing <kernelxing@tencent.com>
Date:   Fri May 10 20:25:02 2024 +0800

    tcp: rstreason: fully support in tcp_check_req()

    We're going to send an RST due to invalid syn packet which is already
    checked whether 1) it is in sequence, 2) it is a retransmitted skb.

    As RFC 793 says, if the state of socket is not CLOSED/LISTEN/SYN-SENT,
    then we should send an RST when receiving bad syn packet:
    "fourth, check the SYN bit,...If the SYN is in the window it is an
    error, send a reset"

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Link: https://lore.kernel.org/r/20240510122502.27850-6-kerneljasonxing@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:42 +02:00
Antoine Tenart 25344d90dd rstreason: prepare for passive reset
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Context differences due to missing upstream commits ba7783ad45c8
  ("net/tcp: Add AO sign to RST packets") and d5dfbfa2f88e ("mptcp: drop
  duplicate header inclusions") in c9s.

commit 6be49deaa09576c141002a2e6f816a1709bc2c86
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:35 2024 +0800

    rstreason: prepare for passive reset

    Adjust the parameter and support passing reason of reset which
    is for now NOT_SPECIFIED. No functional changes.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 8f346a11e7 tcp: make the dropreason really work when calling tcp_rcv_state_process()
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit b9825695930546af725b1e686b8eaf4c71201728
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:26 2024 +0800

    tcp: make the dropreason really work when calling tcp_rcv_state_process()

    Update three callers including both ipv4 and ipv6 and let the dropreason
    mechanism work in reality.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Antoine Tenart af4ae10762 net: tcp: make the txhash available in TIME_WAIT sockets for IPv4 too
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966
Upstream Status: net-next.git

commit 4fbfde4e272065943cbcf2d016f0679456cb4f75
Author: Antoine Tenart <atenart@kernel.org>
Date:   Tue May 23 18:14:51 2023 +0200

    net: tcp: make the txhash available in TIME_WAIT sockets for IPv4 too

    Commit c67b85558f ("ipv6: tcp: send consistent autoflowlabel in
    TIME_WAIT state") made the socket txhash also available in TIME_WAIT
    sockets but for IPv6 only. Make it available for IPv4 too as we'll use
    it in later commits.

    Signed-off-by: Antoine Tenart <atenart@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-16 10:55:25 +02:00
Paolo Abeni d5bf598d62 tcp: tcp_check_req() can be called from process context
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561
Tested: LNST, Tier1

Upstream commit:
commit 580f98cc33a260bb8c6a39ae2921b29586b84fdf
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Feb 27 08:33:36 2023 +0000

    tcp: tcp_check_req() can be called from process context

    This is a follow up of commit 0a375c822497 ("tcp: tcp_rtx_synack()
    can be called from process context").

    Frederick Lawler reported another "__this_cpu_add() in preemptible"
    warning caused by the same reason.

    In my former patch I took care of tcp_rtx_synack()
    but forgot that tcp_check_req() also contained some SNMP updates.

    Note that some parts of tcp_check_req() always run in BH context,
    I added a comment to clarify this.

    Fixes: 8336886f78 ("tcp: TCP Fast Open Server - support TFO listeners")
    Link: https://lore.kernel.org/netdev/8cd33923-a21d-397c-e46b-2a068c287b03@cloudflare.com/T/
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Frederick Lawler <fred@cloudflare.com>
    Tested-by: Frederick Lawler <fred@cloudflare.com>
    Link: https://lore.kernel.org/r/20230227083336.4153089-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-21 10:01:06 +02:00
Felix Maurer 71bdaca768 bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
Bugzilla: https://bugzilla.redhat.com/2166911

commit 061ff040710e9f6f043d1fa80b1b362d2845b17a
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Thu Sep 29 00:04:06 2022 -0700

    bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
    
    When a bad bpf prog '.init' calls
    bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
    
    .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
    ... => .init => bpf_setsockopt(tcp_cc).
    
    It was prevented by the prog->active counter before but the prog->active
    detection cannot be used in struct_ops as explained in the earlier
    patch of the set.
    
    In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
    in order to break the loop.  This is done by using a bit of
    an existing 1 byte hole in tcp_sock to check if there is
    on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
    
    Note that this essentially limits only the first '.init' can
    call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
    does not support ECN) and the second '.init' cannot fallback to
    another cc.  This applies even the second
    bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220929070407.965581-5-martin.lau@linux.dev
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:38 +01:00
Guillaume Nault 01448578fa tcp: Fix a data-race around sysctl_tcp_abort_on_overflow.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 2d17d9c7382327d00aeaea35af44e9b26d53206e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:52 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_abort_on_overflow.

    While reading sysctl_tcp_abort_on_overflow, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault 4c7a8d36a8 tcp: Fix a data-race around sysctl_tcp_rfc1337.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 0b484c91911e758e53656d570de58c2ed81ec6f2
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:51 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_rfc1337.

    While reading sysctl_tcp_rfc1337, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Davide Caratti 728983215c tcp: Access &tcp_hashinfo via net.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 4461568aa4e5
Conflicts:
 - net/ipv4/tcp_ipv4.c: context mismatch as we don't have upstream
   commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and
   address") and 08eaef904031 ("tcp: Clean up some functions.")
 - net/ipv6/tcp_ipv6.c: context mismatch as we don't have upstream
   commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and
   address")
 - net/ipv4/tcp_minisocks.c: hunk applied manually to fix a build issue
   caused by missing upstream commit 08eaef904031 ("tcp: Clean up some
   functions.")

commit 4461568aa4e565de2c336f4875ddf912f26da8a5
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Sep 7 18:10:20 2022 -0700

    tcp: Access &tcp_hashinfo via net.

    We will soon introduce an optional per-netns ehash.

    This means we cannot use tcp_hashinfo directly in most places.

    Instead, access it via net->ipv4.tcp_death_row.hashinfo.

    The access will be valid only while initialising tcp_hashinfo
    itself and creating/destroying each netns.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:59 +01:00
Paolo Abeni d5d15d9336 tcp: md5: incorrect tcp_header_len for incoming connections
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2079411
Tested: LNST, Tier1

Upstream commit:
commit 5b0b9e4c2c895227c8852488b3f09839233bba54
Author: Francesco Ruggeri <fruggeri@arista.com>
Date:   Wed Apr 20 17:50:26 2022 -0700

    tcp: md5: incorrect tcp_header_len for incoming connections

    In tcp_create_openreq_child we adjust tcp_header_len for md5 using the
    remote address in newsk. But that address is still 0 in newsk at this
    point, and it is only set later by the callers (tcp_v[46]_syn_recv_sock).
    Use the address from the request socket instead.

    Fixes: cfb6eeb4c8 ("[TCP]: MD5 Signature Option (RFC2385) support.")
    Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220421005026.686A45EC01F2@us226.sjc.aristanetworks.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-12 16:55:44 +02:00
Kuniyuki Iwashima 55d444b310 tcp: Add stats for socket migration.
This commit adds two stats for the socket migration feature to evaluate the
effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE).

If the migration fails because of the own_req race in receiving ACK and
sending SYN+ACK paths, we do not increment the failure stat. Then another
CPU is responsible for the req.

Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:56:08 -07:00
Kuniyuki Iwashima d4f2c86b2b tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
This patch also changes the code to call reuseport_migrate_sock() and
inet_reqsk_clone(), but unlike the other cases, we do not call
inet_reqsk_clone() right after reuseport_migrate_sock().

Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener
has three kinds of refcnt:

  (A) for listener itself
  (B) carried by reuqest_sock
  (C) sock_hold() in tcp_v[46]_rcv()

While processing the req, (A) may disappear by close(listener). Also, (B)
can disappear by accept(listener) once we put the req into the accept
queue. So, we have to hold another refcnt (C) for the listener to prevent
use-after-free.

For socket migration, we call reuseport_migrate_sock() to select a listener
with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv().
This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv().
Thus we have to take another refcnt (B) for the newly cloned request_sock.

In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and
try to put the new req into the accept queue. By migrating req after
winning the "own_req" race, we can avoid such a worst situation:

  CPU 1 looks up req1
  CPU 2 looks up req1, unhashes it, then CPU 1 loses the race
  CPU 3 looks up req2, unhashes it, then CPU 2 loses the race
  ...

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
2021-06-15 18:01:06 +02:00
Alexander Ovechkin 7233da8669 tcp: relookup sock for RST+ACK packets handled by obsolete req sock
Currently tcp_check_req can be called with obsolete req socket for which big
socket have been already created (because of CPU race or early demux
assigning req socket to multiple packets in gro batch).

Commit e0f9759f53 ("tcp: try to keep packet if SYN_RCV race
is lost") added retry in case when tcp_check_req is called for PSH|ACK packet.
But if client sends RST+ACK immediatly after connection being
established (it is performing healthcheck, for example) retry does not
occur. In that case tcp_check_req tries to close req socket,
leaving big socket active.

Fixes: e0f9759f53 ("tcp: try to keep packet if SYN_RCV race is lost")
Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru>
Reported-by: Oleg Senin <olegsenin@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-15 14:34:29 -07:00
Florian Westphal 049fe386d3 tcp: parse mptcp options contained in reset packets
Because TCP-level resets only affect the subflow, there is a MPTCP
option to indicate that the MPTCP-level connection should be closed
immediately without a mptcp-level fin exchange.

This is the 'MPTCP fast close option'.  It can be carried on ack
segments or TCP resets.  In the latter case, its needed to parse mptcp
options also for reset packets so that MPTCP can act accordingly.

Next patch will add receive side fastclose support in MPTCP.

Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:30:06 -08:00
Paolo Abeni 90bf45134d mptcp: add new sock flag to deal with join subflows
MP_JOIN subflows must not land into the accept queue.
Currently tcp_check_req() calls an mptcp specific helper
to detect such scenario.

Such helper leverages the subflow context to check for
MP_JOIN subflows. We need to deal also with MP JOIN
failures, even when the subflow context is not available
due allocation failure.

A possible solution would be changing the syn_recv_sock()
signature to allow returning a more descriptive action/
error code and deal with that in tcp_check_req().

Since the above need is MPTCP specific, this patch instead
uses a TCP request socket hole to add a MPTCP specific flag.
Such flag is used by the MPTCP syn_recv_sock() to tell
tcp_check_req() how to deal with the request socket.

This change is a no-op for !MPTCP build, and makes the
MPTCP code simpler. It allows also the next patch to deal
correctly with MP JOIN failure.

v1 -> v2:
 - be more conservative on drop_req initialization (Mat)

RFC -> v1:
 - move the drop_req bit inside tcp_request_sock (Eric)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 12:30:13 -07:00
Peter Krystad f296234c98 mptcp: Add handling of incoming MP_JOIN requests
Process the MP_JOIN option in a SYN packet with the same flow
as MP_CAPABLE but when the third ACK is received add the
subflow to the MPTCP socket subflow list instead of adding it to
the TCP socket accept queue.

The subflow is added at the end of the subflow list so it will not
interfere with the existing subflows operation and no data is
expected to be transmitted on it.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Jules Irenge 734c8f7574 tcp: Add missing annotation for tcp_child_process()
Sparse reports warning at tcp_child_process()
warning: context imbalance in tcp_child_process() - unexpected unlock
The root cause is the missing annotation at tcp_child_process()

Add the missing __releases(&((child)->sk_lock.slock)) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:19:41 -07:00
Jakub Sitnicki e80251555f tcp_bpf: Don't let child socket inherit parent protocol ops on copy
Prepare for cloning listening sockets that have their protocol callbacks
overridden by sk_msg. Child sockets must not inherit parent callbacks that
access state stored in sk_user_data owned by the parent.

Restore the child socket protocol callbacks before it gets hashed and any
of the callbacks can get invoked.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-4-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Martin KaFai Lau 0baf26b0fc bpf: tcp: Support tcp_congestion_ops in bpf
This patch makes "struct tcp_congestion_ops" to be the first user
of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
in bpf.

The BPF implemented tcp_congestion_ops can be used like
regular kernel tcp-cc through sysctl and setsockopt.  e.g.
[root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
net.ipv4.tcp_congestion_control = bpf_cubic

There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP).   The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.

BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel).  The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF.  It allows a faster turnaround for testing algorithm
in the production while leveraging the existing (and continue growing)
BPF feature/framework instead of building one specifically for
userspace TCP CC.

This patch allows write access to a few fields in tcp-sock
(in bpf_tcp_ca_btf_struct_access()).

The optional "get_info" is unsupported now.  It can be added
later.  One possible way is to output the info with a btf-id
to describe the content.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
Eric Dumazet e0d694d638 tcp: annotate tp->snd_nxt lockless reads
There are few places where we fetch tp->snd_nxt while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet 0f31746452 tcp: annotate tp->write_seq lockless reads
There are few places where we fetch tp->write_seq while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet 7db48e9839 tcp: annotate tp->copied_seq lockless reads
There are few places where we fetch tp->copied_seq while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet dba7d9b8c7 tcp: annotate tp->rcv_nxt lockless reads
There are few places where we fetch tp->rcv_nxt while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)

syzbot reported :

BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv

write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
 tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
 tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
 tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
 tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
 tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
 ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
 napi_skb_finish net/core/dev.c:5671 [inline]
 napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
 receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061

read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
 tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
 tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
 sock_poll+0xed/0x250 net/socket.c:1256
 vfs_poll include/linux/poll.h:90 [inline]
 ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
 ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
 ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
 ep_send_events fs/eventpoll.c:1793 [inline]
 ep_poll+0xe3/0x900 fs/eventpoll.c:1930
 do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
 __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
 __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
 __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
 do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet d983ea6f16 tcp: add rcu protection around tp->fastopen_rsk
Both tcp_v4_err() and tcp_v6_err() do the following operations
while they do not own the socket lock :

	fastopen = tp->fastopen_rsk;
 	snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;

The problem is that without appropriate barrier, the compiler
might reload tp->fastopen_rsk and trigger a NULL deref.

request sockets are protected by RCU, we can simply add
the missing annotations and barriers to solve the issue.

Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet f6c0f5d209 tcp: honor SO_PRIORITY in TIME_WAIT state
ctl packets sent on behalf of TIME_WAIT sockets currently
have a zero skb->priority, which can cause various problems.

In this patch we :

- add a tw_priority field in struct inet_timewait_sock.

- populate it from sk->sk_priority when a TIME_WAIT is created.

- For IPv4, change ip_send_unicast_reply() and its two
  callers to propagate tw_priority correctly.
  ip_send_unicast_reply() no longer changes sk->sk_priority.

- For IPv6, make sure TIME_WAIT sockets pass their tw_priority
  field to tcp_v6_send_response() and tcp_v6_send_ack().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:05:02 +02:00
Eric Dumazet a842fe1425 tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.

Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.

EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.

Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.

This requires FQ packet scheduler or a EDT-enabled NIC.

This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.

  unsigned int tx_delay = 10000; /* 10 msec */

  setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));

Note that FQ packet scheduler limits might need some tweaking :

man tc-fq

PARAMETERS
   limit
       Hard  limit  on  the  real  queue  size. When this limit is
       reached, new packets are dropped. If the value is  lowered,
       packets  are  dropped so that the new limit is met. Default
       is 10000 packets.

   flow_limit
       Hard limit on the maximum  number  of  packets  queued  per
       flow.  Default value is 100.

Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.

Use of a jump label makes this support runtime-free, for hosts
never using the option.

Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)

Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 13:05:43 -07:00
Eric Dumazet c67b85558f ipv6: tcp: send consistent autoflowlabel in TIME_WAIT state
In case autoflowlabel is in action, skb_get_hash_flowi6()
derives a non zero skb->hash to the flowlabel.

If skb->hash is zero, a flow dissection is performed.

Since all TCP skbs sent from ESTABLISH state inherit their
skb->hash from sk->sk_txhash, we better keep a copy
of sk->sk_txhash into the TIME_WAIT socket.

After this patch, ACK or RST packets sent on behalf of
a TIME_WAIT socket have the flowlabel that was previously
used by the flow.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 20:10:19 -07:00
Thomas Gleixner 457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Yuchung Cheng 336c39a031 tcp: undo init congestion window on false SYNACK timeout
Linux implements RFC6298 and use an initial congestion window
of 1 upon establishing the connection if the SYNACK packet is
retransmitted 2 or more times. In cellular networks SYNACK timeouts
are often spurious if the wireless radio was dormant or idle. Also
some network path is longer than the default SYNACK timeout. In
both cases falsely starting with a minimal cwnd are detrimental
to performance.

This patch avoids doing so when the final ACK's TCP timestamp
indicates the original SYNACK was delivered. It remembers the
original SYNACK timestamp when SYNACK timeout has occurred and
re-uses the function to detect spurious SYN timeout conveniently.

Note that a server may receives multiple SYNs from and immediately
retransmits SYNACKs without any SYNACK timeout. This often happens
on when the client SYNs have timed out due to wireless delay
above. In this case since the server will still use the default
initial congestion (e.g. 10) because tp->undo_marker is reset in
tcp_init_metrics(). This is an intentional design because packets
are not lost but delayed.

This patch only covers regular TCP passive open. Fast Open is
supported in the next patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:47:54 -04:00
Eric Dumazet 6aedbf986f tcp: use tcp_md5_needed for timewait sockets
This might speedup tcp_twsk_destructor() a bit,
avoiding a cache line miss.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26 13:16:03 -08:00
Eric Dumazet 6bcdc40ddd tcp: move rx_opt & syn_data_acked init to tcp_disconnect()
If we make sure all listeners have these fields cleared, then a clone
will also inherit zero values.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:05 -08:00
Eric Dumazet 792c4354a5 tcp: move tp->rack init to tcp_disconnect()
If we make sure all listeners have proper tp->rack value,
then a clone will also inherit proper initial value.

Note that fresh sockets init tp->rack from tcp_init_sock()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:05 -08:00
Eric Dumazet 6cda8b7493 tcp: move app_limited init to tcp_disconnect()
If we make sure all listeners have app_limited set to ~0U,
then a clone will also inherit proper initial value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:05 -08:00
Eric Dumazet 5c701549c9 tcp: move retrans_out, sacked_out, tlp_high_seq, last_oow_ack_time init to tcp_disconnect()
If we make sure all listeners have these fields cleared, then a clone
will also inherit zero values.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:05 -08:00
Eric Dumazet 5d83676462 tcp: do not clear urg_data in tcp_create_openreq_child
All listeners have this field cleared already, since tcp_disconnect()
clears it and newly created sockets have also a zero value here.

So a clone will inherit a zero value here.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:05 -08:00
Eric Dumazet 3a9a57f637 tcp: move snd_cwnd & snd_cwnd_cnt init to tcp_disconnect()
Passive connections can inherit proper value by cloning,
if we make sure all listeners have the proper values there.

tcp_disconnect() was setting snd_cwnd to 2, which seems
quite obsolete since IW10 adoption.

Also remove an obsolete comment.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:05 -08:00
Eric Dumazet b9e2e689aa tcp: move mdev_us init to tcp_disconnect()
If we make sure a listener always has its mdev_us
field set to TCP_TIMEOUT_INIT, we do not need to rewrite
this field after a new clone is created.

tcp_disconnect() is very seldom used in real applications.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:05 -08:00
Eric Dumazet a0070e463f tcp: do not clear srtt_us in tcp_create_openreq_child
All listeners have this field cleared already, since tcp_disconnect()
clears it and newly created sockets have also a zero value here.

So a clone will inherit a zero value here.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:05 -08:00
Eric Dumazet eb2c80ca87 tcp: do not clear packets_out in tcp_create_openreq_child()
New sockets have this field cleared, and tcp_disconnect()
calls tcp_write_queue_purge() which among other things
also clear tp->packets_out

So a listener is guaranteed to have this field cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:04 -08:00
Eric Dumazet 6a408147ea tcp: move icsk_rto init to tcp_disconnect()
If we make sure a listener always has its icsk_rto
field set to TCP_TIMEOUT_INIT, we do not need to rewrite
this field after a new clone is created.

tcp_disconnect() is very seldom used in real applications.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:04 -08:00
Eric Dumazet b84235e291 tcp: do not set snd_ssthresh in tcp_create_openreq_child()
New sockets get the field set to TCP_INFINITE_SSTHRESH in tcp_init_sock()
In case a socket had this field changed and transitions to TCP_LISTEN
state, tcp_disconnect() also makes sure snd_ssthresh is set to
TCP_INFINITE_SSTHRESH.

So a listener has this field set to TCP_INFINITE_SSTHRESH already.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 22:19:04 -08:00
Florian Westphal 63cc357f7b tcp: do not restart timewait timer on rst reception
RFC 1337 says:
 ''Ignore RST segments in TIME-WAIT state.
   If the 2 minute MSL is enforced, this fix avoids all three hazards.''

So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk
expire rather than removing it instantly when a reset is received.

However, Linux will also re-start the TIME-WAIT timer.

This causes connect to fail when tying to re-use ports or very long
delays (until syn retry interval exceeds MSL).

packetdrill test case:
// Demonstrate bogus rearming of TIME-WAIT timer in rfc1337 mode.
`sysctl net.ipv4.tcp_rfc1337=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 29200 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

// Receive first segment
0.310 < P. 1:1001(1000) ack 1 win 46

// Send one ACK
0.310 > . 1:1(0) ack 1001

// read 1000 byte
0.310 read(4, ..., 1000) = 1000

// Application writes 100 bytes
0.350 write(4, ..., 100) = 100
0.350 > P. 1:101(100) ack 1001

// ACK
0.500 < . 1001:1001(0) ack 101 win 257

// close the connection
0.600 close(4) = 0
0.600 > F. 101:101(0) ack 1001 win 244

// Our side is in FIN_WAIT_1 & waits for ack to fin
0.7 < . 1001:1001(0) ack 102 win 244

// Our side is in FIN_WAIT_2 with no outstanding data.
0.8 < F. 1001:1001(0) ack 102 win 244
0.8 > . 102:102(0) ack 1002 win 244

// Our side is now in TIME_WAIT state, send ack for fin.
0.9 < F. 1002:1002(0) ack 102 win 244
0.9 > . 102:102(0) ack 1002 win 244

// Peer reopens with in-window SYN:
1.000 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>

// Therefore, reply with ACK.
1.000 > . 102:102(0) ack 1002 win 244

// Peer sends RST for this ACK.  Normally this RST results
// in tw socket removal, but rfc1337=1 setting prevents this.
1.100 < R 1002:1002(0) win 244

// second syn. Due to rfc1337=1 expect another pure ACK.
31.0 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>
31.0 > . 102:102(0) ack 1002 win 244

// .. and another RST from peer.
31.1 < R 1002:1002(0) win 244
31.2 `echo no timer restart;ss -m -e -a -i -n -t -o state TIME-WAIT`

// third syn after one minute.  Time-Wait socket should have expired by now.
63.0 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>

// so we expect a syn-ack & 3whs to proceed from here on.
63.0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>

Without this patch, 'ss' shows restarts of tw timer and last packet is
thus just another pure ack, more than one minute later.

This restores the original code from commit 283fd6cf0be690a83
("Merge in ANK networking jumbo patch") in netdev-vger-cvs.git .

For some reason the else branch was removed/lost in 1f28b683339f7
("Merge in TCP/UDP optimizations and [..]") and timer restart became
unconditional.

Reported-by: Michal Tesar <mtesar@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 23:10:35 -07:00
Arnd Bergmann cca9bab1b7 tcp: use monotonic timestamps for PAWS
Using get_seconds() for timestamps is deprecated since it can lead
to overflows on 32-bit systems. While the interface generally doesn't
overflow until year 2106, the specific implementation of the TCP PAWS
algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
overflow.

A related problem is that the local timestamps in CLOCK_REALTIME form
lead to unexpected behavior when settimeofday is called to set the system
clock backwards or forwards by more than 24 days.

While the first problem could be solved by using an overflow-safe method
of comparing the timestamps, a nicer solution is to use a monotonic
clocksource with ktime_get_seconds() that simply doesn't overflow (at
least not until 136 years after boot) and that doesn't change during
settimeofday().

To make 32-bit and 64-bit architectures behave the same way here, and
also save a few bytes in the tcp_options_received structure, I'm changing
the type to a 32-bit integer, which is now safe on all architectures.

Finally, the ts_recent_stamp field also (confusingly) gets used to store
a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
This is currently safe, but changing the type to 32-bit requires
some small changes there to keep it working.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 14:50:40 -07:00
Eric Dumazet 242b1bbe51 tcp: remove one indentation level in tcp_create_openreq_child
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 16:02:31 +09:00