Commit Graph

1175 Commits

Author SHA1 Message Date
Sabrina Dubroca 1eacc871f3 tcp: drop secpath at the same time as we currently drop dst
JIRA: https://issues.redhat.com/browse/RHEL-69649
JIRA: https://issues.redhat.com/browse/RHEL-83224
CVE: CVE-2025-21864

commit 9b6412e6979f6f9e0632075f8f008937b5cd4efd
Author: Sabrina Dubroca <sd@queasysnail.net>
Date:   Mon Feb 17 11:23:35 2025 +0100

    tcp: drop secpath at the same time as we currently drop dst

    Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
    running tests that boil down to:
     - create a pair of netns
     - run a basic TCP test over ipcomp6
     - delete the pair of netns

    The xfrm_state found on spi_byaddr was not deleted at the time we
    delete the netns, because we still have a reference on it. This
    lingering reference comes from a secpath (which holds a ref on the
    xfrm_state), which is still attached to an skb. This skb is not
    leaked, it ends up on sk_receive_queue and then gets defer-free'd by
    skb_attempt_defer_free.

    The problem happens when we defer freeing an skb (push it on one CPU's
    defer_list), and don't flush that list before the netns is deleted. In
    that case, we still have a reference on the xfrm_state that we don't
    expect at this point.

    We already drop the skb's dst in the TCP receive path when it's no
    longer needed, so let's also drop the secpath. At this point,
    tcp_filter has already called into the LSM hooks that may require the
    secpath, so it should not be needed anymore. However, in some of those
    places, the MPTCP extension has just been attached to the skb, so we
    cannot simply drop all extensions.

    Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
    Reported-by: Xiumei Mu <xmu@redhat.com>
    Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2025-03-20 10:11:45 +01:00
Rado Vrbovsky 65ee7b65eb Merge: net: visibility patches for 9.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5833

JIRA: https://issues.redhat.com/browse/RHEL-68063

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Guillaume Nault <gnault@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2025-01-06 08:26:06 +00:00
Rado Vrbovsky 81ce48e690 Merge: mptcp: phase-1 backports for RHEL-9.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5449

JIRA: https://issues.redhat.com/browse/RHEL-62871  
JIRA: https://issues.redhat.com/browse/RHEL-58839  
JIRA: https://issues.redhat.com/browse/RHEL-66083  
JIRA: https://issues.redhat.com/browse/RHEL-66074  
CVE: CVE-2024-46711  
CVE: CVE-2024-45009  
CVE: CVE-2024-45010  
Upstream Status: All mainline in net.git  
Tested: kselftest  
Conflicts: see individual patches  
  
Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-22 09:18:31 +00:00
Antoine Tenart f1eac6da54 net: tcp: Add noinline_for_tracing annotation for tcp_drop_reason()
JIRA: https://issues.redhat.com/browse/RHEL-68063
Upstream Status: net-next.git

commit dbd5e2e79ed8653ac2ae255e42d1189283343a0c
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Thu Oct 24 17:37:42 2024 +0800

    net: tcp: Add noinline_for_tracing annotation for tcp_drop_reason()

    We previously hooked the tcp_drop_reason() function using BPF to monitor
    TCP drop reasons. However, after upgrading our compiler from GCC 9 to GCC
    11, tcp_drop_reason() is now inlined, preventing us from hooking into it.
    To address this, it would be beneficial to make noinline explicitly for
    tracing.

    Link: https://lore.kernel.org/netdev/CANn89iJuShCmidCi_ZkYABtmscwbVjhuDta1MS5LxV_4H9tKOA@mail.gmail.com/
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Cc: Menglong Dong <menglong8.dong@gmail.com>
    Link: https://patch.msgid.link/20241024093742.87681-3-laoar.shao@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-11-19 15:34:58 +01:00
Davide Caratti 6758e2bf77 tcp: set TCP_DEFER_ACCEPT locklessly
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 6e97ba552b8d3dd074a28b8600740b8bed42267b

commit 6e97ba552b8d3dd074a28b8600740b8bed42267b
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 4 14:46:16 2023 +0000

    tcp: set TCP_DEFER_ACCEPT locklessly

    rskq_defer_accept field can be read/written without
    the need of holding the socket lock.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti a89122fa2a tcp: set TCP_LINGER2 locklessly
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit a81722ddd7e4d76c9bbff078d29416e18c6d7f71

commit a81722ddd7e4d76c9bbff078d29416e18c6d7f71
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 4 14:46:15 2023 +0000

    tcp: set TCP_LINGER2 locklessly

    tp->linger2 can be set locklessly as long as readers
    use READ_ONCE().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Rado Vrbovsky 392bdee116 Merge: net: tcp: accept old ack during closing
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5359

JIRA: https://issues.redhat.com/browse/RHEL-60572

    795a7dfbc3d9 ("net: tcp: accept old ack during closing")

Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Sterling Alexander <stalexan@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-06 08:21:42 +00:00
Rado Vrbovsky 384fd7eadc Merge: tcp: stable backports for 9.6 phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5444

JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Several stable backport for the tcp protocol addressing sparse corner-cases issues.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-01 08:13:57 +00:00
Paolo Abeni 4111dedcfe tcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits out
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1
Conflicts: different context as rhel-9 lacks the upstream commit \
  3868ab0f1925 ("tcp: new TCP_INFO stats for RTO events")

Upstream commit:
commit 27c80efcc20486c82698f05f00e288b44513c86b
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Oct 1 20:05:17 2024 +0000

    tcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits out

    Fix tcp_rcv_synrecv_state_fastopen() to not zero retrans_stamp
    if retransmits are outstanding.

    tcp_fastopen_synack_timer() sets retrans_stamp, so typically we'll
    need to zero retrans_stamp here to prevent spurious
    retransmits_timed_out(). The logic to zero retrans_stamp is from this
    2019 commit:

    commit cd736d8b67 ("tcp: fix retrans timestamp on passive Fast Open")

    However, in the corner case where the ACK of our TFO SYNACK carried
    some SACK blocks that caused us to enter TCP_CA_Recovery then that
    non-zero retrans_stamp corresponds to the active fast recovery, and we
    need to leave retrans_stamp with its current non-zero value, for
    correct ETIMEDOUT and undo behavior.

    Fixes: cd736d8b67 ("tcp: fix retrans timestamp on passive Fast Open")
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/20241001200517.2756803-4-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:17 +02:00
Paolo Abeni dcea8d7793 tcp: fix tcp_enter_recovery() to zero retrans_stamp when it's safe
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit b41b4cbd9655bcebcce941bef3601db8110335be
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Oct 1 20:05:16 2024 +0000

    tcp: fix tcp_enter_recovery() to zero retrans_stamp when it's safe

    Fix tcp_enter_recovery() so that if there are no retransmits out then
    we zero retrans_stamp when entering fast recovery. This is necessary
    to fix two buggy behaviors.

    Currently a non-zero retrans_stamp value can persist across multiple
    back-to-back loss recovery episodes. This is because we generally only
    clears retrans_stamp if we are completely done with loss recoveries,
    and get to tcp_try_to_open() and find !tcp_any_retrans_done(sk). This
    behavior causes two bugs:

    (1) When a loss recovery episode (CA_Loss or CA_Recovery) is followed
    immediately by a new CA_Recovery, the retrans_stamp value can persist
    and can be a time before this new CA_Recovery episode starts. That
    means that timestamp-based undo will be using the wrong retrans_stamp
    (a value that is too old) when comparing incoming TS ecr values to
    retrans_stamp to see if the current fast recovery episode can be
    undone.

    (2) If there is a roughly minutes-long sequence of back-to-back fast
    recovery episodes, one after another (e.g. in a shallow-buffered or
    policed bottleneck), where each fast recovery successfully makes
    forward progress and recovers one window of sequence space (but leaves
    at least one retransmit in flight at the end of the recovery),
    followed by several RTOs, then the ETIMEDOUT check may be using the
    wrong retrans_stamp (a value set at the start of the first fast
    recovery in the sequence). This can cause a very premature ETIMEDOUT,
    killing the connection prematurely.

    This commit changes the code to zero retrans_stamp when entering fast
    recovery, when this is known to be safe (no retransmits are out in the
    network). That ensures that when starting a fast recovery episode, and
    it is safe to do so, retrans_stamp is set when we send the fast
    retransmit packet. That addresses both bug (1) and bug (2) by ensuring
    that (if no retransmits are out when we start a fast recovery) we use
    the initial fast retransmit of this fast recovery as the time value
    for undo and ETIMEDOUT calculations.

    This makes intuitive sense, since the start of a new fast recovery
    episode (in a scenario where no lost packets are out in the network)
    means that the connection has made forward progress since the last RTO
    or fast recovery, and we should thus "restart the clock" used for both
    undo and ETIMEDOUT logic.

    Note that if when we start fast recovery there *are* retransmits out
    in the network, there can still be undesirable (1)/(2) issues. For
    example, after this patch we can still have the (1) and (2) problems
    in cases like this:

    + round 1: sender sends flight 1

    + round 2: sender receives SACKs and enters fast recovery 1,
      retransmits some packets in flight 1 and then sends some new data as
      flight 2

    + round 3: sender receives some SACKs for flight 2, notes losses, and
      retransmits some packets to fill the holes in flight 2

    + fast recovery has some lost retransmits in flight 1 and continues
      for one or more rounds sending retransmits for flight 1 and flight 2

    + fast recovery 1 completes when snd_una reaches high_seq at end of
      flight 1

    + there are still holes in the SACK scoreboard in flight 2, so we
      enter fast recovery 2, but some retransmits in the flight 2 sequence
      range are still in flight (retrans_out > 0), so we can't execute the
      new retrans_stamp=0 added here to clear retrans_stamp

    It's not yet clear how to fix these remaining (1)/(2) issues in an
    efficient way without breaking undo behavior, given that retrans_stamp
    is currently used for undo and ETIMEDOUT. Perhaps the optimal (but
    expensive) strategy would be to set retrans_stamp to the timestamp of
    the earliest outstanding retransmit when entering fast recovery. But
    at least this commit makes things better.

    Note that this does not change the semantics of retrans_stamp; it
    simply makes retrans_stamp accurate in some cases where it was not
    before:

    (1) Some loss recovery, followed by an immediate entry into a fast
    recovery, where there are no retransmits out when entering the fast
    recovery.

    (2) When a TFO server has a SYNACK retransmit that sets retrans_stamp,
    and then the ACK that completes the 3-way handshake has SACK blocks
    that trigger a fast recovery. In this case when entering fast recovery
    we want to zero out the retrans_stamp from the TFO SYNACK retransmit,
    and set the retrans_stamp based on the timestamp of the fast recovery.

    We introduce a tcp_retrans_stamp_cleanup() helper, because this
    two-line sequence already appears in 3 places and is about to appear
    in 2 more as a result of this bug fix patch series. Once this bug fix
    patches series in the net branch makes it into the net-next branch
    we'll update the 3 other call sites to use the new helper.

    This is a long-standing issue. The Fixes tag below is chosen to be the
    oldest commit at which the patch will apply cleanly, which is from
    Linux v3.5 in 2012.

    Fixes: 1fbc340514 ("tcp: early retransmit: tcp_enter_recovery()")
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/20241001200517.2756803-3-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:17 +02:00
Paolo Abeni b32e835fe5 tcp: fix to allow timestamp undo if no retransmits were sent
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit e37ab7373696e650d3b6262a5b882aadad69bb9e
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Oct 1 20:05:15 2024 +0000

    tcp: fix to allow timestamp undo if no retransmits were sent

    Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that
    it can trigger undo even if TSQ prevents a fast recovery episode from
    reaching tcp_retransmit_skb().

    Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after
    this commit from 2019:

    commit bc9f38c832 ("tcp: avoid unconditional congestion window undo
    on SYN retransmit")

    ...and before this fix we could have buggy scenarios like the
    following:

    + Due to reordering, a TCP connection receives some SACKs and enters a
      spurious fast recovery.

    + TSQ prevents all invocations of tcp_retransmit_skb(), because many
      skbs are queued in lower layers of the sending machine's network
      stack; thus tp->retrans_stamp remains 0.

    + The connection receives a TCP timestamp ECR value echoing a
      timestamp before the fast recovery, indicating that the fast
      recovery was spurious.

    + The connection fails to undo the spurious fast recovery because
      tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false,
      due to the new logic in the 2019 commit: commit bc9f38c832 ("tcp:
      avoid unconditional congestion window undo on SYN retransmit")

    This fix tweaks the logic to be more similar to the
    tcp_packet_delayed() logic before bc9f38c832, except that we take
    care not to be fooled by the FLAG_SYN_ACKED code path zeroing out
    tp->retrans_stamp (the bug noted and fixed by Yuchung in
    bc9f38c832).

    Note that this returns the high-level behavior of tcp_packet_delayed()
    to again match the comment for the function, which says: "Nothing was
    retransmitted or returned timestamp is less than timestamp of the
    first retransmission." Note that this comment is in the original
    2005-04-16 Linux git commit, so this is evidently long-standing
    behavior.

    Fixes: bc9f38c832 ("tcp: avoid unconditional congestion window undo on SYN retransmit")
    Reported-by: Geumhwan Yu <geumhwan.yu@samsung.com>
    Diagnosed-by: Geumhwan Yu <geumhwan.yu@samsung.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:17 +02:00
Paolo Abeni 7a18dd824a tcp: Update window clamping condition
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit a2cbb1603943281a604f5adc48079a148db5cb0d
Author: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
Date:   Thu Aug 8 16:06:40 2024 -0700

    tcp: Update window clamping condition

    This patch is based on the discussions between Neal Cardwell and
    Eric Dumazet in the link
    https://lore.kernel.org/netdev/20240726204105.1466841-1-quic_subashab@quicinc.com/

    It was correctly pointed out that tp->window_clamp would not be
    updated in cases where net.ipv4.tcp_moderate_rcvbuf=0 or if
    (copied <= tp->rcvq_space.space). While it is expected for most
    setups to leave the sysctl enabled, the latter condition may
    not end up hitting depending on the TCP receive queue size and
    the pattern of arriving data.

    The updated check should be hit only on initial MSS update from
    TCP_MIN_MSS to measured MSS value and subsequently if there was
    an update to a larger value.

    Fixes: 05f76b2d634e ("tcp: Adjust clamping window for applications specifying SO_RCVBUF")
    Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com>
    Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:16 +02:00
Paolo Abeni 9af45f55fd tcp: Adjust clamping window for applications specifying SO_RCVBUF
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1
Conflicts: tcp_rcv_space_adjust() lacks the ONCE annotation while \
  writing 'window_clamp' since rhel-9 lacks the upstream commit \
  f410cbea9f3d ("tcp: annotate data-races around tp->window_clamp")

Upstream commit:
commit 05f76b2d634e65ab34472802d9b142ea9e03f74e
Author: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
Date:   Fri Jul 26 13:41:05 2024 -0700

    tcp: Adjust clamping window for applications specifying SO_RCVBUF

    tp->scaling_ratio is not updated based on skb->len/skb->truesize once
    SO_RCVBUF is set leading to the maximum window scaling to be 25% of
    rcvbuf after
    commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
    and 50% of rcvbuf after
    commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio").
    50% tries to emulate the behavior of older kernels using
    sysctl_tcp_adv_win_scale with default value.

    Systems which were using a different values of sysctl_tcp_adv_win_scale
    in older kernels ended up seeing reduced download speeds in certain
    cases as covered in https://lists.openwall.net/netdev/2024/05/15/13
    While the sysctl scheme is no longer acceptable, the value of 50% is
    a bit conservative when the skb->len/skb->truesize ratio is later
    determined to be ~0.66.

    Applications not specifying SO_RCVBUF update the window scaling and
    the receiver buffer every time data is copied to userspace. This
    computation is now used for applications setting SO_RCVBUF to update
    the maximum window scaling while ensuring that the receive buffer
    is within the application specified limit.

    Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
    Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com>
    Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:16 +02:00
Paolo Abeni c35b38b5ea tcp: fix incorrect undo caused by DSACK of TLP retransmit
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit 0ec986ed7bab6801faed1440e8839dcc710331ff
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed Jul 3 13:12:46 2024 -0400

    tcp: fix incorrect undo caused by DSACK of TLP retransmit

    Loss recovery undo_retrans bookkeeping had a long-standing bug where a
    DSACK from a spurious TLP retransmit packet could cause an erroneous
    undo of a fast recovery or RTO recovery that repaired a single
    really-lost packet (in a sequence range outside that of the TLP
    retransmit). Basically, because the loss recovery state machine didn't
    account for the fact that it sent a TLP retransmit, the DSACK for the
    TLP retransmit could erroneously be implicitly be interpreted as
    corresponding to the normal fast recovery or RTO recovery retransmit
    that plugged a real hole, thus resulting in an improper undo.

    For example, consider the following buggy scenario where there is a
    real packet loss but the congestion control response is improperly
    undone because of this bug:

    + send packets P1, P2, P3, P4
    + P1 is really lost
    + send TLP retransmit of P4
    + receive SACK for original P2, P3, P4
    + enter fast recovery, fast-retransmit P1, increment undo_retrans to 1
    + receive DSACK for TLP P4, decrement undo_retrans to 0, undo (bug!)
    + receive cumulative ACK for P1-P4 (fast retransmit plugged real hole)

    The fix: when we initialize undo machinery in tcp_init_undo(), if
    there is a TLP retransmit in flight, then increment tp->undo_retrans
    so that we make sure that we receive a DSACK corresponding to the TLP
    retransmit, as well as DSACKs for all later normal retransmits, before
    triggering a loss recovery undo. Note that we also have to move the
    line that clears tp->tlp_high_seq for RTO recovery, so that upon RTO
    we remember the tp->tlp_high_seq value until tcp_init_undo() and clear
    it only afterward.

    Also note that the bug dates back to the original 2013 TLP
    implementation, commit 6ba8a3b19e ("tcp: Tail loss probe (TLP)").

    However, this patch will only compile and work correctly with kernels
    that have tp->tlp_retrans, which was added only in v5.8 in 2020 in
    commit 76be93fc07 ("tcp: allow at most one TLP probe per flight").
    So we associate this fix with that later commit.

    Fixes: 76be93fc07 ("tcp: allow at most one TLP probe per flight")
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Cc: Yuchung Cheng <ycheng@google.com>
    Cc: Kevin Yang <yyd@google.com>
    Link: https://patch.msgid.link/20240703171246.1739561-1-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:15 +02:00
Paolo Abeni 212db5de55 UPSTREAM: tcp: fix DSACK undo in fast recovery to call tcp_try_to_open()
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit a6458ab7fd4f427d4f6f54380453ad255b7fde83
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed Jun 26 22:42:27 2024 -0400

    UPSTREAM: tcp: fix DSACK undo in fast recovery to call tcp_try_to_open()

    In some production workloads we noticed that connections could
    sometimes close extremely prematurely with ETIMEDOUT after
    transmitting only 1 TLP and RTO retransmission (when we would normally
    expect roughly tcp_retries2 = TCP_RETR2 = 15 RTOs before a connection
    closes with ETIMEDOUT).

    From tracing we determined that these workloads can suffer from a
    scenario where in fast recovery, after some retransmits, a DSACK undo
    can happen at a point where the scoreboard is totally clear (we have
    retrans_out == sacked_out == lost_out == 0). In such cases, calling
    tcp_try_keep_open() means that we do not execute any code path that
    clears tp->retrans_stamp to 0. That means that tp->retrans_stamp can
    remain erroneously set to the start time of the undone fast recovery,
    even after the fast recovery is undone. If minutes or hours elapse,
    and then a TLP/RTO/RTO sequence occurs, then the start_ts value in
    retransmits_timed_out() (which is from tp->retrans_stamp) will be
    erroneously ancient (left over from the fast recovery undone via
    DSACKs). Thus this ancient tp->retrans_stamp value can cause the
    connection to die very prematurely with ETIMEDOUT via
    tcp_write_err().

    The fix: we change DSACK undo in fast recovery (TCP_CA_Recovery) to
    call tcp_try_to_open() instead of tcp_try_keep_open(). This ensures
    that if no retransmits are in flight at the time of DSACK undo in fast
    recovery then we properly zero retrans_stamp. Note that calling
    tcp_try_to_open() is more consistent with other loss recovery
    behavior, since normal fast recovery (CA_Recovery) and RTO recovery
    (CA_Loss) both normally end when tp->snd_una meets or exceeds
    tp->high_seq and then in tcp_fastretrans_alert() the "default" switch
    case executes tcp_try_to_open(). Also note that by inspection this
    change to call tcp_try_to_open() implies at least one other nice bug
    fix, where now an ECE-marked DSACK that causes an undo will properly
    invoke tcp_enter_cwr() rather than ignoring the ECE mark.

    Fixes: c7d9d6a185 ("tcp: undo on DSACK during recovery")
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:15 +02:00
Paolo Abeni 37a7b087d1 tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit 5dfe9d273932c647bdc9d664f939af9a5a398cbc
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Jun 24 14:43:23 2024 +0000

    tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO

    Testing determined that the recent commit 9e046bb111f1 ("tcp: clear
    tp->retrans_stamp in tcp_rcv_fastopen_synack()") has a race, and does
    not always ensure retrans_stamp is 0 after a TFO payload retransmit.

    If transmit completion for the SYN+data skb happens after the client
    TCP stack receives the SYNACK (which sometimes happens), then
    retrans_stamp can erroneously remain non-zero for the lifetime of the
    connection, causing a premature ETIMEDOUT later.

    Testing and tracing showed that the buggy scenario is the following
    somewhat tricky sequence:

    + Client attempts a TFO handshake. tcp_send_syn_data() sends SYN + TFO
      cookie + data in a single packet in the syn_data skb. It hands the
      syn_data skb to tcp_transmit_skb(), which makes a clone. Crucially,
      it then reuses the same original (non-clone) syn_data skb,
      transforming it by advancing the seq by one byte and removing the
      FIN bit, and enques the resulting payload-only skb in the
      sk->tcp_rtx_queue.

    + Client sets retrans_stamp to the start time of the three-way
      handshake.

    + Cookie mismatches or server has TFO disabled, and server only ACKs
      SYN.

    + tcp_ack() sees SYN is acked, tcp_clean_rtx_queue() clears
      retrans_stamp.

    + Since the client SYN was acked but not the payload, the TFO failure
      code path in tcp_rcv_fastopen_synack() tries to retransmit the
      payload skb.  However, in some cases the transmit completion for the
      clone of the syn_data (which had SYN + TFO cookie + data) hasn't
      happened.  In those cases, skb_still_in_host_queue() returns true
      for the retransmitted TFO payload, because the clone of the syn_data
      skb has not had its tx completetion.

    + Because skb_still_in_host_queue() finds skb_fclone_busy() is true,
      it sets the TSQ_THROTTLED bit and the retransmit does not happen in
      the tcp_rcv_fastopen_synack() call chain.

    + The tcp_rcv_fastopen_synack() code next implicitly assumes the
      retransmit process is finished, and sets retrans_stamp to 0 to clear
      it, but this is later overwritten (see below).

    + Later, upon tx completion, tcp_tsq_write() calls
      tcp_xmit_retransmit_queue(), which puts the retransmit in flight and
      sets retrans_stamp to a non-zero value.

    + The client receives an ACK for the retransmitted TFO payload data.

    + Since we're in CA_Open and there are no dupacks/SACKs/DSACKs/ECN to
      make tcp_ack_is_dubious() true and make us call
      tcp_fastretrans_alert() and reach a code path that clears
      retrans_stamp, retrans_stamp stays nonzero.

    + Later, if there is a TLP, RTO, RTO sequence, then the connection
      will suffer an early ETIMEDOUT due to the erroneously ancient
      retrans_stamp.

    The fix: this commit refactors the code to have
    tcp_rcv_fastopen_synack() retransmit by reusing the relevant parts of
    tcp_simple_retransmit() that enter CA_Loss (without changing cwnd) and
    call tcp_xmit_retransmit_queue(). We have tcp_simple_retransmit() and
    tcp_rcv_fastopen_synack() share code in this way because in both cases
    we get a packet indicating non-congestion loss (MTU reduction or TFO
    failure) and thus in both cases we want to retransmit as many packets
    as cwnd allows, without reducing cwnd. And given that retransmits will
    set retrans_stamp to a non-zero value (and may do so in a later
    calling context due to TSQ), we also want to enter CA_Loss so that we
    track when all retransmitted packets are ACked and clear retrans_stamp
    when that happens (to ensure later recurring RTOs are using the
    correct retrans_stamp and don't declare ETIMEDOUT prematurely).

    Fixes: 9e046bb111f1 ("tcp: clear tp->retrans_stamp in tcp_rcv_fastopen_synack()")
    Fixes: a7abf3cd76 ("tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()")
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Yuchung Cheng <ycheng@google.com>
    Link: https://patch.msgid.link/20240624144323.2371403-1-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:15 +02:00
Paolo Abeni ddc843f31c tcp: clear tp->retrans_stamp in tcp_rcv_fastopen_synack()
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit 9e046bb111f13461d3f9331e24e974324245140e
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Jun 14 13:06:15 2024 +0000

    tcp: clear tp->retrans_stamp in tcp_rcv_fastopen_synack()

    Some applications were reporting ETIMEDOUT errors on apparently
    good looking flows, according to packet dumps.

    We were able to root cause the issue to an accidental setting
    of tp->retrans_stamp in the following scenario:

    - client sends TFO SYN with data.
    - server has TFO disabled, ACKs only SYN but not payload.
    - client receives SYNACK covering only SYN.
    - tcp_ack() eats SYN and sets tp->retrans_stamp to 0.
    - tcp_rcv_fastopen_synack() calls tcp_xmit_retransmit_queue()
      to retransmit TFO payload w/o SYN, sets tp->retrans_stamp to "now",
      but we are not in any loss recovery state.
    - TFO payload is ACKed.
    - we are not in any loss recovery state, and don't see any dupacks,
      so we don't get to any code path that clears tp->retrans_stamp.
    - tp->retrans_stamp stays non-zero for the lifetime of the connection.
    - after first RTO, tcp_clamp_rto_to_user_timeout() clamps second RTO
      to 1 jiffy due to bogus tp->retrans_stamp.
    - on clamped RTO with non-zero icsk_retransmits, retransmits_timed_out()
      sets start_ts from tp->retrans_stamp from TFO payload retransmit
      hours/days ago, and computes bogus long elapsed time for loss recovery,
      and suffers ETIMEDOUT early.

    Fixes: a7abf3cd76 ("tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()")
    CC: stable@vger.kernel.org
    Co-developed-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Co-developed-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240614130615.396837-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:15 +02:00
Paolo Abeni fdad6e7a51 tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1
Conflicts: different context in tcp_conn_request(), as rhel-9 \
  lacks the TCP AO support.

Upstream commit:
commit 41eecbd712b73f0d5dcf1152b9a1c27b1f238028
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 7 09:33:22 2024 +0000

    tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field

    TCP can transform a TIMEWAIT socket into a SYN_RECV one from
    a SYN packet, and the ISN of the SYNACK packet is normally
    generated using TIMEWAIT tw_snd_nxt :

    tcp_timewait_state_process()
    ...
        u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
        if (isn == 0)
            isn++;
        TCP_SKB_CB(skb)->tcp_tw_isn = isn;
        return TCP_TW_SYN;

    This SYN packet also bypasses normal checks against listen queue
    being full or not.

    tcp_conn_request()
    ...
           __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
    ...
            /* TW buckets are converted to open requests without
             * limitations, they conserve resources and peer is
             * evidently real one.
             */
            if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
                    want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
                    if (!want_cookie)
                            goto drop;
            }

    This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.

    Unfortunately this field has been accidentally cleared
    after the call to tcp_timewait_state_process() returning
    TCP_TW_SYN.

    Using a field in TCP_SKB_CB(skb) for a temporary state
    is overkill.

    Switch instead to a per-cpu variable.

    As a bonus, we do not have to clear tcp_tw_isn in TCP receive
    fast path.
    It is temporarily set then cleared only in the TCP_TW_SYN dance.

    Fixes: 4ad19de877 ("net: tcp6: fix double call of tcp_v6_fill_cb()")
    Fixes: eeea10b83a ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:08:41 +02:00
Paolo Abeni 4cd846284a tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit b9e810405880c99baafd550ada7043e86465396e
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 7 09:33:21 2024 +0000

    tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()

    tcp_v6_init_req() reads TCP_SKB_CB(skb)->tcp_tw_isn to find
    out if the request socket is created by a SYN hitting a TIMEWAIT socket.

    This has been buggy for a decade, lets directly pass the information
    from tcp_conn_request().

    This is a preparatory patch to make the following one easier to review.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:07:53 +02:00
Paolo Abeni 516cdba7bf tcp: call tcp_try_undo_recovery when an RTOd TFO SYNACK is ACKed
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit e326578a21414738de45f77badd332fb00bd0f58
Author: Aananth V <aananthv@google.com>
Date:   Thu Sep 14 14:36:20 2023 +0000

    tcp: call tcp_try_undo_recovery when an RTOd TFO SYNACK is ACKed

    For passive TCP Fast Open sockets that had SYN/ACK timeout and did not
    send more data in SYN_RECV, upon receiving the final ACK in 3WHS, the
    congestion state may awkwardly stay in CA_Loss mode unless the CA state
    was undone due to TCP timestamp checks. However, if
    tcp_rcv_synrecv_state_fastopen() decides not to undo, then we should
    enter CA_Open, because at that point we have received an ACK covering
    the retransmitted SYNACKs. Currently, the icsk_ca_state is only set to
    CA_Open after we receive an ACK for a data-packet. This is because
    tcp_ack does not call tcp_fastretrans_alert (and tcp_process_loss) if
    !prior_packets

    Note that tcp_process_loss() calls tcp_try_undo_recovery(), so having
    tcp_rcv_synrecv_state_fastopen() decide that if we're in CA_Loss we
    should call tcp_try_undo_recovery() is consistent with that, and
    low risk.

    Fixes: dad8cea7ad ("tcp: fix TFO SYNACK undo to avoid double-timestamp-undo")
    Signed-off-by: Aananth V <aananthv@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 15:29:39 +02:00
Jamie Bainbridge 9bf961c675 net: tcp: accept old ack during closing
JIRA: https://issues.redhat.com/browse/RHEL-60572

Conflicts:
- net/ipv4/tcp_input.c
  Code difference because EL9 already has upstream commit
  7d6ed9afde85 ("tcp: add dropreasons in tcp_rcv_state_process()").
  After this patch, EL9's "reason = " and "if reason <= 0" block now
  match upstream, the same result as if 7d6ed9afde85 and 795a7dfbc3d9
  were applied in order, so future patches here will not conflict.

commit 795a7dfbc3d95e4c7c09569f319f026f8c7f5a9c
Author: Menglong Dong <menglong8.dong@gmail.com>
Date:   Fri Jan 26 12:05:19 2024 +0800

    net: tcp: accept old ack during closing

    For now, the packet with an old ack is not accepted if we are in
    FIN_WAIT1 state, which can cause retransmission. Taking the following
    case as an example:

        Client                               Server
          |                                    |
      FIN_WAIT1(Send FIN, seq=10)          FIN_WAIT1(Send FIN, seq=20, ack=10)
          |                                    |
          |                                Send ACK(seq=21, ack=11)
       Recv ACK(seq=21, ack=11)
          |
       Recv FIN(seq=20, ack=10)

    In the case above, simultaneous close is happening, and the FIN and ACK
    packet that send from the server is out of order. Then, the FIN will be
    dropped by the client, as it has an old ack. Then, the server has to
    retransmit the FIN, which can cause delay if the server has set the
    SO_LINGER on the socket.

    Old ack is accepted in the ESTABLISHED and TIME_WAIT state, and I think
    it should be better to keep the same logic.

    In this commit, we accept old ack in FIN_WAIT1/FIN_WAIT2/CLOSING/LAST_ACK
    states. Maybe we should limit it to FIN_WAIT1 for now?

    Signed-off-by: Menglong Dong <menglong8.dong@gmail.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240126040519.1846345-1-menglong8.dong@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>
2024-10-03 13:54:07 +10:00
Wander Lairson Costa 26717e7509
tcp: drop skb dst in tcp_rcv_established()
JIRA: https://issues.redhat.com/browse/RHEL-9145

commit 783d108dd71d97e4cac5fe8ce70ca43ed7dc7bb7
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 29 18:15:23 2022 -0700

    tcp: drop skb dst in tcp_rcv_established()

    In commit f84af32cbc ("net: ip_queue_rcv_skb() helper")
    I dropped the skb dst in tcp_data_queue().

    This only dealt with so-called TCP input slow path.

    When fast path is taken, tcp_rcv_established() calls
    tcp_queue_rcv() while skb still has a dst.

    This was mostly fine, because most dsts at this point
    are not refcounted (thanks to early demux)

    However, TCP packets sent over loopback have refcounted dst.

    Then commit 68822bdf76f1 ("net: generalize skb freeing
    deferral to per-cpu lists") came and had the effect
    of delaying skb freeing for an arbitrary time.

    If during this time the involved netns is dismantled, cleanup_net()
    frees the struct net with embedded net->ipv6.ip6_dst_ops.

    Then when eventually dst_destroy_rcu() is called,
    if (dst->ops->destroy) ... triggers an use-after-free.

    It is not clear if ip6_route_net_exit() lacks a rcu_barrier()
    as syzbot reported similar issues before the blamed commit.

    ( https://groups.google.com/g/syzkaller-bugs/c/CofzW4eeA9A/m/009WjumTAAAJ )

    Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:29 -03:00
Antoine Tenart 3a0f9f0ce0 tcp: use sk_skb_reason_drop to free rx packets
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: net-next.git

commit 46a02aa357529d7b038096955976b14f7c44aa23
Author: Yan Zhai <yan@cloudflare.com>
Date:   Mon Jun 17 11:09:20 2024 -0700

    tcp: use sk_skb_reason_drop to free rx packets

    Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
    socket to the tracepoint.

    Reported-by: kernel test robot <lkp@intel.com>
    Closes: https://lore.kernel.org/r/202406011539.jhwBd7DX-lkp@intel.com/
    Signed-off-by: Yan Zhai <yan@cloudflare.com>
    Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:42 +02:00
Antoine Tenart 206757f0ed tcp: add dropreasons in tcp_rcv_state_process()
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Code difference in tcp_rcv_state_process due to missing upstream
  commit 795a7dfbc3d9 ("net: tcp: accept old ack during closing").

commit 7d6ed9afde8547723f6f96f81f984cc6c48eef52
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:25 2024 +0800

    tcp: add dropreasons in tcp_rcv_state_process()

    In this patch, I equipped this function with more dropreasons, but
    it still doesn't work yet, which I will do later.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Antoine Tenart 6c9f108418 tcp: add more specific possible drop reasons in tcp_rcv_synsent_state_process()
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit e615e3a24ed6f1a501f9b5426ec0b476fded4d22
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:24 2024 +0800

    tcp: add more specific possible drop reasons in tcp_rcv_synsent_state_process()

    This patch does two things:
    1) add two more new reasons
    2) only change the return value(1) to various drop reason values
    for the future use

    For now, we still cannot trace those two reasons. We'll implement the full
    function in the subsequent patch in this series.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Florian Westphal bd2a0fb2c5 tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets
JIRA: https://issues.redhat.com/browse/RHEL-39833
Upstream Status: commit 94062790aedb
CVE: CVE-2024-36905

commit 94062790aedb505bdda209b10bea47b294d6394f
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed May 1 12:54:48 2024 +0000

    tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets

    TCP_SYN_RECV state is really special, it is only used by
    cross-syn connections, mostly used by fuzzers.

    In the following crash [1], syzbot managed to trigger a divide
    by zero in tcp_rcv_space_adjust()

    A socket makes the following state transitions,
    without ever calling tcp_init_transfer(),
    meaning tcp_init_buffer_space() is also not called.

             TCP_CLOSE
    connect()
             TCP_SYN_SENT
             TCP_SYN_RECV
    shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN)
             TCP_FIN_WAIT1

    To fix this issue, change tcp_shutdown() to not
    perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition,
    which makes no sense anyway.

    When tcp_rcv_state_process() later changes socket state
    from TCP_SYN_RECV to TCP_ESTABLISH, then look at
    sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state,
    and send a FIN packet from a sane socket state.

    This means tcp_send_fin() can now be called from BH
    context, and must use GFP_ATOMIC allocations.

    [1]
    divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
    CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
     RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767
    Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48
    RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246
    RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7
    R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30
    R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da
    FS:  00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0
    Call Trace:
     <TASK>
      tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513
      tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578
      inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680
      sock_recvmsg_nosec net/socket.c:1046 [inline]
      sock_recvmsg+0x109/0x280 net/socket.c:1068
      ____sys_recvmsg+0x1db/0x470 net/socket.c:2803
      ___sys_recvmsg net/socket.c:2845 [inline]
      do_recvmmsg+0x474/0xae0 net/socket.c:2939
      __sys_recvmmsg net/socket.c:3018 [inline]
      __do_sys_recvmmsg net/socket.c:3041 [inline]
      __se_sys_recvmmsg net/socket.c:3034 [inline]
      __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
      do_syscall_x64 arch/x86/entry/common.c:52 [inline]
      do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x77/0x7f
    RIP: 0033:0x7faeb6363db9
    Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9
    RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005
    RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c
    R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Link: https://lore.kernel.org/r/20240501125448.896529-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2024-06-07 15:21:41 +02:00
Sabrina Dubroca 7bc5eeb384 net: skbuff: generalize the skb->decrypted bit
JIRA: https://issues.redhat.com/browse/RHEL-29306

commit 9f06f87fef689d28588cde8c7ebb00a67da34026
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 3 13:21:39 2024 -0700

    net: skbuff: generalize the skb->decrypted bit

    The ->decrypted bit can be reused for other crypto protocols.
    Remove the direct dependency on TLS, add helpers to clean up
    the ifdefs leaking out everywhere.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2024-05-01 17:48:16 +02:00
Paolo Abeni 05001ba2cb tcp: do not accept ACK of bytes we never sent
JIRA: https://issues.redhat.com/browse/RHEL-21432
Tested: LNST, Tier1

Upstream commit:
commit 3d501dd326fb1c73f1b8206d4c6e1d7b15c07e27
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Dec 5 16:18:41 2023 +0000

    tcp: do not accept ACK of bytes we never sent

    This patch is based on a detailed report and ideas from Yepeng Pan
    and Christian Rossow.

    ACK seq validation is currently following RFC 5961 5.2 guidelines:

       The ACK value is considered acceptable only if
       it is in the range of ((SND.UNA - MAX.SND.WND) <= SEG.ACK <=
       SND.NXT).  All incoming segments whose ACK value doesn't satisfy the
       above condition MUST be discarded and an ACK sent back.  It needs to
       be noted that RFC 793 on page 72 (fifth check) says: "If the ACK is a
       duplicate (SEG.ACK < SND.UNA), it can be ignored.  If the ACK
       acknowledges something not yet sent (SEG.ACK > SND.NXT) then send an
       ACK, drop the segment, and return".  The "ignored" above implies that
       the processing of the incoming data segment continues, which means
       the ACK value is treated as acceptable.  This mitigation makes the
       ACK check more stringent since any ACK < SND.UNA wouldn't be
       accepted, instead only ACKs that are in the range ((SND.UNA -
       MAX.SND.WND) <= SEG.ACK <= SND.NXT) get through.

    This can be refined for new (and possibly spoofed) flows,
    by not accepting ACK for bytes that were never sent.

    This greatly improves TCP security at a little cost.

    I added a Fixes: tag to make sure this patch will reach stable trees,
    even if the 'blamed' patch was adhering to the RFC.

    tp->bytes_acked was added in linux-4.2

    Following packetdrill test (courtesy of Yepeng Pan) shows
    the issue at hand:

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1024) = 0

    // ---------------- Handshake ------------------- //

    // when window scale is set to 14 the window size can be extended to
    // 65535 * (2^14) = 1073725440. Linux would accept an ACK packet
    // with ack number in (Server_ISN+1-1073725440. Server_ISN+1)
    // ,though this ack number acknowledges some data never
    // sent by the server.

    +0 < S 0:0(0) win 65535 <mss 1400,nop,wscale 14>
    +0 > S. 0:0(0) ack 1 <...>
    +0 < . 1:1(0) ack 1 win 65535
    +0 accept(3, ..., ...) = 4

    // For the established connection, we send an ACK packet,
    // the ack packet uses ack number 1 - 1073725300 + 2^32,
    // where 2^32 is used to wrap around.
    // Note: we used 1073725300 instead of 1073725440 to avoid possible
    // edge cases.
    // 1 - 1073725300 + 2^32 = 3221241997

    // Oops, old kernels happily accept this packet.
    +0 < . 1:1001(1000) ack 3221241997 win 65535

    // After the kernel fix the following will be replaced by a challenge ACK,
    // and prior malicious frame would be dropped.
    +0 > . 1:1(0) ack 1001

    Fixes: 354e4aa391 ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Yepeng Pan <yepeng.pan@cispa.de>
    Reported-by: Christian Rossow <rossow@cispa.de>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Link: https://lore.kernel.org/r/20231205161841.2702925-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-12 16:31:32 +01:00
Jan Stancek 063f72e7e5 Merge: mptcp: rebase to Linux 6.7
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3305

JIRA: https://issues.redhat.com/browse/RHEL-15036
Tested: LNST, Tier1, selftests, pktdrill

Rebase to the current upstream to bring in new features and
a lot of fixes. A good half of the long commit list touches
the self-tests only, and the remaining is self-contained in mptcp.

The only notable exception is:

tcp: get rid of sysctl_tcp_adv_win_scale

that is a pre requisite to a bunch of mptcp changes included here
and also uncontroversially a good thing (TM) for TCP.

Wider-scope data-races related changeset are included (possibly as
partial backport) only if they help to reduce conflict on later
changes.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-20 21:50:53 +01:00
Jan Stancek 3c8d3e2d4a Merge: tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3301

JIRA: https://issues.redhat.com/browse/RHEL-11592

commit b650d953cd391595e536153ce30b4aab385643ac
Author: mfreemon@cloudflare.com <mfreemon@cloudflare.com>
Date:   Sun Jun 11 22:05:24 2023 -0500

    tcp: enforce receive buffer memory limits by allowing the tcp window to shrink

    Under certain circumstances, the tcp receive buffer memory limit
    set by autotuning (sk_rcvbuf) is increased due to incoming data
    packets as a result of the window not closing when it should be.
    This can result in the receive buffer growing all the way up to
    tcp_rmem[2], even for tcp sessions with a low BDP.

    To reproduce:  Connect a TCP session with the receiver doing
    nothing and the sender sending small packets (an infinite loop
    of socket send() with 4 bytes of payload with a sleep of 1 ms
    in between each send()).  This will cause the tcp receive buffer
    to grow all the way up to tcp_rmem[2].

    As a result, a host can have individual tcp sessions with receive
    buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
    limits, causing the host to go into tcp memory pressure mode.

    The fundamental issue is the relationship between the granularity
    of the window scaling factor and the number of byte ACKed back
    to the sender.  This problem has previously been identified in
    RFC 7323, appendix F [1].

    The Linux kernel currently adheres to never shrinking the window.

    In addition to the overallocation of memory mentioned above, the
    current behavior is functionally incorrect, because once tcp_rmem[2]
    is reached when no remediations remain (i.e. tcp collapse fails to
    free up any more memory and there are no packets to prune from the
    out-of-order queue), the receiver will drop in-window packets
    resulting in retransmissions and an eventual timeout of the tcp
    session.  A receive buffer full condition should instead result
    in a zero window and an indefinite wait.

    In practice, this problem is largely hidden for most flows.  It
    is not applicable to mice flows.  Elephant flows can send data
    fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
    triggering a zero window.

    But this problem does show up for other types of flows.  Examples
    are websockets and other type of flows that send small amounts of
    data spaced apart slightly in time.  In these cases, we directly
    encounter the problem described in [1].

    RFC 7323, section 2.4 [2], says there are instances when a retracted
    window can be offered, and that TCP implementations MUST ensure
    that they handle a shrinking window, as specified in RFC 1122,
    section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
    management have made clear that sender must accept a shrunk window
    from the receiver, including RFC 793 [4] and RFC 1323 [5].

    This patch implements the functionality to shrink the tcp window
    when necessary to keep the right edge within the memory limit by
    autotuning (sk_rcvbuf).  This new functionality is enabled with
    the new sysctl: net.ipv4.tcp_shrink_window

    Additional information can be found at:
    https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

    [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
    [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
    [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
    [4] https://www.rfc-editor.org/rfc/rfc793
    [5] https://www.rfc-editor.org/rfc/rfc1323

    Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-20 21:50:41 +01:00
Jan Stancek 9eea5b8c8f Merge: net: backport drop reason related patches
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3265

JIRA: https://issues.redhat.com/browse/RHEL-14554
Depends: !3196

Skb drop reason related patches, and a few extra ones for easier backports.

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-20 21:49:35 +01:00
Antoine Tenart 6aaf1a5b76 tcp: add TCP_OLD_SEQUENCE drop reason
JIRA: https://issues.redhat.com/browse/RHEL-14554
Upstream Status: linux.git

commit b44693495af8f309b8ddec4b30833085d1c2d0c4
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 06:47:54 2023 +0000

    tcp: add TCP_OLD_SEQUENCE drop reason

    tcp_sequence() uses two conditions to decide to drop a packet,
    and we currently report generic TCP_INVALID_SEQUENCE drop reason.

    Duplicates are common, we need to distinguish them from
    the other case.

    I chose to not reuse TCP_OLD_DATA, and instead added
    TCP_OLD_SEQUENCE drop reason.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719064754.2794106-1-edumazet@google.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-11-10 17:40:30 +01:00
Paolo Abeni 8ce7b9e432 tcp: get rid of sysctl_tcp_adv_win_scale
JIRA: https://issues.redhat.com/browse/RHEL-15036
Tested: LNST, Tier1

Upstream commit:
commit dfa2f0483360d4d6f2324405464c9f281156bd87
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jul 17 15:29:17 2023 +0000

    tcp: get rid of sysctl_tcp_adv_win_scale

    With modern NIC drivers shifting to full page allocations per
    received frame, we face the following issue:

    TCP has one per-netns sysctl used to tweak how to translate
    a memory use into an expected payload (RWIN), in RX path.

    tcp_win_from_space() implementation is limited to few cases.

    For hosts dealing with various MSS, we either under estimate
    or over estimate the RWIN we send to the remote peers.

    For instance with the default sysctl_tcp_adv_win_scale value,
    we expect to store 50% of payload per allocated chunk of memory.

    For the typical use of MTU=1500 traffic, and order-0 pages allocations
    by NIC drivers, we are sending too big RWIN, leading to potential
    tcp collapse operations, which are extremely expensive and source
    of latency spikes.

    This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
    uses a per socket scaling factor, so that we can precisely
    adjust the RWIN based on effective skb->len/skb->truesize ratio.

    This patch alone can double TCP receive performance when receivers
    are too slow to drain their receive queue, or by allowing
    a bigger RWIN when MSS is close to PAGE_SIZE.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-31 21:50:01 +01:00
Felix Maurer 2261d33599 tcp: adjust rcv_ssthresh according to sk_reserved_mem
JIRA: https://issues.redhat.com/browse/RHEL-11592
Conflicts:
- net/ipv4/tcp_input.c: context difference due to missing 240bfd134c59
  ("tcp: tweak len/truesize ratio for coalesce candidates")

commit 053f368412c9a7bfce2befec8c795113c8cfb0b1
Author: Wei Wang <weiwan@google.com>
Date:   Wed Sep 29 10:25:13 2021 -0700

    tcp: adjust rcv_ssthresh according to sk_reserved_mem

    When user sets SO_RESERVE_MEM socket option, in order to utilize the
    reserved memory when in memory pressure state, we adjust rcv_ssthresh
    according to the available reserved memory for the socket, instead of
    using 4 * advmss always.

    Signed-off-by: Wei Wang <weiwan@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-10-31 16:20:06 +01:00
Paolo Abeni 1259c573ac tcp: fix delayed ACKs for MSS boundary condition
JIRA: https://issues.redhat.com/browse/RHEL-14348
Tested: LNST, Tier1

Upstream commit:
commit 4720852ed9afb1c5ab84e96135cb5b73d5afde6f
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sun Oct 1 11:12:39 2023 -0400

    tcp: fix delayed ACKs for MSS boundary condition

    This commit fixes poor delayed ACK behavior that can cause poor TCP
    latency in a particular boundary condition: when an application makes
    a TCP socket write that is an exact multiple of the MSS size.

    The problem is that there is painful boundary discontinuity in the
    current delayed ACK behavior. With the current delayed ACK behavior,
    we have:

    (1) If an app reads data when > 1*MSS is unacknowledged, then
        tcp_cleanup_rbuf() ACKs immediately because of:

         tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||

    (2) If an app reads all received data, and the packets were < 1*MSS,
        and either (a) the app is not ping-pong or (b) we received two
        packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
        of:

         ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
          ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
           !inet_csk_in_pingpong_mode(sk))) &&

    (3) *However*: if an app reads exactly 1*MSS of data,
        tcp_cleanup_rbuf() does not send an immediate ACK. This is true
        even if the app is not ping-pong and the 1*MSS of data had the PSH
        bit set, suggesting the sending application completed an
        application write.

    Thus if the app is not ping-pong, we have this painful case where
    >1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
    write whose last skb is an exact multiple of 1*MSS can get a 40ms
    delayed ACK. This means that any app that transfers data in one
    direction and takes care to align write size or packet size with MSS
    can suffer this problem. With receive zero copy making 4KB MSS values
    more common, it is becoming more common to have application writes
    naturally align with MSS, and more applications are likely to
    encounter this delayed ACK problem.

    The fix in this commit is to refine the delayed ACK heuristics with a
    simple check: immediately ACK a received 1*MSS skb with PSH bit set if
    the app reads all data. Why? If an skb has a len of exactly 1*MSS and
    has the PSH bit set then it is likely the end of an application
    write. So more data may not be arriving soon, and yet the data sender
    may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
    set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
    an ACK immediately if the app reads all of the data and is not
    ping-pong. Note that this logic is also executed for the case where
    len > MSS, but in that case this logic does not matter (and does not
    hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
    app reads data and there is more than an MSS of unACKed data.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Reviewed-by: Yuchung Cheng <ycheng@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Cc: Xin Guo <guoxin0309@gmail.com>
    Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-20 11:53:59 +02:00
Paolo Abeni 318329e4ce tcp: fix mishandling when the sack compression is deferred.
JIRA: https://issues.redhat.com/browse/RHEL-14348
Tested: LNST, Tier1

Upstream commit:
commit 30c6f0bf9579debce27e45fac34fdc97e46acacc
Author: fuyuanli <fuyuanli@didiglobal.com>
Date:   Wed May 31 16:01:50 2023 +0800

    tcp: fix mishandling when the sack compression is deferred.

    In this patch, we mainly try to handle sending a compressed ack
    correctly if it's deferred.

    Here are more details in the old logic:
    When sack compression is triggered in the tcp_compressed_ack_kick(),
    if the sock is owned by user, it will set TCP_DELACK_TIMER_DEFERRED
    and then defer to the release cb phrase. Later once user releases
    the sock, tcp_delack_timer_handler() should send a ack as expected,
    which, however, cannot happen due to lack of ICSK_ACK_TIMER flag.
    Therefore, the receiver would not sent an ack until the sender's
    retransmission timeout. It definitely increases unnecessary latency.

    Fixes: 5d9f4262b7 ("tcp: add SACK compression")
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: fuyuanli <fuyuanli@didiglobal.com>
    Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
    Link: https://lore.kernel.org/netdev/20230529113804.GA20300@didi-ThinkCentre-M920t-N000/
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230531080150.GA20424@didi-ThinkCentre-M920t-N000
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-20 11:46:08 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Jeff Moyer fc8e2acd3e tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit b67985be400969578d4d4b17299714c0e5d2c07b
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Feb 1 10:46:40 2022 -0800

    tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
    
    tcp_shift_skb_data() might collapse three packets into a larger one.
    
    P_A, P_B, P_C  -> P_ABC
    
    Historically, it used a single tcp_skb_can_collapse_to(P_A) call,
    because it was enough.
    
    In commit 8571248411 ("tcp: coalesce/collapse must respect MPTCP extensions"),
    this call was replaced by a call to tcp_skb_can_collapse(P_A, P_B)
    
    But the now needed test over P_C has been missed.
    
    This probably broke MPTCP.
    
    Then later, commit 9b65b17db723 ("net: avoid double accounting for pure zerocopy skbs")
    added an extra condition to tcp_skb_can_collapse(), but the missing call
    from tcp_shift_skb_data() is also breaking TCP zerocopy, because P_A and P_C
    might have different skb_zcopy_pure() status.
    
    Fixes: 8571248411 ("tcp: coalesce/collapse must respect MPTCP extensions")
    Fixes: 9b65b17db723 ("net: avoid double accounting for pure zerocopy skbs")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
    Cc: Talal Ahmad <talalahmad@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Willem de Bruijn <willemb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/20220201184640.756716-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-05-05 15:23:02 -04:00
Paolo Abeni e0b7624646 tcp: fix indefinite deferral of RTO with SACK reneging
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561
Tested: LNST, Tier1

Upstream commit:
commit 3d2af9cce3133b3bc596a9d065c6f9d93419ccfb
Author: Neal Cardwell <ncardwell@google.com>
Date:   Fri Oct 21 17:08:21 2022 +0000

    tcp: fix indefinite deferral of RTO with SACK reneging

    This commit fixes a bug that can cause a TCP data sender to repeatedly
    defer RTOs when encountering SACK reneging.

    The bug is that when we're in fast recovery in a scenario with SACK
    reneging, every time we get an ACK we call tcp_check_sack_reneging()
    and it can note the apparent SACK reneging and rearm the RTO timer for
    srtt/2 into the future. In some SACK reneging scenarios that can
    happen repeatedly until the receive window fills up, at which point
    the sender can't send any more, the ACKs stop arriving, and the RTO
    fires at srtt/2 after the last ACK. But that can take far too long
    (O(10 secs)), since the connection is stuck in fast recovery with a
    low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
    available.

    This fix changes the logic in tcp_check_sack_reneging() to only rearm
    the RTO timer if data is cumulatively ACKed, indicating forward
    progress. This avoids this kind of nearly infinite loop of RTO timer
    re-arming. In addition, this meets the goals of
    tcp_check_sack_reneging() in handling Windows TCP behavior that looks
    temporarily like SACK reneging but is not really.

    Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
    and provided critical packet traces that enabled root-causing this
    issue. Also, many thanks to Jakub Kicinski for testing this fix.

    Fixes: 5ae344c949 ("tcp: reduce spurious retransmits due to transient SACK reneging")
    Reported-by: Jakub Kicinski <kuba@kernel.org>
    Reported-by: Neil Spring <ntspring@fb.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Cc: Yuchung Cheng <ycheng@google.com>
    Tested-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-21 09:56:52 +02:00
Paolo Abeni a2ec85333c tcp: annotate data-race around challenge_timestamp
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561
Tested: LNST, Tier1

Upstream commit:
commit 8c70521238b7863c2af607e20bcba20f974c969b
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Aug 30 11:56:55 2022 -0700

    tcp: annotate data-race around challenge_timestamp

    challenge_timestamp can be read an written by concurrent threads.

    This was expected, but we need to annotate the race to avoid potential issues.

    Following patch moves challenge_timestamp and challenge_count
    to per-netns storage to provide better isolation.

    Fixes: 354e4aa391 ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-21 09:53:57 +02:00
Guillaume Nault e1b28db515 tcp: Fix a data-race around sysctl_tcp_comp_sack_nr.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 79f55473bfc8ac51bd6572929a679eeb4da22251
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:03 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_comp_sack_nr.

    While reading sysctl_tcp_comp_sack_nr, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 9c21d2fc41 ("tcp: add tcp_comp_sack_nr sysctl")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:15 +01:00
Guillaume Nault ba3a173d9b tcp: Fix a data-race around sysctl_tcp_comp_sack_slack_ns.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 22396941a7f343d704738360f9ef0e6576489d43
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:02 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_comp_sack_slack_ns.

    While reading sysctl_tcp_comp_sack_slack_ns, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.

    Fixes: a70437cc09 ("tcp: add hrtimer slack to sack compression")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:15 +01:00
Guillaume Nault 864f6cb56c tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 4866b2b0f7672b6d760c4b8ece6fb56f965dcc8a
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:01 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns.

    While reading sysctl_tcp_comp_sack_delay_ns, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.

    Fixes: 6d82aa2420 ("tcp: add tcp_comp_sack_delay_ns sysctl")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:15 +01:00
Guillaume Nault 744fd61abd tcp: Fix data-races around sk_pacing_rate.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 59bf6c65a09fff74215517aecffbbdcd67df76e3
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:21:59 2022 -0700

    tcp: Fix data-races around sk_pacing_rate.

    While reading sysctl_tcp_pacing_(ss|ca)_ratio, they can be changed
    concurrently.  Thus, we need to add READ_ONCE() to their readers.

    Fixes: 43e122b014 ("tcp: refine pacing rate determination")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:15 +01:00
Guillaume Nault 7ebd35903c tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 2afdbe7b8de84c28e219073a6661080e1b3ded48
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:26 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit.

    While reading sysctl_tcp_invalid_ratelimit, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.

    Fixes: 032ee42369 ("tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:15 +01:00
Guillaume Nault df5278bbb0 tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 1330ffacd05fc9ac4159d19286ce119e22450ed2
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:24 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen.

    While reading sysctl_tcp_min_rtt_wlen, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: f672258391 ("tcp: track min RTT using windowed min-filter")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault 1083d63446 tcp: Fix a data-race around sysctl_tcp_challenge_ack_limit.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit db3815a2fa691da145cfbe834584f31ad75df9ff
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:21 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_challenge_ack_limit.

    While reading sysctl_tcp_challenge_ack_limit, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.

    Fixes: 282f23c6ee ("tcp: implement RFC 5961 3.2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault 672780951d tcp: Fix a data-race around sysctl_tcp_frto.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 706c6202a3589f290e1ef9be0584a8f4a3cc0507
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:15 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_frto.

    While reading sysctl_tcp_frto, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault 7eb9778613 tcp: Fix a data-race around sysctl_tcp_app_win.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 02ca527ac5581cf56749db9fd03d854e842253dd
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:13 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_app_win.

    While reading sysctl_tcp_app_win, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault cbc7d0bf3d tcp: Fix data-races around sysctl_tcp_dsack.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 58ebb1c8b35a8ef38cd6927431e0fa7b173a632d
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:12 2022 -0700

    tcp: Fix data-races around sysctl_tcp_dsack.

    While reading sysctl_tcp_dsack, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00