Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Sabrina Dubroca	1eacc871f3	tcp: drop secpath at the same time as we currently drop dst JIRA: https://issues.redhat.com/browse/RHEL-69649 JIRA: https://issues.redhat.com/browse/RHEL-83224 CVE: CVE-2025-21864 commit 9b6412e6979f6f9e0632075f8f008937b5cd4efd Author: Sabrina Dubroca <sd@queasysnail.net> Date: Mon Feb 17 11:23:35 2025 +0100 tcp: drop secpath at the same time as we currently drop dst Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while running tests that boil down to: - create a pair of netns - run a basic TCP test over ipcomp6 - delete the pair of netns The xfrm_state found on spi_byaddr was not deleted at the time we delete the netns, because we still have a reference on it. This lingering reference comes from a secpath (which holds a ref on the xfrm_state), which is still attached to an skb. This skb is not leaked, it ends up on sk_receive_queue and then gets defer-free'd by skb_attempt_defer_free. The problem happens when we defer freeing an skb (push it on one CPU's defer_list), and don't flush that list before the netns is deleted. In that case, we still have a reference on the xfrm_state that we don't expect at this point. We already drop the skb's dst in the TCP receive path when it's no longer needed, so let's also drop the secpath. At this point, tcp_filter has already called into the LSM hooks that may require the secpath, so it should not be needed anymore. However, in some of those places, the MPTCP extension has just been attached to the skb, so we cannot simply drop all extensions. Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>	2025-03-20 10:11:45 +01:00
Rado Vrbovsky	65ee7b65eb	Merge: net: visibility patches for 9.6 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5833 JIRA: https://issues.redhat.com/browse/RHEL-68063 Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Guillaume Nault <gnault@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2025-01-06 08:26:06 +00:00
Rado Vrbovsky	81ce48e690	Merge: mptcp: phase-1 backports for RHEL-9.6 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5449 JIRA: https://issues.redhat.com/browse/RHEL-62871 JIRA: https://issues.redhat.com/browse/RHEL-58839 JIRA: https://issues.redhat.com/browse/RHEL-66083 JIRA: https://issues.redhat.com/browse/RHEL-66074 CVE: CVE-2024-46711 CVE: CVE-2024-45009 CVE: CVE-2024-45010 Upstream Status: All mainline in net.git Tested: kselftest Conflicts: see individual patches Signed-off-by: Davide Caratti <dcaratti@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Paolo Abeni <pabeni@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-11-22 09:18:31 +00:00
Antoine Tenart	f1eac6da54	net: tcp: Add noinline_for_tracing annotation for tcp_drop_reason() JIRA: https://issues.redhat.com/browse/RHEL-68063 Upstream Status: net-next.git commit dbd5e2e79ed8653ac2ae255e42d1189283343a0c Author: Yafang Shao <laoar.shao@gmail.com> Date: Thu Oct 24 17:37:42 2024 +0800 net: tcp: Add noinline_for_tracing annotation for tcp_drop_reason() We previously hooked the tcp_drop_reason() function using BPF to monitor TCP drop reasons. However, after upgrading our compiler from GCC 9 to GCC 11, tcp_drop_reason() is now inlined, preventing us from hooking into it. To address this, it would be beneficial to make noinline explicitly for tracing. Link: https://lore.kernel.org/netdev/CANn89iJuShCmidCi_ZkYABtmscwbVjhuDta1MS5LxV_4H9tKOA@mail.gmail.com/ Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Menglong Dong <menglong8.dong@gmail.com> Link: https://patch.msgid.link/20241024093742.87681-3-laoar.shao@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-11-19 15:34:58 +01:00
Davide Caratti	6758e2bf77	tcp: set TCP_DEFER_ACCEPT locklessly JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 6e97ba552b8d3dd074a28b8600740b8bed42267b commit 6e97ba552b8d3dd074a28b8600740b8bed42267b Author: Eric Dumazet <edumazet@google.com> Date: Fri Aug 4 14:46:16 2023 +0000 tcp: set TCP_DEFER_ACCEPT locklessly rskq_defer_accept field can be read/written without the need of holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	a89122fa2a	tcp: set TCP_LINGER2 locklessly JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit a81722ddd7e4d76c9bbff078d29416e18c6d7f71 commit a81722ddd7e4d76c9bbff078d29416e18c6d7f71 Author: Eric Dumazet <edumazet@google.com> Date: Fri Aug 4 14:46:15 2023 +0000 tcp: set TCP_LINGER2 locklessly tp->linger2 can be set locklessly as long as readers use READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Rado Vrbovsky	392bdee116	Merge: net: tcp: accept old ack during closing MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5359 JIRA: https://issues.redhat.com/browse/RHEL-60572 795a7dfbc3d9 ("net: tcp: accept old ack during closing") Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Sterling Alexander <stalexan@redhat.com> Approved-by: Guillaume Nault <gnault@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-11-06 08:21:42 +00:00
Rado Vrbovsky	384fd7eadc	Merge: tcp: stable backports for 9.6 phase 1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5444 JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Several stable backport for the tcp protocol addressing sparse corner-cases issues. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-11-01 08:13:57 +00:00
Paolo Abeni	4111dedcfe	tcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits out JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Conflicts: different context as rhel-9 lacks the upstream commit \ 3868ab0f1925 ("tcp: new TCP_INFO stats for RTO events") Upstream commit: commit 27c80efcc20486c82698f05f00e288b44513c86b Author: Neal Cardwell <ncardwell@google.com> Date: Tue Oct 1 20:05:17 2024 +0000 tcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits out Fix tcp_rcv_synrecv_state_fastopen() to not zero retrans_stamp if retransmits are outstanding. tcp_fastopen_synack_timer() sets retrans_stamp, so typically we'll need to zero retrans_stamp here to prevent spurious retransmits_timed_out(). The logic to zero retrans_stamp is from this 2019 commit: commit `cd736d8b67` ("tcp: fix retrans timestamp on passive Fast Open") However, in the corner case where the ACK of our TFO SYNACK carried some SACK blocks that caused us to enter TCP_CA_Recovery then that non-zero retrans_stamp corresponds to the active fast recovery, and we need to leave retrans_stamp with its current non-zero value, for correct ETIMEDOUT and undo behavior. Fixes: `cd736d8b67` ("tcp: fix retrans timestamp on passive Fast Open") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001200517.2756803-4-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:17 +02:00
Paolo Abeni	dcea8d7793	tcp: fix tcp_enter_recovery() to zero retrans_stamp when it's safe JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit b41b4cbd9655bcebcce941bef3601db8110335be Author: Neal Cardwell <ncardwell@google.com> Date: Tue Oct 1 20:05:16 2024 +0000 tcp: fix tcp_enter_recovery() to zero retrans_stamp when it's safe Fix tcp_enter_recovery() so that if there are no retransmits out then we zero retrans_stamp when entering fast recovery. This is necessary to fix two buggy behaviors. Currently a non-zero retrans_stamp value can persist across multiple back-to-back loss recovery episodes. This is because we generally only clears retrans_stamp if we are completely done with loss recoveries, and get to tcp_try_to_open() and find !tcp_any_retrans_done(sk). This behavior causes two bugs: (1) When a loss recovery episode (CA_Loss or CA_Recovery) is followed immediately by a new CA_Recovery, the retrans_stamp value can persist and can be a time before this new CA_Recovery episode starts. That means that timestamp-based undo will be using the wrong retrans_stamp (a value that is too old) when comparing incoming TS ecr values to retrans_stamp to see if the current fast recovery episode can be undone. (2) If there is a roughly minutes-long sequence of back-to-back fast recovery episodes, one after another (e.g. in a shallow-buffered or policed bottleneck), where each fast recovery successfully makes forward progress and recovers one window of sequence space (but leaves at least one retransmit in flight at the end of the recovery), followed by several RTOs, then the ETIMEDOUT check may be using the wrong retrans_stamp (a value set at the start of the first fast recovery in the sequence). This can cause a very premature ETIMEDOUT, killing the connection prematurely. This commit changes the code to zero retrans_stamp when entering fast recovery, when this is known to be safe (no retransmits are out in the network). That ensures that when starting a fast recovery episode, and it is safe to do so, retrans_stamp is set when we send the fast retransmit packet. That addresses both bug (1) and bug (2) by ensuring that (if no retransmits are out when we start a fast recovery) we use the initial fast retransmit of this fast recovery as the time value for undo and ETIMEDOUT calculations. This makes intuitive sense, since the start of a new fast recovery episode (in a scenario where no lost packets are out in the network) means that the connection has made forward progress since the last RTO or fast recovery, and we should thus "restart the clock" used for both undo and ETIMEDOUT logic. Note that if when we start fast recovery there are retransmits out in the network, there can still be undesirable (1)/(2) issues. For example, after this patch we can still have the (1) and (2) problems in cases like this: + round 1: sender sends flight 1 + round 2: sender receives SACKs and enters fast recovery 1, retransmits some packets in flight 1 and then sends some new data as flight 2 + round 3: sender receives some SACKs for flight 2, notes losses, and retransmits some packets to fill the holes in flight 2 + fast recovery has some lost retransmits in flight 1 and continues for one or more rounds sending retransmits for flight 1 and flight 2 + fast recovery 1 completes when snd_una reaches high_seq at end of flight 1 + there are still holes in the SACK scoreboard in flight 2, so we enter fast recovery 2, but some retransmits in the flight 2 sequence range are still in flight (retrans_out > 0), so we can't execute the new retrans_stamp=0 added here to clear retrans_stamp It's not yet clear how to fix these remaining (1)/(2) issues in an efficient way without breaking undo behavior, given that retrans_stamp is currently used for undo and ETIMEDOUT. Perhaps the optimal (but expensive) strategy would be to set retrans_stamp to the timestamp of the earliest outstanding retransmit when entering fast recovery. But at least this commit makes things better. Note that this does not change the semantics of retrans_stamp; it simply makes retrans_stamp accurate in some cases where it was not before: (1) Some loss recovery, followed by an immediate entry into a fast recovery, where there are no retransmits out when entering the fast recovery. (2) When a TFO server has a SYNACK retransmit that sets retrans_stamp, and then the ACK that completes the 3-way handshake has SACK blocks that trigger a fast recovery. In this case when entering fast recovery we want to zero out the retrans_stamp from the TFO SYNACK retransmit, and set the retrans_stamp based on the timestamp of the fast recovery. We introduce a tcp_retrans_stamp_cleanup() helper, because this two-line sequence already appears in 3 places and is about to appear in 2 more as a result of this bug fix patch series. Once this bug fix patches series in the net branch makes it into the net-next branch we'll update the 3 other call sites to use the new helper. This is a long-standing issue. The Fixes tag below is chosen to be the oldest commit at which the patch will apply cleanly, which is from Linux v3.5 in 2012. Fixes: `1fbc340514` ("tcp: early retransmit: tcp_enter_recovery()") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001200517.2756803-3-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:17 +02:00
Paolo Abeni	b32e835fe5	tcp: fix to allow timestamp undo if no retransmits were sent JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit e37ab7373696e650d3b6262a5b882aadad69bb9e Author: Neal Cardwell <ncardwell@google.com> Date: Tue Oct 1 20:05:15 2024 +0000 tcp: fix to allow timestamp undo if no retransmits were sent Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that it can trigger undo even if TSQ prevents a fast recovery episode from reaching tcp_retransmit_skb(). Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after this commit from 2019: commit `bc9f38c832` ("tcp: avoid unconditional congestion window undo on SYN retransmit") ...and before this fix we could have buggy scenarios like the following: + Due to reordering, a TCP connection receives some SACKs and enters a spurious fast recovery. + TSQ prevents all invocations of tcp_retransmit_skb(), because many skbs are queued in lower layers of the sending machine's network stack; thus tp->retrans_stamp remains 0. + The connection receives a TCP timestamp ECR value echoing a timestamp before the fast recovery, indicating that the fast recovery was spurious. + The connection fails to undo the spurious fast recovery because tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false, due to the new logic in the 2019 commit: commit `bc9f38c832` ("tcp: avoid unconditional congestion window undo on SYN retransmit") This fix tweaks the logic to be more similar to the tcp_packet_delayed() logic before `bc9f38c832`, except that we take care not to be fooled by the FLAG_SYN_ACKED code path zeroing out tp->retrans_stamp (the bug noted and fixed by Yuchung in `bc9f38c832`). Note that this returns the high-level behavior of tcp_packet_delayed() to again match the comment for the function, which says: "Nothing was retransmitted or returned timestamp is less than timestamp of the first retransmission." Note that this comment is in the original 2005-04-16 Linux git commit, so this is evidently long-standing behavior. Fixes: `bc9f38c832` ("tcp: avoid unconditional congestion window undo on SYN retransmit") Reported-by: Geumhwan Yu <geumhwan.yu@samsung.com> Diagnosed-by: Geumhwan Yu <geumhwan.yu@samsung.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:17 +02:00
Paolo Abeni	7a18dd824a	tcp: Update window clamping condition JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit a2cbb1603943281a604f5adc48079a148db5cb0d Author: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> Date: Thu Aug 8 16:06:40 2024 -0700 tcp: Update window clamping condition This patch is based on the discussions between Neal Cardwell and Eric Dumazet in the link https://lore.kernel.org/netdev/20240726204105.1466841-1-quic_subashab@quicinc.com/ It was correctly pointed out that tp->window_clamp would not be updated in cases where net.ipv4.tcp_moderate_rcvbuf=0 or if (copied <= tp->rcvq_space.space). While it is expected for most setups to leave the sysctl enabled, the latter condition may not end up hitting depending on the TCP receive queue size and the pattern of arriving data. The updated check should be hit only on initial MSS update from TCP_MIN_MSS to measured MSS value and subsequently if there was an update to a larger value. Fixes: 05f76b2d634e ("tcp: Adjust clamping window for applications specifying SO_RCVBUF") Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com> Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:16 +02:00
Paolo Abeni	9af45f55fd	tcp: Adjust clamping window for applications specifying SO_RCVBUF JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Conflicts: tcp_rcv_space_adjust() lacks the ONCE annotation while \ writing 'window_clamp' since rhel-9 lacks the upstream commit \ f410cbea9f3d ("tcp: annotate data-races around tp->window_clamp") Upstream commit: commit 05f76b2d634e65ab34472802d9b142ea9e03f74e Author: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> Date: Fri Jul 26 13:41:05 2024 -0700 tcp: Adjust clamping window for applications specifying SO_RCVBUF tp->scaling_ratio is not updated based on skb->len/skb->truesize once SO_RCVBUF is set leading to the maximum window scaling to be 25% of rcvbuf after commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") and 50% of rcvbuf after commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio"). 50% tries to emulate the behavior of older kernels using sysctl_tcp_adv_win_scale with default value. Systems which were using a different values of sysctl_tcp_adv_win_scale in older kernels ended up seeing reduced download speeds in certain cases as covered in https://lists.openwall.net/netdev/2024/05/15/13 While the sysctl scheme is no longer acceptable, the value of 50% is a bit conservative when the skb->len/skb->truesize ratio is later determined to be ~0.66. Applications not specifying SO_RCVBUF update the window scaling and the receiver buffer every time data is copied to userspace. This computation is now used for applications setting SO_RCVBUF to update the maximum window scaling while ensuring that the receive buffer is within the application specified limit. Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com> Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:16 +02:00
Paolo Abeni	c35b38b5ea	tcp: fix incorrect undo caused by DSACK of TLP retransmit JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit 0ec986ed7bab6801faed1440e8839dcc710331ff Author: Neal Cardwell <ncardwell@google.com> Date: Wed Jul 3 13:12:46 2024 -0400 tcp: fix incorrect undo caused by DSACK of TLP retransmit Loss recovery undo_retrans bookkeeping had a long-standing bug where a DSACK from a spurious TLP retransmit packet could cause an erroneous undo of a fast recovery or RTO recovery that repaired a single really-lost packet (in a sequence range outside that of the TLP retransmit). Basically, because the loss recovery state machine didn't account for the fact that it sent a TLP retransmit, the DSACK for the TLP retransmit could erroneously be implicitly be interpreted as corresponding to the normal fast recovery or RTO recovery retransmit that plugged a real hole, thus resulting in an improper undo. For example, consider the following buggy scenario where there is a real packet loss but the congestion control response is improperly undone because of this bug: + send packets P1, P2, P3, P4 + P1 is really lost + send TLP retransmit of P4 + receive SACK for original P2, P3, P4 + enter fast recovery, fast-retransmit P1, increment undo_retrans to 1 + receive DSACK for TLP P4, decrement undo_retrans to 0, undo (bug!) + receive cumulative ACK for P1-P4 (fast retransmit plugged real hole) The fix: when we initialize undo machinery in tcp_init_undo(), if there is a TLP retransmit in flight, then increment tp->undo_retrans so that we make sure that we receive a DSACK corresponding to the TLP retransmit, as well as DSACKs for all later normal retransmits, before triggering a loss recovery undo. Note that we also have to move the line that clears tp->tlp_high_seq for RTO recovery, so that upon RTO we remember the tp->tlp_high_seq value until tcp_init_undo() and clear it only afterward. Also note that the bug dates back to the original 2013 TLP implementation, commit `6ba8a3b19e` ("tcp: Tail loss probe (TLP)"). However, this patch will only compile and work correctly with kernels that have tp->tlp_retrans, which was added only in v5.8 in 2020 in commit `76be93fc07` ("tcp: allow at most one TLP probe per flight"). So we associate this fix with that later commit. Fixes: `76be93fc07` ("tcp: allow at most one TLP probe per flight") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Kevin Yang <yyd@google.com> Link: https://patch.msgid.link/20240703171246.1739561-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:15 +02:00
Paolo Abeni	212db5de55	UPSTREAM: tcp: fix DSACK undo in fast recovery to call tcp_try_to_open() JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit a6458ab7fd4f427d4f6f54380453ad255b7fde83 Author: Neal Cardwell <ncardwell@google.com> Date: Wed Jun 26 22:42:27 2024 -0400 UPSTREAM: tcp: fix DSACK undo in fast recovery to call tcp_try_to_open() In some production workloads we noticed that connections could sometimes close extremely prematurely with ETIMEDOUT after transmitting only 1 TLP and RTO retransmission (when we would normally expect roughly tcp_retries2 = TCP_RETR2 = 15 RTOs before a connection closes with ETIMEDOUT). From tracing we determined that these workloads can suffer from a scenario where in fast recovery, after some retransmits, a DSACK undo can happen at a point where the scoreboard is totally clear (we have retrans_out == sacked_out == lost_out == 0). In such cases, calling tcp_try_keep_open() means that we do not execute any code path that clears tp->retrans_stamp to 0. That means that tp->retrans_stamp can remain erroneously set to the start time of the undone fast recovery, even after the fast recovery is undone. If minutes or hours elapse, and then a TLP/RTO/RTO sequence occurs, then the start_ts value in retransmits_timed_out() (which is from tp->retrans_stamp) will be erroneously ancient (left over from the fast recovery undone via DSACKs). Thus this ancient tp->retrans_stamp value can cause the connection to die very prematurely with ETIMEDOUT via tcp_write_err(). The fix: we change DSACK undo in fast recovery (TCP_CA_Recovery) to call tcp_try_to_open() instead of tcp_try_keep_open(). This ensures that if no retransmits are in flight at the time of DSACK undo in fast recovery then we properly zero retrans_stamp. Note that calling tcp_try_to_open() is more consistent with other loss recovery behavior, since normal fast recovery (CA_Recovery) and RTO recovery (CA_Loss) both normally end when tp->snd_una meets or exceeds tp->high_seq and then in tcp_fastretrans_alert() the "default" switch case executes tcp_try_to_open(). Also note that by inspection this change to call tcp_try_to_open() implies at least one other nice bug fix, where now an ECE-marked DSACK that causes an undo will properly invoke tcp_enter_cwr() rather than ignoring the ECE mark. Fixes: `c7d9d6a185` ("tcp: undo on DSACK during recovery") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:15 +02:00
Paolo Abeni	37a7b087d1	tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit 5dfe9d273932c647bdc9d664f939af9a5a398cbc Author: Neal Cardwell <ncardwell@google.com> Date: Mon Jun 24 14:43:23 2024 +0000 tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO Testing determined that the recent commit 9e046bb111f1 ("tcp: clear tp->retrans_stamp in tcp_rcv_fastopen_synack()") has a race, and does not always ensure retrans_stamp is 0 after a TFO payload retransmit. If transmit completion for the SYN+data skb happens after the client TCP stack receives the SYNACK (which sometimes happens), then retrans_stamp can erroneously remain non-zero for the lifetime of the connection, causing a premature ETIMEDOUT later. Testing and tracing showed that the buggy scenario is the following somewhat tricky sequence: + Client attempts a TFO handshake. tcp_send_syn_data() sends SYN + TFO cookie + data in a single packet in the syn_data skb. It hands the syn_data skb to tcp_transmit_skb(), which makes a clone. Crucially, it then reuses the same original (non-clone) syn_data skb, transforming it by advancing the seq by one byte and removing the FIN bit, and enques the resulting payload-only skb in the sk->tcp_rtx_queue. + Client sets retrans_stamp to the start time of the three-way handshake. + Cookie mismatches or server has TFO disabled, and server only ACKs SYN. + tcp_ack() sees SYN is acked, tcp_clean_rtx_queue() clears retrans_stamp. + Since the client SYN was acked but not the payload, the TFO failure code path in tcp_rcv_fastopen_synack() tries to retransmit the payload skb. However, in some cases the transmit completion for the clone of the syn_data (which had SYN + TFO cookie + data) hasn't happened. In those cases, skb_still_in_host_queue() returns true for the retransmitted TFO payload, because the clone of the syn_data skb has not had its tx completetion. + Because skb_still_in_host_queue() finds skb_fclone_busy() is true, it sets the TSQ_THROTTLED bit and the retransmit does not happen in the tcp_rcv_fastopen_synack() call chain. + The tcp_rcv_fastopen_synack() code next implicitly assumes the retransmit process is finished, and sets retrans_stamp to 0 to clear it, but this is later overwritten (see below). + Later, upon tx completion, tcp_tsq_write() calls tcp_xmit_retransmit_queue(), which puts the retransmit in flight and sets retrans_stamp to a non-zero value. + The client receives an ACK for the retransmitted TFO payload data. + Since we're in CA_Open and there are no dupacks/SACKs/DSACKs/ECN to make tcp_ack_is_dubious() true and make us call tcp_fastretrans_alert() and reach a code path that clears retrans_stamp, retrans_stamp stays nonzero. + Later, if there is a TLP, RTO, RTO sequence, then the connection will suffer an early ETIMEDOUT due to the erroneously ancient retrans_stamp. The fix: this commit refactors the code to have tcp_rcv_fastopen_synack() retransmit by reusing the relevant parts of tcp_simple_retransmit() that enter CA_Loss (without changing cwnd) and call tcp_xmit_retransmit_queue(). We have tcp_simple_retransmit() and tcp_rcv_fastopen_synack() share code in this way because in both cases we get a packet indicating non-congestion loss (MTU reduction or TFO failure) and thus in both cases we want to retransmit as many packets as cwnd allows, without reducing cwnd. And given that retransmits will set retrans_stamp to a non-zero value (and may do so in a later calling context due to TSQ), we also want to enter CA_Loss so that we track when all retransmitted packets are ACked and clear retrans_stamp when that happens (to ensure later recurring RTOs are using the correct retrans_stamp and don't declare ETIMEDOUT prematurely). Fixes: 9e046bb111f1 ("tcp: clear tp->retrans_stamp in tcp_rcv_fastopen_synack()") Fixes: `a7abf3cd76` ("tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Link: https://patch.msgid.link/20240624144323.2371403-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:15 +02:00
Paolo Abeni	ddc843f31c	tcp: clear tp->retrans_stamp in tcp_rcv_fastopen_synack() JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit 9e046bb111f13461d3f9331e24e974324245140e Author: Eric Dumazet <edumazet@google.com> Date: Fri Jun 14 13:06:15 2024 +0000 tcp: clear tp->retrans_stamp in tcp_rcv_fastopen_synack() Some applications were reporting ETIMEDOUT errors on apparently good looking flows, according to packet dumps. We were able to root cause the issue to an accidental setting of tp->retrans_stamp in the following scenario: - client sends TFO SYN with data. - server has TFO disabled, ACKs only SYN but not payload. - client receives SYNACK covering only SYN. - tcp_ack() eats SYN and sets tp->retrans_stamp to 0. - tcp_rcv_fastopen_synack() calls tcp_xmit_retransmit_queue() to retransmit TFO payload w/o SYN, sets tp->retrans_stamp to "now", but we are not in any loss recovery state. - TFO payload is ACKed. - we are not in any loss recovery state, and don't see any dupacks, so we don't get to any code path that clears tp->retrans_stamp. - tp->retrans_stamp stays non-zero for the lifetime of the connection. - after first RTO, tcp_clamp_rto_to_user_timeout() clamps second RTO to 1 jiffy due to bogus tp->retrans_stamp. - on clamped RTO with non-zero icsk_retransmits, retransmits_timed_out() sets start_ts from tp->retrans_stamp from TFO payload retransmit hours/days ago, and computes bogus long elapsed time for loss recovery, and suffers ETIMEDOUT early. Fixes: `a7abf3cd76` ("tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()") CC: stable@vger.kernel.org Co-developed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Co-developed-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240614130615.396837-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:15 +02:00
Paolo Abeni	fdad6e7a51	tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Conflicts: different context in tcp_conn_request(), as rhel-9 \ lacks the TCP AO support. Upstream commit: commit 41eecbd712b73f0d5dcf1152b9a1c27b1f238028 Author: Eric Dumazet <edumazet@google.com> Date: Sun Apr 7 09:33:22 2024 +0000 tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field TCP can transform a TIMEWAIT socket into a SYN_RECV one from a SYN packet, and the ISN of the SYNACK packet is normally generated using TIMEWAIT tw_snd_nxt : tcp_timewait_state_process() ... u32 isn = tcptw->tw_snd_nxt + 65535 + 2; if (isn == 0) isn++; TCP_SKB_CB(skb)->tcp_tw_isn = isn; return TCP_TW_SYN; This SYN packet also bypasses normal checks against listen queue being full or not. tcp_conn_request() ... __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn; ... /* TW buckets are converted to open requests without * limitations, they conserve resources and peer is * evidently real one. */ if ((syncookies == 2 \|\| inet_csk_reqsk_queue_is_full(sk)) && !isn) { want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name); if (!want_cookie) goto drop; } This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb. Unfortunately this field has been accidentally cleared after the call to tcp_timewait_state_process() returning TCP_TW_SYN. Using a field in TCP_SKB_CB(skb) for a temporary state is overkill. Switch instead to a per-cpu variable. As a bonus, we do not have to clear tcp_tw_isn in TCP receive fast path. It is temporarily set then cleared only in the TCP_TW_SYN dance. Fixes: `4ad19de877` ("net: tcp6: fix double call of tcp_v6_fill_cb()") Fixes: `eeea10b83a` ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:08:41 +02:00
Paolo Abeni	4cd846284a	tcp: propagate tcp_tw_isn via an extra parameter to ->route_req() JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit b9e810405880c99baafd550ada7043e86465396e Author: Eric Dumazet <edumazet@google.com> Date: Sun Apr 7 09:33:21 2024 +0000 tcp: propagate tcp_tw_isn via an extra parameter to ->route_req() tcp_v6_init_req() reads TCP_SKB_CB(skb)->tcp_tw_isn to find out if the request socket is created by a SYN hitting a TIMEWAIT socket. This has been buggy for a decade, lets directly pass the information from tcp_conn_request(). This is a preparatory patch to make the following one easier to review. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:07:53 +02:00
Paolo Abeni	516cdba7bf	tcp: call tcp_try_undo_recovery when an RTOd TFO SYNACK is ACKed JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit e326578a21414738de45f77badd332fb00bd0f58 Author: Aananth V <aananthv@google.com> Date: Thu Sep 14 14:36:20 2023 +0000 tcp: call tcp_try_undo_recovery when an RTOd TFO SYNACK is ACKed For passive TCP Fast Open sockets that had SYN/ACK timeout and did not send more data in SYN_RECV, upon receiving the final ACK in 3WHS, the congestion state may awkwardly stay in CA_Loss mode unless the CA state was undone due to TCP timestamp checks. However, if tcp_rcv_synrecv_state_fastopen() decides not to undo, then we should enter CA_Open, because at that point we have received an ACK covering the retransmitted SYNACKs. Currently, the icsk_ca_state is only set to CA_Open after we receive an ACK for a data-packet. This is because tcp_ack does not call tcp_fastretrans_alert (and tcp_process_loss) if !prior_packets Note that tcp_process_loss() calls tcp_try_undo_recovery(), so having tcp_rcv_synrecv_state_fastopen() decide that if we're in CA_Loss we should call tcp_try_undo_recovery() is consistent with that, and low risk. Fixes: `dad8cea7ad` ("tcp: fix TFO SYNACK undo to avoid double-timestamp-undo") Signed-off-by: Aananth V <aananthv@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 15:29:39 +02:00
Jamie Bainbridge	9bf961c675	net: tcp: accept old ack during closing JIRA: https://issues.redhat.com/browse/RHEL-60572 Conflicts: - net/ipv4/tcp_input.c Code difference because EL9 already has upstream commit 7d6ed9afde85 ("tcp: add dropreasons in tcp_rcv_state_process()"). After this patch, EL9's "reason = " and "if reason <= 0" block now match upstream, the same result as if 7d6ed9afde85 and 795a7dfbc3d9 were applied in order, so future patches here will not conflict. commit 795a7dfbc3d95e4c7c09569f319f026f8c7f5a9c Author: Menglong Dong <menglong8.dong@gmail.com> Date: Fri Jan 26 12:05:19 2024 +0800 net: tcp: accept old ack during closing For now, the packet with an old ack is not accepted if we are in FIN_WAIT1 state, which can cause retransmission. Taking the following case as an example: Client Server \| \| FIN_WAIT1(Send FIN, seq=10) FIN_WAIT1(Send FIN, seq=20, ack=10) \| \| \| Send ACK(seq=21, ack=11) Recv ACK(seq=21, ack=11) \| Recv FIN(seq=20, ack=10) In the case above, simultaneous close is happening, and the FIN and ACK packet that send from the server is out of order. Then, the FIN will be dropped by the client, as it has an old ack. Then, the server has to retransmit the FIN, which can cause delay if the server has set the SO_LINGER on the socket. Old ack is accepted in the ESTABLISHED and TIME_WAIT state, and I think it should be better to keep the same logic. In this commit, we accept old ack in FIN_WAIT1/FIN_WAIT2/CLOSING/LAST_ACK states. Maybe we should limit it to FIN_WAIT1 for now? Signed-off-by: Menglong Dong <menglong8.dong@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240126040519.1846345-1-menglong8.dong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>	2024-10-03 13:54:07 +10:00
Wander Lairson Costa	26717e7509	tcp: drop skb dst in tcp_rcv_established() JIRA: https://issues.redhat.com/browse/RHEL-9145 commit 783d108dd71d97e4cac5fe8ce70ca43ed7dc7bb7 Author: Eric Dumazet <edumazet@google.com> Date: Fri Apr 29 18:15:23 2022 -0700 tcp: drop skb dst in tcp_rcv_established() In commit `f84af32cbc` ("net: ip_queue_rcv_skb() helper") I dropped the skb dst in tcp_data_queue(). This only dealt with so-called TCP input slow path. When fast path is taken, tcp_rcv_established() calls tcp_queue_rcv() while skb still has a dst. This was mostly fine, because most dsts at this point are not refcounted (thanks to early demux) However, TCP packets sent over loopback have refcounted dst. Then commit 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") came and had the effect of delaying skb freeing for an arbitrary time. If during this time the involved netns is dismantled, cleanup_net() frees the struct net with embedded net->ipv6.ip6_dst_ops. Then when eventually dst_destroy_rcu() is called, if (dst->ops->destroy) ... triggers an use-after-free. It is not clear if ip6_route_net_exit() lacks a rcu_barrier() as syzbot reported similar issues before the blamed commit. ( https://groups.google.com/g/syzkaller-bugs/c/CofzW4eeA9A/m/009WjumTAAAJ ) Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:29 -03:00
Antoine Tenart	3a0f9f0ce0	tcp: use sk_skb_reason_drop to free rx packets JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: net-next.git commit 46a02aa357529d7b038096955976b14f7c44aa23 Author: Yan Zhai <yan@cloudflare.com> Date: Mon Jun 17 11:09:20 2024 -0700 tcp: use sk_skb_reason_drop to free rx packets Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving socket to the tracepoint. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202406011539.jhwBd7DX-lkp@intel.com/ Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:42 +02:00
Antoine Tenart	206757f0ed	tcp: add dropreasons in tcp_rcv_state_process() JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git Conflicts:\ - Code difference in tcp_rcv_state_process due to missing upstream commit 795a7dfbc3d9 ("net: tcp: accept old ack during closing"). commit 7d6ed9afde8547723f6f96f81f984cc6c48eef52 Author: Jason Xing <kernelxing@tencent.com> Date: Mon Feb 26 11:22:25 2024 +0800 tcp: add dropreasons in tcp_rcv_state_process() In this patch, I equipped this function with more dropreasons, but it still doesn't work yet, which I will do later. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:40 +02:00
Antoine Tenart	6c9f108418	tcp: add more specific possible drop reasons in tcp_rcv_synsent_state_process() JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git commit e615e3a24ed6f1a501f9b5426ec0b476fded4d22 Author: Jason Xing <kernelxing@tencent.com> Date: Mon Feb 26 11:22:24 2024 +0800 tcp: add more specific possible drop reasons in tcp_rcv_synsent_state_process() This patch does two things: 1) add two more new reasons 2) only change the return value(1) to various drop reason values for the future use For now, we still cannot trace those two reasons. We'll implement the full function in the subsequent patch in this series. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:40 +02:00
Florian Westphal	bd2a0fb2c5	tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets JIRA: https://issues.redhat.com/browse/RHEL-39833 Upstream Status: commit 94062790aedb CVE: CVE-2024-36905 commit 94062790aedb505bdda209b10bea47b294d6394f Author: Eric Dumazet <edumazet@google.com> Date: Wed May 1 12:54:48 2024 +0000 tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets TCP_SYN_RECV state is really special, it is only used by cross-syn connections, mostly used by fuzzers. In the following crash [1], syzbot managed to trigger a divide by zero in tcp_rcv_space_adjust() A socket makes the following state transitions, without ever calling tcp_init_transfer(), meaning tcp_init_buffer_space() is also not called. TCP_CLOSE connect() TCP_SYN_SENT TCP_SYN_RECV shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN) TCP_FIN_WAIT1 To fix this issue, change tcp_shutdown() to not perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition, which makes no sense anyway. When tcp_rcv_state_process() later changes socket state from TCP_SYN_RECV to TCP_ESTABLISH, then look at sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state, and send a FIN packet from a sane socket state. This means tcp_send_fin() can now be called from BH context, and must use GFP_ATOMIC allocations. [1] divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767 Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48 RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246 RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7 R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30 R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da FS: 00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0 Call Trace: <TASK> tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513 tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578 inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x109/0x280 net/socket.c:1068 ____sys_recvmsg+0x1db/0x470 net/socket.c:2803 ___sys_recvmsg net/socket.c:2845 [inline] do_recvmmsg+0x474/0xae0 net/socket.c:2939 __sys_recvmmsg net/socket.c:3018 [inline] __do_sys_recvmmsg net/socket.c:3041 [inline] __se_sys_recvmmsg net/socket.c:3034 [inline] __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7faeb6363db9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9 RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005 RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240501125448.896529-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Florian Westphal <fwestpha@redhat.com>	2024-06-07 15:21:41 +02:00
Sabrina Dubroca	7bc5eeb384	net: skbuff: generalize the skb->decrypted bit JIRA: https://issues.redhat.com/browse/RHEL-29306 commit 9f06f87fef689d28588cde8c7ebb00a67da34026 Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Apr 3 13:21:39 2024 -0700 net: skbuff: generalize the skb->decrypted bit The ->decrypted bit can be reused for other crypto protocols. Remove the direct dependency on TLS, add helpers to clean up the ifdefs leaking out everywhere. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>	2024-05-01 17:48:16 +02:00
Paolo Abeni	05001ba2cb	tcp: do not accept ACK of bytes we never sent JIRA: https://issues.redhat.com/browse/RHEL-21432 Tested: LNST, Tier1 Upstream commit: commit 3d501dd326fb1c73f1b8206d4c6e1d7b15c07e27 Author: Eric Dumazet <edumazet@google.com> Date: Tue Dec 5 16:18:41 2023 +0000 tcp: do not accept ACK of bytes we never sent This patch is based on a detailed report and ideas from Yepeng Pan and Christian Rossow. ACK seq validation is currently following RFC 5961 5.2 guidelines: The ACK value is considered acceptable only if it is in the range of ((SND.UNA - MAX.SND.WND) <= SEG.ACK <= SND.NXT). All incoming segments whose ACK value doesn't satisfy the above condition MUST be discarded and an ACK sent back. It needs to be noted that RFC 793 on page 72 (fifth check) says: "If the ACK is a duplicate (SEG.ACK < SND.UNA), it can be ignored. If the ACK acknowledges something not yet sent (SEG.ACK > SND.NXT) then send an ACK, drop the segment, and return". The "ignored" above implies that the processing of the incoming data segment continues, which means the ACK value is treated as acceptable. This mitigation makes the ACK check more stringent since any ACK < SND.UNA wouldn't be accepted, instead only ACKs that are in the range ((SND.UNA - MAX.SND.WND) <= SEG.ACK <= SND.NXT) get through. This can be refined for new (and possibly spoofed) flows, by not accepting ACK for bytes that were never sent. This greatly improves TCP security at a little cost. I added a Fixes: tag to make sure this patch will reach stable trees, even if the 'blamed' patch was adhering to the RFC. tp->bytes_acked was added in linux-4.2 Following packetdrill test (courtesy of Yepeng Pan) shows the issue at hand: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1024) = 0 // ---------------- Handshake ------------------- // // when window scale is set to 14 the window size can be extended to // 65535 * (2^14) = 1073725440. Linux would accept an ACK packet // with ack number in (Server_ISN+1-1073725440. Server_ISN+1) // ,though this ack number acknowledges some data never // sent by the server. +0 < S 0:0(0) win 65535 <mss 1400,nop,wscale 14> +0 > S. 0:0(0) ack 1 <...> +0 < . 1:1(0) ack 1 win 65535 +0 accept(3, ..., ...) = 4 // For the established connection, we send an ACK packet, // the ack packet uses ack number 1 - 1073725300 + 2^32, // where 2^32 is used to wrap around. // Note: we used 1073725300 instead of 1073725440 to avoid possible // edge cases. // 1 - 1073725300 + 2^32 = 3221241997 // Oops, old kernels happily accept this packet. +0 < . 1:1001(1000) ack 3221241997 win 65535 // After the kernel fix the following will be replaced by a challenge ACK, // and prior malicious frame would be dropped. +0 > . 1:1(0) ack 1001 Fixes: `354e4aa391` ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yepeng Pan <yepeng.pan@cispa.de> Reported-by: Christian Rossow <rossow@cispa.de> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20231205161841.2702925-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-01-12 16:31:32 +01:00
Jan Stancek	063f72e7e5	Merge: mptcp: rebase to Linux 6.7 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3305 JIRA: https://issues.redhat.com/browse/RHEL-15036 Tested: LNST, Tier1, selftests, pktdrill Rebase to the current upstream to bring in new features and a lot of fixes. A good half of the long commit list touches the self-tests only, and the remaining is self-contained in mptcp. The only notable exception is: tcp: get rid of sysctl_tcp_adv_win_scale that is a pre requisite to a bunch of mptcp changes included here and also uncontroversially a good thing (TM) for TCP. Wider-scope data-races related changeset are included (possibly as partial backport) only if they help to reduce conflict on later changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-11-20 21:50:53 +01:00
Jan Stancek	3c8d3e2d4a	Merge: tcp: enforce receive buffer memory limits by allowing the tcp window to shrink MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3301 JIRA: https://issues.redhat.com/browse/RHEL-11592 commit b650d953cd391595e536153ce30b4aab385643ac Author: mfreemon@cloudflare.com <mfreemon@cloudflare.com> Date: Sun Jun 11 22:05:24 2023 -0500 tcp: enforce receive buffer memory limits by allowing the tcp window to shrink Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to incoming data packets as a result of the window not closing when it should be. This can result in the receive buffer growing all the way up to tcp_rmem[2], even for tcp sessions with a low BDP. To reproduce: Connect a TCP session with the receiver doing nothing and the sender sending small packets (an infinite loop of socket send() with 4 bytes of payload with a sleep of 1 ms in between each send()). This will cause the tcp receive buffer to grow all the way up to tcp_rmem[2]. As a result, a host can have individual tcp sessions with receive buffers of size tcp_rmem[2], and the host itself can reach tcp_mem limits, causing the host to go into tcp memory pressure mode. The fundamental issue is the relationship between the granularity of the window scaling factor and the number of byte ACKed back to the sender. This problem has previously been identified in RFC 7323, appendix F [1]. The Linux kernel currently adheres to never shrinking the window. In addition to the overallocation of memory mentioned above, the current behavior is functionally incorrect, because once tcp_rmem[2] is reached when no remediations remain (i.e. tcp collapse fails to free up any more memory and there are no packets to prune from the out-of-order queue), the receiver will drop in-window packets resulting in retransmissions and an eventual timeout of the tcp session. A receive buffer full condition should instead result in a zero window and an indefinite wait. In practice, this problem is largely hidden for most flows. It is not applicable to mice flows. Elephant flows can send data fast enough to "overrun" the sk_rcvbuf limit (in a single ACK), triggering a zero window. But this problem does show up for other types of flows. Examples are websockets and other type of flows that send small amounts of data spaced apart slightly in time. In these cases, we directly encounter the problem described in [1]. RFC 7323, section 2.4 [2], says there are instances when a retracted window can be offered, and that TCP implementations MUST ensure that they handle a shrinking window, as specified in RFC 1122, section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window management have made clear that sender must accept a shrunk window from the receiver, including RFC 793 [4] and RFC 1323 [5]. This patch implements the functionality to shrink the tcp window when necessary to keep the right edge within the memory limit by autotuning (sk_rcvbuf). This new functionality is enabled with the new sysctl: net.ipv4.tcp_shrink_window Additional information can be found at: https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/ [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4 [3] https://www.rfc-editor.org/rfc/rfc1122#page-91 [4] https://www.rfc-editor.org/rfc/rfc793 [5] https://www.rfc-editor.org/rfc/rfc1323 Signed-off-by: Mike Freemon <mfreemon@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Felix Maurer <fmaurer@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-11-20 21:50:41 +01:00
Jan Stancek	9eea5b8c8f	Merge: net: backport drop reason related patches MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3265 JIRA: https://issues.redhat.com/browse/RHEL-14554 Depends: !3196 Skb drop reason related patches, and a few extra ones for easier backports. Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Approved-by: Sabrina Dubroca <sdubroca@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-11-20 21:49:35 +01:00
Antoine Tenart	6aaf1a5b76	tcp: add TCP_OLD_SEQUENCE drop reason JIRA: https://issues.redhat.com/browse/RHEL-14554 Upstream Status: linux.git commit b44693495af8f309b8ddec4b30833085d1c2d0c4 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 06:47:54 2023 +0000 tcp: add TCP_OLD_SEQUENCE drop reason tcp_sequence() uses two conditions to decide to drop a packet, and we currently report generic TCP_INVALID_SEQUENCE drop reason. Duplicates are common, we need to distinguish them from the other case. I chose to not reuse TCP_OLD_DATA, and instead added TCP_OLD_SEQUENCE drop reason. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719064754.2794106-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-11-10 17:40:30 +01:00
Paolo Abeni	8ce7b9e432	tcp: get rid of sysctl_tcp_adv_win_scale JIRA: https://issues.redhat.com/browse/RHEL-15036 Tested: LNST, Tier1 Upstream commit: commit dfa2f0483360d4d6f2324405464c9f281156bd87 Author: Eric Dumazet <edumazet@google.com> Date: Mon Jul 17 15:29:17 2023 +0000 tcp: get rid of sysctl_tcp_adv_win_scale With modern NIC drivers shifting to full page allocations per received frame, we face the following issue: TCP has one per-netns sysctl used to tweak how to translate a memory use into an expected payload (RWIN), in RX path. tcp_win_from_space() implementation is limited to few cases. For hosts dealing with various MSS, we either under estimate or over estimate the RWIN we send to the remote peers. For instance with the default sysctl_tcp_adv_win_scale value, we expect to store 50% of payload per allocated chunk of memory. For the typical use of MTU=1500 traffic, and order-0 pages allocations by NIC drivers, we are sending too big RWIN, leading to potential tcp collapse operations, which are extremely expensive and source of latency spikes. This patch makes sysctl_tcp_adv_win_scale obsolete, and instead uses a per socket scaling factor, so that we can precisely adjust the RWIN based on effective skb->len/skb->truesize ratio. This patch alone can double TCP receive performance when receivers are too slow to drain their receive queue, or by allowing a bigger RWIN when MSS is close to PAGE_SIZE. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-10-31 21:50:01 +01:00
Felix Maurer	2261d33599	tcp: adjust rcv_ssthresh according to sk_reserved_mem JIRA: https://issues.redhat.com/browse/RHEL-11592 Conflicts: - net/ipv4/tcp_input.c: context difference due to missing 240bfd134c59 ("tcp: tweak len/truesize ratio for coalesce candidates") commit 053f368412c9a7bfce2befec8c795113c8cfb0b1 Author: Wei Wang <weiwan@google.com> Date: Wed Sep 29 10:25:13 2021 -0700 tcp: adjust rcv_ssthresh according to sk_reserved_mem When user sets SO_RESERVE_MEM socket option, in order to utilize the reserved memory when in memory pressure state, we adjust rcv_ssthresh according to the available reserved memory for the socket, instead of using 4 * advmss always. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2023-10-31 16:20:06 +01:00
Paolo Abeni	1259c573ac	tcp: fix delayed ACKs for MSS boundary condition JIRA: https://issues.redhat.com/browse/RHEL-14348 Tested: LNST, Tier1 Upstream commit: commit 4720852ed9afb1c5ab84e96135cb5b73d5afde6f Author: Neal Cardwell <ncardwell@google.com> Date: Sun Oct 1 11:12:39 2023 -0400 tcp: fix delayed ACKs for MSS boundary condition This commit fixes poor delayed ACK behavior that can cause poor TCP latency in a particular boundary condition: when an application makes a TCP socket write that is an exact multiple of the MSS size. The problem is that there is painful boundary discontinuity in the current delayed ACK behavior. With the current delayed ACK behavior, we have: (1) If an app reads data when > 1MSS is unacknowledged, then tcp_cleanup_rbuf() ACKs immediately because of: tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss \|\| (2) If an app reads all received data, and the packets were < 1MSS, and either (a) the app is not ping-pong or (b) we received two packets < 1MSS, then tcp_cleanup_rbuf() ACKs immediately beecause of: ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) \|\| ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) && !inet_csk_in_pingpong_mode(sk))) && (3) However: if an app reads exactly 1MSS of data, tcp_cleanup_rbuf() does not send an immediate ACK. This is true even if the app is not ping-pong and the 1MSS of data had the PSH bit set, suggesting the sending application completed an application write. Thus if the app is not ping-pong, we have this painful case where >1MSS gets an immediate ACK, and <1MSS gets an immediate ACK, but a write whose last skb is an exact multiple of 1MSS can get a 40ms delayed ACK. This means that any app that transfers data in one direction and takes care to align write size or packet size with MSS can suffer this problem. With receive zero copy making 4KB MSS values more common, it is becoming more common to have application writes naturally align with MSS, and more applications are likely to encounter this delayed ACK problem. The fix in this commit is to refine the delayed ACK heuristics with a simple check: immediately ACK a received 1MSS skb with PSH bit set if the app reads all data. Why? If an skb has a len of exactly 1MSS and has the PSH bit set then it is likely the end of an application write. So more data may not be arriving soon, and yet the data sender may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send an ACK immediately if the app reads all of the data and is not ping-pong. Note that this logic is also executed for the case where len > MSS, but in that case this logic does not matter (and does not hurt) because tcp_cleanup_rbuf() will always ACK immediately if the app reads data and there is more than an MSS of unACKed data. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Xin Guo <guoxin0309@gmail.com> Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-10-20 11:53:59 +02:00
Paolo Abeni	318329e4ce	tcp: fix mishandling when the sack compression is deferred. JIRA: https://issues.redhat.com/browse/RHEL-14348 Tested: LNST, Tier1 Upstream commit: commit 30c6f0bf9579debce27e45fac34fdc97e46acacc Author: fuyuanli <fuyuanli@didiglobal.com> Date: Wed May 31 16:01:50 2023 +0800 tcp: fix mishandling when the sack compression is deferred. In this patch, we mainly try to handle sending a compressed ack correctly if it's deferred. Here are more details in the old logic: When sack compression is triggered in the tcp_compressed_ack_kick(), if the sock is owned by user, it will set TCP_DELACK_TIMER_DEFERRED and then defer to the release cb phrase. Later once user releases the sock, tcp_delack_timer_handler() should send a ack as expected, which, however, cannot happen due to lack of ICSK_ACK_TIMER flag. Therefore, the receiver would not sent an ack until the sender's retransmission timeout. It definitely increases unnecessary latency. Fixes: `5d9f4262b7` ("tcp: add SACK compression") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: fuyuanli <fuyuanli@didiglobal.com> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/netdev/20230529113804.GA20300@didi-ThinkCentre-M920t-N000/ Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230531080150.GA20424@didi-ThinkCentre-M920t-N000 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-10-20 11:46:08 +02:00
Jan Stancek	704d11b087	Merge: enable io_uring MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375 # Merge Request Required Information ## Summary of Changes This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits). The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option. ## Approved Development Ticket Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214 Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation") This is actually just an optimization, and it has non-trivial conflicts which would require additional backports to resolve. Skip it. Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce") This fix is incorrectly tagged. The code that it applies to is not present in our tree. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Approved-by: John Meneghini <jmeneghi@redhat.com> Approved-by: Ming Lei <ming.lei@redhat.com> Approved-by: Maurizio Lombardi <mlombard@redhat.com> Approved-by: Brian Foster <bfoster@redhat.com> Approved-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-17 07:47:08 +02:00
Jeff Moyer	fc8e2acd3e	tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit b67985be400969578d4d4b17299714c0e5d2c07b Author: Eric Dumazet <edumazet@google.com> Date: Tue Feb 1 10:46:40 2022 -0800 tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data() tcp_shift_skb_data() might collapse three packets into a larger one. P_A, P_B, P_C -> P_ABC Historically, it used a single tcp_skb_can_collapse_to(P_A) call, because it was enough. In commit `8571248411` ("tcp: coalesce/collapse must respect MPTCP extensions"), this call was replaced by a call to tcp_skb_can_collapse(P_A, P_B) But the now needed test over P_C has been missed. This probably broke MPTCP. Then later, commit 9b65b17db723 ("net: avoid double accounting for pure zerocopy skbs") added an extra condition to tcp_skb_can_collapse(), but the missing call from tcp_shift_skb_data() is also breaking TCP zerocopy, because P_A and P_C might have different skb_zcopy_pure() status. Fixes: `8571248411` ("tcp: coalesce/collapse must respect MPTCP extensions") Fixes: 9b65b17db723 ("net: avoid double accounting for pure zerocopy skbs") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mat Martineau <mathew.j.martineau@linux.intel.com> Cc: Talal Ahmad <talalahmad@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220201184640.756716-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-05-05 15:23:02 -04:00
Paolo Abeni	e0b7624646	tcp: fix indefinite deferral of RTO with SACK reneging Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561 Tested: LNST, Tier1 Upstream commit: commit 3d2af9cce3133b3bc596a9d065c6f9d93419ccfb Author: Neal Cardwell <ncardwell@google.com> Date: Fri Oct 21 17:08:21 2022 +0000 tcp: fix indefinite deferral of RTO with SACK reneging This commit fixes a bug that can cause a TCP data sender to repeatedly defer RTOs when encountering SACK reneging. The bug is that when we're in fast recovery in a scenario with SACK reneging, every time we get an ACK we call tcp_check_sack_reneging() and it can note the apparent SACK reneging and rearm the RTO timer for srtt/2 into the future. In some SACK reneging scenarios that can happen repeatedly until the receive window fills up, at which point the sender can't send any more, the ACKs stop arriving, and the RTO fires at srtt/2 after the last ACK. But that can take far too long (O(10 secs)), since the connection is stuck in fast recovery with a low cwnd that cannot grow beyond ssthresh, even if more bandwidth is available. This fix changes the logic in tcp_check_sack_reneging() to only rearm the RTO timer if data is cumulatively ACKed, indicating forward progress. This avoids this kind of nearly infinite loop of RTO timer re-arming. In addition, this meets the goals of tcp_check_sack_reneging() in handling Windows TCP behavior that looks temporarily like SACK reneging but is not really. Many thanks to Jakub Kicinski and Neil Spring, who reported this issue and provided critical packet traces that enabled root-causing this issue. Also, many thanks to Jakub Kicinski for testing this fix. Fixes: `5ae344c949` ("tcp: reduce spurious retransmits due to transient SACK reneging") Reported-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Neil Spring <ntspring@fb.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Tested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-04-21 09:56:52 +02:00
Paolo Abeni	a2ec85333c	tcp: annotate data-race around challenge_timestamp Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561 Tested: LNST, Tier1 Upstream commit: commit 8c70521238b7863c2af607e20bcba20f974c969b Author: Eric Dumazet <edumazet@google.com> Date: Tue Aug 30 11:56:55 2022 -0700 tcp: annotate data-race around challenge_timestamp challenge_timestamp can be read an written by concurrent threads. This was expected, but we need to annotate the race to avoid potential issues. Following patch moves challenge_timestamp and challenge_count to per-netns storage to provide better isolation. Fixes: `354e4aa391` ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-04-21 09:53:57 +02:00
Guillaume Nault	e1b28db515	tcp: Fix a data-race around sysctl_tcp_comp_sack_nr. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit 79f55473bfc8ac51bd6572929a679eeb4da22251 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Fri Jul 22 11:22:03 2022 -0700 tcp: Fix a data-race around sysctl_tcp_comp_sack_nr. While reading sysctl_tcp_comp_sack_nr, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `9c21d2fc41` ("tcp: add tcp_comp_sack_nr sysctl") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:15 +01:00
Guillaume Nault	ba3a173d9b	tcp: Fix a data-race around sysctl_tcp_comp_sack_slack_ns. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit 22396941a7f343d704738360f9ef0e6576489d43 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Fri Jul 22 11:22:02 2022 -0700 tcp: Fix a data-race around sysctl_tcp_comp_sack_slack_ns. While reading sysctl_tcp_comp_sack_slack_ns, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `a70437cc09` ("tcp: add hrtimer slack to sack compression") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:15 +01:00
Guillaume Nault	864f6cb56c	tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit 4866b2b0f7672b6d760c4b8ece6fb56f965dcc8a Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Fri Jul 22 11:22:01 2022 -0700 tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns. While reading sysctl_tcp_comp_sack_delay_ns, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `6d82aa2420` ("tcp: add tcp_comp_sack_delay_ns sysctl") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:15 +01:00
Guillaume Nault	744fd61abd	tcp: Fix data-races around sk_pacing_rate. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit 59bf6c65a09fff74215517aecffbbdcd67df76e3 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Fri Jul 22 11:21:59 2022 -0700 tcp: Fix data-races around sk_pacing_rate. While reading sysctl_tcp_pacing_(ss\|ca)_ratio, they can be changed concurrently. Thus, we need to add READ_ONCE() to their readers. Fixes: `43e122b014` ("tcp: refine pacing rate determination") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:15 +01:00
Guillaume Nault	7ebd35903c	tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit 2afdbe7b8de84c28e219073a6661080e1b3ded48 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Wed Jul 20 09:50:26 2022 -0700 tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit. While reading sysctl_tcp_invalid_ratelimit, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `032ee42369` ("tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:15 +01:00
Guillaume Nault	df5278bbb0	tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit 1330ffacd05fc9ac4159d19286ce119e22450ed2 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Wed Jul 20 09:50:24 2022 -0700 tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen. While reading sysctl_tcp_min_rtt_wlen, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `f672258391` ("tcp: track min RTT using windowed min-filter") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:14 +01:00
Guillaume Nault	1083d63446	tcp: Fix a data-race around sysctl_tcp_challenge_ack_limit. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit db3815a2fa691da145cfbe834584f31ad75df9ff Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Wed Jul 20 09:50:21 2022 -0700 tcp: Fix a data-race around sysctl_tcp_challenge_ack_limit. While reading sysctl_tcp_challenge_ack_limit, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `282f23c6ee` ("tcp: implement RFC 5961 3.2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:14 +01:00
Guillaume Nault	672780951d	tcp: Fix a data-race around sysctl_tcp_frto. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit 706c6202a3589f290e1ef9be0584a8f4a3cc0507 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Wed Jul 20 09:50:15 2022 -0700 tcp: Fix a data-race around sysctl_tcp_frto. While reading sysctl_tcp_frto, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:14 +01:00
Guillaume Nault	7eb9778613	tcp: Fix a data-race around sysctl_tcp_app_win. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit 02ca527ac5581cf56749db9fd03d854e842253dd Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Wed Jul 20 09:50:13 2022 -0700 tcp: Fix a data-race around sysctl_tcp_app_win. While reading sysctl_tcp_app_win, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:14 +01:00
Guillaume Nault	cbc7d0bf3d	tcp: Fix data-races around sysctl_tcp_dsack. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit 58ebb1c8b35a8ef38cd6927431e0fa7b173a632d Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Wed Jul 20 09:50:12 2022 -0700 tcp: Fix data-races around sysctl_tcp_dsack. While reading sysctl_tcp_dsack, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:14 +01:00

1 2 3 4 5 ...

1175 Commits