Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Sabrina Dubroca	1eacc871f3	tcp: drop secpath at the same time as we currently drop dst JIRA: https://issues.redhat.com/browse/RHEL-69649 JIRA: https://issues.redhat.com/browse/RHEL-83224 CVE: CVE-2025-21864 commit 9b6412e6979f6f9e0632075f8f008937b5cd4efd Author: Sabrina Dubroca <sd@queasysnail.net> Date: Mon Feb 17 11:23:35 2025 +0100 tcp: drop secpath at the same time as we currently drop dst Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while running tests that boil down to: - create a pair of netns - run a basic TCP test over ipcomp6 - delete the pair of netns The xfrm_state found on spi_byaddr was not deleted at the time we delete the netns, because we still have a reference on it. This lingering reference comes from a secpath (which holds a ref on the xfrm_state), which is still attached to an skb. This skb is not leaked, it ends up on sk_receive_queue and then gets defer-free'd by skb_attempt_defer_free. The problem happens when we defer freeing an skb (push it on one CPU's defer_list), and don't flush that list before the netns is deleted. In that case, we still have a reference on the xfrm_state that we don't expect at this point. We already drop the skb's dst in the TCP receive path when it's no longer needed, so let's also drop the secpath. At this point, tcp_filter has already called into the LSM hooks that may require the secpath, so it should not be needed anymore. However, in some of those places, the MPTCP extension has just been attached to the skb, so we cannot simply drop all extensions. Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>	2025-03-20 10:11:45 +01:00
Toke Høiland-Jørgensen	e9b6f5c14c	bpf: Add bpf_sock_destroy kfunc JIRA: https://issues.redhat.com/browse/RHEL-65787 Conflicts: Context difference due to missing af9784d007d8 ("tcp: diag: add support for TIME_WAIT sockets to tcp_abort()") and out-of-order backport of bac76cf89816 ("tcp: fix forever orphan socket caused by tcp_abort") commit 4ddbcb886268af8d12a23e6640b39d1d9c652b1b Author: Aditi Ghag <aditi.ghag@isovalent.com> Date: Fri May 19 22:51:55 2023 +0000 bpf: Add bpf_sock_destroy kfunc The socket destroy kfunc is used to forcefully terminate sockets from certain BPF contexts. We plan to use the capability in Cilium load-balancing to terminate client sockets that continue to connect to deleted backends. The other use case is on-the-fly policy enforcement where existing socket connections prevented by policies need to be forcefully terminated. The kfunc also allows terminating sockets that may or may not be actively sending traffic. The kfunc can currently be called only from BPF TCP and UDP iterators where users can filter, and terminate selected sockets. More specifically, it can only be called from BPF contexts that ensure socket locking in order to allow synchronous execution of protocol specific `diag_destroy` handlers. The previous commit that batches UDP sockets during iteration facilitated a synchronous invocation of the UDP destroy callback from BPF context by skipping socket locks in `udp_abort`. TCP iterator already supported batching of sockets being iterated. To that end, `tracing_iter_filter` callback filter is added so that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER` attach type, and reject other programs. The kfunc takes `sock_common` type argument, even though it expects, and casts them to a `sock` pointer. This enables the verifier to allow the sock_destroy kfunc to be called for TCP with `sock_common` and UDP with `sock` structs. Furthermore, as `sock_common` only has a subset of certain fields of `sock`, casting pointer to the latter type might not always be safe for certain sockets like request sockets, but these have a special handling in the diag_destroy handlers. Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer. eg. getting a sk pointer (may be even NULL) by following another sk pointer. The pointer socket argument passed in TCP and UDP iterators is tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info. The TRUSTED arg changes are contributed by Martin KaFai Lau <martin.lau@kernel.org>. Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com> Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>	2025-01-28 12:51:54 +01:00
Toke Høiland-Jørgensen	3dd3f94935	bpf: tcp: Avoid taking fast sock lock in iterator JIRA: https://issues.redhat.com/browse/RHEL-65787 commit 9378096e8a656fb5c4099b26b1370c56f056eab9 Author: Aditi Ghag <aditi.ghag@isovalent.com> Date: Fri May 19 22:51:49 2023 +0000 bpf: tcp: Avoid taking fast sock lock in iterator This is a preparatory commit to replace `lock_sock_fast` with `lock_sock`,and facilitate BPF programs executed from the TCP sockets iterator to be able to destroy TCP sockets using the bpf_sock_destroy kfunc (implemented in follow-up commits). Previously, BPF TCP iterator was acquiring the sock lock with BH disabled. This led to scenarios where the sockets hash table bucket lock can be acquired with BH enabled in some path versus disabled in other. In such situation, kernel issued a warning since it thinks that in the BH enabled path the same bucket lock might be acquired again in the softirq context (BH disabled), which will lead to a potential dead lock. Since bpf_sock_destroy also happens in a process context, the potential deadlock warning is likely a false alarm. Here is a snippet of annotated stack trace that motivated this change: ``` Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&h->lhash2[i].lock); local_bh_disable(); lock(&h->lhash2[i].lock); kernel imagined possible scenario: local_bh_disable(); /* Possible softirq / lock(&h->lhash2[i].lock); Potential Deadlock * process context: lock_acquire+0xcd/0x330 _raw_spin_lock+0x33/0x40 ------> Acquire (bucket) lhash2.lock with BH enabled __inet_hash+0x4b/0x210 inet_csk_listen_start+0xe6/0x100 inet_listen+0x95/0x1d0 __sys_listen+0x69/0xb0 __x64_sys_listen+0x14/0x20 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc bpf_sock_destroy run from iterator: lock_acquire+0xcd/0x330 _raw_spin_lock+0x33/0x40 ------> Acquire (bucket) lhash2.lock with BH disabled inet_unhash+0x9a/0x110 tcp_set_state+0x6a/0x210 tcp_abort+0x10d/0x200 bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9 bpf_iter_run_prog+0x1ff/0x340 ------> lock_sock_fast that acquires sock lock with BH disabled bpf_iter_tcp_seq_show+0xca/0x190 bpf_seq_read+0x177/0x450 ``` Also, Yonghong reported a deadlock for non-listening TCP sockets that this change resolves. Previously, `lock_sock_fast` held the sock spin lock with BH which was again being acquired in `tcp_abort`: ``` watchdog: BUG: soft lockup - CPU#0 stuck for 86s! [test_progs:2331] RIP: 0010:queued_spin_lock_slowpath+0xd8/0x500 Call Trace: <TASK> _raw_spin_lock+0x84/0x90 tcp_abort+0x13c/0x1f0 bpf_prog_88539c5453a9dd47_iter_tcp6_client+0x82/0x89 bpf_iter_run_prog+0x1aa/0x2c0 ? preempt_count_sub+0x1c/0xd0 ? from_kuid_munged+0x1c8/0x210 bpf_iter_tcp_seq_show+0x14e/0x1b0 bpf_seq_read+0x36c/0x6a0 bpf_iter_tcp_seq_show lock_sock_fast __lock_sock_fast spin_lock_bh(&sk->sk_lock.slock); /* * Fast path return with bottom halves disabled and * sock::sk_lock.slock held.* */ ... tcp_abort local_bh_disable(); spin_lock(&((sk)->sk_lock.slock)); // from bh_lock_sock(sk) ``` With the switch to `lock_sock`, it calls `spin_unlock_bh` before returning: ``` lock_sock lock_sock_nested spin_lock_bh(&sk->sk_lock.slock); : spin_unlock_bh(&sk->sk_lock.slock); ``` Acked-by: Yonghong Song <yhs@meta.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com> Link: https://lore.kernel.org/r/20230519225157.760788-2-aditi.ghag@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>	2025-01-28 12:51:53 +01:00
Rado Vrbovsky	81ce48e690	Merge: mptcp: phase-1 backports for RHEL-9.6 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5449 JIRA: https://issues.redhat.com/browse/RHEL-62871 JIRA: https://issues.redhat.com/browse/RHEL-58839 JIRA: https://issues.redhat.com/browse/RHEL-66083 JIRA: https://issues.redhat.com/browse/RHEL-66074 CVE: CVE-2024-46711 CVE: CVE-2024-45009 CVE: CVE-2024-45010 Upstream Status: All mainline in net.git Tested: kselftest Conflicts: see individual patches Signed-off-by: Davide Caratti <dcaratti@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Paolo Abeni <pabeni@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-11-22 09:18:31 +00:00
Davide Caratti	7bb06cd72e	tcp: annotate data-races around tp->tsoffset JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit dd23c9f1e8d5c1d2e3d29393412385ccb9c7a948 Conflicts: - net/ipv4/tcp_ipv4.c: keep using sock_net(sk) as we don't have upstream commit 08eaef904031 ("tcp: Clean up some functions.") commit dd23c9f1e8d5c1d2e3d29393412385ccb9c7a948 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:48 2023 +0000 tcp: annotate data-races around tp->tsoffset do_tcp_getsockopt() reads tp->tsoffset while another cpu might change its value. Fixes: `93be6ce0e9` ("tcp: set and get per-socket timestamp") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Paolo Abeni	7e492a042f	tcp: avoid premature drops in tcp_add_backlog() JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit ec00ed472bdb7d0af840da68c8c11bff9f4d9caa Author: Eric Dumazet <edumazet@google.com> Date: Tue Apr 23 12:56:20 2024 +0000 tcp: avoid premature drops in tcp_add_backlog() While testing TCP performance with latest trees, I saw suspect SOCKET_BACKLOG drops. tcp_add_backlog() computes its limit with : limit = (u32)READ_ONCE(sk->sk_rcvbuf) + (u32)(READ_ONCE(sk->sk_sndbuf) >> 1); limit += 64 * 1024; This does not take into account that sk->sk_backlog.len is reset only at the very end of __release_sock(). Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach sk_rcvbuf in normal conditions. We should double sk->sk_rcvbuf contribution in the formula to absorb bubbles in the backlog, which happen more often for very fast flows. This change maintains decent protection against abuses. Fixes: `c377411f24` ("net: sk_add_backlog() take rmem_alloc into account") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240423125620.3309458-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:13 +02:00
Paolo Abeni	fdad6e7a51	tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Conflicts: different context in tcp_conn_request(), as rhel-9 \ lacks the TCP AO support. Upstream commit: commit 41eecbd712b73f0d5dcf1152b9a1c27b1f238028 Author: Eric Dumazet <edumazet@google.com> Date: Sun Apr 7 09:33:22 2024 +0000 tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field TCP can transform a TIMEWAIT socket into a SYN_RECV one from a SYN packet, and the ISN of the SYNACK packet is normally generated using TIMEWAIT tw_snd_nxt : tcp_timewait_state_process() ... u32 isn = tcptw->tw_snd_nxt + 65535 + 2; if (isn == 0) isn++; TCP_SKB_CB(skb)->tcp_tw_isn = isn; return TCP_TW_SYN; This SYN packet also bypasses normal checks against listen queue being full or not. tcp_conn_request() ... __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn; ... /* TW buckets are converted to open requests without * limitations, they conserve resources and peer is * evidently real one. */ if ((syncookies == 2 \|\| inet_csk_reqsk_queue_is_full(sk)) && !isn) { want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name); if (!want_cookie) goto drop; } This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb. Unfortunately this field has been accidentally cleared after the call to tcp_timewait_state_process() returning TCP_TW_SYN. Using a field in TCP_SKB_CB(skb) for a temporary state is overkill. Switch instead to a per-cpu variable. As a bonus, we do not have to clear tcp_tw_isn in TCP receive fast path. It is temporarily set then cleared only in the TCP_TW_SYN dance. Fixes: `4ad19de877` ("net: tcp6: fix double call of tcp_v6_fill_cb()") Fixes: `eeea10b83a` ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:08:41 +02:00
Paolo Abeni	4cd846284a	tcp: propagate tcp_tw_isn via an extra parameter to ->route_req() JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Upstream commit: commit b9e810405880c99baafd550ada7043e86465396e Author: Eric Dumazet <edumazet@google.com> Date: Sun Apr 7 09:33:21 2024 +0000 tcp: propagate tcp_tw_isn via an extra parameter to ->route_req() tcp_v6_init_req() reads TCP_SKB_CB(skb)->tcp_tw_isn to find out if the request socket is created by a SYN hitting a TIMEWAIT socket. This has been buggy for a decade, lets directly pass the information from tcp_conn_request(). This is a preparatory patch to make the following one easier to review. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:07:53 +02:00
Florian Westphal	e68e0d9b40	net: tcp/dccp: prepare for tw_timer un-pinning JIRA: https://issues.redhat.com/browse/RHEL-9279 Upstream Status: commit b334b924c9b7 Conflicts: net/ipv4/tcp_minisocks.c We lack a "struct net *net" in this function, earlier conflict fixup used sock_net(). Resolve this by keeping sock_net() usage as-is. commit b334b924c9b709bc969644fb5c406f5c9d01dceb Author: Valentin Schneider <vschneid@redhat.com> Date: Thu Jun 6 17:11:37 2024 +0200 net: tcp/dccp: prepare for tw_timer un-pinning The TCP timewait timer is proving to be problematic for setups where scheduler CPU isolation is achieved at runtime via cpusets (as opposed to statically via isolcpus=domains). What happens there is a CPU goes through tcp_time_wait(), arming the time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing interference for the now-isolated CPU. This is conceptually similar to the issue described in commit e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()") Move inet_twsk_schedule() to within inet_twsk_hashdance(), with the ehash lock held. Expand the lock's critical section from inet_twsk_kill() to inet_twsk_deschedule_put(), serializing the scheduling vs descheduling of the timer. IOW, this prevents the following race: tcp_time_wait() inet_twsk_hashdance() inet_twsk_deschedule_put() del_timer_sync() inet_twsk_schedule() Thanks to Paolo Abeni for suggesting to leverage the ehash lock. This also restores a comment from commit `ec94c2696f` ("tcp/dccp: avoid one atomic operation for timewait hashdance") as inet_twsk_hashdance() had a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing. inet_twsk_deschedule_put() now acquires the ehash spinlock to synchronize with inet_twsk_hashdance_schedule(). To ease possible regression search, actual un-pin is done in next patch. Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/ Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Florian Westphal <fwestpha@redhat.com>	2024-08-21 16:56:29 +02:00
Florian Westphal	23f780623d	tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stamp JIRA: https://issues.redhat.com/browse/RHEL-9279 Upstream Status: commit 69e0b33a7fce CS9 lacks both support for TCP Authentication option and usec resolution for TCP timestamps. Both features are out of scope, so do needed context fixups. This change was added to reduce conflicts in the followup patch. commit 69e0b33a7fce4d96649b9fa32e56b696921aa48e Author: Eric Dumazet <edumazet@google.com> Date: Mon Jun 3 15:51:06 2024 +0000 tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stamp These fields can be read and written locklessly, add annotations around these minor races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Conflicts: net/ipv4/tcp_ipv4.c net/ipv6/tcp_ipv6.c Signed-off-by: Florian Westphal <fwestpha@redhat.com>	2024-08-21 16:55:25 +02:00
Antoine Tenart	3a0f9f0ce0	tcp: use sk_skb_reason_drop to free rx packets JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: net-next.git commit 46a02aa357529d7b038096955976b14f7c44aa23 Author: Yan Zhai <yan@cloudflare.com> Date: Mon Jun 17 11:09:20 2024 -0700 tcp: use sk_skb_reason_drop to free rx packets Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving socket to the tracepoint. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202406011539.jhwBd7DX-lkp@intel.com/ Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:42 +02:00
Antoine Tenart	0bc1f777a4	tcp: rstreason: handle timewait cases in the receive path JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git commit 22a32557758a7100e46dfa8f383a401125e60b16 Author: Jason Xing <kernelxing@tencent.com> Date: Fri May 10 20:25:01 2024 +0800 tcp: rstreason: handle timewait cases in the receive path There are two possible cases where TCP layer can send an RST. Since they happen in the same place, I think using one independent reason is enough to identify this special situation. Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://lore.kernel.org/r/20240510122502.27850-5-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:42 +02:00
Antoine Tenart	51c78f9a4a	rstreason: make it work in trace world JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git commit b533fb9cf4f7c6ca2aa255a5a1fdcde49fff2b24 Author: Jason Xing <kernelxing@tencent.com> Date: Thu Apr 25 11:13:40 2024 +0800 rstreason: make it work in trace world At last, we should let it work by introducing this reset reason in trace world. One of the possible expected outputs is: ... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx state=TCP_ESTABLISHED reason=NOT_SPECIFIED Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:41 +02:00
Antoine Tenart	8ea5cff87d	tcp: support rstreason for passive reset JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git commit 120391ef9ca8fe8f82ea3f2961ad802043468226 Author: Jason Xing <kernelxing@tencent.com> Date: Thu Apr 25 11:13:37 2024 +0800 tcp: support rstreason for passive reset Reuse the dropreason logic to show the exact reason of tcp reset, so we can finally display the corresponding item in enum sk_reset_reason instead of reinventing new reset reasons. This patch replaces all the prior NOT_SPECIFIED reasons. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:41 +02:00
Antoine Tenart	25344d90dd	rstreason: prepare for passive reset JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git Conflicts:\ - Context differences due to missing upstream commits ba7783ad45c8 ("net/tcp: Add AO sign to RST packets") and d5dfbfa2f88e ("mptcp: drop duplicate header inclusions") in c9s. commit 6be49deaa09576c141002a2e6f816a1709bc2c86 Author: Jason Xing <kernelxing@tencent.com> Date: Thu Apr 25 11:13:35 2024 +0800 rstreason: prepare for passive reset Adjust the parameter and support passing reason of reset which is for now NOT_SPECIFIED. No functional changes. Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:41 +02:00
Antoine Tenart	528623fc31	trace: tcp: fully support trace_tcp_send_reset JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git Conflicts:\ - Context differences due to missing upstream commits ba7783ad45c8 ("net/tcp: Add AO sign to RST packets") and 3cccda8db2cf ("ipv6: move np->repflow to atomic flags") in c9s. commit 19822a980e1956a6572998887a7df5a0607a32f6 Author: Jason Xing <kernelxing@tencent.com> Date: Mon Apr 1 15:36:05 2024 +0800 trace: tcp: fully support trace_tcp_send_reset Prior to this patch, what we can see by enabling trace_tcp_send is only happening under two circumstances: 1) active rst mode 2) non-active rst mode and based on the full socket That means the inconsistency occurs if we use tcpdump and trace simultaneously to see how rst happens. It's necessary that we should take into other cases into considerations, say: 1) time-wait socket 2) no socket ... By parsing the incoming skb and reversing its 4-tuple can we know the exact 'flow' which might not exist. Samples after applied this patch: 1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port state=TCP_ESTABLISHED 2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port state=UNKNOWN Note: 1) UNKNOWN means we cannot extract the right information from skb. 2) skbaddr/skaddr could be 0 Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://lore.kernel.org/r/20240401073605.37335-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:41 +02:00
Antoine Tenart	8e320d89a7	tcp: make dropreason in tcp_child_process() work JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git commit ee01defe25bad09a37b68dd051a7e931d1e4cd91 Author: Jason Xing <kernelxing@tencent.com> Date: Mon Feb 26 11:22:27 2024 +0800 tcp: make dropreason in tcp_child_process() work It's time to let it work right now. We've already prepared for this:) Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:40 +02:00
Antoine Tenart	8f346a11e7	tcp: make the dropreason really work when calling tcp_rcv_state_process() JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git commit b9825695930546af725b1e686b8eaf4c71201728 Author: Jason Xing <kernelxing@tencent.com> Date: Mon Feb 26 11:22:26 2024 +0800 tcp: make the dropreason really work when calling tcp_rcv_state_process() Update three callers including both ipv4 and ipv6 and let the dropreason mechanism work in reality. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:40 +02:00
Antoine Tenart	191443751b	tcp: directly drop skb in cookie check for ipv4 JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git Conflicts:\ - Context difference due to missing upstream commit 8e7bab6b9652 ("tcp: Factorise cookie-dependent fields initialisation in cookie_v[46]_check()") in c9s. commit 65be4393f363c4bd5c388ddf3e3eb4abee2b1f79 Author: Jason Xing <kernelxing@tencent.com> Date: Mon Feb 26 11:22:19 2024 +0800 tcp: directly drop skb in cookie check for ipv4 Only move the skb drop from tcp_v4_do_rcv() to cookie_v4_check() itself, no other changes made. It can help us refine the specific drop reasons later. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:40 +02:00
Guillaume Nault	b445907e34	tcp: Use refcount_inc_not_zero() in tcp_twsk_unique(). JIRA: https://issues.redhat.com/browse/RHEL-39837 Upstream Status: linux.git CVE: CVE-2024-36904 commit f2db7230f73a80dbb179deab78f88a7947f0ab7e Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Wed May 1 14:31:45 2024 -0700 tcp: Use refcount_inc_not_zero() in tcp_twsk_unique(). Anderson Nascimento reported a use-after-free splat in tcp_twsk_unique() with nice analysis. Since commit `ec94c2696f` ("tcp/dccp: avoid one atomic operation for timewait hashdance"), inet_twsk_hashdance() sets TIME-WAIT socket's sk_refcnt after putting it into ehash and releasing the bucket lock. Thus, there is a small race window where other threads could try to reuse the port during connect() and call sock_hold() in tcp_twsk_unique() for the TIME-WAIT socket with zero refcnt. If that happens, the refcnt taken by tcp_twsk_unique() is overwritten and sock_put() will cause underflow, triggering a real use-after-free somewhere else. To avoid the use-after-free, we need to use refcount_inc_not_zero() in tcp_twsk_unique() and give up on reusing the port if it returns false. [0]: refcount_t: addition on 0; use-after-free. WARNING: CPU: 0 PID: 1039313 at lib/refcount.c:25 refcount_warn_saturate+0xe5/0x110 CPU: 0 PID: 1039313 Comm: trigger Not tainted 6.8.6-200.fc39.x86_64 #1 Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.21805430.B64.2305221830 05/22/2023 RIP: 0010:refcount_warn_saturate+0xe5/0x110 Code: 42 8e ff 0f 0b c3 cc cc cc cc 80 3d aa 13 ea 01 00 0f 85 5e ff ff ff 48 c7 c7 f8 8e b7 82 c6 05 96 13 ea 01 01 e8 7b 42 8e ff <0f> 0b c3 cc cc cc cc 48 c7 c7 50 8f b7 82 c6 05 7a 13 ea 01 01 e8 RSP: 0018:ffffc90006b43b60 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff888009bb3ef0 RCX: 0000000000000027 RDX: ffff88807be218c8 RSI: 0000000000000001 RDI: ffff88807be218c0 RBP: 0000000000069d70 R08: 0000000000000000 R09: ffffc90006b439f0 R10: ffffc90006b439e8 R11: 0000000000000003 R12: ffff8880029ede84 R13: 0000000000004e20 R14: ffffffff84356dc0 R15: ffff888009bb3ef0 FS: 00007f62c10926c0(0000) GS:ffff88807be00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020ccb000 CR3: 000000004628c005 CR4: 0000000000f70ef0 PKRU: 55555554 Call Trace: <TASK> ? refcount_warn_saturate+0xe5/0x110 ? __warn+0x81/0x130 ? refcount_warn_saturate+0xe5/0x110 ? report_bug+0x171/0x1a0 ? refcount_warn_saturate+0xe5/0x110 ? handle_bug+0x3c/0x80 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? refcount_warn_saturate+0xe5/0x110 tcp_twsk_unique+0x186/0x190 __inet_check_established+0x176/0x2d0 __inet_hash_connect+0x74/0x7d0 ? __pfx___inet_check_established+0x10/0x10 tcp_v4_connect+0x278/0x530 __inet_stream_connect+0x10f/0x3d0 inet_stream_connect+0x3a/0x60 __sys_connect+0xa8/0xd0 __x64_sys_connect+0x18/0x20 do_syscall_64+0x83/0x170 entry_SYSCALL_64_after_hwframe+0x78/0x80 RIP: 0033:0x7f62c11a885d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a3 45 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007f62c1091e58 EFLAGS: 00000296 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000020ccb004 RCX: 00007f62c11a885d RDX: 0000000000000010 RSI: 0000000020ccb000 RDI: 0000000000000003 RBP: 00007f62c1091e90 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000296 R12: 00007f62c10926c0 R13: ffffffffffffff88 R14: 0000000000000000 R15: 00007ffe237885b0 </TASK> Fixes: `ec94c2696f` ("tcp/dccp: avoid one atomic operation for timewait hashdance") Reported-by: Anderson Nascimento <anderson@allelesecurity.com> Closes: https://lore.kernel.org/netdev/37a477a6-d39e-486b-9577-3463f655a6b7@allelesecurity.com/ Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240501213145.62261-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2024-06-07 20:27:04 +02:00
Sabrina Dubroca	7bc5eeb384	net: skbuff: generalize the skb->decrypted bit JIRA: https://issues.redhat.com/browse/RHEL-29306 commit 9f06f87fef689d28588cde8c7ebb00a67da34026 Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Apr 3 13:21:39 2024 -0700 net: skbuff: generalize the skb->decrypted bit The ->decrypted bit can be reused for other crypto protocols. Remove the direct dependency on TLS, add helpers to clean up the ifdefs leaking out everywhere. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>	2024-05-01 17:48:16 +02:00
Paolo Abeni	026591e567	tcp: check mptcp-level constraints for backlog coalescing JIRA: https://issues.redhat.com/browse/RHEL-21432 Tested: LNST, Tier1 Upstream commit: commit 6db8a37dfc541e059851652cfd4f0bb13b8ff6af Author: Paolo Abeni <pabeni@redhat.com> Date: Wed Oct 18 11:23:53 2023 -0700 tcp: check mptcp-level constraints for backlog coalescing The MPTCP protocol can acquire the subflow-level socket lock and cause the tcp backlog usage. When inserting new skbs into the backlog, the stack will try to coalesce them. Currently, we have no check in place to ensure that such coalescing will respect the MPTCP-level DSS, and that may cause data stream corruption, as reported by Christoph. Address the issue by adding the relevant admission check for coalescing in tcp_add_backlog(). Note the issue is not easy to reproduce, as the MPTCP protocol tries hard to avoid acquiring the subflow-level socket lock. Fixes: `648ef4b886` ("mptcp: Implement MPTCP receive path") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/420 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-2-17ecb002e41d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-01-12 13:41:50 +01:00
Felix Maurer	130ad87ddc	tcp: enforce receive buffer memory limits by allowing the tcp window to shrink JIRA: https://issues.redhat.com/browse/RHEL-11592 Conflicts: - net/ipv4/sysctl_net_ipv4.c: context difference due to missing new sysctls - net/ipv4/tcp_ipv4.c: context difference due to missing ccce324dabfe ("tcp: make the first N SYN RTO backoffs linear") and 37ba017dcc3b ("ipv4/tcp: do not use per netns ctl sockets") commit b650d953cd391595e536153ce30b4aab385643ac Author: mfreemon@cloudflare.com <mfreemon@cloudflare.com> Date: Sun Jun 11 22:05:24 2023 -0500 tcp: enforce receive buffer memory limits by allowing the tcp window to shrink Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to incoming data packets as a result of the window not closing when it should be. This can result in the receive buffer growing all the way up to tcp_rmem[2], even for tcp sessions with a low BDP. To reproduce: Connect a TCP session with the receiver doing nothing and the sender sending small packets (an infinite loop of socket send() with 4 bytes of payload with a sleep of 1 ms in between each send()). This will cause the tcp receive buffer to grow all the way up to tcp_rmem[2]. As a result, a host can have individual tcp sessions with receive buffers of size tcp_rmem[2], and the host itself can reach tcp_mem limits, causing the host to go into tcp memory pressure mode. The fundamental issue is the relationship between the granularity of the window scaling factor and the number of byte ACKed back to the sender. This problem has previously been identified in RFC 7323, appendix F [1]. The Linux kernel currently adheres to never shrinking the window. In addition to the overallocation of memory mentioned above, the current behavior is functionally incorrect, because once tcp_rmem[2] is reached when no remediations remain (i.e. tcp collapse fails to free up any more memory and there are no packets to prune from the out-of-order queue), the receiver will drop in-window packets resulting in retransmissions and an eventual timeout of the tcp session. A receive buffer full condition should instead result in a zero window and an indefinite wait. In practice, this problem is largely hidden for most flows. It is not applicable to mice flows. Elephant flows can send data fast enough to "overrun" the sk_rcvbuf limit (in a single ACK), triggering a zero window. But this problem does show up for other types of flows. Examples are websockets and other type of flows that send small amounts of data spaced apart slightly in time. In these cases, we directly encounter the problem described in [1]. RFC 7323, section 2.4 [2], says there are instances when a retracted window can be offered, and that TCP implementations MUST ensure that they handle a shrinking window, as specified in RFC 1122, section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window management have made clear that sender must accept a shrunk window from the receiver, including RFC 793 [4] and RFC 1323 [5]. This patch implements the functionality to shrink the tcp window when necessary to keep the right edge within the memory limit by autotuning (sk_rcvbuf). This new functionality is enabled with the new sysctl: net.ipv4.tcp_shrink_window Additional information can be found at: https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/ [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4 [3] https://www.rfc-editor.org/rfc/rfc1122#page-91 [4] https://www.rfc-editor.org/rfc/rfc793 [5] https://www.rfc-editor.org/rfc/rfc1323 Signed-off-by: Mike Freemon <mfreemon@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2023-10-31 16:20:06 +01:00
Jan Stancek	5a0d19aa9d	Merge: net: improve skb hash stability when net.core.txrehash=0 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2694 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966 As a side effect this also improves stability for IPv6 autoflowlabel. Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-07-04 11:15:02 +02:00
Jan Stancek	e341c7e709	Merge: bpf, xdp: update to 6.3 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2583 Rebase bpf and xdp to 6.3. Bugzilla: https://bugzilla.redhat.com/2178930 Signed-off-by: Viktor Malik <vmalik@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Artem Savkov <asavkov@redhat.com> Approved-by: Jason Wang <jasowang@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Approved-by: Jan Stancek <jstancek@redhat.com> Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-06-28 07:52:45 +02:00
Antoine Tenart	1cfc972fac	net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966 Upstream Status: net-next.git Conflicts:\ - Context difference due to missing upstream commit e22aa1486668 ("net: Find dst with sk's xfrm policy not ctl_sk") in c9s. commit c0a8966e2bc7d31f77a7246947ebc09c1ff06066 Author: Antoine Tenart <atenart@kernel.org> Date: Tue May 23 18:14:52 2023 +0200 net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV When using IPv4/TCP, skb->hash comes from sk->sk_txhash except in TIME_WAIT and SYN_RECV where it's not set in the reply skb from ip_send_unicast_reply. Those packets will have a mismatched hash with others from the same flow as their hashes will be 0. IPv6 does not have the same issue as the hash is set from the socket txhash in those cases. This commits sets the hash in the reply skb from ip_send_unicast_reply, which makes the IPv4 code behaving like IPv6. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-06-16 10:55:36 +02:00
Antoine Tenart	6dd8976945	tcp: fix possible sk_priority leak in tcp_v4_send_reset() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966 Upstream Status: linux.git Conflicts:\ - Context differences due to missing upstream commits e22aa1486668 ("net: Find dst with sk's xfrm policy not ctl_sk") and 37ba017dcc3b ("ipv4/tcp: do not use per netns ctl sockets") in c9s. commit 1e306ec49a1f206fd2cc89a42fac6e6f592a8cc1 Author: Eric Dumazet <edumazet@google.com> Date: Thu May 11 11:47:49 2023 +0000 tcp: fix possible sk_priority leak in tcp_v4_send_reset() When tcp_v4_send_reset() is called with @sk == NULL, we do not change ctl_sk->sk_priority, which could have been set from a prior invocation. Change tcp_v4_send_reset() to set sk_priority and sk_mark fields before calling ip_send_unicast_reply(). This means tcp_v4_send_reset() and tcp_v4_send_ack() no longer have to clear ctl_sk->sk_mark after their call to ip_send_unicast_reply(). Fixes: `f6c0f5d209` ("tcp: honor SO_PRIORITY in TIME_WAIT state") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-06-16 10:55:24 +02:00
Felix Maurer	eed4a49571	bpf: tcp: Use sock_gen_put instead of sock_put in bpf_iter_tcp Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930 commit 580031ff9952b7dbf48dedba6b56a100ae002bef Author: Martin KaFai Lau <martin.lau@kernel.org> Date: Mon Mar 27 17:42:32 2023 -0700 bpf: tcp: Use sock_gen_put instead of sock_put in bpf_iter_tcp While reviewing the udp-iter batching patches, noticed the bpf_iter_tcp calling sock_put() is incorrect. It should call sock_gen_put instead because bpf_iter_tcp is iterating the ehash table which has the req sk and tw sk. This patch replaces all sock_put with sock_gen_put in the bpf_iter_tcp codepath. Fixes: 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230328004232.2134233-1-martin.lau@linux.dev Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2023-06-14 10:44:30 +02:00
Antoine Tenart	30b200a890	tcp: add TCP_MINTTL drop reason Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073 Upstream Status: linux.git Conflicts:\ - Context difference due to missing upstream commits 020e71a3cf7f ("ipv4: guard IP_MINTTL with a static key") and 14834c4f4eb3 ("ipv4: annotate data races arount inet->min_ttl") in c9s. commit 2798e36dc233a409a5d3f26f73029596dc504020 Author: Eric Dumazet <edumazet@google.com> Date: Wed Feb 1 17:43:45 2023 +0000 tcp: add TCP_MINTTL drop reason In the unlikely case incoming packets are dropped because of IP_MINTTL / IPV6_MINHOPCOUNT constraints... Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230201174345.2708943-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-06-06 11:23:15 +02:00
Paolo Abeni	220a990332	dccp/tcp: Reset saddr on failure after inet6?_hash_connect(). Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561 Tested: LNST, Tier1 Upstream commit: commit 77934dc6db0d2b111a8f2759e9ad2fb67f5cffa5 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Fri Nov 18 17:49:11 2022 -0800 dccp/tcp: Reset saddr on failure after inet6?_hash_connect(). When connect() is called on a socket bound to the wildcard address, we change the socket's saddr to a local address. If the socket fails to connect() to the destination, we have to reset the saddr. However, when an error occurs after inet_hash6?_connect() in (dccp\|tcp)_v[46]_conect(), we forget to reset saddr and leave the socket bound to the address. From the user's point of view, whether saddr is reset or not varies with errno. Let's fix this inconsistent behaviour. Note that after this patch, the repro [0] will trigger the WARN_ON() in inet_csk_get_port() again, but this patch is not buggy and rather fixes a bug papering over the bhash2's bug for which we need another fix. For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect() by this sequence: s1 = socket() s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s1.bind(('127.0.0.1', 10000)) s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000))) # or s1.connect(('127.0.0.1', 10000)) s2 = socket() s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s2.bind(('0.0.0.0', 10000)) s2.connect(('127.0.0.1', 10000)) # -EADDRNOTAVAIL s2.listen(32) # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2); [0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09 Fixes: `3df80d9320` ("[DCCP]: Introduce DCCPv6") Fixes: `7c657876b6` ("[DCCP]: Initial implementation") Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-04-21 09:57:10 +02:00
Paolo Abeni	059dc63005	tcp: fix a signed-integer-overflow bug in tcp_add_backlog() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561 Tested: LNST, Tier1 Upstream commit: commit ec791d8149ff60c40ad2074af3b92a39c916a03f Author: Lu Wei <luwei32@huawei.com> Date: Fri Oct 21 12:06:22 2022 +0800 tcp: fix a signed-integer-overflow bug in tcp_add_backlog() The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and in tcp_add_backlog(), the variable limit is caculated by adding sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value of int and overflow. This patch reduces the limit budget by halving the sndbuf to solve this issue since ACK packets are much smaller than the payload. Fixes: `c9c3321257` ("tcp: add tcp_add_backlog()") Signed-off-by: Lu Wei <luwei32@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-04-21 09:56:46 +02:00
Paolo Abeni	8dda5cd012	tcp: minor optimization in tcp_add_backlog() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561 Tested: LNST, Tier1 Upstream commit: commit d519f350967a60b85a574ad8aeac43f2b4384746 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 15 11:02:30 2021 -0800 tcp: minor optimization in tcp_add_backlog() If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values are not used. Defer their access to the point we need them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-04-21 09:56:38 +02:00
Guillaume Nault	04b96d8fcf	tcp: Fix data-races around sysctl_tcp_reflect_tos. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073 Upstream Status: linux.git commit 870e3a634b6a6cb1543b359007aca73fe6a03ac5 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Fri Jul 22 11:22:04 2022 -0700 tcp: Fix data-races around sysctl_tcp_reflect_tos. While reading sysctl_tcp_reflect_tos, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: `ac8f1710c1` ("tcp: reflect tos value received in SYN to the socket") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2023-01-17 12:25:15 +01:00
Guillaume Nault	8114c29f71	tcp: Fix a data-race around sysctl_tcp_tw_reuse. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949 Upstream Status: linux.git commit cbfc6495586a3f09f6f07d9fb3c7cafe807e3c55 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Fri Jul 15 10:17:52 2022 -0700 tcp: Fix a data-race around sysctl_tcp_tw_reuse. While reading sysctl_tcp_tw_reuse, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2022-12-22 11:37:58 +01:00
Frantisek Hrbata	e265d68e77	Merge: tcp: phase-1 backports for RHEL-9.2 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1504 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491 Upstream Status: All mainline in net-next.git. Tested: boot-tested only Conflicts: see individual patches Signed-off-by: Davide Caratti <dcaratti@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-11-14 02:40:21 -05:00
Davide Caratti	728983215c	tcp: Access &tcp_hashinfo via net. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858 Upstream Status: net.git commit 4461568aa4e5 Conflicts: - net/ipv4/tcp_ipv4.c: context mismatch as we don't have upstream commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") and 08eaef904031 ("tcp: Clean up some functions.") - net/ipv6/tcp_ipv6.c: context mismatch as we don't have upstream commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") - net/ipv4/tcp_minisocks.c: hunk applied manually to fix a build issue caused by missing upstream commit 08eaef904031 ("tcp: Clean up some functions.") commit 4461568aa4e565de2c336f4875ddf912f26da8a5 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Wed Sep 7 18:10:20 2022 -0700 tcp: Access &tcp_hashinfo via net. We will soon introduce an optional per-netns ehash. This means we cannot use tcp_hashinfo directly in most places. Instead, access it via net->ipv4.tcp_death_row.hashinfo. The access will be valid only while initialising tcp_hashinfo itself and creating/destroying each netns. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2022-11-08 17:10:59 +01:00
Davide Caratti	9aac6c4346	net: add per_cpu_fw_alloc field to struct proto Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858 Upstream Status: net.git commit 0defbb0af775 Conflicts: - net/core/sock.c: context mismatch because of missing backport of upstream commit f20cfd662a62 ("net: add sanity check in proto_register()") commit 0defbb0af775ef037913786048d099bbe8b9a2c2 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jun 8 23:34:08 2022 -0700 net: add per_cpu_fw_alloc field to struct proto Each protocol having a ->memory_allocated pointer gets a corresponding per-cpu reserve, that following patches will use. Instead of having reserved bytes per socket, we want to have per-cpu reserves. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2022-11-08 17:10:55 +01:00
Davide Caratti	a3894ee946	net: inet: Retire port only listening_hash Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858 Upstream Status: net.git commit cae3873c5b3a commit cae3873c5b3a4fcd9706fb461ff4e91bdf1f0120 Author: Martin KaFai Lau <kafai@fb.com> Date: Wed May 11 17:06:05 2022 -0700 net: inet: Retire port only listening_hash The listen sk is currently stored in two hash tables, listening_hash (hashed by port) and lhash2 (hashed by port and address). After commit `0ee58dad5b` ("net: tcp6: prefer listeners bound to an address") and commit `d9fbc7f643` ("net: tcp: prefer listeners bound to an address"), the TCP-SYN lookup fast path does not use listening_hash. The commit 05c0b35709c5 ("tcp: seq_file: Replace listening_hash with lhash2") also moved the seq_file (/proc/net/tcp) iteration usage from listening_hash to lhash2. There are still a few listening_hash usages left. One of them is inet_reuseport_add_sock() which uses the listening_hash to search a listen sk during the listen() system call. This turns out to be very slow on use cases that listen on many different VIPs at a popular port (e.g. 443). [ On top of the slowness in adding to the tail in the IPv6 case ]. The latter patch has a selftest to demonstrate this case. This patch takes this chance to move all remaining listening_hash usages to lhash2 and then retire listening_hash. Since most changes need to be done together, it is hard to cut the listening_hash to lhash2 switch into small patches. The changes in this patch is highlighted here for the review purpose. 1. Because of the listening_hash removal, lhash2 can use the sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node. This will also keep the sk_unhashed() check to work as is after stop adding sk to listening_hash. The union is removed from inet_listen_hashbucket because only nulls_head is needed. 2. icsk->icsk_listen_portaddr_node and its helpers are removed. 3. The current lhash2 users needs to iterate with sk_nulls_node instead of icsk_listen_portaddr_node. One case is in the inet[6]_lhash2_lookup(). Another case is the seq_file iterator in tcp_ipv4.c. One thing to note is sk_nulls_next() is needed because the old inet_lhash2_for_each_icsk_continue() does a "next" first before iterating. 4. Move the remaining listening_hash usage to lhash2 inet_reuseport_add_sock() which this series is trying to improve. inet_diag.c and mptcp_diag.c are the final two remaining use cases and is moved to lhash2 now also. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2022-11-08 17:10:54 +01:00
Davide Caratti	c7ab33ab51	tcp: add a missing nf_reset_ct() in 3WHS handling Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491 Upstream Status: net-next.git commit 6f0012e35160 commit 6f0012e35160cd08a53e46e3b3bbf724b92dfe68 Author: Eric Dumazet <edumazet@google.com> Date: Thu Jun 23 05:04:36 2022 +0000 tcp: add a missing nf_reset_ct() in 3WHS handling When the third packet of 3WHS connection establishment contains payload, it is added into socket receive queue without the XFRM check and the drop of connection tracking context. This means that if the data is left unread in the socket receive queue, conntrack module can not be unloaded. As most applications usually reads the incoming data immediately after accept(), bug has been hiding for quite a long time. Commit 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") exposed this bug because even if the application reads this data, the skb with nfct state could stay in a per-cpu cache for an arbitrary time, if said cpu no longer process RX softirqs. Many thanks to Ilya Maximets for reporting this issue, and for testing various patches: https://lore.kernel.org/netdev/20220619003919.394622-1-i.maximets@ovn.org/ Note that I also added a missing xfrm4_policy_check() call, although this is probably not a big issue, as the SYN packet should have been dropped earlier. Fixes: `b59c270104` ("[NETFILTER]: Keep conntrack reference until IPsec policy checks are done") Reported-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Tested-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Ilya Maximets <i.maximets@ovn.org> Link: https://lore.kernel.org/r/20220623050436.1290307-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2022-11-07 10:10:25 +01:00
Antoine Tenart	626c678449	net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit f8319dfd1b3b3be6c08795017fc30f880f8bc861 Author: Menglong Dong <imagedong@tencent.com> Date: Fri May 13 11:03:39 2022 +0800 net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv() The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv() and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the return value of tcp_inbound_md5_hash(). And it can panic the kernel with NULL pointer in net_dm_packet_report_size() if the reason is 0, as drop_reasons[0] is NULL. Fixes: 1330b6ef3313 ("skb: make drop reason booleanable") Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-14 17:40:26 +02:00
Antoine Tenart	04f4917aca	skb: make drop reason booleanable Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 1330b6ef3313fcec577d2b020c290dc8b9f11f1a Author: Jakub Kicinski <kuba@kernel.org> Date: Mon Mar 7 16:44:21 2022 -0800 skb: make drop reason booleanable We have a number of cases where function returns drop/no drop decision as a boolean. Now that we want to report the reason code as well we have to pass extra output arguments. We can make the reason code evaluate correctly as bool. I believe we're good to reorder the reasons as they are reported to user space as strings. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:24 +02:00
Antoine Tenart	997d93a49f	net/tcp: Merge TCP-MD5 inbound callbacks Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 7bbb765b73496699a165d505ecdce962f903b422 Author: Dmitry Safonov <0x7f454c46@gmail.com> Date: Wed Feb 23 17:57:40 2022 +0000 net/tcp: Merge TCP-MD5 inbound callbacks The functions do essentially the same work to verify TCP-MD5 sign. Code can be merged into one family-independent function in order to reduce copy'n'paste and generated code. Later with TCP-AO option added, this will allow to create one function that's responsible for segment verification, that will have all the different checks for MD5/AO/non-signed packets, which in turn will help to see checks for all corner-cases in one function, rather than spread around different families and functions. Cc: Eric Dumazet <edumazet@google.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Signed-off-by: Dmitry Safonov <dima@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220223175740.452397-1-dima@arista.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:24 +02:00
Antoine Tenart	7e7867a749	net: tcp: use kfree_skb_reason() for tcp_v{4,6}_do_rcv() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 8eba65fa5f06519042b98564089b942d795e3f8d Author: Menglong Dong <imagedong@tencent.com> Date: Sun Feb 20 15:06:34 2022 +0800 net: tcp: use kfree_skb_reason() for tcp_v{4,6}_do_rcv() Replace kfree_skb() used in tcp_v4_do_rcv() and tcp_v6_do_rcv() with kfree_skb_reason(). Reviewed-by: Mengen Sun <mengensun@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:22 +02:00
Antoine Tenart	0b99c6c861	net: tcp: add skb drop reasons to tcp_add_backlog() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git Conflicts:\ - In tcp.h due to missing commit f35f821935d8 ("tcp: defer skb freeing after socket lock is released") in C9S; which is fine btw as the chunk in tcp.h was later removed upstream by commit 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists"). commit 7a26dc9e7b43f5a24c4b843713e728582adf1c38 Author: Menglong Dong <imagedong@tencent.com> Date: Sun Feb 20 15:06:33 2022 +0800 net: tcp: add skb drop reasons to tcp_add_backlog() Pass the address of drop_reason to tcp_add_backlog() to store the reasons for skb drops when fails. Following drop reasons are introduced: SKB_DROP_REASON_SOCKET_BACKLOG Reviewed-by: Mengen Sun <mengensun@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:22 +02:00
Antoine Tenart	de5f3d75e9	net: tcp: add skb drop reasons to tcp_v{4,6}_inbound_md5_hash() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 643b622b51f1f0015e0a80f90b4ef9032e6ddb1b Author: Menglong Dong <imagedong@tencent.com> Date: Sun Feb 20 15:06:32 2022 +0800 net: tcp: add skb drop reasons to tcp_v{4,6}_inbound_md5_hash() Pass the address of drop reason to tcp_v4_inbound_md5_hash() and tcp_v6_inbound_md5_hash() to store the reasons for skb drops when this function fails. Therefore, the drop reason can be passed to kfree_skb_reason() when the skb needs to be freed. Following drop reasons are added: SKB_DROP_REASON_TCP_MD5NOTFOUND SKB_DROP_REASON_TCP_MD5UNEXPECTED SKB_DROP_REASON_TCP_MD5FAILURE SKB_DROP_REASON_TCP_MD5* above correspond to LINUX_MIB_TCPMD5* Reviewed-by: Mengen Sun <mengensun@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:22 +02:00
Antoine Tenart	21c3e93b20	net: tcp: add skb drop reasons to tcp_v4_rcv() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 255f9034d3050fb1d0691226712c6b7f1ca674cd Author: Menglong Dong <imagedong@tencent.com> Date: Sun Feb 20 15:06:30 2022 +0800 net: tcp: add skb drop reasons to tcp_v4_rcv() Use kfree_skb_reason() for some path in tcp_v4_rcv() that missed before, including: SKB_DROP_REASON_SOCKET_FILTER SKB_DROP_REASON_XFRM_POLICY Reviewed-by: Mengen Sun <mengensun@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:22 +02:00
Antoine Tenart	36fdf75633	net: socket: rename SKB_DROP_REASON_SOCKET_FILTER Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 364df53c081d93fcfd6b91085ff2650c7f17b3c7 Author: Menglong Dong <imagedong@tencent.com> Date: Thu Jan 27 17:13:01 2022 +0800 net: socket: rename SKB_DROP_REASON_SOCKET_FILTER Rename SKB_DROP_REASON_SOCKET_FILTER, which is used as the reason of skb drop out of socket filter before it's part of a released kernel. It will be used for more protocols than just TCP in future series. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/all/20220127091308.91401-2-imagedong@tencent.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:21 +02:00
Felix Maurer	de20724127	net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620 commit 91a760b26926265a60c77ddf016529bcf3e17a04 Author: Menglong Dong <imagedong@tencent.com> Date: Thu Jan 6 21:20:20 2022 +0800 net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2022-08-24 16:53:48 +02:00
Paolo Abeni	036c0e121e	tcp: add accessors to read/set tp->snd_cwnd Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101465 Tested: LNST, Tier1 Upstream commit: commit 40570375356c874b1578e05c1dcc3ff7c1322dbe Author: Eric Dumazet <edumazet@google.com> Date: Tue Apr 5 16:35:38 2022 -0700 tcp: add accessors to read/set tp->snd_cwnd We had various bugs over the years with code breaking the assumption that tp->snd_cwnd is greater than zero. Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added in commit `8b8a321ff7` ("tcp: fix zero cwnd in tcp_cwnd_reduction") can trigger, and without a repro we would have to spend considerable time finding the bug. Instead of complaining too late, we want to catch where and when tp->snd_cwnd is set to an illegal value. Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-06-27 16:43:55 +02:00
Patrick Talbert	8c5b3f7fd9	Merge: XDP and networking eBPF rebase to v5.15 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/674 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618 Depends: !572 Tested: Using bpf selftests, everything passes. This rebases XDP and networking eBPF to upstream kernel version 5.15. Signed-off-by: Jiri Benc <jbenc@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Toke Høiland-Jørgensen <toke@redhat.com> Approved-by: Íñigo Huguet <ihuguet@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-06-03 09:26:25 +02:00

1 2 3 4 5 ...

889 Commits