Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Toke Høiland-Jørgensen	e9b6f5c14c	bpf: Add bpf_sock_destroy kfunc JIRA: https://issues.redhat.com/browse/RHEL-65787 Conflicts: Context difference due to missing af9784d007d8 ("tcp: diag: add support for TIME_WAIT sockets to tcp_abort()") and out-of-order backport of bac76cf89816 ("tcp: fix forever orphan socket caused by tcp_abort") commit 4ddbcb886268af8d12a23e6640b39d1d9c652b1b Author: Aditi Ghag <aditi.ghag@isovalent.com> Date: Fri May 19 22:51:55 2023 +0000 bpf: Add bpf_sock_destroy kfunc The socket destroy kfunc is used to forcefully terminate sockets from certain BPF contexts. We plan to use the capability in Cilium load-balancing to terminate client sockets that continue to connect to deleted backends. The other use case is on-the-fly policy enforcement where existing socket connections prevented by policies need to be forcefully terminated. The kfunc also allows terminating sockets that may or may not be actively sending traffic. The kfunc can currently be called only from BPF TCP and UDP iterators where users can filter, and terminate selected sockets. More specifically, it can only be called from BPF contexts that ensure socket locking in order to allow synchronous execution of protocol specific `diag_destroy` handlers. The previous commit that batches UDP sockets during iteration facilitated a synchronous invocation of the UDP destroy callback from BPF context by skipping socket locks in `udp_abort`. TCP iterator already supported batching of sockets being iterated. To that end, `tracing_iter_filter` callback filter is added so that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER` attach type, and reject other programs. The kfunc takes `sock_common` type argument, even though it expects, and casts them to a `sock` pointer. This enables the verifier to allow the sock_destroy kfunc to be called for TCP with `sock_common` and UDP with `sock` structs. Furthermore, as `sock_common` only has a subset of certain fields of `sock`, casting pointer to the latter type might not always be safe for certain sockets like request sockets, but these have a special handling in the diag_destroy handlers. Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer. eg. getting a sk pointer (may be even NULL) by following another sk pointer. The pointer socket argument passed in TCP and UDP iterators is tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info. The TRUSTED arg changes are contributed by Martin KaFai Lau <martin.lau@kernel.org>. Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com> Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>	2025-01-28 12:51:54 +01:00
Rado Vrbovsky	81ce48e690	Merge: mptcp: phase-1 backports for RHEL-9.6 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5449 JIRA: https://issues.redhat.com/browse/RHEL-62871 JIRA: https://issues.redhat.com/browse/RHEL-58839 JIRA: https://issues.redhat.com/browse/RHEL-66083 JIRA: https://issues.redhat.com/browse/RHEL-66074 CVE: CVE-2024-46711 CVE: CVE-2024-45009 CVE: CVE-2024-45010 Upstream Status: All mainline in net.git Tested: kselftest Conflicts: see individual patches Signed-off-by: Davide Caratti <dcaratti@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Paolo Abeni <pabeni@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-11-22 09:18:31 +00:00
Davide Caratti	6758e2bf77	tcp: set TCP_DEFER_ACCEPT locklessly JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 6e97ba552b8d3dd074a28b8600740b8bed42267b commit 6e97ba552b8d3dd074a28b8600740b8bed42267b Author: Eric Dumazet <edumazet@google.com> Date: Fri Aug 4 14:46:16 2023 +0000 tcp: set TCP_DEFER_ACCEPT locklessly rskq_defer_accept field can be read/written without the need of holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	a89122fa2a	tcp: set TCP_LINGER2 locklessly JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit a81722ddd7e4d76c9bbff078d29416e18c6d7f71 commit a81722ddd7e4d76c9bbff078d29416e18c6d7f71 Author: Eric Dumazet <edumazet@google.com> Date: Fri Aug 4 14:46:15 2023 +0000 tcp: set TCP_LINGER2 locklessly tp->linger2 can be set locklessly as long as readers use READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	6813538fe7	tcp: set TCP_KEEPCNT locklessly JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 84485080cbc1e5a011e7549966739df4cec158b1 commit 84485080cbc1e5a011e7549966739df4cec158b1 Author: Eric Dumazet <edumazet@google.com> Date: Fri Aug 4 14:46:14 2023 +0000 tcp: set TCP_KEEPCNT locklessly tp->keepalive_probes can be set locklessly, readers are already taking care of this field being potentially set by other threads. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	555249fe86	tcp: set TCP_KEEPINTVL locklessly JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 6fd70a6b4e6f58482a43da46d12323795bbc5f68 commit 6fd70a6b4e6f58482a43da46d12323795bbc5f68 Author: Eric Dumazet <edumazet@google.com> Date: Fri Aug 4 14:46:13 2023 +0000 tcp: set TCP_KEEPINTVL locklessly tp->keepalive_intvl can be set locklessly, readers are already taking care of this field being potentially set by other threads. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	dc3759782c	tcp: set TCP_USER_TIMEOUT locklessly JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit d58f2e15aa0c07f6f03ec71f64d7697ca43d04a1 commit d58f2e15aa0c07f6f03ec71f64d7697ca43d04a1 Author: Eric Dumazet <edumazet@google.com> Date: Fri Aug 4 14:46:12 2023 +0000 tcp: set TCP_USER_TIMEOUT locklessly icsk->icsk_user_timeout can be set locklessly, if all read sides use READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	c312a246d9	tcp: set TCP_SYNCNT locklessly JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit d44fd4a767b3755899f8ad1df3e8eca3961ba708 commit d44fd4a767b3755899f8ad1df3e8eca3961ba708 Author: Eric Dumazet <edumazet@google.com> Date: Fri Aug 4 14:46:11 2023 +0000 tcp: set TCP_SYNCNT locklessly icsk->icsk_syn_retries can safely be set without locking the socket. We have to add READ_ONCE() annotations in tcp_fastopen_synack_timer() and tcp_write_timeout(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	2c5048dd03	tcp: annotate data-races around fastopenq.max_qlen JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 70f360dd7042cb843635ece9d28335a4addff9eb commit 70f360dd7042cb843635ece9d28335a4addff9eb Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:57 2023 +0000 tcp: annotate data-races around fastopenq.max_qlen This field can be read locklessly. Fixes: `1536e2857b` ("tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-12-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	66355cbfde	tcp: annotate data-races around icsk->icsk_user_timeout JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 26023e91e12c68669db416b97234328a03d8e499 commit 26023e91e12c68669db416b97234328a03d8e499 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:56 2023 +0000 tcp: annotate data-races around icsk->icsk_user_timeout This field can be read locklessly from do_tcp_getsockopt() Fixes: `dca43c75e7` ("tcp: Add TCP_USER_TIMEOUT socket option.") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	8abf5b2c25	tcp: annotate data-races around tp->notsent_lowat JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 1aeb87bc1440c5447a7fa2d6e3c2cca52cbd206b commit 1aeb87bc1440c5447a7fa2d6e3c2cca52cbd206b Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:55 2023 +0000 tcp: annotate data-races around tp->notsent_lowat tp->notsent_lowat can be read locklessly from do_tcp_getsockopt() and tcp_poll(). Fixes: `c9bee3b7fd` ("tcp: TCP_NOTSENT_LOWAT socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	dded454961	tcp: annotate data-races around rskq_defer_accept JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit ae488c74422fb1dcd807c0201804b3b5e8a322a3 commit ae488c74422fb1dcd807c0201804b3b5e8a322a3 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:54 2023 +0000 tcp: annotate data-races around rskq_defer_accept do_tcp_getsockopt() reads rskq_defer_accept while another cpu might change its value. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	a0ed783738	tcp: annotate data-races around tp->linger2 JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 9df5335ca974e688389c875546e5819778a80d59 commit 9df5335ca974e688389c875546e5819778a80d59 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:53 2023 +0000 tcp: annotate data-races around tp->linger2 do_tcp_getsockopt() reads tp->linger2 while another cpu might change its value. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	1cac8d920d	tcp: annotate data-races around icsk->icsk_syn_retries JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 3a037f0f3c4bfe44518f2fbb478aa2f99a9cd8bb commit 3a037f0f3c4bfe44518f2fbb478aa2f99a9cd8bb Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:52 2023 +0000 tcp: annotate data-races around icsk->icsk_syn_retries do_tcp_getsockopt() and reqsk_timer_handler() read icsk->icsk_syn_retries while another cpu might change its value. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	eddb55b003	tcp: annotate data-races around tp->keepalive_probes JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 6e5e1de616bf5f3df1769abc9292191dfad9110a commit 6e5e1de616bf5f3df1769abc9292191dfad9110a Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:51 2023 +0000 tcp: annotate data-races around tp->keepalive_probes do_tcp_getsockopt() reads tp->keepalive_probes while another cpu might change its value. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	2f8df3a24e	tcp: annotate data-races around tp->keepalive_intvl JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 5ecf9d4f52ff2f1d4d44c9b68bc75688e82f13b4 commit 5ecf9d4f52ff2f1d4d44c9b68bc75688e82f13b4 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:50 2023 +0000 tcp: annotate data-races around tp->keepalive_intvl do_tcp_getsockopt() reads tp->keepalive_intvl while another cpu might change its value. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	1a343a4516	tcp: annotate data-races around tp->keepalive_time JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 4164245c76ff906c9086758e1c3f87082a7f5ef5 commit 4164245c76ff906c9086758e1c3f87082a7f5ef5 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:49 2023 +0000 tcp: annotate data-races around tp->keepalive_time do_tcp_getsockopt() reads tp->keepalive_time while another cpu might change its value. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	7bb06cd72e	tcp: annotate data-races around tp->tsoffset JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit dd23c9f1e8d5c1d2e3d29393412385ccb9c7a948 Conflicts: - net/ipv4/tcp_ipv4.c: keep using sock_net(sk) as we don't have upstream commit 08eaef904031 ("tcp: Clean up some functions.") commit dd23c9f1e8d5c1d2e3d29393412385ccb9c7a948 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:48 2023 +0000 tcp: annotate data-races around tp->tsoffset do_tcp_getsockopt() reads tp->tsoffset while another cpu might change its value. Fixes: `93be6ce0e9` ("tcp: set and get per-socket timestamp") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:58 +01:00
Davide Caratti	988b363978	tcp: annotate data-races around tp->tcp_tx_delay JIRA: https://issues.redhat.com/browse/RHEL-62871 Upstream Status: net.git commit 348b81b68b13ebd489a3e6a46aa1c384c731c919 commit 348b81b68b13ebd489a3e6a46aa1c384c731c919 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 19 21:28:47 2023 +0000 tcp: annotate data-races around tp->tcp_tx_delay do_tcp_getsockopt() reads tp->tcp_tx_delay while another cpu might change its value. Fixes: `a842fe1425` ("tcp: add optional per socket transmit delay") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2024-11-12 10:18:57 +01:00
Rado Vrbovsky	384fd7eadc	Merge: tcp: stable backports for 9.6 phase 1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5444 JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Several stable backport for the tcp protocol addressing sparse corner-cases issues. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-11-01 08:13:57 +00:00
Rado Vrbovsky	570a71d7db	Merge: mm: update core code to v6.6 upstream MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252 JIRA: https://issues.redhat.com/browse/RHEL-27743 JIRA: https://issues.redhat.com/browse/RHEL-59459 CVE: CVE-2024-46787 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961 This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level. This work follows up on the previous v6.5 update (RHEL-27742) and as such, the bulk of this changeset is comprised of refactoring and clean-ups of the internal implementation of several APIs as it further advances the conversion to FOLIOS, and follow up on the per-VMA locking changes. Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds, and we add a potential extra level of protection (assessment pending) to help on mitigating kernel heap exploits dubbed as "SlubStick". Follow-up fixes are omitted from this series either because they are irrelevant to the bits we support on RHEL or because they depend on bigger changesets introduced upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately. Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot") Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources") Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()") Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros") Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages") Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType") Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()") Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio") Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling") Signed-off-by: Rafael Aquini <raquini@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Approved-by: Mark Salter <msalter@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Approved-by: David Airlie <airlied@redhat.com> Approved-by: Michal Schmidt <mschmidt@redhat.com> Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-10-30 07:22:28 +00:00
Paolo Abeni	1b55adcd2c	tcp: fix forever orphan socket caused by tcp_abort JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Conflicts: different context, as rhel-9 lacks the upstream \ commit 4ddbcb886268 ("bpf: Add bpf_sock_destroy kfunc"). Also no \ tcp_done_with_error() in rhel-9, since it lacks the upstream commit \ 5ce4645c23cf ("tcp: fix races in tcp_abort()") Upstream commit: commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4 Author: Xueming Feng <kuro@kuroa.me> Date: Mon Aug 26 18:23:27 2024 +0800 tcp: fix forever orphan socket caused by tcp_abort We have some problem closing zero-window fin-wait-1 tcp sockets in our environment. This patch come from the investigation. Previously tcp_abort only sends out reset and calls tcp_done when the socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only purging the write queue, but not close the socket and left it to the timer. While purging the write queue, tp->packets_out and sk->sk_write_queue is cleared along the way. However tcp_retransmit_timer have early return based on !tp->packets_out and tcp_probe_timer have early return based on !sk->sk_write_queue. This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched and socket not being killed by the timers, converting a zero-windowed orphan into a forever orphan. This patch removes the SOCK_DEAD check in tcp_abort, making it send reset to peer and close the socket accordingly. Preventing the timer-less orphan from happening. According to Lorenzo's email in the v1 thread, the check was there to prevent force-closing the same socket twice. That situation is handled by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is already closed. The -ENOENT code comes from the associate patch Lorenzo made for iproute2-ss; link attached below, which also conform to RFC 9293. At the end of the patch, tcp_write_queue_purge(sk) is removed because it was already called in tcp_done_with_error(). p.s. This is the same patch with v2. Resent due to mis-labeled "changes requested" on patchwork.kernel.org. Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/ Fixes: `c1e64e298b` ("net: diag: Support destroying TCP sockets.") Signed-off-by: Xueming Feng <kuro@kuroa.me> Tested-by: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:09:16 +02:00
Paolo Abeni	fdad6e7a51	tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field JIRA: https://issues.redhat.com/browse/RHEL-62865 Tested: LNST, Tier1 Conflicts: different context in tcp_conn_request(), as rhel-9 \ lacks the TCP AO support. Upstream commit: commit 41eecbd712b73f0d5dcf1152b9a1c27b1f238028 Author: Eric Dumazet <edumazet@google.com> Date: Sun Apr 7 09:33:22 2024 +0000 tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field TCP can transform a TIMEWAIT socket into a SYN_RECV one from a SYN packet, and the ISN of the SYNACK packet is normally generated using TIMEWAIT tw_snd_nxt : tcp_timewait_state_process() ... u32 isn = tcptw->tw_snd_nxt + 65535 + 2; if (isn == 0) isn++; TCP_SKB_CB(skb)->tcp_tw_isn = isn; return TCP_TW_SYN; This SYN packet also bypasses normal checks against listen queue being full or not. tcp_conn_request() ... __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn; ... /* TW buckets are converted to open requests without * limitations, they conserve resources and peer is * evidently real one. */ if ((syncookies == 2 \|\| inet_csk_reqsk_queue_is_full(sk)) && !isn) { want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name); if (!want_cookie) goto drop; } This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb. Unfortunately this field has been accidentally cleared after the call to tcp_timewait_state_process() returning TCP_TW_SYN. Using a field in TCP_SKB_CB(skb) for a temporary state is overkill. Switch instead to a per-cpu variable. As a bonus, we do not have to clear tcp_tw_isn in TCP receive fast path. It is temporarily set then cleared only in the TCP_TW_SYN dance. Fixes: `4ad19de877` ("net: tcp6: fix double call of tcp_v6_fill_cb()") Fixes: `eeea10b83a` ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-10-16 19:08:41 +02:00
Rafael Aquini	d755df6daa	mm: allow per-VMA locks on file-backed VMAs JIRA: https://issues.redhat.com/browse/RHEL-27743 Conflicts: * MAINTAINERS: minor context difference due to backport of upstream commit 14006f1d8fa2 ("Documentations: Analyze heavily used Networking related structs") This patch is a backport of the following upstream commit: commit 350f6bbca1de515cd7519a33661cefc93ea06054 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Mon Jul 24 19:54:02 2023 +0100 mm: allow per-VMA locks on file-backed VMAs Remove the TCP layering violation by allowing per-VMA locks on all VMAs. The fault path will immediately fail in handle_mm_fault(). There may be a small performance reduction from this patch as a little unnecessary work will be done on each page fault. See later patches for the improvement. Link: https://lkml.kernel.org/r/20230724185410.1124082-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-10-01 11:19:35 -04:00
Wander Lairson Costa	e5f704d611	net: generalize skb freeing deferral to per-cpu lists JIRA: https://issues.redhat.com/browse/RHEL-9145 Conflicts: inet/tls/tls_sw.c: we already have: * 4cbc325ed6b4 ("tls: rx: allow only one reader at a time") net/ipv4/tcp_ipv4.c: we already have: * `67b688aecd` tcp: fix tcp_cleanup_rbuf() for tcp_read_skb() * `0240ed7c51` tcp: allow again tcp_disconnect() when threads are waiting * `0d5e52df56` bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf * 7a26dc9e7b43 net: tcp: add skb drop reasons to tcp_add_backlog() commit 68822bdf76f10c3dc80609d4e2cdc1e847429086 Author: Eric Dumazet <edumazet@google.com> Date: Fri Apr 22 13:12:37 2022 -0700 net: generalize skb freeing deferral to per-cpu lists Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions () 57.81% [kernel] [k] copy_user_enhanced_fast_string () 12.87% [kernel] [k] skb_release_data () 4.25% [kernel] [k] __free_one_page () 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter () 1.59% [kernel] [k] free_unref_page_commit () 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter () 1.01% [kernel] [k] kfree () 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue () 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog () 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked () 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq () 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv () 0.33% [kernel] [k] __put_page () 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:25 -03:00
Wander Lairson Costa	7442869e56	tcp: do not call tcp_cleanup_rbuf() if we have a backlog JIRA: https://issues.redhat.com/browse/RHEL-9145 commit 29fbc26e6dfc7be351c23261938de3f93f5cde57 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 15 11:02:48 2021 -0800 tcp: do not call tcp_cleanup_rbuf() if we have a backlog Under pressure, tcp recvmsg() has logic to process the socket backlog, but calls tcp_cleanup_rbuf() right before. Avoiding sending ACK right before processing new segments makes a lot of sense, as this decrease the number of ACK packets, with no impact on effective ACK clocking. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:25 -03:00
Wander Lairson Costa	fd0c645742	tcp: add a missing sk_defer_free_flush() in tcp_splice_read() JIRA: https://issues.redhat.com/browse/RHEL-9145 commit ebdc1a0309629e71e5910b353e6b005f022ce171 Author: Eric Dumazet <edumazet@google.com> Date: Thu Jan 20 04:45:30 2022 -0800 tcp: add a missing sk_defer_free_flush() in tcp_splice_read() Without it, splice users can hit the warning added in commit 79074a72d335 ("net: Flush deferred skb free on socket destroy") Fixes: f35f821935d8 ("tcp: defer skb freeing after socket lock is released") Fixes: 79074a72d335 ("net: Flush deferred skb free on socket destroy") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Gal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/r/20220120124530.925607-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:25 -03:00
Wander Lairson Costa	19b7cb57b3	tcp: defer skb freeing after socket lock is released JIRA: https://issues.redhat.com/browse/RHEL-9145 Conflicts: * include/net/tcp.h: we already have 7a26dc9e7b43 ("net: tcp: add skb drop reasons to tcp_add_backlog()") * net/ipv4/tcp.c: we already have * `67b688aecd` tcp: fix tcp_cleanup_rbuf() for tcp_read_skb() * `0240ed7c51` tcp: allow again tcp_disconnect() when threads are waiting * `0d5e52df56` bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf commit f35f821935d8df76f9c92e2431a225bdff938169 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 15 11:02:46 2021 -0800 tcp: defer skb freeing after socket lock is released tcp recvmsg() (or rx zerocopy) spends a fair amount of time freeing skbs after their payload has been consumed. A typical ~64KB GRO packet has to release ~45 page references, eventually going to page allocator for each of them. Currently, this freeing is performed while socket lock is held, meaning that there is a high chance that BH handler has to queue incoming packets to tcp socket backlog. This can cause additional latencies, because the user thread has to process the backlog at release_sock() time, and while doing so, additional frames can be added by BH handler. This patch adds logic to defer these frees after socket lock is released, or directly from BH handler if possible. Being able to free these skbs from BH handler helps a lot, because this avoids the usual alloc/free assymetry, when BH handler and user thread do not run on same cpu or NUMA node. One cpu can now be fully utilized for the kernel->user copy, and another cpu is handling BH processing and skb/page allocs/frees (assuming RFS is not forcing use of a single CPU) Tested: 100Gbit NIC Max throughput for one TCP_STREAM flow, over 10 runs MTU : 1500 Before: 55 Gbit After: 66 Gbit MTU : 4096+(headers) Before: 82 Gbit After: 95 Gbit Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:20 -03:00
Lucas Zampieri	55f96777fb	Merge: net: backport visibility improvements MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4765 JIRA: https://issues.redhat.com/browse/RHEL-48648 Various visibility improvements; mainly around drop reasons, reset reason and improved tracepoints this time. Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-08-12 16:18:50 +00:00
Lucas Zampieri	cb92e2e4c6	Merge: net: Optimize cacheline consumption of core networking structs MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4307 JIRA: https://issues.redhat.com/browse/RHEL-30902 Tested: manual testing and preliminary LNST run show improvement in some tests and no regressions. The fields that the rx and tx paths use were placed all over the core networking structs. Reorganize these structs so the fields of each struct that are read/written in rx/tx paths are placed close to each other to reduce the number of cache lines used. Signed-off-by: Felix Maurer <fmaurer@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Sabrina Dubroca <sdubroca@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-08-07 16:49:46 +00:00
Antoine Tenart	aef83a52dd	rstreason: prepare for active reset JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: linux.git Conflicts:\ - Context difference due to missing upstream commit e13ec3da05d1 ("tcp: annotate lockless access to sk->sk_err") in c9s. commit 5691276b39daf90294c6a81fb6d62d667f634c92 Author: Jason Xing <kernelxing@tencent.com> Date: Thu Apr 25 11:13:36 2024 +0800 rstreason: prepare for active reset Like what we did to passive reset: only passing possible reset reason in each active reset path. No functional changes. Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:41 +02:00
Felix Maurer	d6374b3821	tcp: move tp->scaling_ratio to tcp_sock_read_txrx group JIRA: https://issues.redhat.com/browse/RHEL-30902 commit 119ff04864a24470b1e531bb53e5c141aa8fefb0 Author: Eric Dumazet <edumazet@google.com> Date: Thu Feb 8 14:43:21 2024 +0000 tcp: move tp->scaling_ratio to tcp_sock_read_txrx group tp->scaling_ratio is a read mostly field, used in rx and tx fast paths. Fixes: d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Wei Wang <weiwan@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2024-06-26 17:17:19 +02:00
Felix Maurer	98c09081cb	tcp: reorganize tcp_sock fast path variables JIRA: https://issues.redhat.com/browse/RHEL-30902 Conflicts: - include/linux/tcp.h: Context difference because tcp_usec_ts is missing due to missing 614e8316aa4c ("tcp: add support for usec resolution in TCP TS values") Omitted-fix: 666a877deab2 ("tcp: move tp->tcp_usec_ts to tcp_sock_read_txrx group") This field was never backported, see conflicts. commit d5fed5addb2b6bc13035de4338b7ea2052a2e006 Author: Coco Li <lixiaoyan@google.com> Date: Mon Dec 4 20:12:31 2023 +0000 tcp: reorganize tcp_sock fast path variables The variables are organized according in the following way: - TX read-mostly hotpath cache lines - TXRX read-mostly hotpath cache lines - RX read-mostly hotpath cache lines - TX read-write hotpath cache line - TXRX read-write hotpath cache line - RX read-write hotpath cache line Fastpath cachelines end after rcvq_space. Cache line boundaries are enforced only between read-mostly and read-write. That is, if read-mostly tx cachelines bleed into read-mostly txrx cachelines, we do not care. We care about the boundaries between read and write cachelines because we want to prevent false sharing. Fast path variables span cache lines before change: 12 Fast path variables span cache lines after change: 8 Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231204201232.520025-3-lixiaoyan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2024-06-26 17:17:18 +02:00
Florian Westphal	bd2a0fb2c5	tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets JIRA: https://issues.redhat.com/browse/RHEL-39833 Upstream Status: commit 94062790aedb CVE: CVE-2024-36905 commit 94062790aedb505bdda209b10bea47b294d6394f Author: Eric Dumazet <edumazet@google.com> Date: Wed May 1 12:54:48 2024 +0000 tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets TCP_SYN_RECV state is really special, it is only used by cross-syn connections, mostly used by fuzzers. In the following crash [1], syzbot managed to trigger a divide by zero in tcp_rcv_space_adjust() A socket makes the following state transitions, without ever calling tcp_init_transfer(), meaning tcp_init_buffer_space() is also not called. TCP_CLOSE connect() TCP_SYN_SENT TCP_SYN_RECV shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN) TCP_FIN_WAIT1 To fix this issue, change tcp_shutdown() to not perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition, which makes no sense anyway. When tcp_rcv_state_process() later changes socket state from TCP_SYN_RECV to TCP_ESTABLISH, then look at sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state, and send a FIN packet from a sane socket state. This means tcp_send_fin() can now be called from BH context, and must use GFP_ATOMIC allocations. [1] divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767 Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48 RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246 RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7 R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30 R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da FS: 00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0 Call Trace: <TASK> tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513 tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578 inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x109/0x280 net/socket.c:1068 ____sys_recvmsg+0x1db/0x470 net/socket.c:2803 ___sys_recvmsg net/socket.c:2845 [inline] do_recvmmsg+0x474/0xae0 net/socket.c:2939 __sys_recvmmsg net/socket.c:3018 [inline] __do_sys_recvmmsg net/socket.c:3041 [inline] __se_sys_recvmmsg net/socket.c:3034 [inline] __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7faeb6363db9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9 RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005 RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240501125448.896529-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Florian Westphal <fwestpha@redhat.com>	2024-06-07 15:21:41 +02:00
Lucas Zampieri	3dce9ca7d2	Merge: MM: proactive fixes for 9.5 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4047 The following MR is part of the MM SSTs plan on updating the MM codebase to 6.4 for RHEL9.5. As part of this effort, we have 1 MR for each V6.x upstream update without the followup fixes. This MR serves to ensure we are maintaining stability by utilizing available upstream fixes for the commits we have in the MM codebase. The rough structure of this MR is as follows: ``` The first set of patches (1-28) are missing patches from <v6.1 The second set of patches (29-86) are the fixes that were omitted from the v6.1-v6.4 The third set of patches (87-129) are fixes from v6.4+ that are marked as STABLE patches the fourth set of patches (130-171) are other fixes that are not stable patches and effect previous rhel releases or are fixes that i missed from step 2 ``` Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3951 JIRA: https://issues.redhat.com/browse/RHEL-5619 Signed-off-by: Nico Pache <npache@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Approved-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-05-16 13:16:58 +00:00
Lucas Zampieri	1eb3817020	Merge: mm: update to 6.3 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3803 JIRA: https://issues.redhat.com/browse/RHEL-27740 Depends: !3738 Like 6.1, 6.2 and 6.4 backports, we're not backporting fixes: instead it'll be done as a follow up MR in order to prevent wasting time merging out of order fixes. LTP failures on mremap06 and vma02 are fixed by a 6.4 patch (7e7757876f25) that will be included by the 6.4 backport (!3951) Omitted-fix: 443ed4c302fff6a26af980300463343a7adc9ee8 Omitted-fix: 10f4c9b9a33b7df000f74fa0d896351fb1a61e6a Omitted-fix: 95a301eefa82057571207edd06ea36218985a75e Omitted-fix: a101482421a318369eef2d0e03f2fcb40a47abad Omitted-fix: 6c54312f9689fbe27c70db5d42eebd29d04b672e Omitted-fix: 6f74c0ec2095335158015ce29b708e775b9cea3a Omitted-fix: c643e6ebedb435bcf863001f5e69a578f2658055 Omitted-fix: 77795f900e2a07c1cbedc375789aefb43843b6c2 Omitted-fix: 2658f94d679243209889cdfa8de3743cde1abea9 Omitted-fix: 7e7757876f258d99266e7b3c559639289a2a45fe Omitted-fix: 9425c591e06a9ab27a145ba655fb50532cf0bcc9 Omitted-fix: d1adb25df7111de83b64655a80b5a135adbded61 Omitted-fix: 4d4b6d66db63ceed399f1fb1a4b24081d2590eb1 Omitted-fix: a259945efe6ada94087ef666e9b38f8e34ea34ba Omitted-fix: 00ca0f2e86bf40b016a646e6323a8941a09cf106 Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Approved-by: Jerry Snitselaar <jsnitsel@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Tony Camuso <tcamuso@redhat.com> Approved-by: Jocelyn Falempe <jfalempe@redhat.com> Approved-by: Jan Stancek <jstancek@redhat.com> Approved-by: Vladis Dronov <vdronov@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Approved-by: Steve Best <sbest@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Lucas Zampieri <lzampier@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-05-13 12:46:48 +00:00
Patrick Talbert	14d069c085	Merge: tcp: stable backport for 9.5 from phase 1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4004 JIRA: https://issues.redhat.com/browse/RHEL-32164 JIRA: https://issues.redhat.com/browse/RHEL-29496 CVE: CVE-2024-26640 Upstream Status: All mainline in net.git. A bunch of stable backport from upstream addressing races, edge conditions adding missing sanity checks. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Davide Caratti <dcaratti@redhat.com> Merged-by: Patrick Talbert <ptalbert@redhat.com>	2024-05-03 12:43:30 +02:00
Nico Pache	dc2f01811e	tcp: Use per-vma locking for receive zerocopy commit 7a7f094635349a7d0314364ad50bdeb770b6df4f Author: Arjun Roy <arjunroy@google.com> Date: Fri Jun 16 12:34:27 2023 -0700 tcp: Use per-vma locking for receive zerocopy Per-VMA locking allows us to lock a struct vm_area_struct without taking the process-wide mmap lock in read mode. Consider a process workload where the mmap lock is taken constantly in write mode. In this scenario, all zerocopy receives are periodically blocked during that period of time - though in principle, the memory ranges being used by TCP are not touched by the operations that need the mmap write lock. This results in performance degradation. Now consider another workload where the mmap lock is never taken in write mode, but there are many TCP connections using receive zerocopy that are concurrently receiving. These connections all take the mmap lock in read mode, but this does induce a lot of contention and atomic ops for this process-wide lock. This results in additional CPU overhead caused by contending on the cache line for this lock. However, with per-vma locking, both of these problems can be avoided. As a test, I ran an RPC-style request/response workload with 4KB payloads and receive zerocopy enabled, with 100 simultaneous TCP connections. I measured perf cycles within the find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and without per-vma locking enabled. When using process-wide mmap semaphore read locking, about 1% of measured perf cycles were within this path. With per-VMA locking, this value dropped to about 0.45%. Signed-off-by: Arjun Roy <arjunroy@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> JIRA: https://issues.redhat.com/browse/RHEL-5619 Signed-off-by: Nico Pache <npache@redhat.com>	2024-04-30 17:51:25 -06:00
Aristeu Rozanski	e214620cfb	mm: replace vma->vm_flags direct modifications with modifier calls JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me Conflicts: dropped stuff we don't support when not applying cleanly, left the rest for sake of saving work commit 1c71222e5f2393b5ea1a41795c67589eea7e3490 Author: Suren Baghdasaryan <surenb@google.com> Date: Thu Jan 26 11:37:49 2023 -0800 mm: replace vma->vm_flags direct modifications with modifier calls Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:17 -04:00
Aristeu Rozanski	20dd56698e	mm: remove zap_page_range and create zap_vma_pages JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me Conflicts: dropped RISCV changes, and due missing b59c9dc4d9d47b commit e9adcfecf572fcfaa9f8525904cf49c709974f73 Author: Mike Kravetz <mike.kravetz@oracle.com> Date: Tue Jan 3 16:27:32 2023 -0800 mm: remove zap_page_range and create zap_vma_pages zap_page_range was originally designed to unmap pages within an address range that could span multiple vmas. While working on [1], it was discovered that all callers of zap_page_range pass a range entirely within a single vma. In addition, the mmu notification call within zap_page range does not correctly handle ranges that span multiple vmas. When crossing a vma boundary, a new mmu_notifier_range_init/end call pair with the new vma should be made. Instead of fixing zap_page_range, do the following: - Create a new routine zap_vma_pages() that will remove all pages within the passed vma. Most users of zap_page_range pass the entire vma and can use this new routine. - For callers of zap_page_range not passing the entire vma, instead call zap_page_range_single(). - Remove zap_page_range. [1] https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.kravetz@oracle.com/ Link: https://lkml.kernel.org/r/20230104002732.232573-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Peter Xu <peterx@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390] Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:03 -04:00
Paolo Abeni	a3a3caa33a	tcp: properly terminate timers for kernel sockets JIRA: https://issues.redhat.com/browse/RHEL-32164 Tested: LNST, Tier1 Upstream commit: commit 151c9c724d05d5b0dd8acd3e11cb69ef1f2dbada Author: Eric Dumazet <edumazet@google.com> Date: Fri Mar 22 13:57:32 2024 +0000 tcp: properly terminate timers for kernel sockets We had various syzbot reports about tcp timers firing after the corresponding netns has been dismantled. Fortunately Josef Bacik could trigger the issue more often, and could test a patch I wrote two years ago. When TCP sockets are closed, we call inet_csk_clear_xmit_timers() to 'stop' the timers. inet_csk_clear_xmit_timers() can be called from any context, including when socket lock is held. This is the reason it uses sk_stop_timer(), aka del_timer(). This means that ongoing timers might finish much later. For user sockets, this is fine because each running timer holds a reference on the socket, and the user socket holds a reference on the netns. For kernel sockets, we risk that the netns is freed before timer can complete, because kernel sockets do not hold reference on the netns. This patch adds inet_csk_clear_xmit_timers_sync() function that using sk_stop_timer_sync() to make sure all timers are terminated before the kernel socket is released. Modules using kernel sockets close them in their netns exit() handler. Also add sock_not_owned_by_me() helper to get LOCKDEP support : inet_csk_clear_xmit_timers_sync() must not be called while socket lock is held. It is very possible we can revert in the future commit 3a58f13a881e ("net: rds: acquire refcount on TCP sockets") which attempted to solve the issue in rds only. (net/smc/af_smc.c and net/mptcp/subflow.c have similar code) We probably can remove the check_net() tests from tcp_out_of_resources() and __tcp_close() in the future. Reported-by: Josef Bacik <josef@toxicpanda.com> Closes: https://lore.kernel.org/netdev/20240314210740.GA2823176@perftesting/ Fixes: `26abe14379` ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Fixes: `8a68173691` ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") Link: https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Josef Bacik <josef@toxicpanda.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/20240322135732.1535772-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-04-16 16:56:41 +02:00
Paolo Abeni	db84bcbe01	tcp: add sanity checks to rx zerocopy JIRA: https://issues.redhat.com/browse/RHEL-32164 JIRA: https://issues.redhat.com/browse/RHEL-29496 Tested: LNST, Tier1 CVE: CVE-2024-26640 Upstream commit: commit 577e4432f3ac810049cb7e6b71f4d96ec7c6e894 Author: Eric Dumazet <edumazet@google.com> Date: Thu Jan 25 10:33:17 2024 +0000 tcp: add sanity checks to rx zerocopy TCP rx zerocopy intent is to map pages initially allocated from NIC drivers, not pages owned by a fs. This patch adds to can_map_frag() these additional checks: - Page must not be a compound one. - page->mapping must be NULL. This fixes the panic reported by ZhangPeng. syzbot was able to loopback packets built with sendfile(), mapping pages owned by an ext4 file to TCP rx zerocopy. r3 = socket$inet_tcp(0x2, 0x1, 0x0) mmap(&(0x7f0000ff9000/0x4000)=nil, 0x4000, 0x0, 0x12, r3, 0x0) r4 = socket$inet_tcp(0x2, 0x1, 0x0) bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e24, @multicast1}, 0x10) connect$inet(r4, &(0x7f00000006c0)={0x2, 0x4e24, @empty}, 0x10) r5 = openat$dir(0xffffffffffffff9c, &(0x7f00000000c0)='./file0\x00', 0x181e42, 0x0) fallocate(r5, 0x0, 0x0, 0x85b8) sendfile(r4, r5, 0x0, 0x8ba0) getsockopt$inet_tcp_TCP_ZEROCOPY_RECEIVE(r4, 0x6, 0x23, &(0x7f00000001c0)={&(0x7f0000ffb000/0x3000)=nil, 0x3000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, &(0x7f0000000440)=0x40) r6 = openat$dir(0xffffffffffffff9c, &(0x7f00000000c0)='./file0\x00', 0x181e42, 0x0) Fixes: `93ab6cc691` ("tcp: implement mmap() for zero copy receive") Link: https://lore.kernel.org/netdev/5106a58e-04da-372a-b836-9d3d0bd2507b@huawei.com/T/ Reported-and-bisected-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: linux-mm@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-04-16 16:56:39 +02:00
Paolo Abeni	8af20bee5f	tcp: Add memory barrier to tcp_push() JIRA: https://issues.redhat.com/browse/RHEL-32164 Tested: LNST, Tier1 Upstream commit: commit 7267e8dcad6b2f9fce05a6a06335d7040acbc2b6 Author: Salvatore Dipietro <dipiets@amazon.com> Date: Fri Jan 19 11:01:33 2024 -0800 tcp: Add memory barrier to tcp_push() On CPUs with weak memory models, reads and updates performed by tcp_push to the sk variables can get reordered leaving the socket throttled when it should not. The tasklet running tcp_wfree() may also not observe the memory updates in time and will skip flushing any packets throttled by tcp_push(), delaying the sending. This can pathologically cause 40ms extra latency due to bad interactions with delayed acks. Adding a memory barrier in tcp_push removes the bug, similarly to the previous commit `bf06200e73` ("tcp: tsq: fix nonagle handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead on x86 since not affected. Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and Apache Tomcat 9.0.83 running the basic servlet below: import java.io.IOException; import java.io.OutputStreamWriter; import java.io.PrintWriter; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; public class HelloWorldServlet extends HttpServlet { @Override protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { response.setContentType("text/html;charset=utf-8"); OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8"); String s = "a".repeat(3096); osw.write(s,0,s.length()); osw.flush(); } } Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+ values is observed while, with the patch, the extra latency disappears. No patch and tcp_autocorking=1 ./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello ... 50.000% 0.91ms 75.000% 1.13ms 90.000% 1.46ms 99.000% 1.74ms 99.900% 1.89ms 99.990% 41.95ms <<< 40+ ms extra latency 99.999% 48.32ms 100.000% 48.96ms With patch and tcp_autocorking=1 ./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello ... 50.000% 0.90ms 75.000% 1.13ms 90.000% 1.45ms 99.000% 1.72ms 99.900% 1.83ms 99.990% 2.11ms <<< no 40+ ms extra latency 99.999% 2.53ms 100.000% 2.62ms Patch has been also tested on x86 (m7i.2xlarge instance) which it is not affected by this issue and the patch doesn't introduce any additional delay. Fixes: `7aa5470c2c` ("tcp: tsq: move tsq_flags close to sk_wmem_alloc") Signed-off-by: Salvatore Dipietro <dipiets@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240119190133.43698-1-dipiets@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-04-08 18:29:27 +02:00
Paolo Abeni	9bc87f74bb	tcp: fix possible freeze in tx path under memory pressure JIRA: https://issues.redhat.com/browse/RHEL-32164 Tested: LNST, Tier1 Conflicts: the out-of-order backport of upstream commit eb315a7d1396 \ ("tcp: support externally provided ubufs") into rhel commit `abfc92436c`, needs some mangling for the first chunk in tcp_sendmsg_locked(). The resulting code mirrors the current upstream one. Upstream commit: commit 849b425cd091e1804af964b771761cfbefbafb43 Author: Eric Dumazet <edumazet@google.com> Date: Tue Jun 14 10:17:34 2022 -0700 tcp: fix possible freeze in tx path under memory pressure Blamed commit only dealt with applications issuing small writes. Issue here is that we allow to force memory schedule for the sk_buff allocation, but we have no guarantee that sendmsg() is able to copy some payload in it. In this patch, I make sure the socket can use up to tcp_wmem[0] bytes. For example, if we consider tcp_wmem[0] = 4096 (default on x86), and initial skb->truesize being 1280, tcp_sendmsg() is able to copy up to 2816 bytes under memory pressure. Before this patch a sendmsg() sending more than 2816 bytes would either block forever (if persistent memory pressure), or return -EAGAIN. For bigger MTU networks, it is advised to increase tcp_wmem[0] to avoid sending too small packets. v2: deal with zero copy paths. Fixes: `8e4d980ac2` ("tcp: fix behavior for epoll edge trigger") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-04-08 18:25:02 +02:00
Ivan Vecera	4ee448db07	net: introduce include/net/rps.h JIRA: https://issues.redhat.com/browse/RHEL-31916 Conflicts: * net/core/dev.c context conflict due to missing commit 2b0cfa6e49566 ("net: add generic percpu page_pool allocator") * net/core/sysctl_net_core.c context conflict due to missing commit 2658b5a8a4eee ("net: introduce struct net_hotdata") commit 490a79faf95e705ba0ffd9ebf04a624b379e53c9 Author: Eric Dumazet <edumazet@google.com> Date: Wed Mar 6 16:00:30 2024 +0000 net: introduce include/net/rps.h Move RPS related structures and helpers from include/linux/netdevice.h and include/net/sock.h to a new include file. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2024-04-05 16:03:32 +02:00
Scott Weaver	d007eb89da	Merge: tcp: Dump bound-only sockets in inet_diag. MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3588 JIRA: https://issues.redhat.com/browse/RHEL-21223 Upstream Status: linux.git Signed-off-by: Guillaume Nault <gnault@redhat.com> Approved-by: Paolo Abeni <pabeni@redhat.com> Approved-by: John B. Wyatt IV <jwyatt@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2024-02-01 11:34:44 -05:00
Guillaume Nault	4d99f185a2	tcp: Dump bound-only sockets in inet_diag. JIRA: https://issues.redhat.com/browse/RHEL-21223 Upstream Status: linux.git Conflicts: Missing upstream commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and address"): Centos Stream 9 doesn't have the ->bhash2 hash table. Use ->bhash instead. Because ->bhash can also contain time-wait sockets, we have to use sock_gen_put() instead of plain sock_put(). commit 91051f003948432f83b5d2766eeb83b2b4993649 Author: Guillaume Nault <gnault@redhat.com> Date: Fri Dec 1 15:49:52 2023 +0100 tcp: Dump bound-only sockets in inet_diag. Walk the hashinfo->bhash2 table so that inet_diag can dump TCP sockets that are bound but haven't yet called connect() or listen(). The code is inspired by the ->lhash2 loop. However there's no manual test of the source port, since this kind of filtering is already handled by inet_diag_bc_sk(). Also, a maximum of 16 sockets are dumped at a time, to avoid running with bh disabled for too long. There's no TCP state for bound but otherwise inactive sockets. Such sockets normally map to TCP_CLOSE. However, "ss -l", which is supposed to only dump listening sockets, actually requests the kernel to dump sockets in either the TCP_LISTEN or TCP_CLOSE states. To avoid dumping bound-only sockets with "ss -l", we therefore need to define a new pseudo-state (TCP_BOUND_INACTIVE) that user space will be able to set explicitly. With an IPv4, an IPv6 and an IPv6-only socket, bound respectively to 40000, 64000, 60000, an updated version of iproute2 could work as follow: $ ss -t state bound-inactive Recv-Q Send-Q Local Address:Port Peer Address:Port Process 0 0 0.0.0.0:40000 0.0.0.0:* 0 0 [::]:60000 [::]:* 0 0 :64000 :* Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/b3a84ae61e19c06806eea9c602b3b66e8f0cfc81.1701362867.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Guillaume Nault <gnault@redhat.com>	2024-01-24 17:24:46 +01:00
Paolo Abeni	29e5f409c6	net: do not leave an empty skb in write queue JIRA: https://issues.redhat.com/browse/RHEL-21432 Tested: LNST, Tier1 Conflicts: Different context around tcp_remove_empty_skb, as rhel lacks \ upstream sendpage() refactor. Just add the relevant new info. \ tcp_remove_empty_skb() a skb as the 2nd argument, as rhel lacks the \ upstream commit 27728ba80f1e ("tcp: cleanup tcp_remove_empty_skb() use") Upstream commit: commit 72bf4f1767f0386970dc04726dc5bc2e3991dc19 Author: Eric Dumazet <edumazet@google.com> Date: Thu Oct 19 11:24:57 2023 +0000 net: do not leave an empty skb in write queue Under memory stress conditions, tcp_sendmsg_locked() might call sk_stream_wait_memory(), thus releasing the socket lock. If a fresh skb has been allocated prior to this, we should not leave it in the write queue otherwise tcp_write_xmit() could panic. This apparently does not happen often, but a future change in __sk_mem_raise_allocated() that Shakeel and others are considering would increase chances of being hurt. Under discussion is to remove this controversial part: /* Fail only if socket is _under_ its sndbuf. * In this case we cannot block, so that we have to fail. / if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) { / Force charge with __GFP_NOFAIL */ if (memcg_charge && !charged) { mem_cgroup_charge_skmem(sk->sk_memcg, amt, gfp_memcg_charge() \| __GFP_NOFAIL); } return 1; } Fixes: `fdfc5c8594` ("tcp: remove empty skb from write queue in error cases") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20231019112457.1190114-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-01-12 16:31:29 +01:00
Scott Weaver	8d95883db0	Merge: io_uring: update to upstream v6.6 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3318 Update io_uring and its dependencies to upstream kernel version 6.6. JIRA: https://issues.redhat.com/browse/RHEL-12076 JIRA: https://issues.redhat.com/browse/RHEL-14998 JIRA: https://issues.redhat.com/browse/RHEL-4447 CVE: CVE-2023-46862 Omitted-Fix: ab69838e7c75 ("io_uring/kbuf: Fix check of BID wrapping in provided buffers") Omitted-Fix: f74c746e476b ("io_uring/kbuf: Allow the full buffer id space for provided buffers") This is the list of new features available (includes upstream kernel versions 6.3-6.6): User-specified ring buffer Provided Buffers allocated by the kernel Ability to register the ring fd Multi-shot timeouts ability to pass custom flags to the completion queue entry for ring messages All of these features are covered by the liburing tests. In my testing, no-mmap-inval.t failed because of a broken test. socket-uring-cmd.t also failed because of a missing selinux policy rule. Try running audit2allow if you see a failure in that test. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Approved-by: Donald Dutile <ddutile@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Approved-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-12-16 14:38:47 -05:00
Jan Stancek	4d6cc8878b	Merge: tcp: allow again tcp_disconnect() when threads are waiting MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3266 JIRA: https://issues.redhat.com/browse/RHEL-12593 Tested: vs bz reproducer Restore the ability to cancel a pending connect() via connect(AF_UNSPEC) in another thread Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-11-24 07:31:07 +01:00

1 2 3 4 5 ...

961 Commits