Commit Graph

961 Commits

Author SHA1 Message Date
Toke Høiland-Jørgensen e9b6f5c14c bpf: Add bpf_sock_destroy kfunc
JIRA: https://issues.redhat.com/browse/RHEL-65787

Conflicts: Context difference due to missing af9784d007d8 ("tcp: diag:
add support for TIME_WAIT sockets to tcp_abort()") and out-of-order
backport of bac76cf89816 ("tcp: fix forever orphan socket caused by
tcp_abort")

commit 4ddbcb886268af8d12a23e6640b39d1d9c652b1b
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date:   Fri May 19 22:51:55 2023 +0000

    bpf: Add bpf_sock_destroy kfunc

    The socket destroy kfunc is used to forcefully terminate sockets from
    certain BPF contexts. We plan to use the capability in Cilium
    load-balancing to terminate client sockets that continue to connect to
    deleted backends.  The other use case is on-the-fly policy enforcement
    where existing socket connections prevented by policies need to be
    forcefully terminated.  The kfunc also allows terminating sockets that may
    or may not be actively sending traffic.

    The kfunc can currently be called only from BPF TCP and UDP iterators
    where users can filter, and terminate selected sockets. More
    specifically, it can only be called from  BPF contexts that ensure
    socket locking in order to allow synchronous execution of protocol
    specific `diag_destroy` handlers. The previous commit that batches UDP
    sockets during iteration facilitated a synchronous invocation of the UDP
    destroy callback from BPF context by skipping socket locks in
    `udp_abort`. TCP iterator already supported batching of sockets being
    iterated. To that end, `tracing_iter_filter` callback filter is added so
    that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
    attach type, and reject other programs.

    The kfunc takes `sock_common` type argument, even though it expects, and
    casts them to a `sock` pointer. This enables the verifier to allow the
    sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
    `sock` structs. Furthermore, as `sock_common` only has a subset of
    certain fields of `sock`, casting pointer to the latter type might not
    always be safe for certain sockets like request sockets, but these have a
    special handling in the diag_destroy handlers.

    Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
    cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
    eg. getting a sk pointer (may be even NULL) by following another sk
    pointer. The pointer socket argument passed in TCP and UDP iterators is
    tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info.  The TRUSTED arg changes
    are contributed by Martin KaFai Lau <martin.lau@kernel.org>.

    Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
    Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:54 +01:00
Rado Vrbovsky 81ce48e690 Merge: mptcp: phase-1 backports for RHEL-9.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5449

JIRA: https://issues.redhat.com/browse/RHEL-62871  
JIRA: https://issues.redhat.com/browse/RHEL-58839  
JIRA: https://issues.redhat.com/browse/RHEL-66083  
JIRA: https://issues.redhat.com/browse/RHEL-66074  
CVE: CVE-2024-46711  
CVE: CVE-2024-45009  
CVE: CVE-2024-45010  
Upstream Status: All mainline in net.git  
Tested: kselftest  
Conflicts: see individual patches  
  
Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-22 09:18:31 +00:00
Davide Caratti 6758e2bf77 tcp: set TCP_DEFER_ACCEPT locklessly
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 6e97ba552b8d3dd074a28b8600740b8bed42267b

commit 6e97ba552b8d3dd074a28b8600740b8bed42267b
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 4 14:46:16 2023 +0000

    tcp: set TCP_DEFER_ACCEPT locklessly

    rskq_defer_accept field can be read/written without
    the need of holding the socket lock.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti a89122fa2a tcp: set TCP_LINGER2 locklessly
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit a81722ddd7e4d76c9bbff078d29416e18c6d7f71

commit a81722ddd7e4d76c9bbff078d29416e18c6d7f71
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 4 14:46:15 2023 +0000

    tcp: set TCP_LINGER2 locklessly

    tp->linger2 can be set locklessly as long as readers
    use READ_ONCE().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 6813538fe7 tcp: set TCP_KEEPCNT locklessly
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 84485080cbc1e5a011e7549966739df4cec158b1

commit 84485080cbc1e5a011e7549966739df4cec158b1
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 4 14:46:14 2023 +0000

    tcp: set TCP_KEEPCNT locklessly

    tp->keepalive_probes can be set locklessly, readers
    are already taking care of this field being potentially
    set by other threads.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 555249fe86 tcp: set TCP_KEEPINTVL locklessly
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 6fd70a6b4e6f58482a43da46d12323795bbc5f68

commit 6fd70a6b4e6f58482a43da46d12323795bbc5f68
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 4 14:46:13 2023 +0000

    tcp: set TCP_KEEPINTVL locklessly

    tp->keepalive_intvl can be set locklessly, readers
    are already taking care of this field being potentially
    set by other threads.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti dc3759782c tcp: set TCP_USER_TIMEOUT locklessly
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit d58f2e15aa0c07f6f03ec71f64d7697ca43d04a1

commit d58f2e15aa0c07f6f03ec71f64d7697ca43d04a1
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 4 14:46:12 2023 +0000

    tcp: set TCP_USER_TIMEOUT locklessly

    icsk->icsk_user_timeout can be set locklessly,
    if all read sides use READ_ONCE().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti c312a246d9 tcp: set TCP_SYNCNT locklessly
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit d44fd4a767b3755899f8ad1df3e8eca3961ba708

commit d44fd4a767b3755899f8ad1df3e8eca3961ba708
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 4 14:46:11 2023 +0000

    tcp: set TCP_SYNCNT locklessly

    icsk->icsk_syn_retries can safely be set without locking the socket.

    We have to add READ_ONCE() annotations in tcp_fastopen_synack_timer()
    and tcp_write_timeout().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 2c5048dd03 tcp: annotate data-races around fastopenq.max_qlen
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 70f360dd7042cb843635ece9d28335a4addff9eb

commit 70f360dd7042cb843635ece9d28335a4addff9eb
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:57 2023 +0000

    tcp: annotate data-races around fastopenq.max_qlen

    This field can be read locklessly.

    Fixes: 1536e2857b ("tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-12-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 66355cbfde tcp: annotate data-races around icsk->icsk_user_timeout
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 26023e91e12c68669db416b97234328a03d8e499

commit 26023e91e12c68669db416b97234328a03d8e499
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:56 2023 +0000

    tcp: annotate data-races around icsk->icsk_user_timeout

    This field can be read locklessly from do_tcp_getsockopt()

    Fixes: dca43c75e7 ("tcp: Add TCP_USER_TIMEOUT socket option.")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-11-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 8abf5b2c25 tcp: annotate data-races around tp->notsent_lowat
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 1aeb87bc1440c5447a7fa2d6e3c2cca52cbd206b

commit 1aeb87bc1440c5447a7fa2d6e3c2cca52cbd206b
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:55 2023 +0000

    tcp: annotate data-races around tp->notsent_lowat

    tp->notsent_lowat can be read locklessly from do_tcp_getsockopt()
    and tcp_poll().

    Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-10-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti dded454961 tcp: annotate data-races around rskq_defer_accept
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit ae488c74422fb1dcd807c0201804b3b5e8a322a3

commit ae488c74422fb1dcd807c0201804b3b5e8a322a3
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:54 2023 +0000

    tcp: annotate data-races around rskq_defer_accept

    do_tcp_getsockopt() reads rskq_defer_accept while another cpu
    might change its value.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-9-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti a0ed783738 tcp: annotate data-races around tp->linger2
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 9df5335ca974e688389c875546e5819778a80d59

commit 9df5335ca974e688389c875546e5819778a80d59
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:53 2023 +0000

    tcp: annotate data-races around tp->linger2

    do_tcp_getsockopt() reads tp->linger2 while another cpu
    might change its value.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-8-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 1cac8d920d tcp: annotate data-races around icsk->icsk_syn_retries
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 3a037f0f3c4bfe44518f2fbb478aa2f99a9cd8bb

commit 3a037f0f3c4bfe44518f2fbb478aa2f99a9cd8bb
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:52 2023 +0000

    tcp: annotate data-races around icsk->icsk_syn_retries

    do_tcp_getsockopt() and reqsk_timer_handler() read
    icsk->icsk_syn_retries while another cpu might change its value.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-7-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti eddb55b003 tcp: annotate data-races around tp->keepalive_probes
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 6e5e1de616bf5f3df1769abc9292191dfad9110a

commit 6e5e1de616bf5f3df1769abc9292191dfad9110a
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:51 2023 +0000

    tcp: annotate data-races around tp->keepalive_probes

    do_tcp_getsockopt() reads tp->keepalive_probes while another cpu
    might change its value.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-6-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 2f8df3a24e tcp: annotate data-races around tp->keepalive_intvl
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 5ecf9d4f52ff2f1d4d44c9b68bc75688e82f13b4

commit 5ecf9d4f52ff2f1d4d44c9b68bc75688e82f13b4
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:50 2023 +0000

    tcp: annotate data-races around tp->keepalive_intvl

    do_tcp_getsockopt() reads tp->keepalive_intvl while another cpu
    might change its value.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-5-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 1a343a4516 tcp: annotate data-races around tp->keepalive_time
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 4164245c76ff906c9086758e1c3f87082a7f5ef5

commit 4164245c76ff906c9086758e1c3f87082a7f5ef5
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:49 2023 +0000

    tcp: annotate data-races around tp->keepalive_time

    do_tcp_getsockopt() reads tp->keepalive_time while another cpu
    might change its value.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-4-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 7bb06cd72e tcp: annotate data-races around tp->tsoffset
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit dd23c9f1e8d5c1d2e3d29393412385ccb9c7a948
Conflicts:
  - net/ipv4/tcp_ipv4.c: keep using sock_net(sk) as we don't have
    upstream commit 08eaef904031 ("tcp: Clean up some functions.")

commit dd23c9f1e8d5c1d2e3d29393412385ccb9c7a948
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:48 2023 +0000

    tcp: annotate data-races around tp->tsoffset

    do_tcp_getsockopt() reads tp->tsoffset while another cpu
    might change its value.

    Fixes: 93be6ce0e9 ("tcp: set and get per-socket timestamp")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-3-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 988b363978 tcp: annotate data-races around tp->tcp_tx_delay
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 348b81b68b13ebd489a3e6a46aa1c384c731c919

commit 348b81b68b13ebd489a3e6a46aa1c384c731c919
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:47 2023 +0000

    tcp: annotate data-races around tp->tcp_tx_delay

    do_tcp_getsockopt() reads tp->tcp_tx_delay while another cpu
    might change its value.

    Fixes: a842fe1425 ("tcp: add optional per socket transmit delay")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-2-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:57 +01:00
Rado Vrbovsky 384fd7eadc Merge: tcp: stable backports for 9.6 phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5444

JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Several stable backport for the tcp protocol addressing sparse corner-cases issues.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-01 08:13:57 +00:00
Rado Vrbovsky 570a71d7db Merge: mm: update core code to v6.6 upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252

JIRA: https://issues.redhat.com/browse/RHEL-27743  
JIRA: https://issues.redhat.com/browse/RHEL-59459    
CVE: CVE-2024-46787    
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961  
  
This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level.    
This work follows up on the previous v6.5 update (RHEL-27742) and as such,    
the bulk of this changeset is comprised of refactoring and clean-ups of     
the internal implementation of several APIs as it further advances the     
conversion to FOLIOS, and follow up on the per-VMA locking changes.

Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow    
Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds,    
and we add a potential extra level of protection (assessment pending) to help    
on mitigating kernel heap exploits dubbed as "SlubStick".     
    
Follow-up fixes are omitted from this series either because they are irrelevant to     
the bits we support on RHEL or because they depend on bigger changesets introduced     
upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately.    

Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot")    
Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources")   
Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()")    
Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros")    
Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages")    
Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType")    
Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()")    
Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio")    
Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling")    
    
Signed-off-by: Rafael Aquini <raquini@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-30 07:22:28 +00:00
Paolo Abeni 1b55adcd2c tcp: fix forever orphan socket caused by tcp_abort
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1
Conflicts: different context, as rhel-9 lacks the upstream \
  commit 4ddbcb886268 ("bpf: Add bpf_sock_destroy kfunc"). Also no \
  tcp_done_with_error() in rhel-9, since it lacks the upstream commit \
  5ce4645c23cf ("tcp: fix races in tcp_abort()")

Upstream commit:
commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4
Author: Xueming Feng <kuro@kuroa.me>
Date:   Mon Aug 26 18:23:27 2024 +0800

    tcp: fix forever orphan socket caused by tcp_abort

    We have some problem closing zero-window fin-wait-1 tcp sockets in our
    environment. This patch come from the investigation.

    Previously tcp_abort only sends out reset and calls tcp_done when the
    socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
    purging the write queue, but not close the socket and left it to the
    timer.

    While purging the write queue, tp->packets_out and sk->sk_write_queue
    is cleared along the way. However tcp_retransmit_timer have early
    return based on !tp->packets_out and tcp_probe_timer have early
    return based on !sk->sk_write_queue.

    This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
    and socket not being killed by the timers, converting a zero-windowed
    orphan into a forever orphan.

    This patch removes the SOCK_DEAD check in tcp_abort, making it send
    reset to peer and close the socket accordingly. Preventing the
    timer-less orphan from happening.

    According to Lorenzo's email in the v1 thread, the check was there to
    prevent force-closing the same socket twice. That situation is handled
    by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
    already closed.

    The -ENOENT code comes from the associate patch Lorenzo made for
    iproute2-ss; link attached below, which also conform to RFC 9293.

    At the end of the patch, tcp_write_queue_purge(sk) is removed because it
    was already called in tcp_done_with_error().

    p.s. This is the same patch with v2. Resent due to mis-labeled "changes
    requested" on patchwork.kernel.org.

    Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
    Fixes: c1e64e298b ("net: diag: Support destroying TCP sockets.")
    Signed-off-by: Xueming Feng <kuro@kuroa.me>
    Tested-by: Lorenzo Colitti <lorenzo@google.com>
    Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:16 +02:00
Paolo Abeni fdad6e7a51 tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1
Conflicts: different context in tcp_conn_request(), as rhel-9 \
  lacks the TCP AO support.

Upstream commit:
commit 41eecbd712b73f0d5dcf1152b9a1c27b1f238028
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 7 09:33:22 2024 +0000

    tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field

    TCP can transform a TIMEWAIT socket into a SYN_RECV one from
    a SYN packet, and the ISN of the SYNACK packet is normally
    generated using TIMEWAIT tw_snd_nxt :

    tcp_timewait_state_process()
    ...
        u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
        if (isn == 0)
            isn++;
        TCP_SKB_CB(skb)->tcp_tw_isn = isn;
        return TCP_TW_SYN;

    This SYN packet also bypasses normal checks against listen queue
    being full or not.

    tcp_conn_request()
    ...
           __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
    ...
            /* TW buckets are converted to open requests without
             * limitations, they conserve resources and peer is
             * evidently real one.
             */
            if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
                    want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
                    if (!want_cookie)
                            goto drop;
            }

    This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.

    Unfortunately this field has been accidentally cleared
    after the call to tcp_timewait_state_process() returning
    TCP_TW_SYN.

    Using a field in TCP_SKB_CB(skb) for a temporary state
    is overkill.

    Switch instead to a per-cpu variable.

    As a bonus, we do not have to clear tcp_tw_isn in TCP receive
    fast path.
    It is temporarily set then cleared only in the TCP_TW_SYN dance.

    Fixes: 4ad19de877 ("net: tcp6: fix double call of tcp_v6_fill_cb()")
    Fixes: eeea10b83a ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:08:41 +02:00
Rafael Aquini d755df6daa mm: allow per-VMA locks on file-backed VMAs
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * MAINTAINERS: minor context difference due to backport of upstream commit
      14006f1d8fa2 ("Documentations: Analyze heavily used Networking related structs")

This patch is a backport of the following upstream commit:
commit 350f6bbca1de515cd7519a33661cefc93ea06054
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:02 2023 +0100

    mm: allow per-VMA locks on file-backed VMAs

    Remove the TCP layering violation by allowing per-VMA locks on all VMAs.
    The fault path will immediately fail in handle_mm_fault().  There may be a
    small performance reduction from this patch as a little unnecessary work
    will be done on each page fault.  See later patches for the improvement.

    Link: https://lkml.kernel.org/r/20230724185410.1124082-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:35 -04:00
Wander Lairson Costa e5f704d611
net: generalize skb freeing deferral to per-cpu lists
JIRA: https://issues.redhat.com/browse/RHEL-9145

Conflicts:
inet/tls/tls_sw.c: we already have:
* 4cbc325ed6b4 ("tls: rx: allow only one reader at a time")
net/ipv4/tcp_ipv4.c: we already have:
* 67b688aecd tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
* 0240ed7c51 tcp: allow again tcp_disconnect() when threads are waiting
* 0d5e52df56 bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf
* 7a26dc9e7b43 net: tcp: add skb drop reasons to tcp_add_backlog()

commit 68822bdf76f10c3dc80609d4e2cdc1e847429086
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 22 13:12:37 2022 -0700

    net: generalize skb freeing deferral to per-cpu lists

    Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
    lock is released") helped bulk TCP flows to move the cost of skbs
    frees outside of critical section where socket lock was held.

    But for RPC traffic, or hosts with RFS enabled, the solution is far from
    being ideal.

    For RPC traffic, recvmsg() has to return to user space right after
    skb payload has been consumed, meaning that BH handler has no chance
    to pick the skb before recvmsg() thread. This issue is more visible
    with BIG TCP, as more RPC fit one skb.

    For RFS, even if BH handler picks the skbs, they are still picked
    from the cpu on which user thread is running.

    Ideally, it is better to free the skbs (and associated page frags)
    on the cpu that originally allocated them.

    This patch removes the per socket anchor (sk->defer_list) and
    instead uses a per-cpu list, which will hold more skbs per round.

    This new per-cpu list is drained at the end of net_action_rx(),
    after incoming packets have been processed, to lower latencies.

    In normal conditions, skbs are added to the per-cpu list with
    no further action. In the (unlikely) cases where the cpu does not
    run net_action_rx() handler fast enough, we use an IPI to raise
    NET_RX_SOFTIRQ on the remote cpu.

    Also, we do not bother draining the per-cpu list from dev_cpu_dead()
    This is because skbs in this list have no requirement on how fast
    they should be freed.

    Note that we can add in the future a small per-cpu cache
    if we see any contention on sd->defer_lock.

    Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
    and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
    page recycling strategy used by NIC driver (its page pool capacity
    being too small compared to number of skbs/pages held in sockets
    receive queues)

    Note that this tuning was only done to demonstrate worse
    conditions for skb freeing for this particular test.
    These conditions can happen in more general production workload.

    10 runs of one TCP_STREAM flow

    Before:
    Average throughput: 49685 Mbit.

    Kernel profiles on cpu running user thread recvmsg() show high cost for
    skb freeing related functions (*)

        57.81%  [kernel]       [k] copy_user_enhanced_fast_string
    (*) 12.87%  [kernel]       [k] skb_release_data
    (*)  4.25%  [kernel]       [k] __free_one_page
    (*)  3.57%  [kernel]       [k] __list_del_entry_valid
         1.85%  [kernel]       [k] __netif_receive_skb_core
         1.60%  [kernel]       [k] __skb_datagram_iter
    (*)  1.59%  [kernel]       [k] free_unref_page_commit
    (*)  1.16%  [kernel]       [k] __slab_free
         1.16%  [kernel]       [k] _copy_to_iter
    (*)  1.01%  [kernel]       [k] kfree
    (*)  0.88%  [kernel]       [k] free_unref_page
         0.57%  [kernel]       [k] ip6_rcv_core
         0.55%  [kernel]       [k] ip6t_do_table
         0.54%  [kernel]       [k] flush_smp_call_function_queue
    (*)  0.54%  [kernel]       [k] free_pcppages_bulk
         0.51%  [kernel]       [k] llist_reverse_order
         0.38%  [kernel]       [k] process_backlog
    (*)  0.38%  [kernel]       [k] free_pcp_prepare
         0.37%  [kernel]       [k] tcp_recvmsg_locked
    (*)  0.37%  [kernel]       [k] __list_add_valid
         0.34%  [kernel]       [k] sock_rfree
         0.34%  [kernel]       [k] _raw_spin_lock_irq
    (*)  0.33%  [kernel]       [k] __page_cache_release
         0.33%  [kernel]       [k] tcp_v6_rcv
    (*)  0.33%  [kernel]       [k] __put_page
    (*)  0.29%  [kernel]       [k] __mod_zone_page_state
         0.27%  [kernel]       [k] _raw_spin_lock

    After patch:
    Average throughput: 73076 Mbit.

    Kernel profiles on cpu running user thread recvmsg() looks better:

        81.35%  [kernel]       [k] copy_user_enhanced_fast_string
         1.95%  [kernel]       [k] _copy_to_iter
         1.95%  [kernel]       [k] __skb_datagram_iter
         1.27%  [kernel]       [k] __netif_receive_skb_core
         1.03%  [kernel]       [k] ip6t_do_table
         0.60%  [kernel]       [k] sock_rfree
         0.50%  [kernel]       [k] tcp_v6_rcv
         0.47%  [kernel]       [k] ip6_rcv_core
         0.45%  [kernel]       [k] read_tsc
         0.44%  [kernel]       [k] _raw_spin_lock_irqsave
         0.37%  [kernel]       [k] _raw_spin_lock
         0.37%  [kernel]       [k] native_irq_return_iret
         0.33%  [kernel]       [k] __inet6_lookup_established
         0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
         0.29%  [kernel]       [k] tcp_rcv_established
         0.29%  [kernel]       [k] llist_reverse_order

    v2: kdoc issue (kernel bots)
        do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
        replace the sk_buff_head with a single-linked list (Jakub)
        add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:25 -03:00
Wander Lairson Costa 7442869e56
tcp: do not call tcp_cleanup_rbuf() if we have a backlog
JIRA: https://issues.redhat.com/browse/RHEL-9145

commit 29fbc26e6dfc7be351c23261938de3f93f5cde57
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:48 2021 -0800

    tcp: do not call tcp_cleanup_rbuf() if we have a backlog

    Under pressure, tcp recvmsg() has logic to process the socket backlog,
    but calls tcp_cleanup_rbuf() right before.

    Avoiding sending ACK right before processing new segments makes
    a lot of sense, as this decrease the number of ACK packets,
    with no impact on effective ACK clocking.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:25 -03:00
Wander Lairson Costa fd0c645742
tcp: add a missing sk_defer_free_flush() in tcp_splice_read()
JIRA: https://issues.redhat.com/browse/RHEL-9145

commit ebdc1a0309629e71e5910b353e6b005f022ce171
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jan 20 04:45:30 2022 -0800

    tcp: add a missing sk_defer_free_flush() in tcp_splice_read()

    Without it, splice users can hit the warning
    added in commit 79074a72d335 ("net: Flush deferred skb free on socket destroy")

    Fixes: f35f821935d8 ("tcp: defer skb freeing after socket lock is released")
    Fixes: 79074a72d335 ("net: Flush deferred skb free on socket destroy")
    Suggested-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Gal Pressman <gal@nvidia.com>
    Link: https://lore.kernel.org/r/20220120124530.925607-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:25 -03:00
Wander Lairson Costa 19b7cb57b3
tcp: defer skb freeing after socket lock is released
JIRA: https://issues.redhat.com/browse/RHEL-9145

Conflicts:
* include/net/tcp.h: we already have
  7a26dc9e7b43 ("net: tcp: add skb drop reasons to tcp_add_backlog()")
* net/ipv4/tcp.c: we already have
* 67b688aecd tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
* 0240ed7c51 tcp: allow again tcp_disconnect() when threads are waiting
* 0d5e52df56 bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf

commit f35f821935d8df76f9c92e2431a225bdff938169
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:46 2021 -0800

    tcp: defer skb freeing after socket lock is released

    tcp recvmsg() (or rx zerocopy) spends a fair amount of time
    freeing skbs after their payload has been consumed.

    A typical ~64KB GRO packet has to release ~45 page
    references, eventually going to page allocator
    for each of them.

    Currently, this freeing is performed while socket lock
    is held, meaning that there is a high chance that
    BH handler has to queue incoming packets to tcp socket backlog.

    This can cause additional latencies, because the user
    thread has to process the backlog at release_sock() time,
    and while doing so, additional frames can be added
    by BH handler.

    This patch adds logic to defer these frees after socket
    lock is released, or directly from BH handler if possible.

    Being able to free these skbs from BH handler helps a lot,
    because this avoids the usual alloc/free assymetry,
    when BH handler and user thread do not run on same cpu or
    NUMA node.

    One cpu can now be fully utilized for the kernel->user copy,
    and another cpu is handling BH processing and skb/page
    allocs/frees (assuming RFS is not forcing use of a single CPU)

    Tested:
     100Gbit NIC
     Max throughput for one TCP_STREAM flow, over 10 runs

    MTU : 1500
    Before: 55 Gbit
    After:  66 Gbit

    MTU : 4096+(headers)
    Before: 82 Gbit
    After:  95 Gbit

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:20 -03:00
Lucas Zampieri 55f96777fb Merge: net: backport visibility improvements
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4765

JIRA: https://issues.redhat.com/browse/RHEL-48648  
  
Various visibility improvements; mainly around drop reasons, reset reason and improved tracepoints this time.  
  
Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-12 16:18:50 +00:00
Lucas Zampieri cb92e2e4c6 Merge: net: Optimize cacheline consumption of core networking structs
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4307

JIRA: https://issues.redhat.com/browse/RHEL-30902  
Tested: manual testing and preliminary LNST run show improvement in some  
tests and no regressions.  
  
The fields that the rx and tx paths use were placed all over the core  
networking structs. Reorganize these structs so the fields of each  
struct that are read/written in rx/tx paths are placed close to each  
other to reduce the number of cache lines used.  
  
Signed-off-by: Felix Maurer <fmaurer@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-07 16:49:46 +00:00
Antoine Tenart aef83a52dd rstreason: prepare for active reset
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Context difference due to missing upstream commit e13ec3da05d1 ("tcp:
  annotate lockless access to sk->sk_err") in c9s.

commit 5691276b39daf90294c6a81fb6d62d667f634c92
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:36 2024 +0800

    rstreason: prepare for active reset

    Like what we did to passive reset:
    only passing possible reset reason in each active reset path.

    No functional changes.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Felix Maurer d6374b3821 tcp: move tp->scaling_ratio to tcp_sock_read_txrx group
JIRA: https://issues.redhat.com/browse/RHEL-30902

commit 119ff04864a24470b1e531bb53e5c141aa8fefb0
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 8 14:43:21 2024 +0000

    tcp: move tp->scaling_ratio to tcp_sock_read_txrx group
    
    tp->scaling_ratio is a read mostly field, used in rx and tx fast paths.
    
    Fixes: d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Coco Li <lixiaoyan@google.com>
    Cc: Wei Wang <weiwan@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-06-26 17:17:19 +02:00
Felix Maurer 98c09081cb tcp: reorganize tcp_sock fast path variables
JIRA: https://issues.redhat.com/browse/RHEL-30902
Conflicts:
- include/linux/tcp.h: Context difference because tcp_usec_ts is missing
  due to missing 614e8316aa4c ("tcp: add support for usec resolution in TCP
  TS values")
Omitted-fix: 666a877deab2 ("tcp: move tp->tcp_usec_ts to tcp_sock_read_txrx group")
  This field was never backported, see conflicts.

commit d5fed5addb2b6bc13035de4338b7ea2052a2e006
Author: Coco Li <lixiaoyan@google.com>
Date:   Mon Dec 4 20:12:31 2023 +0000

    tcp: reorganize tcp_sock fast path variables

    The variables are organized according in the following way:

    - TX read-mostly hotpath cache lines
    - TXRX read-mostly hotpath cache lines
    - RX read-mostly hotpath cache lines
    - TX read-write hotpath cache line
    - TXRX read-write hotpath cache line
    - RX read-write hotpath cache line

    Fastpath cachelines end after rcvq_space.

    Cache line boundaries are enforced only between read-mostly and
    read-write. That is, if read-mostly tx cachelines bleed into
    read-mostly txrx cachelines, we do not care. We care about the
    boundaries between read and write cachelines because we want
    to prevent false sharing.

    Fast path variables span cache lines before change: 12
    Fast path variables span cache lines after change: 8

    Suggested-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Wei Wang <weiwan@google.com>
    Signed-off-by: Coco Li <lixiaoyan@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20231204201232.520025-3-lixiaoyan@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-06-26 17:17:18 +02:00
Florian Westphal bd2a0fb2c5 tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets
JIRA: https://issues.redhat.com/browse/RHEL-39833
Upstream Status: commit 94062790aedb
CVE: CVE-2024-36905

commit 94062790aedb505bdda209b10bea47b294d6394f
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed May 1 12:54:48 2024 +0000

    tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets

    TCP_SYN_RECV state is really special, it is only used by
    cross-syn connections, mostly used by fuzzers.

    In the following crash [1], syzbot managed to trigger a divide
    by zero in tcp_rcv_space_adjust()

    A socket makes the following state transitions,
    without ever calling tcp_init_transfer(),
    meaning tcp_init_buffer_space() is also not called.

             TCP_CLOSE
    connect()
             TCP_SYN_SENT
             TCP_SYN_RECV
    shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN)
             TCP_FIN_WAIT1

    To fix this issue, change tcp_shutdown() to not
    perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition,
    which makes no sense anyway.

    When tcp_rcv_state_process() later changes socket state
    from TCP_SYN_RECV to TCP_ESTABLISH, then look at
    sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state,
    and send a FIN packet from a sane socket state.

    This means tcp_send_fin() can now be called from BH
    context, and must use GFP_ATOMIC allocations.

    [1]
    divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
    CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
     RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767
    Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48
    RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246
    RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7
    R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30
    R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da
    FS:  00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0
    Call Trace:
     <TASK>
      tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513
      tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578
      inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680
      sock_recvmsg_nosec net/socket.c:1046 [inline]
      sock_recvmsg+0x109/0x280 net/socket.c:1068
      ____sys_recvmsg+0x1db/0x470 net/socket.c:2803
      ___sys_recvmsg net/socket.c:2845 [inline]
      do_recvmmsg+0x474/0xae0 net/socket.c:2939
      __sys_recvmmsg net/socket.c:3018 [inline]
      __do_sys_recvmmsg net/socket.c:3041 [inline]
      __se_sys_recvmmsg net/socket.c:3034 [inline]
      __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
      do_syscall_x64 arch/x86/entry/common.c:52 [inline]
      do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x77/0x7f
    RIP: 0033:0x7faeb6363db9
    Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9
    RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005
    RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c
    R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Link: https://lore.kernel.org/r/20240501125448.896529-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2024-06-07 15:21:41 +02:00
Lucas Zampieri 3dce9ca7d2 Merge: MM: proactive fixes for 9.5
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4047

The following MR is part of the MM SSTs plan on updating the MM codebase to 6.4 for RHEL9.5. As part of this effort, we have 1 MR for each V6.x upstream update without the followup fixes. 

This MR serves to ensure we are maintaining stability by utilizing available upstream fixes for the commits we have in the MM codebase. 

The rough structure of this MR is as follows:
```
The first set of patches (1-28) are missing patches from <v6.1
The second set of patches (29-86) are the fixes that were omitted from the v6.1-v6.4
The third set of patches (87-129) are fixes from v6.4+ that are marked as STABLE patches
the fourth set of patches (130-171) are other fixes that are not stable patches and effect previous rhel releases or are fixes that i missed from step 2
```
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3951

JIRA: https://issues.redhat.com/browse/RHEL-5619

Signed-off-by: Nico Pache <npache@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-16 13:16:58 +00:00
Lucas Zampieri 1eb3817020 Merge: mm: update to 6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3803

JIRA: https://issues.redhat.com/browse/RHEL-27740    
Depends: !3738    
    
Like 6.1, 6.2 and 6.4 backports, we're not backporting fixes: instead    
it'll be done as a follow up MR in order to prevent wasting time merging    
out of order fixes.    
    
LTP failures on mremap06 and vma02 are fixed by a 6.4 patch    
(7e7757876f25) that will be included by the 6.4 backport (!3951)    
    
Omitted-fix: 443ed4c302fff6a26af980300463343a7adc9ee8    
Omitted-fix: 10f4c9b9a33b7df000f74fa0d896351fb1a61e6a    
Omitted-fix: 95a301eefa82057571207edd06ea36218985a75e    
Omitted-fix: a101482421a318369eef2d0e03f2fcb40a47abad    
Omitted-fix: 6c54312f9689fbe27c70db5d42eebd29d04b672e    
Omitted-fix: 6f74c0ec2095335158015ce29b708e775b9cea3a    
Omitted-fix: c643e6ebedb435bcf863001f5e69a578f2658055    
Omitted-fix: 77795f900e2a07c1cbedc375789aefb43843b6c2    
Omitted-fix: 2658f94d679243209889cdfa8de3743cde1abea9    
Omitted-fix: 7e7757876f258d99266e7b3c559639289a2a45fe    
Omitted-fix: 9425c591e06a9ab27a145ba655fb50532cf0bcc9    
Omitted-fix: d1adb25df7111de83b64655a80b5a135adbded61    
Omitted-fix: 4d4b6d66db63ceed399f1fb1a4b24081d2590eb1    
Omitted-fix: a259945efe6ada94087ef666e9b38f8e34ea34ba    
Omitted-fix: 00ca0f2e86bf40b016a646e6323a8941a09cf106    
    
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Jocelyn Falempe <jfalempe@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Vladis Dronov <vdronov@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Lucas Zampieri <lzampier@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-13 12:46:48 +00:00
Patrick Talbert 14d069c085 Merge: tcp: stable backport for 9.5 from phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4004

JIRA: https://issues.redhat.com/browse/RHEL-32164
JIRA: https://issues.redhat.com/browse/RHEL-29496
CVE: CVE-2024-26640
Upstream Status: All mainline in net.git.

A bunch of stable backport from upstream addressing races,
edge conditions adding missing sanity checks.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>

Merged-by: Patrick Talbert <ptalbert@redhat.com>
2024-05-03 12:43:30 +02:00
Nico Pache dc2f01811e tcp: Use per-vma locking for receive zerocopy
commit 7a7f094635349a7d0314364ad50bdeb770b6df4f
Author: Arjun Roy <arjunroy@google.com>
Date:   Fri Jun 16 12:34:27 2023 -0700

    tcp: Use per-vma locking for receive zerocopy

    Per-VMA locking allows us to lock a struct vm_area_struct without
    taking the process-wide mmap lock in read mode.

    Consider a process workload where the mmap lock is taken constantly in
    write mode. In this scenario, all zerocopy receives are periodically
    blocked during that period of time - though in principle, the memory
    ranges being used by TCP are not touched by the operations that need
    the mmap write lock. This results in performance degradation.

    Now consider another workload where the mmap lock is never taken in
    write mode, but there are many TCP connections using receive zerocopy
    that are concurrently receiving. These connections all take the mmap
    lock in read mode, but this does induce a lot of contention and atomic
    ops for this process-wide lock. This results in additional CPU
    overhead caused by contending on the cache line for this lock.

    However, with per-vma locking, both of these problems can be avoided.

    As a test, I ran an RPC-style request/response workload with 4KB
    payloads and receive zerocopy enabled, with 100 simultaneous TCP
    connections. I measured perf cycles within the
    find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
    without per-vma locking enabled.

    When using process-wide mmap semaphore read locking, about 1% of
    measured perf cycles were within this path. With per-VMA locking, this
    value dropped to about 0.45%.

    Signed-off-by: Arjun Roy <arjunroy@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:25 -06:00
Aristeu Rozanski e214620cfb mm: replace vma->vm_flags direct modifications with modifier calls
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped stuff we don't support when not applying cleanly, left the rest for sake of saving work

commit 1c71222e5f2393b5ea1a41795c67589eea7e3490
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:49 2023 -0800

    mm: replace vma->vm_flags direct modifications with modifier calls

    Replace direct modifications to vma->vm_flags with calls to modifier
    functions to be able to track flag changes and to keep vma locking
    correctness.

    [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
    Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski 20dd56698e mm: remove zap_page_range and create zap_vma_pages
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped RISCV changes, and due missing b59c9dc4d9d47b

commit e9adcfecf572fcfaa9f8525904cf49c709974f73
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Tue Jan 3 16:27:32 2023 -0800

    mm: remove zap_page_range and create zap_vma_pages

    zap_page_range was originally designed to unmap pages within an address
    range that could span multiple vmas.  While working on [1], it was
    discovered that all callers of zap_page_range pass a range entirely within
    a single vma.  In addition, the mmu notification call within zap_page
    range does not correctly handle ranges that span multiple vmas.  When
    crossing a vma boundary, a new mmu_notifier_range_init/end call pair with
    the new vma should be made.

    Instead of fixing zap_page_range, do the following:
    - Create a new routine zap_vma_pages() that will remove all pages within
      the passed vma.  Most users of zap_page_range pass the entire vma and
      can use this new routine.
    - For callers of zap_page_range not passing the entire vma, instead call
      zap_page_range_single().
    - Remove zap_page_range.

    [1] https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.kravetz@oracle.com/
    Link: https://lkml.kernel.org/r/20230104002732.232573-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Peter Xu <peterx@redhat.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Acked-by: Heiko Carstens <hca@linux.ibm.com>    [s390]
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:03 -04:00
Paolo Abeni a3a3caa33a tcp: properly terminate timers for kernel sockets
JIRA: https://issues.redhat.com/browse/RHEL-32164
Tested: LNST, Tier1

Upstream commit:
commit 151c9c724d05d5b0dd8acd3e11cb69ef1f2dbada
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Mar 22 13:57:32 2024 +0000

    tcp: properly terminate timers for kernel sockets

    We had various syzbot reports about tcp timers firing after
    the corresponding netns has been dismantled.

    Fortunately Josef Bacik could trigger the issue more often,
    and could test a patch I wrote two years ago.

    When TCP sockets are closed, we call inet_csk_clear_xmit_timers()
    to 'stop' the timers.

    inet_csk_clear_xmit_timers() can be called from any context,
    including when socket lock is held.
    This is the reason it uses sk_stop_timer(), aka del_timer().
    This means that ongoing timers might finish much later.

    For user sockets, this is fine because each running timer
    holds a reference on the socket, and the user socket holds
    a reference on the netns.

    For kernel sockets, we risk that the netns is freed before
    timer can complete, because kernel sockets do not hold
    reference on the netns.

    This patch adds inet_csk_clear_xmit_timers_sync() function
    that using sk_stop_timer_sync() to make sure all timers
    are terminated before the kernel socket is released.
    Modules using kernel sockets close them in their netns exit()
    handler.

    Also add sock_not_owned_by_me() helper to get LOCKDEP
    support : inet_csk_clear_xmit_timers_sync() must not be called
    while socket lock is held.

    It is very possible we can revert in the future commit
    3a58f13a881e ("net: rds: acquire refcount on TCP sockets")
    which attempted to solve the issue in rds only.
    (net/smc/af_smc.c and net/mptcp/subflow.c have similar code)

    We probably can remove the check_net() tests from
    tcp_out_of_resources() and __tcp_close() in the future.

    Reported-by: Josef Bacik <josef@toxicpanda.com>
    Closes: https://lore.kernel.org/netdev/20240314210740.GA2823176@perftesting/
    Fixes: 26abe14379 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
    Fixes: 8a68173691 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
    Link: https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Tested-by: Josef Bacik <josef@toxicpanda.com>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Link: https://lore.kernel.org/r/20240322135732.1535772-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-16 16:56:41 +02:00
Paolo Abeni db84bcbe01 tcp: add sanity checks to rx zerocopy
JIRA: https://issues.redhat.com/browse/RHEL-32164
JIRA: https://issues.redhat.com/browse/RHEL-29496
Tested: LNST, Tier1
CVE: CVE-2024-26640

Upstream commit:
commit 577e4432f3ac810049cb7e6b71f4d96ec7c6e894
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jan 25 10:33:17 2024 +0000

    tcp: add sanity checks to rx zerocopy

    TCP rx zerocopy intent is to map pages initially allocated
    from NIC drivers, not pages owned by a fs.

    This patch adds to can_map_frag() these additional checks:

    - Page must not be a compound one.
    - page->mapping must be NULL.

    This fixes the panic reported by ZhangPeng.

    syzbot was able to loopback packets built with sendfile(),
    mapping pages owned by an ext4 file to TCP rx zerocopy.

    r3 = socket$inet_tcp(0x2, 0x1, 0x0)
    mmap(&(0x7f0000ff9000/0x4000)=nil, 0x4000, 0x0, 0x12, r3, 0x0)
    r4 = socket$inet_tcp(0x2, 0x1, 0x0)
    bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e24, @multicast1}, 0x10)
    connect$inet(r4, &(0x7f00000006c0)={0x2, 0x4e24, @empty}, 0x10)
    r5 = openat$dir(0xffffffffffffff9c, &(0x7f00000000c0)='./file0\x00',
        0x181e42, 0x0)
    fallocate(r5, 0x0, 0x0, 0x85b8)
    sendfile(r4, r5, 0x0, 0x8ba0)
    getsockopt$inet_tcp_TCP_ZEROCOPY_RECEIVE(r4, 0x6, 0x23,
        &(0x7f00000001c0)={&(0x7f0000ffb000/0x3000)=nil, 0x3000, 0x0, 0x0, 0x0,
        0x0, 0x0, 0x0, 0x0}, &(0x7f0000000440)=0x40)
    r6 = openat$dir(0xffffffffffffff9c, &(0x7f00000000c0)='./file0\x00',
        0x181e42, 0x0)

    Fixes: 93ab6cc691 ("tcp: implement mmap() for zero copy receive")
    Link: https://lore.kernel.org/netdev/5106a58e-04da-372a-b836-9d3d0bd2507b@huawei.com/T/
    Reported-and-bisected-by: ZhangPeng <zhangpeng362@huawei.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: linux-mm@vger.kernel.org
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-16 16:56:39 +02:00
Paolo Abeni 8af20bee5f tcp: Add memory barrier to tcp_push()
JIRA: https://issues.redhat.com/browse/RHEL-32164
Tested: LNST, Tier1

Upstream commit:
commit 7267e8dcad6b2f9fce05a6a06335d7040acbc2b6
Author: Salvatore Dipietro <dipiets@amazon.com>
Date:   Fri Jan 19 11:01:33 2024 -0800

    tcp: Add memory barrier to tcp_push()

    On CPUs with weak memory models, reads and updates performed by tcp_push
    to the sk variables can get reordered leaving the socket throttled when
    it should not. The tasklet running tcp_wfree() may also not observe the
    memory updates in time and will skip flushing any packets throttled by
    tcp_push(), delaying the sending. This can pathologically cause 40ms
    extra latency due to bad interactions with delayed acks.

    Adding a memory barrier in tcp_push removes the bug, similarly to the
    previous commit bf06200e73 ("tcp: tsq: fix nonagle handling").
    smp_mb__after_atomic() is used to not incur in unnecessary overhead
    on x86 since not affected.

    Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu
    22.04 and Apache Tomcat 9.0.83 running the basic servlet below:

    import java.io.IOException;
    import java.io.OutputStreamWriter;
    import java.io.PrintWriter;
    import javax.servlet.ServletException;
    import javax.servlet.http.HttpServlet;
    import javax.servlet.http.HttpServletRequest;
    import javax.servlet.http.HttpServletResponse;

    public class HelloWorldServlet extends HttpServlet {
        @Override
        protected void doGet(HttpServletRequest request, HttpServletResponse response)
          throws ServletException, IOException {
            response.setContentType("text/html;charset=utf-8");
            OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
            String s = "a".repeat(3096);
            osw.write(s,0,s.length());
            osw.flush();
        }
    }

    Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
    c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
    values is observed while, with the patch, the extra latency disappears.

    No patch and tcp_autocorking=1
    ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
      ...
     50.000%    0.91ms
     75.000%    1.13ms
     90.000%    1.46ms
     99.000%    1.74ms
     99.900%    1.89ms
     99.990%   41.95ms  <<< 40+ ms extra latency
     99.999%   48.32ms
    100.000%   48.96ms

    With patch and tcp_autocorking=1
    ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
      ...
     50.000%    0.90ms
     75.000%    1.13ms
     90.000%    1.45ms
     99.000%    1.72ms
     99.900%    1.83ms
     99.990%    2.11ms  <<< no 40+ ms extra latency
     99.999%    2.53ms
    100.000%    2.62ms

    Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
    affected by this issue and the patch doesn't introduce any additional
    delay.

    Fixes: 7aa5470c2c ("tcp: tsq: move tsq_flags close to sk_wmem_alloc")
    Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240119190133.43698-1-dipiets@amazon.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-08 18:29:27 +02:00
Paolo Abeni 9bc87f74bb tcp: fix possible freeze in tx path under memory pressure
JIRA: https://issues.redhat.com/browse/RHEL-32164
Tested: LNST, Tier1
Conflicts: the out-of-order backport of upstream commit eb315a7d1396 \
  ("tcp: support externally provided ubufs") into rhel commit
  abfc92436c, needs some mangling for the first chunk in
  tcp_sendmsg_locked(). The resulting code mirrors the current upstream
  one.

Upstream commit:
commit 849b425cd091e1804af964b771761cfbefbafb43
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jun 14 10:17:34 2022 -0700

    tcp: fix possible freeze in tx path under memory pressure

    Blamed commit only dealt with applications issuing small writes.

    Issue here is that we allow to force memory schedule for the sk_buff
    allocation, but we have no guarantee that sendmsg() is able to
    copy some payload in it.

    In this patch, I make sure the socket can use up to tcp_wmem[0] bytes.

    For example, if we consider tcp_wmem[0] = 4096 (default on x86),
    and initial skb->truesize being 1280, tcp_sendmsg() is able to
    copy up to 2816 bytes under memory pressure.

    Before this patch a sendmsg() sending more than 2816 bytes
    would either block forever (if persistent memory pressure),
    or return -EAGAIN.

    For bigger MTU networks, it is advised to increase tcp_wmem[0]
    to avoid sending too small packets.

    v2: deal with zero copy paths.

    Fixes: 8e4d980ac2 ("tcp: fix behavior for epoll edge trigger")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Reviewed-by: Wei Wang <weiwan@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-08 18:25:02 +02:00
Ivan Vecera 4ee448db07 net: introduce include/net/rps.h
JIRA: https://issues.redhat.com/browse/RHEL-31916

Conflicts:
* net/core/dev.c
  context conflict due to missing commit 2b0cfa6e49566 ("net: add
  generic percpu page_pool allocator")
* net/core/sysctl_net_core.c
  context conflict due to missing commit 2658b5a8a4eee ("net: introduce
  struct net_hotdata")

commit 490a79faf95e705ba0ffd9ebf04a624b379e53c9
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Mar 6 16:00:30 2024 +0000

    net: introduce include/net/rps.h

    Move RPS related structures and helpers from include/linux/netdevice.h
    and include/net/sock.h to a new include file.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-05 16:03:32 +02:00
Scott Weaver d007eb89da Merge: tcp: Dump bound-only sockets in inet_diag.
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3588

JIRA: https://issues.redhat.com/browse/RHEL-21223
Upstream Status: linux.git

Signed-off-by: Guillaume Nault <gnault@redhat.com>

Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: John B. Wyatt IV <jwyatt@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2024-02-01 11:34:44 -05:00
Guillaume Nault 4d99f185a2 tcp: Dump bound-only sockets in inet_diag.
JIRA: https://issues.redhat.com/browse/RHEL-21223
Upstream Status: linux.git
Conflicts: Missing upstream commit 28044fc1d495 ("net: Add a bhash2
           table hashed by port and address"):
           Centos Stream 9 doesn't have the ->bhash2 hash table.
           Use ->bhash instead. Because ->bhash can also contain
           time-wait sockets, we have to use sock_gen_put() instead of
           plain sock_put().

commit 91051f003948432f83b5d2766eeb83b2b4993649
Author: Guillaume Nault <gnault@redhat.com>
Date:   Fri Dec 1 15:49:52 2023 +0100

    tcp: Dump bound-only sockets in inet_diag.

    Walk the hashinfo->bhash2 table so that inet_diag can dump TCP sockets
    that are bound but haven't yet called connect() or listen().

    The code is inspired by the ->lhash2 loop. However there's no manual
    test of the source port, since this kind of filtering is already
    handled by inet_diag_bc_sk(). Also, a maximum of 16 sockets are dumped
    at a time, to avoid running with bh disabled for too long.

    There's no TCP state for bound but otherwise inactive sockets. Such
    sockets normally map to TCP_CLOSE. However, "ss -l", which is supposed
    to only dump listening sockets, actually requests the kernel to dump
    sockets in either the TCP_LISTEN or TCP_CLOSE states. To avoid dumping
    bound-only sockets with "ss -l", we therefore need to define a new
    pseudo-state (TCP_BOUND_INACTIVE) that user space will be able to set
    explicitly.

    With an IPv4, an IPv6 and an IPv6-only socket, bound respectively to
    40000, 64000, 60000, an updated version of iproute2 could work as
    follow:

      $ ss -t state bound-inactive
      Recv-Q   Send-Q     Local Address:Port       Peer Address:Port   Process
      0        0                0.0.0.0:40000           0.0.0.0:*
      0        0                   [::]:60000              [::]:*
      0        0                      *:64000                 *:*

    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Guillaume Nault <gnault@redhat.com>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/b3a84ae61e19c06806eea9c602b3b66e8f0cfc81.1701362867.git.gnault@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-01-24 17:24:46 +01:00
Paolo Abeni 29e5f409c6 net: do not leave an empty skb in write queue
JIRA: https://issues.redhat.com/browse/RHEL-21432
Tested: LNST, Tier1
Conflicts: Different context around tcp_remove_empty_skb, as rhel lacks \
  upstream sendpage() refactor. Just add the relevant new info. \
  tcp_remove_empty_skb() a skb as the 2nd argument, as rhel lacks the \
  upstream commit 27728ba80f1e ("tcp: cleanup tcp_remove_empty_skb() use")

Upstream commit:
commit 72bf4f1767f0386970dc04726dc5bc2e3991dc19
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Oct 19 11:24:57 2023 +0000

    net: do not leave an empty skb in write queue

    Under memory stress conditions, tcp_sendmsg_locked()
    might call sk_stream_wait_memory(), thus releasing the socket lock.

    If a fresh skb has been allocated prior to this,
    we should not leave it in the write queue otherwise
    tcp_write_xmit() could panic.

    This apparently does not happen often, but a future change
    in __sk_mem_raise_allocated() that Shakeel and others are
    considering would increase chances of being hurt.

    Under discussion is to remove this controversial part:

        /* Fail only if socket is _under_ its sndbuf.
         * In this case we cannot block, so that we have to fail.
         */
        if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) {
            /* Force charge with __GFP_NOFAIL */
            if (memcg_charge && !charged) {
                mem_cgroup_charge_skmem(sk->sk_memcg, amt,
                    gfp_memcg_charge() | __GFP_NOFAIL);
            }
            return 1;
        }

    Fixes: fdfc5c8594 ("tcp: remove empty skb from write queue in error cases")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Link: https://lore.kernel.org/r/20231019112457.1190114-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-12 16:31:29 +01:00
Scott Weaver 8d95883db0 Merge: io_uring: update to upstream v6.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3318

Update io_uring and its dependencies to upstream kernel version 6.6.

JIRA: https://issues.redhat.com/browse/RHEL-12076
JIRA: https://issues.redhat.com/browse/RHEL-14998
JIRA: https://issues.redhat.com/browse/RHEL-4447
CVE: CVE-2023-46862

Omitted-Fix: ab69838e7c75 ("io_uring/kbuf: Fix check of BID wrapping in provided buffers")
Omitted-Fix: f74c746e476b ("io_uring/kbuf: Allow the full buffer id space for provided buffers")

This is the list of new features available (includes upstream kernel versions 6.3-6.6):

    User-specified ring buffer
    Provided Buffers allocated by the kernel
    Ability to register the ring fd
    Multi-shot timeouts
    ability to pass custom flags to the completion queue entry for ring messages

All of these features are covered by the liburing tests.

In my testing, no-mmap-inval.t failed because of a broken test.  socket-uring-cmd.t also failed because of a missing selinux policy rule.  Try running audit2allow if you see a failure in that test.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-12-16 14:38:47 -05:00
Jan Stancek 4d6cc8878b Merge: tcp: allow again tcp_disconnect() when threads are waiting
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3266

JIRA: https://issues.redhat.com/browse/RHEL-12593
Tested: vs bz reproducer

Restore the ability to cancel a pending connect() via connect(AF_UNSPEC) in another thread

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-24 07:31:07 +01:00