Commit Graph

905 Commits

Author SHA1 Message Date
Rado Vrbovsky 81ce48e690 Merge: mptcp: phase-1 backports for RHEL-9.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5449

JIRA: https://issues.redhat.com/browse/RHEL-62871  
JIRA: https://issues.redhat.com/browse/RHEL-58839  
JIRA: https://issues.redhat.com/browse/RHEL-66083  
JIRA: https://issues.redhat.com/browse/RHEL-66074  
CVE: CVE-2024-46711  
CVE: CVE-2024-45009  
CVE: CVE-2024-45010  
Upstream Status: All mainline in net.git  
Tested: kselftest  
Conflicts: see individual patches  
  
Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-22 09:18:31 +00:00
Davide Caratti 8abf5b2c25 tcp: annotate data-races around tp->notsent_lowat
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 1aeb87bc1440c5447a7fa2d6e3c2cca52cbd206b

commit 1aeb87bc1440c5447a7fa2d6e3c2cca52cbd206b
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:55 2023 +0000

    tcp: annotate data-races around tp->notsent_lowat

    tp->notsent_lowat can be read locklessly from do_tcp_getsockopt()
    and tcp_poll().

    Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-10-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti eddb55b003 tcp: annotate data-races around tp->keepalive_probes
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 6e5e1de616bf5f3df1769abc9292191dfad9110a

commit 6e5e1de616bf5f3df1769abc9292191dfad9110a
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:51 2023 +0000

    tcp: annotate data-races around tp->keepalive_probes

    do_tcp_getsockopt() reads tp->keepalive_probes while another cpu
    might change its value.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-6-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 2f8df3a24e tcp: annotate data-races around tp->keepalive_intvl
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 5ecf9d4f52ff2f1d4d44c9b68bc75688e82f13b4

commit 5ecf9d4f52ff2f1d4d44c9b68bc75688e82f13b4
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:50 2023 +0000

    tcp: annotate data-races around tp->keepalive_intvl

    do_tcp_getsockopt() reads tp->keepalive_intvl while another cpu
    might change its value.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-5-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Davide Caratti 1a343a4516 tcp: annotate data-races around tp->keepalive_time
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit 4164245c76ff906c9086758e1c3f87082a7f5ef5

commit 4164245c76ff906c9086758e1c3f87082a7f5ef5
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:49 2023 +0000

    tcp: annotate data-races around tp->keepalive_time

    do_tcp_getsockopt() reads tp->keepalive_time while another cpu
    might change its value.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-4-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Rado Vrbovsky 384fd7eadc Merge: tcp: stable backports for 9.6 phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5444

JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Several stable backport for the tcp protocol addressing sparse corner-cases issues.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-01 08:13:57 +00:00
Rado Vrbovsky 570a71d7db Merge: mm: update core code to v6.6 upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252

JIRA: https://issues.redhat.com/browse/RHEL-27743  
JIRA: https://issues.redhat.com/browse/RHEL-59459    
CVE: CVE-2024-46787    
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961  
  
This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level.    
This work follows up on the previous v6.5 update (RHEL-27742) and as such,    
the bulk of this changeset is comprised of refactoring and clean-ups of     
the internal implementation of several APIs as it further advances the     
conversion to FOLIOS, and follow up on the per-VMA locking changes.

Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow    
Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds,    
and we add a potential extra level of protection (assessment pending) to help    
on mitigating kernel heap exploits dubbed as "SlubStick".     
    
Follow-up fixes are omitted from this series either because they are irrelevant to     
the bits we support on RHEL or because they depend on bigger changesets introduced     
upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately.    

Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot")    
Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources")   
Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()")    
Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros")    
Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages")    
Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType")    
Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()")    
Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio")    
Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling")    
    
Signed-off-by: Rafael Aquini <raquini@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-30 07:22:28 +00:00
Paolo Abeni 1758229dfc tcp: check skb is non-NULL in tcp_rto_delta_us()
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit c8770db2d54437a5f49417ae7b46f7de23d14db6
Author: Josh Hunt <johunt@akamai.com>
Date:   Tue Sep 10 15:08:22 2024 -0400

    tcp: check skb is non-NULL in tcp_rto_delta_us()

    We have some machines running stock Ubuntu 20.04.6 which is their 5.4.0-174-generic
    kernel that are running ceph and recently hit a null ptr dereference in
    tcp_rearm_rto(). Initially hitting it from the TLP path, but then later we also
    saw it getting hit from the RACK case as well. Here are examples of the oops
    messages we saw in each of those cases:

    Jul 26 15:05:02 rx [11061395.780353] BUG: kernel NULL pointer dereference, address: 0000000000000020
    Jul 26 15:05:02 rx [11061395.787572] #PF: supervisor read access in kernel mode
    Jul 26 15:05:02 rx [11061395.792971] #PF: error_code(0x0000) - not-present page
    Jul 26 15:05:02 rx [11061395.798362] PGD 0 P4D 0
    Jul 26 15:05:02 rx [11061395.801164] Oops: 0000 [#1] SMP NOPTI
    Jul 26 15:05:02 rx [11061395.805091] CPU: 0 PID: 9180 Comm: msgr-worker-1 Tainted: G W 5.4.0-174-generic #193-Ubuntu
    Jul 26 15:05:02 rx [11061395.814996] Hardware name: Supermicro SMC 2x26 os-gen8 64C NVME-Y 256G/H12SSW-NTR, BIOS 2.5.V1.2U.NVMe.UEFI 05/09/2023
    Jul 26 15:05:02 rx [11061395.825952] RIP: 0010:tcp_rearm_rto+0xe4/0x160
    Jul 26 15:05:02 rx [11061395.830656] Code: 87 ca 04 00 00 00 5b 41 5c 41 5d 5d c3 c3 49 8b bc 24 40 06 00 00 eb 8d 48 bb cf f7 53 e3 a5 9b c4 20 4c 89 ef e8 0c fe 0e 00 <48> 8b 78 20 48 c1 ef 03 48 89 f8 41 8b bc 24 80 04 00 00 48 f7 e3
    Jul 26 15:05:02 rx [11061395.849665] RSP: 0018:ffffb75d40003e08 EFLAGS: 00010246
    Jul 26 15:05:02 rx [11061395.855149] RAX: 0000000000000000 RBX: 20c49ba5e353f7cf RCX: 0000000000000000
    Jul 26 15:05:02 rx [11061395.862542] RDX: 0000000062177c30 RSI: 000000000000231c RDI: ffff9874ad283a60
    Jul 26 15:05:02 rx [11061395.869933] RBP: ffffb75d40003e20 R08: 0000000000000000 R09: ffff987605e20aa8
    Jul 26 15:05:02 rx [11061395.877318] R10: ffffb75d40003f00 R11: ffffb75d4460f740 R12: ffff9874ad283900
    Jul 26 15:05:02 rx [11061395.884710] R13: ffff9874ad283a60 R14: ffff9874ad283980 R15: ffff9874ad283d30
    Jul 26 15:05:02 rx [11061395.892095] FS: 00007f1ef4a2e700(0000) GS:ffff987605e00000(0000) knlGS:0000000000000000
    Jul 26 15:05:02 rx [11061395.900438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jul 26 15:05:02 rx [11061395.906435] CR2: 0000000000000020 CR3: 0000003e450ba003 CR4: 0000000000760ef0
    Jul 26 15:05:02 rx [11061395.913822] PKRU: 55555554
    Jul 26 15:05:02 rx [11061395.916786] Call Trace:
    Jul 26 15:05:02 rx [11061395.919488]
    Jul 26 15:05:02 rx [11061395.921765] ? show_regs.cold+0x1a/0x1f
    Jul 26 15:05:02 rx [11061395.925859] ? __die+0x90/0xd9
    Jul 26 15:05:02 rx [11061395.929169] ? no_context+0x196/0x380
    Jul 26 15:05:02 rx [11061395.933088] ? ip6_protocol_deliver_rcu+0x4e0/0x4e0
    Jul 26 15:05:02 rx [11061395.938216] ? ip6_sublist_rcv_finish+0x3d/0x50
    Jul 26 15:05:02 rx [11061395.943000] ? __bad_area_nosemaphore+0x50/0x1a0
    Jul 26 15:05:02 rx [11061395.947873] ? bad_area_nosemaphore+0x16/0x20
    Jul 26 15:05:02 rx [11061395.952486] ? do_user_addr_fault+0x267/0x450
    Jul 26 15:05:02 rx [11061395.957104] ? ipv6_list_rcv+0x112/0x140
    Jul 26 15:05:02 rx [11061395.961279] ? __do_page_fault+0x58/0x90
    Jul 26 15:05:02 rx [11061395.965458] ? do_page_fault+0x2c/0xe0
    Jul 26 15:05:02 rx [11061395.969465] ? page_fault+0x34/0x40
    Jul 26 15:05:02 rx [11061395.973217] ? tcp_rearm_rto+0xe4/0x160
    Jul 26 15:05:02 rx [11061395.977313] ? tcp_rearm_rto+0xe4/0x160
    Jul 26 15:05:02 rx [11061395.981408] tcp_send_loss_probe+0x10b/0x220
    Jul 26 15:05:02 rx [11061395.985937] tcp_write_timer_handler+0x1b4/0x240
    Jul 26 15:05:02 rx [11061395.990809] tcp_write_timer+0x9e/0xe0
    Jul 26 15:05:02 rx [11061395.994814] ? tcp_write_timer_handler+0x240/0x240
    Jul 26 15:05:02 rx [11061395.999866] call_timer_fn+0x32/0x130
    Jul 26 15:05:02 rx [11061396.003782] __run_timers.part.0+0x180/0x280
    Jul 26 15:05:02 rx [11061396.008309] ? recalibrate_cpu_khz+0x10/0x10
    Jul 26 15:05:02 rx [11061396.012841] ? native_x2apic_icr_write+0x30/0x30
    Jul 26 15:05:02 rx [11061396.017718] ? lapic_next_event+0x21/0x30
    Jul 26 15:05:02 rx [11061396.021984] ? clockevents_program_event+0x8f/0xe0
    Jul 26 15:05:02 rx [11061396.027035] run_timer_softirq+0x2a/0x50
    Jul 26 15:05:02 rx [11061396.031212] __do_softirq+0xd1/0x2c1
    Jul 26 15:05:02 rx [11061396.035044] do_softirq_own_stack+0x2a/0x40
    Jul 26 15:05:02 rx [11061396.039480]
    Jul 26 15:05:02 rx [11061396.041840] do_softirq.part.0+0x46/0x50
    Jul 26 15:05:02 rx [11061396.046022] __local_bh_enable_ip+0x50/0x60
    Jul 26 15:05:02 rx [11061396.050460] _raw_spin_unlock_bh+0x1e/0x20
    Jul 26 15:05:02 rx [11061396.054817] nf_conntrack_tcp_packet+0x29e/0xbe0 [nf_conntrack]
    Jul 26 15:05:02 rx [11061396.060994] ? get_l4proto+0xe7/0x190 [nf_conntrack]
    Jul 26 15:05:02 rx [11061396.066220] nf_conntrack_in+0xe9/0x670 [nf_conntrack]
    Jul 26 15:05:02 rx [11061396.071618] ipv6_conntrack_local+0x14/0x20 [nf_conntrack]
    Jul 26 15:05:02 rx [11061396.077356] nf_hook_slow+0x45/0xb0
    Jul 26 15:05:02 rx [11061396.081098] ip6_xmit+0x3f0/0x5d0
    Jul 26 15:05:02 rx [11061396.084670] ? ipv6_anycast_cleanup+0x50/0x50
    Jul 26 15:05:02 rx [11061396.089282] ? __sk_dst_check+0x38/0x70
    Jul 26 15:05:02 rx [11061396.093381] ? inet6_csk_route_socket+0x13b/0x200
    Jul 26 15:05:02 rx [11061396.098346] inet6_csk_xmit+0xa7/0xf0
    Jul 26 15:05:02 rx [11061396.102263] __tcp_transmit_skb+0x550/0xb30
    Jul 26 15:05:02 rx [11061396.106701] tcp_write_xmit+0x3c6/0xc20
    Jul 26 15:05:02 rx [11061396.110792] ? __alloc_skb+0x98/0x1d0
    Jul 26 15:05:02 rx [11061396.114708] __tcp_push_pending_frames+0x37/0x100
    Jul 26 15:05:02 rx [11061396.119667] tcp_push+0xfd/0x100
    Jul 26 15:05:02 rx [11061396.123150] tcp_sendmsg_locked+0xc70/0xdd0
    Jul 26 15:05:02 rx [11061396.127588] tcp_sendmsg+0x2d/0x50
    Jul 26 15:05:02 rx [11061396.131245] inet6_sendmsg+0x43/0x70
    Jul 26 15:05:02 rx [11061396.135075] __sock_sendmsg+0x48/0x70
    Jul 26 15:05:02 rx [11061396.138994] ____sys_sendmsg+0x212/0x280
    Jul 26 15:05:02 rx [11061396.143172] ___sys_sendmsg+0x88/0xd0
    Jul 26 15:05:02 rx [11061396.147098] ? __seccomp_filter+0x7e/0x6b0
    Jul 26 15:05:02 rx [11061396.151446] ? __switch_to+0x39c/0x460
    Jul 26 15:05:02 rx [11061396.155453] ? __switch_to_asm+0x42/0x80
    Jul 26 15:05:02 rx [11061396.159636] ? __switch_to_asm+0x5a/0x80
    Jul 26 15:05:02 rx [11061396.163816] __sys_sendmsg+0x5c/0xa0
    Jul 26 15:05:02 rx [11061396.167647] __x64_sys_sendmsg+0x1f/0x30
    Jul 26 15:05:02 rx [11061396.171832] do_syscall_64+0x57/0x190
    Jul 26 15:05:02 rx [11061396.175748] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
    Jul 26 15:05:02 rx [11061396.181055] RIP: 0033:0x7f1ef692618d
    Jul 26 15:05:02 rx [11061396.184893] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 ca ee ff ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2f 44 89 c7 48 89 44 24 08 e8 fe ee ff ff 48
    Jul 26 15:05:02 rx [11061396.203889] RSP: 002b:00007f1ef4a26aa0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
    Jul 26 15:05:02 rx [11061396.211708] RAX: ffffffffffffffda RBX: 000000000000084b RCX: 00007f1ef692618d
    Jul 26 15:05:02 rx [11061396.219091] RDX: 0000000000004000 RSI: 00007f1ef4a26b10 RDI: 0000000000000275
    Jul 26 15:05:02 rx [11061396.226475] RBP: 0000000000004000 R08: 0000000000000000 R09: 0000000000000020
    Jul 26 15:05:02 rx [11061396.233859] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000084b
    Jul 26 15:05:02 rx [11061396.241243] R13: 00007f1ef4a26b10 R14: 0000000000000275 R15: 000055592030f1e8
    Jul 26 15:05:02 rx [11061396.248628] Modules linked in: vrf bridge stp llc vxlan ip6_udp_tunnel udp_tunnel nls_iso8859_1 amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof ipmi_ssif input_leds joydev rndis_host cdc_ether usbnet mii ast drm_vram_helper ttm drm_kms_helper i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt ccp mac_hid ipmi_si ipmi_devintf ipmi_msghandler nft_ct sch_fq_codel nf_tables_set nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ramoops reed_solomon efi_pstore drm ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx5_ib ib_uverbs ib_core raid1 mlx5_core hid_generic pci_hyperv_intf crc32_pclmul tls usbhid ahci mlxfw bnxt_en libahci hid nvme i2c_piix4 nvme_core wmi
    Jul 26 15:05:02 rx [11061396.324334] CR2: 0000000000000020
    Jul 26 15:05:02 rx [11061396.327944] ---[ end trace 68a2b679d1cfb4f1 ]---
    Jul 26 15:05:02 rx [11061396.433435] RIP: 0010:tcp_rearm_rto+0xe4/0x160
    Jul 26 15:05:02 rx [11061396.438137] Code: 87 ca 04 00 00 00 5b 41 5c 41 5d 5d c3 c3 49 8b bc 24 40 06 00 00 eb 8d 48 bb cf f7 53 e3 a5 9b c4 20 4c 89 ef e8 0c fe 0e 00 <48> 8b 78 20 48 c1 ef 03 48 89 f8 41 8b bc 24 80 04 00 00 48 f7 e3
    Jul 26 15:05:02 rx [11061396.457144] RSP: 0018:ffffb75d40003e08 EFLAGS: 00010246
    Jul 26 15:05:02 rx [11061396.462629] RAX: 0000000000000000 RBX: 20c49ba5e353f7cf RCX: 0000000000000000
    Jul 26 15:05:02 rx [11061396.470012] RDX: 0000000062177c30 RSI: 000000000000231c RDI: ffff9874ad283a60
    Jul 26 15:05:02 rx [11061396.477396] RBP: ffffb75d40003e20 R08: 0000000000000000 R09: ffff987605e20aa8
    Jul 26 15:05:02 rx [11061396.484779] R10: ffffb75d40003f00 R11: ffffb75d4460f740 R12: ffff9874ad283900
    Jul 26 15:05:02 rx [11061396.492164] R13: ffff9874ad283a60 R14: ffff9874ad283980 R15: ffff9874ad283d30
    Jul 26 15:05:02 rx [11061396.499547] FS: 00007f1ef4a2e700(0000) GS:ffff987605e00000(0000) knlGS:0000000000000000
    Jul 26 15:05:02 rx [11061396.507886] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jul 26 15:05:02 rx [11061396.513884] CR2: 0000000000000020 CR3: 0000003e450ba003 CR4: 0000000000760ef0
    Jul 26 15:05:02 rx [11061396.521267] PKRU: 55555554
    Jul 26 15:05:02 rx [11061396.524230] Kernel panic - not syncing: Fatal exception in interrupt
    Jul 26 15:05:02 rx [11061396.530885] Kernel Offset: 0x1b200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    Jul 26 15:05:03 rx [11061396.660181] ---[ end Kernel panic - not syncing: Fatal
     exception in interrupt ]---

    After we hit this we disabled TLP by setting tcp_early_retrans to 0 and then hit the crash in the RACK case:

    Aug 7 07:26:16 rx [1006006.265582] BUG: kernel NULL pointer dereference, address: 0000000000000020
    Aug 7 07:26:16 rx [1006006.272719] #PF: supervisor read access in kernel mode
    Aug 7 07:26:16 rx [1006006.278030] #PF: error_code(0x0000) - not-present page
    Aug 7 07:26:16 rx [1006006.283343] PGD 0 P4D 0
    Aug 7 07:26:16 rx [1006006.286057] Oops: 0000 [#1] SMP NOPTI
    Aug 7 07:26:16 rx [1006006.289896] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.4.0-174-generic #193-Ubuntu
    Aug 7 07:26:16 rx [1006006.299107] Hardware name: Supermicro SMC 2x26 os-gen8 64C NVME-Y 256G/H12SSW-NTR, BIOS 2.5.V1.2U.NVMe.UEFI 05/09/2023
    Aug 7 07:26:16 rx [1006006.309970] RIP: 0010:tcp_rearm_rto+0xe4/0x160
    Aug 7 07:26:16 rx [1006006.314584] Code: 87 ca 04 00 00 00 5b 41 5c 41 5d 5d c3 c3 49 8b bc 24 40 06 00 00 eb 8d 48 bb cf f7 53 e3 a5 9b c4 20 4c 89 ef e8 0c fe 0e 00 <48> 8b 78 20 48 c1 ef 03 48 89 f8 41 8b bc 24 80 04 00 00 48 f7 e3
    Aug 7 07:26:16 rx [1006006.333499] RSP: 0018:ffffb42600a50960 EFLAGS: 00010246
    Aug 7 07:26:16 rx [1006006.338895] RAX: 0000000000000000 RBX: 20c49ba5e353f7cf RCX: 0000000000000000
    Aug 7 07:26:16 rx [1006006.346193] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff92d687ed8160
    Aug 7 07:26:16 rx [1006006.353489] RBP: ffffb42600a50978 R08: 0000000000000000 R09: 00000000cd896dcc
    Aug 7 07:26:16 rx [1006006.360786] R10: ffff92dc3404f400 R11: 0000000000000001 R12: ffff92d687ed8000
    Aug 7 07:26:16 rx [1006006.368084] R13: ffff92d687ed8160 R14: 00000000cd896dcc R15: 00000000cd8fca81
    Aug 7 07:26:16 rx [1006006.375381] FS: 0000000000000000(0000) GS:ffff93158ad40000(0000) knlGS:0000000000000000
    Aug 7 07:26:16 rx [1006006.383632] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Aug 7 07:26:16 rx [1006006.389544] CR2: 0000000000000020 CR3: 0000003e775ce006 CR4: 0000000000760ee0
    Aug 7 07:26:16 rx [1006006.396839] PKRU: 55555554
    Aug 7 07:26:16 rx [1006006.399717] Call Trace:
    Aug 7 07:26:16 rx [1006006.402335]
    Aug 7 07:26:16 rx [1006006.404525] ? show_regs.cold+0x1a/0x1f
    Aug 7 07:26:16 rx [1006006.408532] ? __die+0x90/0xd9
    Aug 7 07:26:16 rx [1006006.411760] ? no_context+0x196/0x380
    Aug 7 07:26:16 rx [1006006.415599] ? __bad_area_nosemaphore+0x50/0x1a0
    Aug 7 07:26:16 rx [1006006.420392] ? _raw_spin_lock+0x1e/0x30
    Aug 7 07:26:16 rx [1006006.424401] ? bad_area_nosemaphore+0x16/0x20
    Aug 7 07:26:16 rx [1006006.428927] ? do_user_addr_fault+0x267/0x450
    Aug 7 07:26:16 rx [1006006.433450] ? __do_page_fault+0x58/0x90
    Aug 7 07:26:16 rx [1006006.437542] ? do_page_fault+0x2c/0xe0
    Aug 7 07:26:16 rx [1006006.441470] ? page_fault+0x34/0x40
    Aug 7 07:26:16 rx [1006006.445134] ? tcp_rearm_rto+0xe4/0x160
    Aug 7 07:26:16 rx [1006006.449145] tcp_ack+0xa32/0xb30
    Aug 7 07:26:16 rx [1006006.452542] tcp_rcv_established+0x13c/0x670
    Aug 7 07:26:16 rx [1006006.456981] ? sk_filter_trim_cap+0x48/0x220
    Aug 7 07:26:16 rx [1006006.461419] tcp_v6_do_rcv+0xdb/0x450
    Aug 7 07:26:16 rx [1006006.465257] tcp_v6_rcv+0xc2b/0xd10
    Aug 7 07:26:16 rx [1006006.468918] ip6_protocol_deliver_rcu+0xd3/0x4e0
    Aug 7 07:26:16 rx [1006006.473706] ip6_input_finish+0x15/0x20
    Aug 7 07:26:16 rx [1006006.477710] ip6_input+0xa2/0xb0
    Aug 7 07:26:16 rx [1006006.481109] ? ip6_protocol_deliver_rcu+0x4e0/0x4e0
    Aug 7 07:26:16 rx [1006006.486151] ip6_sublist_rcv_finish+0x3d/0x50
    Aug 7 07:26:16 rx [1006006.490679] ip6_sublist_rcv+0x1aa/0x250
    Aug 7 07:26:16 rx [1006006.494779] ? ip6_rcv_finish_core.isra.0+0xa0/0xa0
    Aug 7 07:26:16 rx [1006006.499828] ipv6_list_rcv+0x112/0x140
    Aug 7 07:26:16 rx [1006006.503748] __netif_receive_skb_list_core+0x1a4/0x250
    Aug 7 07:26:16 rx [1006006.509057] netif_receive_skb_list_internal+0x1a1/0x2b0
    Aug 7 07:26:16 rx [1006006.514538] gro_normal_list.part.0+0x1e/0x40
    Aug 7 07:26:16 rx [1006006.519068] napi_complete_done+0x91/0x130
    Aug 7 07:26:16 rx [1006006.523352] mlx5e_napi_poll+0x18e/0x610 [mlx5_core]
    Aug 7 07:26:16 rx [1006006.528481] net_rx_action+0x142/0x390
    Aug 7 07:26:16 rx [1006006.532398] __do_softirq+0xd1/0x2c1
    Aug 7 07:26:16 rx [1006006.536142] irq_exit+0xae/0xb0
    Aug 7 07:26:16 rx [1006006.539452] do_IRQ+0x5a/0xf0
    Aug 7 07:26:16 rx [1006006.542590] common_interrupt+0xf/0xf
    Aug 7 07:26:16 rx [1006006.546421]
    Aug 7 07:26:16 rx [1006006.548695] RIP: 0010:native_safe_halt+0xe/0x10
    Aug 7 07:26:16 rx [1006006.553399] Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 36 2c 50 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 26 2c 50 00 fb f4 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 dd 5e 61 ff 65
    Aug 7 07:26:16 rx [1006006.572309] RSP: 0018:ffffb42600177e70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffc2
    Aug 7 07:26:16 rx [1006006.580040] RAX: ffffffff8ed08b20 RBX: 0000000000000005 RCX: 0000000000000001
    Aug 7 07:26:16 rx [1006006.587337] RDX: 00000000f48eeca2 RSI: 0000000000000082 RDI: 0000000000000082
    Aug 7 07:26:16 rx [1006006.594635] RBP: ffffb42600177e90 R08: 0000000000000000 R09: 000000000000020f
    Aug 7 07:26:16 rx [1006006.601931] R10: 0000000000100000 R11: 0000000000000000 R12: 0000000000000005
    Aug 7 07:26:16 rx [1006006.609229] R13: ffff93157deb5f00 R14: 0000000000000000 R15: 0000000000000000
    Aug 7 07:26:16 rx [1006006.616530] ? __cpuidle_text_start+0x8/0x8
    Aug 7 07:26:16 rx [1006006.620886] ? default_idle+0x20/0x140
    Aug 7 07:26:16 rx [1006006.624804] arch_cpu_idle+0x15/0x20
    Aug 7 07:26:16 rx [1006006.628545] default_idle_call+0x23/0x30
    Aug 7 07:26:16 rx [1006006.632640] do_idle+0x1fb/0x270
    Aug 7 07:26:16 rx [1006006.636035] cpu_startup_entry+0x20/0x30
    Aug 7 07:26:16 rx [1006006.640126] start_secondary+0x178/0x1d0
    Aug 7 07:26:16 rx [1006006.644218] secondary_startup_64+0xa4/0xb0
    Aug 7 07:26:17 rx [1006006.648568] Modules linked in: vrf bridge stp llc vxlan ip6_udp_tunnel udp_tunnel nls_iso8859_1 nft_ct amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof ipmi_ssif input_leds joydev rndis_host cdc_ether usbnet ast mii drm_vram_helper ttm drm_kms_helper i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt ccp mac_hid ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel nf_tables_set nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ramoops reed_solomon efi_pstore drm ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx5_ib ib_uverbs ib_core raid1 hid_generic mlx5_core pci_hyperv_intf crc32_pclmul usbhid ahci tls mlxfw bnxt_en hid libahci nvme i2c_piix4 nvme_core wmi [last unloaded: cpuid]
    Aug 7 07:26:17 rx [1006006.726180] CR2: 0000000000000020
    Aug 7 07:26:17 rx [1006006.729718] ---[ end trace e0e2e37e4e612984 ]---

    Prior to seeing the first crash and on other machines we also see the warning in
    tcp_send_loss_probe() where packets_out is non-zero, but both transmit and retrans
    queues are empty so we know the box is seeing some accounting issue in this area:

    Jul 26 09:15:27 kernel: ------------[ cut here ]------------
    Jul 26 09:15:27 kernel: invalid inflight: 2 state 1 cwnd 68 mss 8988
    Jul 26 09:15:27 kernel: WARNING: CPU: 16 PID: 0 at net/ipv4/tcp_output.c:2605 tcp_send_loss_probe+0x214/0x220
    Jul 26 09:15:27 kernel: Modules linked in: vrf bridge stp llc vxlan ip6_udp_tunnel udp_tunnel nls_iso8859_1 nft_ct amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof ipmi_ssif joydev input_leds rndis_host cdc_ether usbnet mii ast drm_vram_helper ttm drm_kms_he>
    Jul 26 09:15:27 kernel: CPU: 16 PID: 0 Comm: swapper/16 Not tainted 5.4.0-174-generic #193-Ubuntu
    Jul 26 09:15:27 kernel: Hardware name: Supermicro SMC 2x26 os-gen8 64C NVME-Y 256G/H12SSW-NTR, BIOS 2.5.V1.2U.NVMe.UEFI 05/09/2023
    Jul 26 09:15:27 kernel: RIP: 0010:tcp_send_loss_probe+0x214/0x220
    Jul 26 09:15:27 kernel: Code: 08 26 01 00 75 e2 41 0f b6 54 24 12 41 8b 8c 24 c0 06 00 00 45 89 f0 48 c7 c7 e0 b4 20 a7 c6 05 8d 08 26 01 01 e8 4a c0 0f 00 <0f> 0b eb ba 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
    Jul 26 09:15:27 kernel: RSP: 0018:ffffb7838088ce00 EFLAGS: 00010286
    Jul 26 09:15:27 kernel: RAX: 0000000000000000 RBX: ffff9b84b5630430 RCX: 0000000000000006
    Jul 26 09:15:27 kernel: RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff9b8e4621c8c0
    Jul 26 09:15:27 kernel: RBP: ffffb7838088ce18 R08: 0000000000000927 R09: 0000000000000004
    Jul 26 09:15:27 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff9b84b5630000
    Jul 26 09:15:27 kernel: R13: 0000000000000000 R14: 000000000000231c R15: ffff9b84b5630430
    Jul 26 09:15:27 kernel: FS: 0000000000000000(0000) GS:ffff9b8e46200000(0000) knlGS:0000000000000000
    Jul 26 09:15:27 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jul 26 09:15:27 kernel: CR2: 000056238cec2380 CR3: 0000003e49ede005 CR4: 0000000000760ee0
    Jul 26 09:15:27 kernel: PKRU: 55555554
    Jul 26 09:15:27 kernel: Call Trace:
    Jul 26 09:15:27 kernel: <IRQ>
    Jul 26 09:15:27 kernel: ? show_regs.cold+0x1a/0x1f
    Jul 26 09:15:27 kernel: ? __warn+0x98/0xe0
    Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220
    Jul 26 09:15:27 kernel: ? report_bug+0xd1/0x100
    Jul 26 09:15:27 kernel: ? do_error_trap+0x9b/0xc0
    Jul 26 09:15:27 kernel: ? do_invalid_op+0x3c/0x50
    Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220
    Jul 26 09:15:27 kernel: ? invalid_op+0x1e/0x30
    Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220
    Jul 26 09:15:27 kernel: tcp_write_timer_handler+0x1b4/0x240
    Jul 26 09:15:27 kernel: tcp_write_timer+0x9e/0xe0
    Jul 26 09:15:27 kernel: ? tcp_write_timer_handler+0x240/0x240
    Jul 26 09:15:27 kernel: call_timer_fn+0x32/0x130
    Jul 26 09:15:27 kernel: __run_timers.part.0+0x180/0x280
    Jul 26 09:15:27 kernel: ? timerqueue_add+0x9b/0xb0
    Jul 26 09:15:27 kernel: ? enqueue_hrtimer+0x3d/0x90
    Jul 26 09:15:27 kernel: ? do_error_trap+0x9b/0xc0
    Jul 26 09:15:27 kernel: ? do_invalid_op+0x3c/0x50
    Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220
    Jul 26 09:15:27 kernel: ? invalid_op+0x1e/0x30
    Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220
    Jul 26 09:15:27 kernel: tcp_write_timer_handler+0x1b4/0x240
    Jul 26 09:15:27 kernel: tcp_write_timer+0x9e/0xe0
    Jul 26 09:15:27 kernel: ? tcp_write_timer_handler+0x240/0x240
    Jul 26 09:15:27 kernel: call_timer_fn+0x32/0x130
    Jul 26 09:15:27 kernel: __run_timers.part.0+0x180/0x280
    Jul 26 09:15:27 kernel: ? timerqueue_add+0x9b/0xb0
    Jul 26 09:15:27 kernel: ? enqueue_hrtimer+0x3d/0x90
    Jul 26 09:15:27 kernel: ? recalibrate_cpu_khz+0x10/0x10
    Jul 26 09:15:27 kernel: ? ktime_get+0x3e/0xa0
    Jul 26 09:15:27 kernel: ? native_x2apic_icr_write+0x30/0x30
    Jul 26 09:15:27 kernel: run_timer_softirq+0x2a/0x50
    Jul 26 09:15:27 kernel: __do_softirq+0xd1/0x2c1
    Jul 26 09:15:27 kernel: irq_exit+0xae/0xb0
    Jul 26 09:15:27 kernel: smp_apic_timer_interrupt+0x7b/0x140
    Jul 26 09:15:27 kernel: apic_timer_interrupt+0xf/0x20
    Jul 26 09:15:27 kernel: </IRQ>
    Jul 26 09:15:27 kernel: RIP: 0010:native_safe_halt+0xe/0x10
    Jul 26 09:15:27 kernel: Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 36 2c 50 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 26 2c 50 00 fb f4 <c3> 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 dd 5e 61 ff 65
    Jul 26 09:15:27 kernel: RSP: 0018:ffffb783801cfe70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
    Jul 26 09:15:27 kernel: RAX: ffffffffa6908b20 RBX: 0000000000000010 RCX: 0000000000000001
    Jul 26 09:15:27 kernel: RDX: 000000006fc0c97e RSI: 0000000000000082 RDI: 0000000000000082
    Jul 26 09:15:27 kernel: RBP: ffffb783801cfe90 R08: 0000000000000000 R09: 0000000000000225
    Jul 26 09:15:27 kernel: R10: 0000000000100000 R11: 0000000000000000 R12: 0000000000000010
    Jul 26 09:15:27 kernel: R13: ffff9b8e390b0000 R14: 0000000000000000 R15: 0000000000000000
    Jul 26 09:15:27 kernel: ? __cpuidle_text_start+0x8/0x8
    Jul 26 09:15:27 kernel: ? default_idle+0x20/0x140
    Jul 26 09:15:27 kernel: arch_cpu_idle+0x15/0x20
    Jul 26 09:15:27 kernel: default_idle_call+0x23/0x30
    Jul 26 09:15:27 kernel: do_idle+0x1fb/0x270
    Jul 26 09:15:27 kernel: cpu_startup_entry+0x20/0x30
    Jul 26 09:15:27 kernel: start_secondary+0x178/0x1d0
    Jul 26 09:15:27 kernel: secondary_startup_64+0xa4/0xb0
    Jul 26 09:15:27 kernel: ---[ end trace e7ac822987e33be1 ]---

    The NULL ptr deref is coming from tcp_rto_delta_us() attempting to pull an skb
    off the head of the retransmit queue and then dereferencing that skb to get the
    skb_mstamp_ns value via tcp_skb_timestamp_us(skb).

    The crash is the same one that was reported a # of years ago here:
    https://lore.kernel.org/netdev/86c0f836-9a7c-438b-d81a-839be45f1f58@gmail.com/T/#t

    and the kernel we're running has the fix which was added to resolve this issue.

    Unfortunately we've been unsuccessful so far in reproducing this problem in the
    lab and do not have the luxury of pushing out a new kernel to try and test if
    newer kernels resolve this issue at the moment. I realize this is a report
    against both an Ubuntu kernel and also an older 5.4 kernel. I have reported this
    issue to Ubuntu here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2077657
    however I feel like since this issue has possibly cropped up again it makes
    sense to build in some protection in this path (even on the latest kernel
    versions) since the code in question just blindly assumes there's a valid skb
    without testing if it's NULL b/f it looks at the timestamp.

    Given we have seen crashes in this path before and now this case it seems like
    we should protect ourselves for when packets_out accounting is incorrect.
    While we should fix that root cause we should also just make sure the skb
    is not NULL before dereferencing it. Also add a warn once here to capture
    some information if/when the problem case is hit again.

    Fixes: e1a10ef7fa ("tcp: introduce tcp_rto_delta_us() helper for xmit timer fix")
    Signed-off-by: Josh Hunt <johunt@akamai.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:16 +02:00
Paolo Abeni c29bc5e5ab tcp: increase the default TCP scaling ratio
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit 697a6c8cec03c2299f850fa50322641a8bf6b915
Author: Hechao Li <hli@netflix.com>
Date:   Tue Apr 9 09:43:55 2024 -0700

    tcp: increase the default TCP scaling ratio

    After commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale"),
    we noticed an application-level timeout due to reduced throughput.

    Before the commit, for a client that sets SO_RCVBUF to 65k, it takes
    around 22 seconds to transfer 10M data. After the commit, it takes 40
    seconds. Because our application has a 30-second timeout, this
    regression broke the application.

    The reason that it takes longer to transfer data is that
    tp->scaling_ratio is initialized to a value that results in ~0.25 of
    rcvbuf. In our case, SO_RCVBUF is set to 65536 by the application, which
    translates to 2 * 65536 = 131,072 bytes in rcvbuf and hence a ~28k
    initial receive window.

    Later, even though the scaling_ratio is updated to a more accurate
    skb->len/skb->truesize, which is ~0.66 in our environment, the window
    stays at ~0.25 * rcvbuf. This is because tp->window_clamp does not
    change together with the tp->scaling_ratio update when autotuning is
    disabled due to SO_RCVBUF. As a result, the window size is capped at the
    initial window_clamp, which is also ~0.25 * rcvbuf, and never grows
    bigger.

    Most modern applications let the kernel do autotuning, and benefit from
    the increased scaling_ratio. But there are applications such as kafka
    that has a default setting of SO_RCVBUF=64k.

    This patch increases the initial scaling_ratio from ~25% to 50% in order
    to make it backward compatible with the original default
    sysctl_tcp_adv_win_scale for applications setting SO_RCVBUF.

    Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
    Signed-off-by: Hechao Li <hli@netflix.com>
    Reviewed-by: Tycho Andersen <tycho@tycho.pizza>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/netdev/20240402215405.432863-1-hli@netflix.com/
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:13 +02:00
Paolo Abeni fdad6e7a51 tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1
Conflicts: different context in tcp_conn_request(), as rhel-9 \
  lacks the TCP AO support.

Upstream commit:
commit 41eecbd712b73f0d5dcf1152b9a1c27b1f238028
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 7 09:33:22 2024 +0000

    tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field

    TCP can transform a TIMEWAIT socket into a SYN_RECV one from
    a SYN packet, and the ISN of the SYNACK packet is normally
    generated using TIMEWAIT tw_snd_nxt :

    tcp_timewait_state_process()
    ...
        u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
        if (isn == 0)
            isn++;
        TCP_SKB_CB(skb)->tcp_tw_isn = isn;
        return TCP_TW_SYN;

    This SYN packet also bypasses normal checks against listen queue
    being full or not.

    tcp_conn_request()
    ...
           __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
    ...
            /* TW buckets are converted to open requests without
             * limitations, they conserve resources and peer is
             * evidently real one.
             */
            if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
                    want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
                    if (!want_cookie)
                            goto drop;
            }

    This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.

    Unfortunately this field has been accidentally cleared
    after the call to tcp_timewait_state_process() returning
    TCP_TW_SYN.

    Using a field in TCP_SKB_CB(skb) for a temporary state
    is overkill.

    Switch instead to a per-cpu variable.

    As a bonus, we do not have to clear tcp_tw_isn in TCP receive
    fast path.
    It is temporarily set then cleared only in the TCP_TW_SYN dance.

    Fixes: 4ad19de877 ("net: tcp6: fix double call of tcp_v6_fill_cb()")
    Fixes: eeea10b83a ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:08:41 +02:00
Paolo Abeni 4cd846284a tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit b9e810405880c99baafd550ada7043e86465396e
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 7 09:33:21 2024 +0000

    tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()

    tcp_v6_init_req() reads TCP_SKB_CB(skb)->tcp_tw_isn to find
    out if the request socket is created by a SYN hitting a TIMEWAIT socket.

    This has been buggy for a decade, lets directly pass the information
    from tcp_conn_request().

    This is a preparatory patch to make the following one easier to review.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:07:53 +02:00
Rafael Aquini d755df6daa mm: allow per-VMA locks on file-backed VMAs
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * MAINTAINERS: minor context difference due to backport of upstream commit
      14006f1d8fa2 ("Documentations: Analyze heavily used Networking related structs")

This patch is a backport of the following upstream commit:
commit 350f6bbca1de515cd7519a33661cefc93ea06054
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jul 24 19:54:02 2023 +0100

    mm: allow per-VMA locks on file-backed VMAs

    Remove the TCP layering violation by allowing per-VMA locks on all VMAs.
    The fault path will immediately fail in handle_mm_fault().  There may be a
    small performance reduction from this patch as a little unnecessary work
    will be done on each page fault.  See later patches for the improvement.

    Link: https://lkml.kernel.org/r/20230724185410.1124082-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:35 -04:00
Wander Lairson Costa e5f704d611
net: generalize skb freeing deferral to per-cpu lists
JIRA: https://issues.redhat.com/browse/RHEL-9145

Conflicts:
inet/tls/tls_sw.c: we already have:
* 4cbc325ed6b4 ("tls: rx: allow only one reader at a time")
net/ipv4/tcp_ipv4.c: we already have:
* 67b688aecd tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
* 0240ed7c51 tcp: allow again tcp_disconnect() when threads are waiting
* 0d5e52df56 bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf
* 7a26dc9e7b43 net: tcp: add skb drop reasons to tcp_add_backlog()

commit 68822bdf76f10c3dc80609d4e2cdc1e847429086
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 22 13:12:37 2022 -0700

    net: generalize skb freeing deferral to per-cpu lists

    Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
    lock is released") helped bulk TCP flows to move the cost of skbs
    frees outside of critical section where socket lock was held.

    But for RPC traffic, or hosts with RFS enabled, the solution is far from
    being ideal.

    For RPC traffic, recvmsg() has to return to user space right after
    skb payload has been consumed, meaning that BH handler has no chance
    to pick the skb before recvmsg() thread. This issue is more visible
    with BIG TCP, as more RPC fit one skb.

    For RFS, even if BH handler picks the skbs, they are still picked
    from the cpu on which user thread is running.

    Ideally, it is better to free the skbs (and associated page frags)
    on the cpu that originally allocated them.

    This patch removes the per socket anchor (sk->defer_list) and
    instead uses a per-cpu list, which will hold more skbs per round.

    This new per-cpu list is drained at the end of net_action_rx(),
    after incoming packets have been processed, to lower latencies.

    In normal conditions, skbs are added to the per-cpu list with
    no further action. In the (unlikely) cases where the cpu does not
    run net_action_rx() handler fast enough, we use an IPI to raise
    NET_RX_SOFTIRQ on the remote cpu.

    Also, we do not bother draining the per-cpu list from dev_cpu_dead()
    This is because skbs in this list have no requirement on how fast
    they should be freed.

    Note that we can add in the future a small per-cpu cache
    if we see any contention on sd->defer_lock.

    Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
    and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
    page recycling strategy used by NIC driver (its page pool capacity
    being too small compared to number of skbs/pages held in sockets
    receive queues)

    Note that this tuning was only done to demonstrate worse
    conditions for skb freeing for this particular test.
    These conditions can happen in more general production workload.

    10 runs of one TCP_STREAM flow

    Before:
    Average throughput: 49685 Mbit.

    Kernel profiles on cpu running user thread recvmsg() show high cost for
    skb freeing related functions (*)

        57.81%  [kernel]       [k] copy_user_enhanced_fast_string
    (*) 12.87%  [kernel]       [k] skb_release_data
    (*)  4.25%  [kernel]       [k] __free_one_page
    (*)  3.57%  [kernel]       [k] __list_del_entry_valid
         1.85%  [kernel]       [k] __netif_receive_skb_core
         1.60%  [kernel]       [k] __skb_datagram_iter
    (*)  1.59%  [kernel]       [k] free_unref_page_commit
    (*)  1.16%  [kernel]       [k] __slab_free
         1.16%  [kernel]       [k] _copy_to_iter
    (*)  1.01%  [kernel]       [k] kfree
    (*)  0.88%  [kernel]       [k] free_unref_page
         0.57%  [kernel]       [k] ip6_rcv_core
         0.55%  [kernel]       [k] ip6t_do_table
         0.54%  [kernel]       [k] flush_smp_call_function_queue
    (*)  0.54%  [kernel]       [k] free_pcppages_bulk
         0.51%  [kernel]       [k] llist_reverse_order
         0.38%  [kernel]       [k] process_backlog
    (*)  0.38%  [kernel]       [k] free_pcp_prepare
         0.37%  [kernel]       [k] tcp_recvmsg_locked
    (*)  0.37%  [kernel]       [k] __list_add_valid
         0.34%  [kernel]       [k] sock_rfree
         0.34%  [kernel]       [k] _raw_spin_lock_irq
    (*)  0.33%  [kernel]       [k] __page_cache_release
         0.33%  [kernel]       [k] tcp_v6_rcv
    (*)  0.33%  [kernel]       [k] __put_page
    (*)  0.29%  [kernel]       [k] __mod_zone_page_state
         0.27%  [kernel]       [k] _raw_spin_lock

    After patch:
    Average throughput: 73076 Mbit.

    Kernel profiles on cpu running user thread recvmsg() looks better:

        81.35%  [kernel]       [k] copy_user_enhanced_fast_string
         1.95%  [kernel]       [k] _copy_to_iter
         1.95%  [kernel]       [k] __skb_datagram_iter
         1.27%  [kernel]       [k] __netif_receive_skb_core
         1.03%  [kernel]       [k] ip6t_do_table
         0.60%  [kernel]       [k] sock_rfree
         0.50%  [kernel]       [k] tcp_v6_rcv
         0.47%  [kernel]       [k] ip6_rcv_core
         0.45%  [kernel]       [k] read_tsc
         0.44%  [kernel]       [k] _raw_spin_lock_irqsave
         0.37%  [kernel]       [k] _raw_spin_lock
         0.37%  [kernel]       [k] native_irq_return_iret
         0.33%  [kernel]       [k] __inet6_lookup_established
         0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
         0.29%  [kernel]       [k] tcp_rcv_established
         0.29%  [kernel]       [k] llist_reverse_order

    v2: kdoc issue (kernel bots)
        do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
        replace the sk_buff_head with a single-linked list (Jakub)
        add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:25 -03:00
Wander Lairson Costa 14019aa268
tcp: Add a stub for sk_defer_free_flush()
JIRA: https://issues.redhat.com/browse/RHEL-9145

commit 48cec899e357cfb92d022a9c0df6bbe72a7f6951
Author: Gal Pressman <gal@nvidia.com>
Date:   Thu Jan 20 14:34:40 2022 +0200

    tcp: Add a stub for sk_defer_free_flush()

    When compiling the kernel with CONFIG_INET disabled, the
    sk_defer_free_flush() should be defined as a nop.

    This resolves the following compilation error:
      ld: net/core/sock.o: in function `sk_defer_free_flush':
      ./include/net/tcp.h:1378: undefined reference to `__sk_defer_free_flush'

    Fixes: 79074a72d335 ("net: Flush deferred skb free on socket destroy")
    Reported-by: kernel test robot <lkp@intel.com>
    Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
    Signed-off-by: Gal Pressman <gal@nvidia.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220120123440.9088-1-gal@nvidia.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:25 -03:00
Wander Lairson Costa 19b7cb57b3
tcp: defer skb freeing after socket lock is released
JIRA: https://issues.redhat.com/browse/RHEL-9145

Conflicts:
* include/net/tcp.h: we already have
  7a26dc9e7b43 ("net: tcp: add skb drop reasons to tcp_add_backlog()")
* net/ipv4/tcp.c: we already have
* 67b688aecd tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
* 0240ed7c51 tcp: allow again tcp_disconnect() when threads are waiting
* 0d5e52df56 bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf

commit f35f821935d8df76f9c92e2431a225bdff938169
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:46 2021 -0800

    tcp: defer skb freeing after socket lock is released

    tcp recvmsg() (or rx zerocopy) spends a fair amount of time
    freeing skbs after their payload has been consumed.

    A typical ~64KB GRO packet has to release ~45 page
    references, eventually going to page allocator
    for each of them.

    Currently, this freeing is performed while socket lock
    is held, meaning that there is a high chance that
    BH handler has to queue incoming packets to tcp socket backlog.

    This can cause additional latencies, because the user
    thread has to process the backlog at release_sock() time,
    and while doing so, additional frames can be added
    by BH handler.

    This patch adds logic to defer these frees after socket
    lock is released, or directly from BH handler if possible.

    Being able to free these skbs from BH handler helps a lot,
    because this avoids the usual alloc/free assymetry,
    when BH handler and user thread do not run on same cpu or
    NUMA node.

    One cpu can now be fully utilized for the kernel->user copy,
    and another cpu is handling BH processing and skb/page
    allocs/frees (assuming RFS is not forcing use of a single CPU)

    Tested:
     100Gbit NIC
     Max throughput for one TCP_STREAM flow, over 10 runs

    MTU : 1500
    Before: 55 Gbit
    After:  66 Gbit

    MTU : 4096+(headers)
    Before: 82 Gbit
    After:  95 Gbit

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:20 -03:00
Antoine Tenart aef83a52dd rstreason: prepare for active reset
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Context difference due to missing upstream commit e13ec3da05d1 ("tcp:
  annotate lockless access to sk->sk_err") in c9s.

commit 5691276b39daf90294c6a81fb6d62d667f634c92
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:36 2024 +0800

    rstreason: prepare for active reset

    Like what we did to passive reset:
    only passing possible reset reason in each active reset path.

    No functional changes.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 8f346a11e7 tcp: make the dropreason really work when calling tcp_rcv_state_process()
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit b9825695930546af725b1e686b8eaf4c71201728
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:26 2024 +0800

    tcp: make the dropreason really work when calling tcp_rcv_state_process()

    Update three callers including both ipv4 and ipv6 and let the dropreason
    mechanism work in reality.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Antoine Tenart 206757f0ed tcp: add dropreasons in tcp_rcv_state_process()
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Code difference in tcp_rcv_state_process due to missing upstream
  commit 795a7dfbc3d9 ("net: tcp: accept old ack during closing").

commit 7d6ed9afde8547723f6f96f81f984cc6c48eef52
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:25 2024 +0800

    tcp: add dropreasons in tcp_rcv_state_process()

    In this patch, I equipped this function with more dropreasons, but
    it still doesn't work yet, which I will do later.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Lucas Zampieri 3dce9ca7d2 Merge: MM: proactive fixes for 9.5
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4047

The following MR is part of the MM SSTs plan on updating the MM codebase to 6.4 for RHEL9.5. As part of this effort, we have 1 MR for each V6.x upstream update without the followup fixes. 

This MR serves to ensure we are maintaining stability by utilizing available upstream fixes for the commits we have in the MM codebase. 

The rough structure of this MR is as follows:
```
The first set of patches (1-28) are missing patches from <v6.1
The second set of patches (29-86) are the fixes that were omitted from the v6.1-v6.4
The third set of patches (87-129) are fixes from v6.4+ that are marked as STABLE patches
the fourth set of patches (130-171) are other fixes that are not stable patches and effect previous rhel releases or are fixes that i missed from step 2
```
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3951

JIRA: https://issues.redhat.com/browse/RHEL-5619

Signed-off-by: Nico Pache <npache@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-16 13:16:58 +00:00
Patrick Talbert 994dcfa46c Merge: mptcp: phase-1 backports for RHEL-9.5
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4034

JIRA: https://issues.redhat.com/browse/RHEL-32669
JIRA: https://issues.redhat.com/browse/RHEL-31604
CVE: CVE-2024-26708
CVE: CVE-2024-26781
Upstream Status: All mainline in net.git.
Tested: boot-tested only
Conflicts: see individual patches

Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>

Merged-by: Patrick Talbert <ptalbert@redhat.com>
2024-05-07 08:23:25 +02:00
Nico Pache dc2f01811e tcp: Use per-vma locking for receive zerocopy
commit 7a7f094635349a7d0314364ad50bdeb770b6df4f
Author: Arjun Roy <arjunroy@google.com>
Date:   Fri Jun 16 12:34:27 2023 -0700

    tcp: Use per-vma locking for receive zerocopy

    Per-VMA locking allows us to lock a struct vm_area_struct without
    taking the process-wide mmap lock in read mode.

    Consider a process workload where the mmap lock is taken constantly in
    write mode. In this scenario, all zerocopy receives are periodically
    blocked during that period of time - though in principle, the memory
    ranges being used by TCP are not touched by the operations that need
    the mmap write lock. This results in performance degradation.

    Now consider another workload where the mmap lock is never taken in
    write mode, but there are many TCP connections using receive zerocopy
    that are concurrently receiving. These connections all take the mmap
    lock in read mode, but this does induce a lot of contention and atomic
    ops for this process-wide lock. This results in additional CPU
    overhead caused by contending on the cache line for this lock.

    However, with per-vma locking, both of these problems can be avoided.

    As a test, I ran an RPC-style request/response workload with 4KB
    payloads and receive zerocopy enabled, with 100 simultaneous TCP
    connections. I measured perf cycles within the
    find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
    without per-vma locking enabled.

    When using process-wide mmap semaphore read locking, about 1% of
    measured perf cycles were within this path. With per-VMA locking, this
    value dropped to about 0.45%.

    Signed-off-by: Arjun Roy <arjunroy@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:25 -06:00
Davide Caratti d9c18bd98a mptcp: fix lockless access in subflow ULP diag
JIRA: https://issues.redhat.com/browse/RHEL-32669
Upstream Status: net.git commit b8adb69a7d29c2d33eb327bca66476fb6066516b

commit b8adb69a7d29c2d33eb327bca66476fb6066516b
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Thu Feb 15 19:25:30 2024 +0100

    mptcp: fix lockless access in subflow ULP diag

    Since the introduction of the subflow ULP diag interface, the
    dump callback accessed all the subflow data with lockless.

    We need either to annotate all the read and write operation accordingly,
    or acquire the subflow socket lock. Let's do latter, even if slower, to
    avoid a diffstat havoc.

    Fixes: 5147dfb508 ("mptcp: allow dumping subflow context to userspace")
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Mat Martineau <martineau@kernel.org>
    Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-04-18 17:25:35 +02:00
Paolo Abeni 322ce8ba2c tcp: fix cookie_init_timestamp() overflows
JIRA: https://issues.redhat.com/browse/RHEL-32164
Tested: LNST, Tier1

Upstream commit:
commit 73ed8e03388d16c12fc577e5c700b58a29045a15
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Oct 20 12:57:37 2023 +0000

    tcp: fix cookie_init_timestamp() overflows

    cookie_init_timestamp() is supposed to return a 64bit timestamp
    suitable for both TSval determination and setting of skb->tstamp.

    Unfortunately it uses 32bit fields and overflows after
    2^32 * 10^6 nsec (~49 days) of uptime.

    Generated TSval are still correct, but skb->tstamp might be set
    far away in the past, potentially confusing other layers.

    tcp_ns_to_ts() is changed to return a full 64bit value,
    ts and ts_now variables are changed to u64 type,
    and TSMASK is removed in favor of shifts operations.

    While we are at it, change this sequence:
                    ts >>= TSBITS;
                    ts--;
                    ts <<= TSBITS;
                    ts |= options;
    to:
                    ts -= (1UL << TSBITS);

    Fixes: 9a568de481 ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-08 18:29:00 +02:00
Scott Weaver c6519990cd Merge: net: visibility patches
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3447

JIRA: https://issues.redhat.com/browse/RHEL-17413

A set of various visibility / debuggability improvements related to the net stack.

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Eric Chanudet <echanude@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2024-01-02 10:35:00 -05:00
Scott Weaver 8d95883db0 Merge: io_uring: update to upstream v6.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3318

Update io_uring and its dependencies to upstream kernel version 6.6.

JIRA: https://issues.redhat.com/browse/RHEL-12076
JIRA: https://issues.redhat.com/browse/RHEL-14998
JIRA: https://issues.redhat.com/browse/RHEL-4447
CVE: CVE-2023-46862

Omitted-Fix: ab69838e7c75 ("io_uring/kbuf: Fix check of BID wrapping in provided buffers")
Omitted-Fix: f74c746e476b ("io_uring/kbuf: Allow the full buffer id space for provided buffers")

This is the list of new features available (includes upstream kernel versions 6.3-6.6):

    User-specified ring buffer
    Provided Buffers allocated by the kernel
    Ability to register the ring fd
    Multi-shot timeouts
    ability to pass custom flags to the completion queue entry for ring messages

All of these features are covered by the liburing tests.

In my testing, no-mmap-inval.t failed because of a broken test.  socket-uring-cmd.t also failed because of a missing selinux policy rule.  Try running audit2allow if you see a failure in that test.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-12-16 14:38:47 -05:00
Antoine Tenart 128059bf48 tcp: Add tracepoint for tcp_set_ca_state
JIRA: https://issues.redhat.com/browse/RHEL-17413
Upstream Status: linux.git

commit 15fcdf6ae116d1e8da4ff76c6fd82514f5ea501b
Author: Ping Gan <jacky_gam_2001@163.com>
Date:   Wed Apr 6 09:09:56 2022 +0800

    tcp: Add tracepoint for tcp_set_ca_state

    The congestion status of a tcp flow may be updated since there
    is congestion between tcp sender and receiver. It makes sense to
    add tracepoint for congestion status set function to summate cc
    status duration and evaluate the performance of network
    and congestion algorithm. the backgound of this patch is below.

    Link: https://github.com/iovisor/bcc/pull/3899

    Signed-off-by: Ping Gan <jacky_gam_2001@163.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220406010956.19656-1-jacky_gam_2001@163.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-12-11 11:12:03 +01:00
Jan Stancek 063f72e7e5 Merge: mptcp: rebase to Linux 6.7
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3305

JIRA: https://issues.redhat.com/browse/RHEL-15036
Tested: LNST, Tier1, selftests, pktdrill

Rebase to the current upstream to bring in new features and
a lot of fixes. A good half of the long commit list touches
the self-tests only, and the remaining is self-contained in mptcp.

The only notable exception is:

tcp: get rid of sysctl_tcp_adv_win_scale

that is a pre requisite to a bunch of mptcp changes included here
and also uncontroversially a good thing (TM) for TCP.

Wider-scope data-races related changeset are included (possibly as
partial backport) only if they help to reduce conflict on later
changes.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-20 21:50:53 +01:00
Jan Stancek 3c8d3e2d4a Merge: tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3301

JIRA: https://issues.redhat.com/browse/RHEL-11592

commit b650d953cd391595e536153ce30b4aab385643ac
Author: mfreemon@cloudflare.com <mfreemon@cloudflare.com>
Date:   Sun Jun 11 22:05:24 2023 -0500

    tcp: enforce receive buffer memory limits by allowing the tcp window to shrink

    Under certain circumstances, the tcp receive buffer memory limit
    set by autotuning (sk_rcvbuf) is increased due to incoming data
    packets as a result of the window not closing when it should be.
    This can result in the receive buffer growing all the way up to
    tcp_rmem[2], even for tcp sessions with a low BDP.

    To reproduce:  Connect a TCP session with the receiver doing
    nothing and the sender sending small packets (an infinite loop
    of socket send() with 4 bytes of payload with a sleep of 1 ms
    in between each send()).  This will cause the tcp receive buffer
    to grow all the way up to tcp_rmem[2].

    As a result, a host can have individual tcp sessions with receive
    buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
    limits, causing the host to go into tcp memory pressure mode.

    The fundamental issue is the relationship between the granularity
    of the window scaling factor and the number of byte ACKed back
    to the sender.  This problem has previously been identified in
    RFC 7323, appendix F [1].

    The Linux kernel currently adheres to never shrinking the window.

    In addition to the overallocation of memory mentioned above, the
    current behavior is functionally incorrect, because once tcp_rmem[2]
    is reached when no remediations remain (i.e. tcp collapse fails to
    free up any more memory and there are no packets to prune from the
    out-of-order queue), the receiver will drop in-window packets
    resulting in retransmissions and an eventual timeout of the tcp
    session.  A receive buffer full condition should instead result
    in a zero window and an indefinite wait.

    In practice, this problem is largely hidden for most flows.  It
    is not applicable to mice flows.  Elephant flows can send data
    fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
    triggering a zero window.

    But this problem does show up for other types of flows.  Examples
    are websockets and other type of flows that send small amounts of
    data spaced apart slightly in time.  In these cases, we directly
    encounter the problem described in [1].

    RFC 7323, section 2.4 [2], says there are instances when a retracted
    window can be offered, and that TCP implementations MUST ensure
    that they handle a shrinking window, as specified in RFC 1122,
    section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
    management have made clear that sender must accept a shrunk window
    from the receiver, including RFC 793 [4] and RFC 1323 [5].

    This patch implements the functionality to shrink the tcp window
    when necessary to keep the right edge within the memory limit by
    autotuning (sk_rcvbuf).  This new functionality is enabled with
    the new sysctl: net.ipv4.tcp_shrink_window

    Additional information can be found at:
    https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

    [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
    [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
    [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
    [4] https://www.rfc-editor.org/rfc/rfc793
    [5] https://www.rfc-editor.org/rfc/rfc1323

    Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-20 21:50:41 +01:00
Jeff Moyer 092f5d645a net: ioctl: Use kernel memory on protocol ioctl callbacks
JIRA: https://issues.redhat.com/browse/RHEL-12076
Conflicts: There are contextual differences as we're missing commit
  559260fd9d9a ("ipmr: do not acquire mrt_lock in
  ioctl(SIOCGETVIFCNT)").  I also pulled in header changes from commit
  949d6b405e61 ("net: add missing includes and forward declarations
  under net/") to address a build failure with this patch applied.

commit e1d001fa5b477c4da46a29be1fcece91db7c7c6f
Author: Breno Leitao <leitao@debian.org>
Date:   Fri Jun 9 08:27:42 2023 -0700

    net: ioctl: Use kernel memory on protocol ioctl callbacks
    
    Most of the ioctls to net protocols operates directly on userspace
    argument (arg). Usually doing get_user()/put_user() directly in the
    ioctl callback.  This is not flexible, because it is hard to reuse these
    functions without passing userspace buffers.
    
    Change the "struct proto" ioctls to avoid touching userspace memory and
    operate on kernel buffers, i.e., all protocol's ioctl callbacks is
    adapted to operate on a kernel memory other than on userspace (so, no
    more {put,get}_user() and friends being called in the ioctl callback).
    
    This changes the "struct proto" ioctl format in the following way:
    
        int                     (*ioctl)(struct sock *sk, int cmd,
    -                                        unsigned long arg);
    +                                        int *karg);
    
    (Important to say that this patch does not touch the "struct proto_ops"
    protocols)
    
    So, the "karg" argument, which is passed to the ioctl callback, is a
    pointer allocated to kernel space memory (inside a function wrapper).
    This buffer (karg) may contain input argument (copied from userspace in
    a prep function) and it might return a value/buffer, which is copied
    back to userspace if necessary. There is not one-size-fits-all format
    (that is I am using 'may' above), but basically, there are three type of
    ioctls:
    
    1) Do not read from userspace, returns a result to userspace
    2) Read an input parameter from userspace, and does not return anything
      to userspace
    3) Read an input from userspace, and return a buffer to userspace.
    
    The default case (1) (where no input parameter is given, and an "int" is
    returned to userspace) encompasses more than 90% of the cases, but there
    are two other exceptions. Here is a list of exceptions:
    
    * Protocol RAW:
       * cmd = SIOCGETVIFCNT:
         * input and output = struct sioc_vif_req
       * cmd = SIOCGETSGCNT
         * input and output = struct sioc_sg_req
       * Explanation: for the SIOCGETVIFCNT case, userspace passes the input
         argument, which is struct sioc_vif_req. Then the callback populates
         the struct, which is copied back to userspace.
    
    * Protocol RAW6:
       * cmd = SIOCGETMIFCNT_IN6
         * input and output = struct sioc_mif_req6
       * cmd = SIOCGETSGCNT_IN6
         * input and output = struct sioc_sg_req6
    
    * Protocol PHONET:
      * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
         * input int (4 bytes)
      * Nothing is copied back to userspace.
    
    For the exception cases, functions sock_sk_ioctl_inout() will
    copy the userspace input, and copy it back to kernel space.
    
    The wrapper that prepare the buffer and put the buffer back to user is
    sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
    calls sk_ioctl(), which will handle all cases.
    
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:32:16 -04:00
Paolo Abeni 5df45b6e33 tcp: define initial scaling factor value as a macro
JIRA: https://issues.redhat.com/browse/RHEL-15036
Tested: LNST, Tier1

Upstream commit:
commit 849ee75a38b297187c760bb1d23d8f2a7b1fc73e
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Mon Oct 23 13:44:37 2023 -0700

    tcp: define initial scaling factor value as a macro

    So that other users could access it. Notably MPTCP will use
    it in the next patch.

    No functional change intended.

    Acked-by: Matthieu Baerts <matttbe@kernel.org>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Mat Martineau <martineau@kernel.org>
    Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-4-9dc60939d371@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-31 21:50:28 +01:00
Paolo Abeni 2ff38709c3 mptcp: fix rcv buffer auto-tuning
JIRA: https://issues.redhat.com/browse/RHEL-15036
Tested: LNST, Tier1

Upstream commit:
commit b8dc6d6ce93142ccd4c976003bb6c25d63aac2ce
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Thu Jul 20 20:47:50 2023 +0200

    mptcp: fix rcv buffer auto-tuning

    The MPTCP code uses the assumption that the tcp_win_from_space() helper
    does not use any TCP-specific field, and thus works correctly operating
    on an MPTCP socket.

    The commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
    broke such assumption, and as a consequence most MPTCP connections stall
    on zero-window event due to auto-tuning changing the rcv buffer size
    quite randomly.

    Address the issue syncing again the MPTCP auto-tuning code with the TCP
    one. To achieve that, factor out the windows size logic in socket
    independent helpers, and reuse them in mptcp_rcv_space_adjust(). The
    MPTCP level scaling_ratio is selected as the minimum one from the all
    the subflows, as a worst-case estimate.

    Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
    Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Link: https://lore.kernel.org/r/20230720-upstream-net-next-20230720-mptcp-fix-rcv-buffer-auto-tuning-v1-1-175ef12b8380@tessares.net
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-31 21:50:01 +01:00
Paolo Abeni 8ce7b9e432 tcp: get rid of sysctl_tcp_adv_win_scale
JIRA: https://issues.redhat.com/browse/RHEL-15036
Tested: LNST, Tier1

Upstream commit:
commit dfa2f0483360d4d6f2324405464c9f281156bd87
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jul 17 15:29:17 2023 +0000

    tcp: get rid of sysctl_tcp_adv_win_scale

    With modern NIC drivers shifting to full page allocations per
    received frame, we face the following issue:

    TCP has one per-netns sysctl used to tweak how to translate
    a memory use into an expected payload (RWIN), in RX path.

    tcp_win_from_space() implementation is limited to few cases.

    For hosts dealing with various MSS, we either under estimate
    or over estimate the RWIN we send to the remote peers.

    For instance with the default sysctl_tcp_adv_win_scale value,
    we expect to store 50% of payload per allocated chunk of memory.

    For the typical use of MTU=1500 traffic, and order-0 pages allocations
    by NIC drivers, we are sending too big RWIN, leading to potential
    tcp collapse operations, which are extremely expensive and source
    of latency spikes.

    This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
    uses a per socket scaling factor, so that we can precisely
    adjust the RWIN based on effective skb->len/skb->truesize ratio.

    This patch alone can double TCP receive performance when receivers
    are too slow to drain their receive queue, or by allowing
    a bigger RWIN when MSS is close to PAGE_SIZE.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-31 21:50:01 +01:00
Felix Maurer 2261d33599 tcp: adjust rcv_ssthresh according to sk_reserved_mem
JIRA: https://issues.redhat.com/browse/RHEL-11592
Conflicts:
- net/ipv4/tcp_input.c: context difference due to missing 240bfd134c59
  ("tcp: tweak len/truesize ratio for coalesce candidates")

commit 053f368412c9a7bfce2befec8c795113c8cfb0b1
Author: Wei Wang <weiwan@google.com>
Date:   Wed Sep 29 10:25:13 2021 -0700

    tcp: adjust rcv_ssthresh according to sk_reserved_mem

    When user sets SO_RESERVE_MEM socket option, in order to utilize the
    reserved memory when in memory pressure state, we adjust rcv_ssthresh
    according to the available reserved memory for the socket, instead of
    using 4 * advmss always.

    Signed-off-by: Wei Wang <weiwan@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-10-31 16:20:06 +01:00
Paolo Abeni bdc0298e9a tcp: fix quick-ack counting to count actual ACKs of new data
JIRA: https://issues.redhat.com/browse/RHEL-14348
Tested: LNST, Tier1
Conflicts: different context, as rhel lacks the upstream commit \
  03b123debcbc ("tcp: tcp_enter_quickack_mode() should be static")

Upstream commit:
commit 059217c18be6757b95bfd77ba53fb50b48b8a816
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sun Oct 1 11:12:38 2023 -0400

    tcp: fix quick-ack counting to count actual ACKs of new data

    This commit fixes quick-ack counting so that it only considers that a
    quick-ack has been provided if we are sending an ACK that newly
    acknowledges data.

    The code was erroneously using the number of data segments in outgoing
    skbs when deciding how many quick-ack credits to remove. This logic
    does not make sense, and could cause poor performance in
    request-response workloads, like RPC traffic, where requests or
    responses can be multi-segment skbs.

    When a TCP connection decides to send N quick-acks, that is to
    accelerate the cwnd growth of the congestion control module
    controlling the remote endpoint of the TCP connection. That quick-ack
    decision is purely about the incoming data and outgoing ACKs. It has
    nothing to do with the outgoing data or the size of outgoing data.

    And in particular, an ACK only serves the intended purpose of allowing
    the remote congestion control to grow the congestion window quickly if
    the ACK is ACKing or SACKing new data.

    The fix is simple: only count packets as serving the goal of the
    quickack mechanism if they are ACKing/SACKing new data. We can tell
    whether this is the case by checking inet_csk_ack_scheduled(), since
    we schedule an ACK exactly when we are ACKing/SACKing new data.

    Fixes: fc6415bcb0 ("[TCP]: Fix quick-ack decrementing with TSO.")
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Reviewed-by: Yuchung Cheng <ycheng@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-20 11:53:04 +02:00
Paolo Abeni 318329e4ce tcp: fix mishandling when the sack compression is deferred.
JIRA: https://issues.redhat.com/browse/RHEL-14348
Tested: LNST, Tier1

Upstream commit:
commit 30c6f0bf9579debce27e45fac34fdc97e46acacc
Author: fuyuanli <fuyuanli@didiglobal.com>
Date:   Wed May 31 16:01:50 2023 +0800

    tcp: fix mishandling when the sack compression is deferred.

    In this patch, we mainly try to handle sending a compressed ack
    correctly if it's deferred.

    Here are more details in the old logic:
    When sack compression is triggered in the tcp_compressed_ack_kick(),
    if the sock is owned by user, it will set TCP_DELACK_TIMER_DEFERRED
    and then defer to the release cb phrase. Later once user releases
    the sock, tcp_delack_timer_handler() should send a ack as expected,
    which, however, cannot happen due to lack of ICSK_ACK_TIMER flag.
    Therefore, the receiver would not sent an ack until the sender's
    retransmission timeout. It definitely increases unnecessary latency.

    Fixes: 5d9f4262b7 ("tcp: add SACK compression")
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: fuyuanli <fuyuanli@didiglobal.com>
    Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
    Link: https://lore.kernel.org/netdev/20230529113804.GA20300@didi-ThinkCentre-M920t-N000/
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230531080150.GA20424@didi-ThinkCentre-M920t-N000
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-20 11:46:08 +02:00
Artem Savkov b85fd5b987 net: Update an existing TCP congestion control algorithm.
Bugzilla: https://bugzilla.redhat.com/2221599

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 8fb1a76a0f35c45a424c9eb84b0f97ffd51e6052
Author: Kui-Feng Lee <kuifeng@meta.com>
Date:   Wed Mar 22 20:23:59 2023 -0700

    net: Update an existing TCP congestion control algorithm.

    This feature lets you immediately transition to another congestion
    control algorithm or implementation with the same name.  Once a name
    is updated, new connections will apply this new algorithm.

    The purpose is to update a customized algorithm implemented in BPF
    struct_ops with a new version on the flight.  The following is an
    example of using the userspace API implemented in later BPF patches.

       link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
       .......
       err = bpf_link__update_map(link, skel->maps.ca_update_2);

    We first load and register an algorithm implemented in BPF struct_ops,
    then swap it out with a new one using the same name. After that, newly
    created connections will apply the updated algorithm, while older ones
    retain the previous version already applied.

    This patch also takes this chance to refactor the ca validation into
    the new tcp_validate_congestion_control() function.

    Cc: netdev@vger.kernel.org, Eric Dumazet <edumazet@google.com>
    Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
    Link: https://lore.kernel.org/r/20230323032405.3735486-3-kuifeng@meta.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:37 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Jeff Moyer d19688b83d net: avoid double accounting for pure zerocopy skbs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 9b65b17db72313b7a4fe9bc9502928c88be57986
Author: Talal Ahmad <talalahmad@google.com>
Date:   Tue Nov 2 22:58:44 2021 -0400

    net: avoid double accounting for pure zerocopy skbs
    
    Track skbs containing only zerocopy data and avoid charging them to
    kernel memory to correctly account the memory utilization for
    msg_zerocopy. All of the data in such skbs is held in user pages which
    are already accounted to user. Before this change, they are charged
    again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
    excessive because data is not being copied into skb frags. This
    excessive charging can lead to kernel going into memory pressure
    state which impacts all sockets in the system adversely. Mark pure
    zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
    charge/uncharge for data in such skbs.
    
    Initially, an skb is marked pure zerocopy when it is empty and in
    zerocopy path. skb can then change from a pure zerocopy skb to mixed
    data skb (zerocopy and copy data) if it is at tail of write queue and
    there is room available in it and non-zerocopy data is being sent in
    the next sendmsg call. At this time sk_mem_charge is done for the pure
    zerocopied data and the pure zerocopy flag is unmarked. We found that
    this happens very rarely on workloads that pass MSG_ZEROCOPY.
    
    A pure zerocopy skb can later be coalesced into normal skb if they are
    next to each other in queue but this patch prevents coalescing from
    happening. This avoids complexity of charging when skb downgrades from
    pure zerocopy to mixed. This is also rare.
    
    In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
    for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
    tcp_skb_entail for an skb without data.
    
    Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
    with zerocopy showed that before this patch the 'sock' variable in
    memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
    sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
    change it is 0. This is due to no charge to sk_forward_alloc for
    zerocopy data and shows memory utilization for kernel is lowered.
    
    With this commit we don't see the warning we saw in previous commit
    which resulted in commit 84882cf72cd774cf16fd338bdbf00f69ac9f9194.
    
    Signed-off-by: Talal Ahmad <talalahmad@google.com>
    Acked-by: Arjun Roy <arjunroy@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:03:02 -04:00
Jeff Moyer 474b0b4e6c tcp: rename sk_wmem_free_skb
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 03271f3a3594c0e88f68d8cfbec0ba250b2c538a
Author: Talal Ahmad <talalahmad@google.com>
Date:   Fri Oct 29 22:05:41 2021 -0400

    tcp: rename sk_wmem_free_skb
    
    sk_wmem_free_skb() is only used by TCP.
    
    Rename it to make this clear, and move its declaration to
    include/net/tcp.h
    
    Signed-off-by: Talal Ahmad <talalahmad@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Acked-by: Arjun Roy <arjunroy@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:02:02 -04:00
Jerome Marchand f79873d0b0 bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes
Bugzilla: https://bugzilla.redhat.com/2177177

commit a351d6087bf7d3d8440d58d3bf244ec64b89394a
Author: Pengcheng Yang <yangpc@wangsu.com>
Date:   Tue Nov 29 18:40:39 2022 +0800

    bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes

    When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS
    flag from the msg->flags. If apply_bytes is used and it is larger than
    the current data being processed, sk_psock_msg_verdict() will not be
    called when sendmsg() is called again. At this time, the msg->flags is 0,
    and we lost the BPF_F_INGRESS flag.

    So we need to save the BPF_F_INGRESS flag in sk_psock and use it when
    redirection.

    Fixes: 8934ce2fd0 ("bpf: sockmap redirect ingress support")
    Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/1669718441-2654-3-git-send-email-yangpc@wangsu.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:13 +02:00
Guillaume Nault 9194a37d24 tcp/udp: Make early_demux back namespacified.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2186795
Upstream Status: linux.git

commit 11052589cf5c0bab3b4884d423d5f60c38fcf25d
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 13 10:52:07 2022 -0700

    tcp/udp: Make early_demux back namespacified.

    Commit e21145a987 ("ipv4: namespacify ip_early_demux sysctl knob") made
    it possible to enable/disable early_demux on a per-netns basis.  Then, we
    introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
    TCP/UDP in commit dddb64bcb3 ("net: Add sysctl to toggle early demux for
    tcp and udp").  However, the .proc_handler() was wrong and actually
    disabled us from changing the behaviour in each netns.

    We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
    .early_demux() handler is not NULL.  When we toggle (tcp|udp)_early_demux,
    the change itself is saved in each netns variable, but the .early_demux()
    handler is a global variable, so the handler is switched based on the
    init_net's sysctl variable.  Thus, netns (tcp|udp)_early_demux knobs have
    nothing to do with the logic.  Whether we CAN execute proto .early_demux()
    is always decided by init_net's sysctl knob, and whether we DO it or not is
    by each netns ip_early_demux knob.

    This patch namespacifies (tcp|udp)_early_demux again.  For now, the users
    of the .early_demux() handler are TCP and UDP only, and they are called
    directly to avoid retpoline.  So, we can remove the .early_demux() handler
    from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
    If another proto needs .early_demux(), we can restore it at that time.

    Fixes: dddb64bcb3 ("net: Add sysctl to toggle early demux for tcp and udp")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-04-14 16:19:36 +02:00
Felix Maurer ee5a18be02 bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
Bugzilla: https://bugzilla.redhat.com/2166911

commit 273b7f0fb44847c41814a59901977be284daa447
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Thu Sep 1 17:29:18 2022 -0700

    bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
    
    This patch changes bpf_getsockopt(SOL_TCP) to reuse
    do_tcp_getsockopt().  It removes the duplicated code from
    bpf_getsockopt(SOL_TCP).
    
    Before this patch, there were some optnames available to
    bpf_setsockopt(SOL_TCP) but missing in bpf_getsockopt(SOL_TCP).
    For example, TCP_NODELAY, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL,
    and a few more.  It surprises users from time to time.  This patch
    automatically closes this gap without duplicating more code.
    
    bpf_getsockopt(TCP_SAVED_SYN) does not free the saved_syn,
    so it stays in sol_tcp_sockopt().
    
    For string name value like TCP_CONGESTION, bpf expects it
    is always null terminated, so sol_tcp_sockopt() decrements
    optlen by one before calling do_tcp_getsockopt() and
    the 'if (optlen < saved_optlen) memset(..,0,..);'
    in __bpf_getsockopt() will always do a null termination.
    
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20220902002918.2894511-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:33 +01:00
Felix Maurer 4eef215a5c bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()
Bugzilla: https://bugzilla.redhat.com/2166911

commit 0c751f7071ef98d334ed06ca3f8f4cc1f7458cf5
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Tue Aug 16 23:18:19 2022 -0700

    bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()
    
    After the prep work in the previous patches,
    this patch removes all the dup code from bpf_setsockopt(SOL_TCP)
    and reuses the do_tcp_setsockopt().
    
    The existing optname white-list is refactored into a new
    function sol_tcp_setsockopt().  The sol_tcp_setsockopt()
    also calls the bpf_sol_tcp_setsockopt() to handle
    the TCP_BPF_XXX specific optnames.
    
    bpf_setsockopt(TCP_SAVE_SYN) now also allows a value 2 to
    save the eth header also and it comes for free from
    do_tcp_setsockopt().
    
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/r/20220817061819.4180146-1-kafai@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:30 +01:00
Guillaume Nault bd9db43134 tcp: Fix a data-race around sysctl_tcp_adv_win_scale.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 36eeee75ef0157e42fb6593dcc65daab289b559e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 20 09:50:14 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_adv_win_scale.

    While reading sysctl_tcp_adv_win_scale, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Guillaume Nault d895c95ec3 tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 4845b5713ab18a1bb6e31d1fbb4d600240b8b691
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:48 2022 -0700

    tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.

    While reading sysctl_tcp_slow_start_after_idle, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.

    Fixes: 35089bb203 ("[TCP]: Add tcp_slow_start_after_idle sysctl.")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:14 +01:00
Herton R. Krzesinski ee17c5d305 Merge: bpf, xdp: update to 6.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1742

bpf, xdp: update to 6.0

Bugzilla: https://bugzilla.redhat.com/2137876

Signed-off-by: Artem Savkov <asavkov@redhat.com>

Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Jerome Marchand <jmarchan@redhat.com>
Approved-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Approved-by: Michael Petlan <mpetlan@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-01-12 16:01:19 +00:00
Herton R. Krzesinski 621a3b0cfb Merge: net: Backport data race annotations in the networking stack (part 1).
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1722

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git
Conflicts: Few minor conflicts, see description in affected commits.

Properly mark concurent reads and writes with READ_ONCE() and
WRITE_ONCE() in various parts of the networking stack. This is a
backport of the following upstream patch series:
  
  * Patch set A: merge commit e97e68b56e78 ("Merge branch 'sk_bound_dev_if-annotations'")
  * Patch set B: merge commit 32b3ad1418ea ("Merge branch 'sysctl-data-races'")
  * Patch set C: merge commit 7d5424b26f17 ("Merge branch 'net-sysctl-races'")
  * Patch set D: merge commit 782d86fe44e3 ("Merge branch 'net-sysctl-races-round2'")
  * Patch set E: merge commit c9f21106d97b ("Merge branch 'net-ipv4-sysctl-races-part-3'")

Patch 1 is a standalone READ_ONCE() annotation for sk->sk_bound_dev_if.
It's a prerequisite for correctly backporting patch set A.

Patches 2-9 are backports of patch set A. The following upstream
patches have been omitted since they're already in Centos Stream:
  
  * Upstream commit a20ea298071f ("sctp: read sk->sk_bound_dev_if once                                                                                                                                                                                                                                                         
    in sctp_rcv()"), backported by Centos Stream commit 5d539b8523.

  * Upstream commit 70f87de9fa0d ("net_sched: em_meta: add READ_ONCE()                                                                                                                                                                                                                                                         
    in var_sk_bound_if()"), backported by Centos Stream commit
    866ca288f3.

Patch 10 was in the original upstream series of patch set B, but was
resubmitted independently as that series was reworked before being
applied. Therefore, it doesn't strictly belong to patch set B, but is
closely related to it and is thus backported here.

Patches 11-21 are backports of patch set B. The following upstream
patch has been omitted since it's already in Centos Stream:
  
  * Upstream commit 310731e2f161 ("net: Fix data-races around                                                                                                                                                                                                                                                                  
    sysctl_mem.", backported by Centos Stream commit a99b2cb4eb.

Patches 22-36 are backports corresponding to patch set C.

Patches 37-51 are backports corresponding to patch set D.

Patches 52-66 are backports corresponding to patch set E.

Signed-off-by: Guillaume Nault <gnault@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-01-09 15:37:27 +00:00
Felix Maurer 09faf01cb9 net: Introduce a new proto_ops ->read_skb()
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: Context difference due to not yet applied 314001f0bf927
("af_unix: Add OOB support") and already applied 3f92a64e44e5 ("tcp:
allow tls to decrypt directly from the tcp rcv queue")

commit 965b57b469a589d64d81b1688b38dcb537011bb0
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Jun 15 09:20:12 2022 -0700

    net: Introduce a new proto_ops ->read_skb()

    Currently both splice() and sockmap use ->read_sock() to
    read skb from receive queue, but for sockmap we only read
    one entire skb at a time, so ->read_sock() is too conservative
    to use. Introduce a new proto_ops ->read_skb() which supports
    this sematic, with this we can finally pass the ownership of
    skb to recv actors.

    For non-TCP protocols, all ->read_sock() can be simply
    converted to ->read_skb().

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Felix Maurer a4ae9a073c tcp: Introduce tcp_read_skb()
Bugzilla: https://bugzilla.redhat.com/2137876
Conflicts: 3f92a64e44e5 "tcp: allow tls to decrypt directly from the
           tcp rcv queue" already backported.

commit 04919bed948dc22a0032a9da867b7dcb8aece4ca
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Jun 15 09:20:11 2022 -0700

    tcp: Introduce tcp_read_skb()

    This patch inroduces tcp_read_skb() based on tcp_read_sock(),
    a preparation for the next patch which actually introduces
    a new sock ops.

    TCP is special here, because it has tcp_read_sock() which is
    mainly used by splice(). tcp_read_sock() supports partial read
    and arbitrary offset, neither of them is needed for sockmap.

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220615162014.89193-2-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Artem Savkov 56229aa3cb bpf: Add helpers to issue and check SYN cookies in XDP
Bugzilla: https://bugzilla.redhat.com/2137876

commit 33bf9885040c399cf6a95bd33216644126728e14
Author: Maxim Mikityanskiy <maximmi@nvidia.com>
Date:   Wed Jun 15 16:48:44 2022 +0300

    bpf: Add helpers to issue and check SYN cookies in XDP
    
    The new helpers bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} allow an XDP
    program to generate SYN cookies in response to TCP SYN packets and to
    check those cookies upon receiving the first ACK packet (the final
    packet of the TCP handshake).
    
    Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a
    listening socket on the local machine, which allows to use them together
    with synproxy to accelerate SYN cookie generation.
    
    Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
    Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
    Link: https://lore.kernel.org/r/20220615134847.3753567-4-maximmi@nvidia.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:36 +01:00