Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Rado Vrbovsky	4da7c39b53	Merge: io_uring: Update to upstream v6.10 + fixes	2025-01-13 18:58:47 +00:00
Rado Vrbovsky	ac45be9b4d	Merge: CNB96: net/sched: update TC core to upstream v6.12 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5417 JIRA: https://issues.redhat.com/browse/RHEL-57768 Depends: !5270 Depends: !5305 Depends: !4753 Depends: !5235 Depends: !5318 Depends: !5539 Commits: ``` 4fdb6b6063f0 ("net: count drops due to missing qdisc as dev->tx_drops") bab4923132fe ("tracing/net_sched: NULL pointer dereference in perf_trace_qdisc_reset()") 3abbd7ed8b76 ("act_ct: prepare for stolen verdict coming from conntrack and nat engine") b07593edd2fa ("net/sched: act_skbmod: convert comma to semicolon") e2d0fadd703c ("sched: act_ct: avoid -Wflex-array-member-not-at-end warning") 216203bdc228 ("UAPI: net/sched: Use __struct_group() in flex struct tc_u32_sel") a0c9fe5eecc9 ("tc-testing: don't access non-existent variable on exception") a7a45f02a093 ("net: sched: Correct spelling in headers") 938863727076 ("tc: adjust network header after 2nd vlan push") 59c330eccee8 ("selftests: tc_actions: test ingress 2nd vlan push") 2da44703a544 ("selftests: tc_actions: test egress 2nd vlan push") bc21000e99f9 ("net_sched: sch_fq: fix incorrect behavior for small weights") d5c4546062fd ("net: sched: consistently use rcu_replace_pointer() in taprio_change()") c48994baefdc ("sch_cake: constify inverse square root cache") 95ecba62e2fd ("net: fix races in netdev_tx_sent_queue()/dev_watchdog()") 34d35b4edbbe ("net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Petr Oros <poros@redhat.com> Approved-by: Davide Caratti <dcaratti@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-12-05 16:12:46 +00:00
Jeff Moyer	289efb74a9	net: always try to set ubuf in skb_zerocopy_iter_stream JIRA: https://issues.redhat.com/browse/RHEL-64867 commit 9e2db9d3993e270b24fbc4ce1ca7e09756e8df25 Author: Pavel Begunkov <asml.silence@gmail.com> Date: Thu Jun 27 13:59:41 2024 +0100 net: always try to set ubuf in skb_zerocopy_iter_stream skb_zcopy_set() does nothing if there is already a ubuf_info associated with an skb, and since ->link_skb should have set it several lines above the check here essentially does nothing and can be removed. It's also safer this way, because even if the callback is faulty we'll have it set. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-12-02 11:21:27 -05:00
Jeff Moyer	7bc353cc45	net: add callback for setting a ubuf_info to skb JIRA: https://issues.redhat.com/browse/RHEL-64867 commit 65bada80dec1f2108a751644773b2120bd789934 Author: Pavel Begunkov <asml.silence@gmail.com> Date: Fri Apr 19 12:08:40 2024 +0100 net: add callback for setting a ubuf_info to skb At the moment an skb can only have one ubuf_info associated with it, which might be a performance problem for zerocopy sends in cases like TCP via io_uring. Add a callback for assigning ubuf_info to skb, this way we will implement smarter assignment later like linking ubuf_info together. Note, it's an optional callback, which should be compatible with skb_zcopy_set(), that's because the net stack might potentially decide to clone an skb and take another reference to ubuf_info whenever it wishes. Also, a correct implementation should always be able to bind to an skb without prior ubuf_info, otherwise we could end up in a situation when the send would not be able to progress. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-11-28 17:39:44 -05:00
Jeff Moyer	97718db85e	net: extend ubuf_info callback to ops structure JIRA: https://issues.redhat.com/browse/RHEL-64867 Conflicts: The conflicts here existed upstream, and were resolved by merge commit 3830fff39941 ("Merge branch 'for-uring-ubufops' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into for-6.10/io_uring"). This patch incorporates the fixes from that merge commit. commit 7ab4f16f9e2440e797eae88812f800458e5879d2 Author: Pavel Begunkov <asml.silence@gmail.com> Date: Fri Apr 19 12:08:39 2024 +0100 net: extend ubuf_info callback to ops structure We'll need to associate additional callbacks with ubuf_info, introduce a structure holding ubuf_info callbacks. Apart from a more smarter io_uring notification management introduced in next patches, it can be used to generalise msg_zerocopy_put_abort() and also store ->sg_from_iter, which is currently passed in struct msghdr. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-11-28 17:38:44 -05:00
Jeff Moyer	d84f719557	skbuff: use mempool KASAN hooks JIRA: https://issues.redhat.com/browse/RHEL-64867 commit 74e831af165acc968418a4d9fde8c2e099f3e8bf Author: Andrey Konovalov <andreyknvl@gmail.com> Date: Tue Dec 19 23:29:04 2023 +0100 skbuff: use mempool KASAN hooks Instead of using slab-internal KASAN hooks for poisoning and unpoisoning cached objects, use the proper mempool KASAN hooks. Also check the return value of kasan_mempool_poison_object to prevent double-free and invali-free bugs. Link: https://lkml.kernel.org/r/a3482c41395c69baa80eb59dbb06beef213d2a14.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-11-28 16:23:44 -05:00
Jeff Moyer	34644cd1d2	kasan: rename and document kasan_(un)poison_object_data JIRA: https://issues.redhat.com/browse/RHEL-64867 commit 1ce9a0523938f87dd8505233cc3445f8e2d8dcee Author: Andrey Konovalov <andreyknvl@gmail.com> Date: Tue Dec 19 23:29:03 2023 +0100 kasan: rename and document kasan_(un)poison_object_data Rename kasan_unpoison_object_data to kasan_unpoison_new_object and add a documentation comment. Do the same for kasan_poison_object_data. The new names and the comments should suggest the users that these hooks are intended for internal use by the slab allocator. The following patch will remove non-slab-internal uses of these hooks. No functional changes. [andreyknvl@google.com: update references to renamed functions in comments] Link: https://lkml.kernel.org/r/20231221180637.105098-1-andrey.konovalov@linux.dev Link: https://lkml.kernel.org/r/eab156ebbd635f9635ef67d1a4271f716994e628.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-11-28 16:22:44 -05:00
Jeff Moyer	63317c47dc	net: skbuff: drop the word head from skb cache JIRA: https://issues.redhat.com/browse/RHEL-64867 commit 025a785ff083729819dc82ac81baf190cb4aee5c Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Feb 8 22:06:42 2023 -0800 net: skbuff: drop the word head from skb cache skbuff_head_cache is misnamed (perhaps for historical reasons?) because it does not hold heads. Head is the buffer which skb->data points to, and also where shinfo lives. struct sk_buff is a metadata structure, not the head. Eric recently added skb_small_head_cache (which allocates actual head buffers), let that serve as an excuse to finally clean this up :) Leave the user-space visible name intact, it could possibly be uAPI. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2024-11-28 16:03:44 -05:00
Ivan Vecera	ad35cce099	tc: adjust network header after 2nd vlan push JIRA: https://issues.redhat.com/browse/RHEL-57768 commit 938863727076f684abb39d1d0f9dce1924e9028e Author: Boris Sukholitko <boris.sukholitko@broadcom.com> Date: Thu Aug 22 13:35:08 2024 +0300 tc: adjust network header after 2nd vlan push <tldr> skb network header of the single-tagged vlan packet continues to point the vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This causes problem at the dissector which expects double-tagged packet network header to point to the inner vlan. The fix is to adjust network header in tcf_act_vlan.c but requires refactoring of skb_vlan_push function. </tldr> Consider the following shell script snippet configuring TC rules on the veth interface: ip link add veth0 type veth peer veth1 ip link set veth0 up ip link set veth1 up tc qdisc add dev veth0 clsact tc filter add dev veth0 ingress pref 10 chain 0 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5 tc filter add dev veth0 ingress pref 20 chain 0 flower \ num_of_vlans 1 action vlan push id 100 \ protocol 0x8100 action goto chain 5 tc filter add dev veth0 ingress pref 30 chain 5 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success" Sending double-tagged vlan packet with the IP payload inside: cat <<ENDS \| text2pcap - - \| tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64 ..........."...d 0010 81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11 ......E..&...... 0020 18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12 ................ 0030 e1 c7 00 00 00 00 00 00 00 00 00 00 ............ ENDS will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to the dmesg. OTOH, sending single-tagged vlan packet: cat <<ENDS \| text2pcap - - \| tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14 ...........".... 0010 08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00 ..E............ 0020 00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00 ................ 0030 00 00 00 00 00 00 00 00 00 00 00 00 ............ ENDS will match rule 20, will push the second vlan tag but will not* match rule 30. IOW, the match at rule 30 fails if the second vlan was freshly pushed by the kernel. Lets look at __skb_flow_dissect working on the double-tagged vlan packet. Here is the relevant code from around net/core/flow_dissector.c:1277 copy-pasted here for convenience: if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX && skb && skb_vlan_tag_present(skb)) { proto = skb->protocol; } else { vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), data, hlen, &_vlan); if (!vlan) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } proto = vlan->h_vlan_encapsulated_proto; nhoff += sizeof(*vlan); } The "else" clause above gets the protocol of the encapsulated packet from the skb data at the network header location. printk debugging has showed that in the good double-tagged packet case proto is htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet case proto is garbage leading to the failure to match tc filter 30. proto is being set from the skb header pointed by nhoff parameter which is defined at the beginning of __skb_flow_dissect (net/core/flow_dissector.c:1055 in the current version): nhoff = skb_network_offset(skb); Therefore the culprit seems to be that the skb network offset is different between double-tagged packet received from the interface and single-tagged packet having its vlan tag pushed by TC. Lets look at the interesting points of the lifetime of the single/double tagged packets as they traverse our packet flow. Both of them will start at __netif_receive_skb_core where the first vlan tag will be stripped: if (eth_type_vlan(skb->protocol)) { skb = skb_vlan_untag(skb); if (unlikely(!skb)) goto out; } At this stage in double-tagged case skb->data points to the second vlan tag while in single-tagged case skb->data points to the network (eg. IP) header. Looking at TC vlan push action (net/sched/act_vlan.c) we have the following code at tcf_vlan_act (interesting points are in square brackets): if (skb_at_tc_ingress(skb)) [1] skb_push_rcsum(skb, skb->mac_len); .... case TCA_VLAN_ACT_PUSH: err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid \| (p->tcfv_push_prio << VLAN_PRIO_SHIFT), 0); if (err) goto drop; break; .... out: if (skb_at_tc_ingress(skb)) [3] skb_pull_rcsum(skb, skb->mac_len); And skb_vlan_push (net/core/skbuff.c:6204) function does: err = __vlan_insert_tag(skb, skb->vlan_proto, skb_vlan_tag_get(skb)); if (err) return err; skb->protocol = skb->vlan_proto; [2] skb->mac_len += VLAN_HLEN; in the case of pushing the second tag. Lets look at what happens with skb->data of the single-tagged packet at each of the above points: 1. As a result of the skb_push_rcsum, skb->data is moved back to the start of the packet. 2. First VLAN tag is moved from the skb into packet buffer, skb->mac_len is incremented, skb->data still points to the start of the packet. 3. As a result of the skb_pull_rcsum, skb->data is moved forward by the modified skb->mac_len, thus pointing to the network header again. Then __skb_flow_dissect will get confused by having double-tagged vlan packet with the skb->data at the network header. The solution for the bug is to preserve "skb->data at second vlan header" semantics in the skb_vlan_push function. We do this by manipulating skb->network_header rather than skb->mac_len. skb_vlan_push callers are updated to do skb_reset_mac_len. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2024-11-22 11:07:15 +01:00
Felix Maurer	8643c21aa1	page_pool: check for PP direct cache locality later JIRA: https://issues.redhat.com/browse/RHEL-57765 Conflicts: - Context differences (missing skb_cow_data_for_xdp) due to missing e6d5dbdd20aa ("xdp: add multi-buff support for xdp running in generic mode") - net/core/skbuff.c: context difference (condition moved to function) due to missing 8cfa2dee325f ("skbuff: Add a function to check if a page belongs to page_pool") with no functional changes - net/core/skbuff.c: context difference (missing skb_kfree_head) due to missing bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head"); this can appear in revumatic as if skb_free_head was moved but that isn't true, the hunks are just reordered (check the line nums) commit 4a96a4e807c390a9d91b450ebe04eeb2e0ecc076 Author: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Fri Mar 29 17:55:06 2024 +0100 page_pool: check for PP direct cache locality later Since we have pool->p.napi (Jakub) and pool->cpuid (Lorenzo) to check whether it's safe to use direct recycling, we can use both globally for each page instead of relying solely on @allow_direct argument. Let's assume that @allow_direct means "I'm sure it's local, don't waste time rechecking this" and when it's false, try the mentioned params to still recycle the page directly. If neither is true, we'll lose some CPU cycles, but then it surely won't be hotpath. On the other hand, paths where it's possible to use direct cache, but not possible to safely set @allow_direct, will benefit from this move. The whole propagation of @napi_safe through a dozen of skb freeing functions can now go away, which saves us some stack space. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20240329165507.3240110-2-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2024-11-06 18:18:24 +01:00
Felix Maurer	edee1c1e12	page_pool: disable direct recycling based on pool->cpuid on destroy JIRA: https://issues.redhat.com/browse/RHEL-57765 commit 56ef27e3abe6d6453b1f4f6127041f3a65d7cbc9 Author: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Thu Feb 15 12:39:05 2024 +0100 page_pool: disable direct recycling based on pool->cpuid on destroy Now that direct recycling is performed basing on pool->cpuid when set, memory leaks are possible: 1. A pool is destroyed. 2. Alloc cache is emptied (it's done only once). 3. pool->cpuid is still set. 4. napi_pp_put_page() does direct recycling basing on pool->cpuid. 5. Now alloc cache is not empty, but it won't ever be freed. In order to avoid that, rewrite pool->cpuid to -1 when unlinking NAPI to make sure no direct recycling will be possible after emptying the cache. This involves a bit of overhead as pool->cpuid now must be accessed via READ_ONCE() to avoid partial reads. Rename page_pool_unlink_napi() -> page_pool_disable_direct_recycling() to reflect what it actually does and unexport it. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20240215113905.96817-1-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2024-10-21 16:37:42 +02:00
Wander Lairson Costa	a69829f6ec	net: add debugging checks in skb_attempt_defer_free() JIRA: https://issues.redhat.com/browse/RHEL-9145 commit e8e1ce8454c9cc8ad2e4422bef346428e52455e3 Author: Eric Dumazet <edumazet@google.com> Date: Fri Apr 21 09:43:53 2023 +0000 net: add debugging checks in skb_attempt_defer_free() Make sure skbs that are stored in softnet_data.defer_list do not have a dst attached. Also make sure the the skb was orphaned. Link: https://lore.kernel.org/netdev/CANn89iJuEVe72bPmEftyEJHLzzN=QNR2yueFjTxYXCEpS5S8HQ@mail.gmail.com/T/ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:30 -03:00
Wander Lairson Costa	7f4cde038e	net: do not provide hard irq safety for sd->defer_lock JIRA: https://issues.redhat.com/browse/RHEL-9145 commit 931e93bdf8ca71cef1f8759c43bc2c5385392b8b Author: Eric Dumazet <edumazet@google.com> Date: Fri Apr 21 09:43:54 2023 +0000 net: do not provide hard irq safety for sd->defer_lock kfree_skb() can be called from hard irq handlers, but skb_attempt_defer_free() is meant to be used from process or BH contexts, and skb_defer_free_flush() is meant to be called from BH contexts. Not having to mask hard irq can save some cycles. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:29 -03:00
Wander Lairson Costa	9b810452bc	net: avoid strange behavior with skb_defer_max == 1 JIRA: https://issues.redhat.com/browse/RHEL-9145 commit c09b0cd2cc6c3f91988a20d45fa45c889f72c56c Author: Jakub Kicinski <kuba@kernel.org> Date: Wed May 18 11:55:22 2022 -0700 net: avoid strange behavior with skb_defer_max == 1 When user sets skb_defer_max to 1 the kick threshold is 0 (half of 1). If we increment queue length before the check the kick will never happen, and the skb may get stranded. This is likely harmless but can be avoided by moving the increment after the check. This way skb_defer_max == 1 will always kick. Still a silly config to have, but somehow that feels more correct. While at it drop a comment which seems to be outdated or confusing, and wrap the defer_count write with a WRITE_ONCE() since it's read on the fast path that avoids taking the lock. Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220518185522.2038683-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:28 -03:00
Wander Lairson Costa	0a68918319	net: add skb_defer_max sysctl JIRA: https://issues.redhat.com/browse/RHEL-9145 commit 39564c3fdc6684c6726b63e131d2a9f3809811cb Author: Eric Dumazet <edumazet@google.com> Date: Sun May 15 21:24:55 2022 -0700 net: add skb_defer_max sysctl commit 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") added another per-cpu cache of skbs. It was expected to be small, and an IPI was forced whenever the list reached 128 skbs. We might need to be able to control more precisely queue capacity and added latency. An IPI is generated whenever queue reaches half capacity. Default value of the new limit is 64. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:28 -03:00
Wander Lairson Costa	ca510a6c82	net: Use backlog-NAPI to clean up the defer_list. JIRA: https://issues.redhat.com/browse/RHEL-9145 commit 80d2eefcb4c84aa9018b2a997ab3a4c567bc821a Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Mon Mar 25 08:40:30 2024 +0100 net: Use backlog-NAPI to clean up the defer_list. The defer_list is a per-CPU list which is used to free skbs outside of the socket lock and on the CPU on which they have been allocated. The list is processed during NAPI callbacks so ideally the list is cleaned up. Should the amount of skbs on the list exceed a certain water mark then the softirq is triggered remotely on the target CPU by invoking a remote function call. The raise of the softirqs via a remote function call leads to waking the ksoftirqd on PREEMPT_RT which is undesired. The backlog-NAPI threads already provide the infrastructure which can be utilized to perform the cleanup of the defer_list. The NAPI state is updated with the input_pkt_queue.lock acquired. It order not to break the state, it is needed to also wake the backlog-NAPI thread with the lock held. This requires to acquire the use the lock in rps_lock_irq() if the backlog-NAPI threads are used even with RPS disabled. Move the logic of remotely starting softirqs to clean up the defer_list into kick_defer_list_purge(). Make sure a lock is held in rps_lock_irq() if backlog-NAPI threads are used. Schedule backlog-NAPI for defer_list cleanup if backlog-NAPI is available. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:27 -03:00
Wander Lairson Costa	de7a3b7b85	net: add generic percpu page_pool allocator JIRA: https://issues.redhat.com/browse/RHEL-9145 Conflicts: we already have 490a79faf95e ("net: introduce include/net/rps.h") commit 2b0cfa6e49566c8fa6759734cf821aa6e8271a9e Author: Lorenzo Bianconi <lorenzo@kernel.org> Date: Mon Feb 12 10:50:54 2024 +0100 net: add generic percpu page_pool allocator Introduce generic percpu page_pools allocator. Moreover add page_pool_create_percpu() and cpuid filed in page_pool struct in order to recycle the page in the page_pool "hot" cache if napi_pp_put_page() is running on the same cpu. This is a preliminary patch to add xdp multi-buff support for xdp running in generic mode. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/80bc4285228b6f4220cd03de1999d86e46e3fcbd.1707729884.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:27 -03:00
Wander Lairson Costa	072de1240f	net: fix possible race in skb_attempt_defer_free() JIRA: https://issues.redhat.com/browse/RHEL-9145 commit 97e719a82b43c6c2bb5eebdb3c5d479a332ac2ac Author: Eric Dumazet <edumazet@google.com> Date: Sun May 15 21:24:53 2022 -0700 net: fix possible race in skb_attempt_defer_free() A cpu can observe sd->defer_count reaching 128, and call smp_call_function_single_async() Problem is that the remote CPU can clear sd->defer_count before the IPI is run/acknowledged. Other cpus can queue more packets and also decide to call smp_call_function_single_async() while the pending IPI was not yet delivered. This is a common issue with smp_call_function_single_async(). Callers must ensure correct synchronization and serialization. I triggered this issue while experimenting smaller threshold. Performing the call to smp_call_function_single_async() under sd->defer_lock protection did not solve the problem. Commit `5a18ceca63` ("smp: Allow smp_call_function_single_async() to insert locked csd") replaced an informative WARN_ON_ONCE() with a return of -EBUSY, which is often ignored. Test of CSD_FLAG_LOCK presence is racy anyway. Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:26 -03:00
Wander Lairson Costa	e5f704d611	net: generalize skb freeing deferral to per-cpu lists JIRA: https://issues.redhat.com/browse/RHEL-9145 Conflicts: inet/tls/tls_sw.c: we already have: * 4cbc325ed6b4 ("tls: rx: allow only one reader at a time") net/ipv4/tcp_ipv4.c: we already have: * `67b688aecd` tcp: fix tcp_cleanup_rbuf() for tcp_read_skb() * `0240ed7c51` tcp: allow again tcp_disconnect() when threads are waiting * `0d5e52df56` bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf * 7a26dc9e7b43 net: tcp: add skb drop reasons to tcp_add_backlog() commit 68822bdf76f10c3dc80609d4e2cdc1e847429086 Author: Eric Dumazet <edumazet@google.com> Date: Fri Apr 22 13:12:37 2022 -0700 net: generalize skb freeing deferral to per-cpu lists Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions () 57.81% [kernel] [k] copy_user_enhanced_fast_string () 12.87% [kernel] [k] skb_release_data () 4.25% [kernel] [k] __free_one_page () 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter () 1.59% [kernel] [k] free_unref_page_commit () 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter () 1.01% [kernel] [k] kfree () 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue () 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog () 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked () 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq () 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv () 0.33% [kernel] [k] __put_page () 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2024-09-16 16:04:25 -03:00
Antoine Tenart	a8e32bb8a9	net: introduce sk_skb_reason_drop function JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: net-next.git Conflicts:\ - Context difference due to missing upstream commit 1cface552a5b ("net: add skb_data_unref() helper") in c9s. commit ba8de796baf4bdc03530774fb284fe3c97875566 Author: Yan Zhai <yan@cloudflare.com> Date: Mon Jun 17 11:09:09 2024 -0700 net: introduce sk_skb_reason_drop function Long used destructors kfree_skb and kfree_skb_reason do not pass receiving socket to packet drop tracepoints trace_kfree_skb. This makes it hard to track packet drops of a certain netns (container) or a socket (user application). The naming of these destructors are also not consistent with most sk/skb operating functions, i.e. functions named "sk_xxx" or "skb_xxx". Introduce a new functions sk_skb_reason_drop as drop-in replacement for kfree_skb_reason on local receiving path. Callers can now pass receiving sockets to the tracepoints. kfree_skb and kfree_skb_reason are still usable but they are now just inline helpers that call sk_skb_reason_drop. Note it is not feasible to do the same to consume_skb. Packets not dropped can flow through multiple receive handlers, and have multiple receiving sockets. Leave it untouched for now. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:42 +02:00
Antoine Tenart	283c6d9f79	net: add rx_sk to trace_kfree_skb JIRA: https://issues.redhat.com/browse/RHEL-48648 Upstream Status: net-next.git commit c53795d48ee8f385c6a9e394651e7ee914baaeba Author: Yan Zhai <yan@cloudflare.com> Date: Mon Jun 17 11:09:04 2024 -0700 net: add rx_sk to trace_kfree_skb skb does not include enough information to find out receiving sockets/services and netns/containers on packet drops. In theory skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP stack for OOO packet lookup. Similarly, skb->sk often identifies a local sender, and tells nothing about a receiver. Allow passing an extra receiving socket to the tracepoint to improve the visibility on receiving drops. Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-07-16 17:29:42 +02:00
Xin Long	4c164dafae	net: core: reject skb_copy(_expand) for fraglist GSO skbs JIRA: https://issues.redhat.com/browse/RHEL-39781 CVE: CVE-2024-36929 Tested: compile only commit d091e579b864fa790dd6a0cd537a22c383126681 Author: Felix Fietkau <nbd@nbd.name> Date: Sat Apr 27 20:24:19 2024 +0200 net: core: reject skb_copy(_expand) for fraglist GSO skbs SKB_GSO_FRAGLIST skbs must not be linearized, otherwise they become invalid. Return NULL if such an skb is passed to skb_copy or skb_copy_expand, in order to prevent a crash on a potential later call to skb_gso_segment. Fixes: `3a1296a38d` ("net: Support GRO/GSO fraglist chaining.") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Xin Long <lxin@redhat.com>	2024-06-07 14:39:53 -04:00
Petr Oros	6c5988280c	page_pool: remove PP_FLAG_PAGE_FRAG JIRA: https://issues.redhat.com/browse/RHEL-31941 Conflicts: - drivers/net/ethernet/hisilicon/hns3/hns3_enet.c: chunk skipped due to missing 93188e9642c3ce ("net: hns3: support skb's frag page recycling based on page pool") - drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c chunk skipped due to missing b2e3406a38f0f4 ("octeontx2-pf: Add support for page pool") Upstream commit(s): commit 09d96ee5674a0eaa800c664353756ecc45c4a87f Author: Yunsheng Lin <linyunsheng@huawei.com> Date: Fri Oct 20 17:59:49 2023 +0800 page_pool: remove PP_FLAG_PAGE_FRAG PP_FLAG_PAGE_FRAG is not really needed after pp_frag_count handling is unified and page_pool_alloc_frag() is supported in 32-bit arch with 64-bit DMA, so remove it. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Link: https://lore.kernel.org/r/20231020095952.11055-3-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Petr Oros <poros@redhat.com>	2024-05-16 19:27:54 +02:00
Petr Oros	a9e7bc19d6	net: skbuff: always try to recycle PP pages directly when in softirq JIRA: https://issues.redhat.com/browse/RHEL-31941 Upstream commit(s): commit 4a36d0180c452c3482792e0ff14e2bcf536a9284 Author: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Fri Aug 4 20:05:29 2023 +0200 net: skbuff: always try to recycle PP pages directly when in softirq Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized NAPI") allowed direct recycling of skb pages to their PP for some cases, but unfortunately missed a couple of other majors. For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(), which unconditionally passes `false` as @napi_safe. Thus, all pages go through ptr_ring and locks, although most of time we're actually inside the NAPI polling this PP is linked with, so that it would be perfectly safe to recycle pages directly. Let's address such. If @napi_safe is true, we're fine, don't change anything for this path. But if it's false, check whether we are in the softirq context. It will most likely be so and then if ->list_owner is our current CPU, we're good to use direct recycling, even though @napi_safe is false -- concurrent access is excluded. in_softirq() protection is needed mostly due to we can hit this place in the process context (not the hardirq though). For the mentioned xdp-drop-skb-mode case, the improvement I got is 3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and alloc_slow counter doesn't change most of time, which means the MM layer is not even called to allocate any new pages. Suggested-by: Jakub Kicinski <kuba@kernel.org> # in_softirq() Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-7-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Petr Oros <poros@redhat.com>	2024-05-16 19:27:54 +02:00
Petr Oros	af8cacc642	net: skbuff: avoid accessing page_pool if !napi_safe when returning page JIRA: https://issues.redhat.com/browse/RHEL-31941 Upstream commit(s): commit 5b899c33b3b852b9559b724cfee67801324a0886 Author: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Fri Aug 4 20:05:27 2023 +0200 net: skbuff: avoid accessing page_pool if !napi_safe when returning page Currently, pp->p.napi is always read, but the actual variable it gets assigned to is read-only when @napi_safe is true. For the !napi_safe cases, which yet is still a pack, it's an unneeded operation. Moreover, it can lead to premature or even redundant page_pool cacheline access. For example, when page_pool_is_last_frag() returns false (with the recent frag improvements). Thus, read it only when @napi_safe is true. This also allows moving @napi inside the condition block itself. Constify it while we are here, because why not. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-5-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Petr Oros <poros@redhat.com>	2024-05-16 19:27:54 +02:00
Patrick Talbert	bed353f1ff	Merge: CNB95: net: remove gfp_mask from napi_alloc_skb() MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4100 JIRA: https://issues.redhat.com/browse/RHEL-32108 Signed-off-by: Izabela Bakollari <ibakolla@redhat.com> Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: Corinna Vinschen <vinschen@redhat.com> Approved-by: Sabrina Dubroca <sdubroca@redhat.com> Approved-by: Kamal Heib <kheib@redhat.com> Merged-by: Patrick Talbert <ptalbert@redhat.com>	2024-05-03 12:43:32 +02:00
Lucas Zampieri	bce8aa3053	Merge: xfrm: backports from upstream MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4013 JIRA: https://issues.redhat.com/browse/RHEL-31751 Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Davide Caratti <dcaratti@redhat.com> Merged-by: Lucas Zampieri <lzampier@redhat.com>	2024-04-26 12:34:25 +00:00
Izabela Bakollari	d18d335178	net: remove gfp_mask from napi_alloc_skb() JIRA: https://issues.redhat.com/browse/RHEL-32108 __napi_alloc_skb() is napi_alloc_skb() with the added flexibility of choosing gfp_mask. This is a NAPI function, so GFP_ATOMIC is implied. The only practical choice the caller has is whether to set __GFP_NOWARN. But that's a false choice, too, allocation failures in atomic context will happen, and printing warnings in logs, effectively for a packet drop, is both too much and very likely non-actionable. This leads me to a conclusion that most uses of napi_alloc_skb() are simply misguided, and should use __GFP_NOWARN in the first place. We also have a "standard" way of reporting allocation failures via the queue stat API (qstats::rx-alloc-fail). The direct motivation for this patch is that one of the drivers used at Meta calls napi_alloc_skb() (so prior to this patch without __GFP_NOWARN), and the resulting OOM warning is the top networking warning in our fleet. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240327040213.3153864-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> (cherry picked from commit 6e9b01909a811555ff3326cf80a5847169c57806) Signed-off-by: Izabela Bakollari <ibakolla@redhat.com>	2024-04-23 08:33:11 +02:00
Sabrina Dubroca	06fe287412	net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h> JIRA: https://issues.redhat.com/browse/RHEL-31751 Conflicts: context around #include in net/core/skbuff.c commit 75eaf63ea7afeafd026ffef03bdc69e31f10829b Author: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Fri Aug 4 20:05:25 2023 +0200 net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h> Currently, touching <net/page_pool/types.h> triggers a rebuild of more than half of the kernel. That's because it's included in <linux/skbuff.h>. And each new include to page_pool/types.h adds more [useless] data for the toolchain to process per each source file from that pile. In commit `6a5bcd84e8` ("page_pool: Allow drivers to hint on SKB recycling"), Matteo included it to be able to call a couple of functions defined there. Then, in commit 57f05bc2ab24 ("page_pool: keep pp info as long as page pool owns the page") one of the calls was removed, so only one was left. It's the call to page_pool_return_skb_page() in napi_frag_unref(). The function is external and doesn't have any dependencies. Having very niche page_pool_types.h included only for that looks like an overkill. As %PP_SIGNATURE is not local to page_pool.c (was only in the early submissions), nothing holds this function there. Teleport page_pool_return_skb_page() to skbuff.c, just next to the main consumer, skb_pp_recycle(), and rename it to napi_pp_put_page(), as it doesn't work with skbs at all and the former name tells nothing. The #if guards here are only to not compile and have it in the vmlinux when not needed -- both call sites are already guarded. Now, touching page_pool_types.h only triggers rebuilding of the drivers using it and a couple of core networking files. Suggested-by: Jakub Kicinski <kuba@kernel.org> # make skbuff.h less heavy Suggested-by: Alexander Duyck <alexanderduyck@fb.com> # move to skbuff.c Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-3-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>	2024-04-11 10:04:27 +02:00
Antoine Tenart	b757199955	net: deal with integer overflows in kmalloc_reserve() JIRA: https://issues.redhat.com/browse/RHEL-28786 Upstream Status: linux.git Conflicts:\ - Context difference due to missing upstream commit bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head") and follow-ups in c9s. commit 915d975b2ffa58a14bfcf16fafe00c41315949ff Author: Eric Dumazet <edumazet@google.com> Date: Thu Aug 31 18:37:50 2023 +0000 net: deal with integer overflows in kmalloc_reserve() Blamed commit changed: ptr = kmalloc(size); if (ptr) size = ksize(ptr); to: size = kmalloc_size_roundup(size); ptr = kmalloc(size); This allowed various crash as reported by syzbot [1] and Kyle Zeng. Problem is that if @size is bigger than 0x80000001, kmalloc_size_roundup(size) returns 2^32. kmalloc_reserve() uses a 32bit variable (obj_size), so 2^32 is truncated to 0. kmalloc(0) returns ZERO_SIZE_PTR which is not handled by skb allocations. Following trace can be triggered if a netdev->mtu is set close to 0x7fffffff We might in the future limit netdev->mtu to more sensible limit (like KMALLOC_MAX_SIZE). This patch is based on a syzbot report, and also a report and tentative fix from Kyle Zeng. [1] BUG: KASAN: user-memory-access in __build_skb_around net/core/skbuff.c:294 [inline] BUG: KASAN: user-memory-access in __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527 Write of size 32 at addr 00000000fffffd10 by task syz-executor.4/22554 CPU: 1 PID: 22554 Comm: syz-executor.4 Not tainted 6.1.39-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023 Call trace: dump_backtrace+0x1c8/0x1f4 arch/arm64/kernel/stacktrace.c:279 show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:286 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x120/0x1a0 lib/dump_stack.c:106 print_report+0xe4/0x4b4 mm/kasan/report.c:398 kasan_report+0x150/0x1ac mm/kasan/report.c:495 kasan_check_range+0x264/0x2a4 mm/kasan/generic.c:189 memset+0x40/0x70 mm/kasan/shadow.c:44 __build_skb_around net/core/skbuff.c:294 [inline] __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527 alloc_skb include/linux/skbuff.h:1316 [inline] igmpv3_newpack+0x104/0x1088 net/ipv4/igmp.c:359 add_grec+0x81c/0x1124 net/ipv4/igmp.c:534 igmpv3_send_cr net/ipv4/igmp.c:667 [inline] igmp_ifc_timer_expire+0x1b0/0x1008 net/ipv4/igmp.c:810 call_timer_fn+0x1c0/0x9f0 kernel/time/timer.c:1474 expire_timers kernel/time/timer.c:1519 [inline] __run_timers+0x54c/0x710 kernel/time/timer.c:1790 run_timer_softirq+0x28/0x4c kernel/time/timer.c:1803 _stext+0x380/0xfbc ____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:79 call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:891 do_softirq_own_stack+0x20/0x2c arch/arm64/kernel/irq.c:84 invoke_softirq kernel/softirq.c:437 [inline] __irq_exit_rcu+0x1c0/0x4cc kernel/softirq.c:683 irq_exit_rcu+0x14/0x78 kernel/softirq.c:695 el0_interrupt+0x7c/0x2e0 arch/arm64/kernel/entry-common.c:717 __el0_irq_handler_common+0x18/0x24 arch/arm64/kernel/entry-common.c:724 el0t_64_irq_handler+0x10/0x1c arch/arm64/kernel/entry-common.c:729 el0t_64_irq+0x1a0/0x1a4 arch/arm64/kernel/entry.S:584 Fixes: 12d6c1d3a2ad ("skbuff: Proactively round up to kmalloc bucket size") Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Kyle Zeng <zengyhkyle@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-04-08 15:21:32 +02:00
Antoine Tenart	1f8dbc1cbd	net: factorize code in kmalloc_reserve() JIRA: https://issues.redhat.com/browse/RHEL-28786 Upstream Status: linux.git commit 5c0e820cbbbe2d1c4cea5cd2bfc1302c123436df Author: Eric Dumazet <edumazet@google.com> Date: Mon Feb 6 17:31:02 2023 +0000 net: factorize code in kmalloc_reserve() All kmalloc_reserve() callers have to make the same computation, we can factorize them, to prepare following patch in the series. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-04-08 15:21:32 +02:00
Antoine Tenart	4acc1cb6c2	net: remove osize variable in __alloc_skb() JIRA: https://issues.redhat.com/browse/RHEL-28786 Upstream Status: linux.git commit 65998d2bf857b9ae5acc1f3b70892bd1b429ccab Author: Eric Dumazet <edumazet@google.com> Date: Mon Feb 6 17:31:01 2023 +0000 net: remove osize variable in __alloc_skb() This is a cleanup patch, to prepare following change. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-04-08 15:21:32 +02:00
Antoine Tenart	d7c131c88f	net: add SKB_HEAD_ALIGN() helper JIRA: https://issues.redhat.com/browse/RHEL-28786 Upstream Status: linux.git commit 115f1a5c42bdad9a9ea356fc0b4a39ec7537947f Author: Eric Dumazet <edumazet@google.com> Date: Mon Feb 6 17:31:00 2023 +0000 net: add SKB_HEAD_ALIGN() helper We have many places using this expression: SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) Use of SKB_HEAD_ALIGN() will allow to clean them. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-04-08 15:21:32 +02:00
Antoine Tenart	6529d7d8b2	skbuff: Proactively round up to kmalloc bucket size JIRA: https://issues.redhat.com/browse/RHEL-28786 Upstream Status: linux.git commit 12d6c1d3a2ad0c199ec57c201cdc71e8e157a232 Author: Kees Cook <keescook@chromium.org> Date: Tue Oct 25 15:39:35 2022 -0700 skbuff: Proactively round up to kmalloc bucket size Instead of discovering the kmalloc bucket size _after_ allocation, round up proactively so the allocation is explicitly made for the full size, allowing the compiler to correctly reason about the resulting size of the buffer through the existing __alloc_size() hint. This will allow for kernels built with CONFIG_UBSAN_BOUNDS or the coming dynamic bounds checking under CONFIG_FORTIFY_SOURCE to gain back the __alloc_size() hints that were temporarily reverted in commit 93dd04ab0b2b ("slab: remove __alloc_size attribute from __kmalloc_track_caller") Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: netdev@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Link: https://patchwork.kernel.org/project/netdevbpf/patch/20221021234713.you.031-kees@kernel.org/ Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221025223811.up.360-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-04-08 15:21:32 +02:00
Antoine Tenart	e5433daab7	skbuff: pass the result of data ksize to __build_skb_around JIRA: https://issues.redhat.com/browse/RHEL-28786 Upstream Status: linux.git commit a5df6333f1a08380c3b94a02105482263711ed3a Author: Li RongQing <lirongqing@baidu.com> Date: Wed Sep 22 14:17:19 2021 +0800 skbuff: pass the result of data ksize to __build_skb_around Avoid to call ksize again in __build_skb_around by passing the result of data ksize to __build_skb_around nginx stress test shows this change can reduce ksize cpu usage, and give a little performance boost Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2024-04-08 15:21:32 +02:00
Scott Weaver	68fc749fdd	Merge: add kabi reserved fields and excludes to networking MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3584 JIRA: https://issues.redhat.com/browse/RHEL-21356 Upstream Status: mostly RHEL-only patches This series adds reserved fields to networking structs, and excludes some areas of networking from the kABI guarantee. These reserved fields are only needed during backports to z-stream. Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com> Approved-by: Jan Stancek <jstancek@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2024-01-22 12:01:19 -05:00
Paolo Abeni	3649e472b9	net: prevent mss overflow in skb_segment() JIRA: https://issues.redhat.com/browse/RHEL-21447 Tested: LNST, Tier1 Conflicts: rhel lacks the upstream commit 867046cc7027 ("minmax: relax \ check to allow comparison between unsigned arguments and signed \ constant"), use the min_t() macro instead of plain min() to avoid a \ compile warning. Upstream commit: commit 23d05d563b7e7b0314e65c8e882bc27eac2da8e7 Author: Eric Dumazet <edumazet@google.com> Date: Tue Dec 12 16:46:21 2023 +0000 net: prevent mss overflow in skb_segment() Once again syzbot is able to crash the kernel in skb_segment() [1] GSO_BY_FRAGS is a forbidden value, but unfortunately the following computation in skb_segment() can reach it quite easily : mss = mss * partial_segs; 65535 = 3 * 5 * 17 * 257, so many initial values of mss can lead to a bad final result. Make sure to limit segmentation so that the new mss value is smaller than GSO_BY_FRAGS. [1] general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077] CPU: 1 PID: 5079 Comm: syz-executor993 Not tainted 6.7.0-rc4-syzkaller-00141-g1ae4cd3cbdd0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023 RIP: 0010:skb_segment+0x181d/0x3f30 net/core/skbuff.c:4551 Code: 83 e3 02 e9 fb ed ff ff e8 90 68 1c f9 48 8b 84 24 f8 00 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 8a 21 00 00 48 8b 84 24 f8 00 RSP: 0018:ffffc900043473d0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000010046 RCX: ffffffff886b1597 RDX: 000000000000000e RSI: ffffffff886b2520 RDI: 0000000000000070 RBP: ffffc90004347578 R08: 0000000000000005 R09: 000000000000ffff R10: 000000000000ffff R11: 0000000000000002 R12: ffff888063202ac0 R13: 0000000000010000 R14: 000000000000ffff R15: 0000000000000046 FS: 0000555556e7e380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020010000 CR3: 0000000027ee2000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> udp6_ufo_fragment+0xa0e/0xd00 net/ipv6/udp_offload.c:109 ipv6_gso_segment+0x534/0x17e0 net/ipv6/ip6_offload.c:120 skb_mac_gso_segment+0x290/0x610 net/core/gso.c:53 __skb_gso_segment+0x339/0x710 net/core/gso.c:124 skb_gso_segment include/net/gso.h:83 [inline] validate_xmit_skb+0x36c/0xeb0 net/core/dev.c:3626 __dev_queue_xmit+0x6f3/0x3d60 net/core/dev.c:4338 dev_queue_xmit include/linux/netdevice.h:3134 [inline] packet_xmit+0x257/0x380 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3087 [inline] packet_sendmsg+0x24c6/0x5220 net/packet/af_packet.c:3119 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0xd5/0x180 net/socket.c:745 __sys_sendto+0x255/0x340 net/socket.c:2190 __do_sys_sendto net/socket.c:2202 [inline] __se_sys_sendto net/socket.c:2198 [inline] __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x7f8692032aa9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 d1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff8d685418 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8692032aa9 RDX: 0000000000010048 RSI: 00000000200000c0 RDI: 0000000000000003 RBP: 00000000000f4240 R08: 0000000020000540 R09: 0000000000000014 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff8d685480 R13: 0000000000000001 R14: 00007fff8d685480 R15: 0000000000000003 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:skb_segment+0x181d/0x3f30 net/core/skbuff.c:4551 Code: 83 e3 02 e9 fb ed ff ff e8 90 68 1c f9 48 8b 84 24 f8 00 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 8a 21 00 00 48 8b 84 24 f8 00 RSP: 0018:ffffc900043473d0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000010046 RCX: ffffffff886b1597 RDX: 000000000000000e RSI: ffffffff886b2520 RDI: 0000000000000070 RBP: ffffc90004347578 R08: 0000000000000005 R09: 000000000000ffff R10: 000000000000ffff R11: 0000000000000002 R12: ffff888063202ac0 R13: 0000000000010000 R14: 000000000000ffff R15: 0000000000000046 FS: 0000555556e7e380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020010000 CR3: 0000000027ee2000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: `3953c46c3a` ("sk_buff: allow segmenting based on frag sizes") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20231212164621.4131800-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-01-12 17:13:08 +01:00
Sabrina Dubroca	986c223fcf	net: add reserved fields to sk_buff using custom code JIRA: https://issues.redhat.com/browse/RHEL-21356 Upstream Status: RHEL-only Add 16 bytes of reserved space to sk_buff, using the same mechanism as in RHEL8 to help make its future usage less error-prone. Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>	2024-01-12 14:27:38 +01:00
Petr Oros	8333fb8ac8	page_pool: split types and declarations from page_pool.h JIRA: https://issues.redhat.com/browse/RHEL-16983 Conflicts: - net/core/skbuff.c: adjusted context conflict due to missing 78476d315e1905 ("mctp: Add flow extension to skb") - drivers/net/ethernet/hisilicon/hns3/hns3_enet.h: adjusted context conflict due to missing 87a9b2fd9288c5 ("net: hns3: add support for TX push mode") - drivers/net/ethernet/mediatek/mtk_eth_soc.[c\|h] Chunks ommited due to lack of page_pool support in driver. Missing upstream commit 23233e577ef973 ("net: ethernet: mtk_eth_soc: rely on page_pool for single page buffers") - drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c adjusted context conflict due to missing 67f245c2ec0af1 ("mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type") - drivers/net/ethernet/microsoft/mana/mana_en.c adjusted context conflict due to missing 92272ec4107ef4 ("eth: add missing xdp.h includes in drivers") - drivers/net/veth.c Chunks ommited due to missing 0ebab78cbcbfd6 ("net: veth: add page_pool for page recycling") - Unmerged path's (missing in rhel): drivers/net/ethernet/engleder/tsnep_main.c, drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c, drivers/net/ethernet/microchip/lan966x/lan966x_main.h, drivers/net/ethernet/wangxun/libwx/wx_lib.c Upstream commit(s): commit a9ca9f9ceff382b58b488248f0c0da9e157f5d06 Author: Yunsheng Lin <linyunsheng@huawei.com> Date: Fri Aug 4 20:05:24 2023 +0200 page_pool: split types and declarations from page_pool.h Split types and pure function declarations from page_pool.h and add them in page_page/types.h, so that C sources can include page_pool.h and headers should generally only include page_pool/types.h as suggested by jakub. Rename page_pool.h to page_pool/helpers.h to have both in one place. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com [Jakub: change microsoft/mana, fix kdoc paths in Documentation] Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Petr Oros <poros@redhat.com>	2023-11-30 19:11:24 +01:00
Jan Stancek	9eea5b8c8f	Merge: net: backport drop reason related patches MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3265 JIRA: https://issues.redhat.com/browse/RHEL-14554 Depends: !3196 Skb drop reason related patches, and a few extra ones for easier backports. Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Approved-by: Sabrina Dubroca <sdubroca@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-11-20 21:49:35 +01:00
Jan Stancek	25a74f04f2	Merge: net-core: stable backports for 9.4 phase 1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3251 net-core: stable backports for 9.4 phase 1 JIRA: https://issues.redhat.com/browse/RHEL-14364 Tested: LNST, Tier1 A bunch of fixes for the networking core, comprising a few serious ones. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Sabrina Dubroca <sdubroca@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-11-20 21:49:02 +01:00
Antoine Tenart	8ad0fa4384	net: skb_queue_purge_reason() optimizations JIRA: https://issues.redhat.com/browse/RHEL-14554 Upstream Status: net-next.git commit d86e5fbd4c965fdda72f99ccd54a1031ea4df51d Author: Eric Dumazet <edumazet@google.com> Date: Tue Oct 3 18:19:20 2023 +0000 net: skb_queue_purge_reason() optimizations 1) Exit early if the list is empty. 2) splice the list into a local list, so that we block hard irqs only once. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231003181920.3280453-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-11-10 17:40:30 +01:00
Antoine Tenart	005a9126b4	net: add skb_queue_purge_reason and __skb_queue_purge_reason JIRA: https://issues.redhat.com/browse/RHEL-14554 Upstream Status: linux.git commit 4025d3e73abde4f65f4b04d4b1d8449b00e31473 Author: Eric Dumazet <edumazet@google.com> Date: Fri Aug 18 09:40:39 2023 +0000 net: add skb_queue_purge_reason and __skb_queue_purge_reason skb_queue_purge() and __skb_queue_purge() become wrappers around the new generic functions. New SKB_DROP_REASON_QUEUE_PURGE drop reason is added, but users can start adding more specific reasons. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-11-10 17:40:30 +01:00
Antoine Tenart	b2c4833a40	net: skbuff: update and rename __kfree_skb_defer() JIRA: https://issues.redhat.com/browse/RHEL-14554 Upstream Status: linux.git commit 8fa66e4a1bdd41d55d7842928e60a40fed65715d Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Apr 19 19:00:05 2023 -0700 net: skbuff: update and rename __kfree_skb_defer() __kfree_skb_defer() uses the old naming where "defer" meant slab bulk free/alloc APIs. In the meantime we also made __kfree_skb_defer() feed the per-NAPI skb cache, which implies bulk APIs. So take away the 'defer' and add 'napi'. While at it add a drop reason. This only matters on the tx_action path, if the skb has a frag_list. But getting rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net benefit so why not. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230420020005.815854-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-11-10 17:40:29 +01:00
Scott Weaver	6cf5659031	Merge: CNB94: page_pool: allow caching from safely localized NAPI MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3196 JIRA: https://issues.redhat.com/browse/RHEL-12613 Tested: Using LNST net-driver test-suite on i40e, bnxt_en, ice and mlx5_core [http://dashboard.lnst.anl.lab.eng.bos.redhat.com/pipeline/3644] Commits: ``` 4727bab4e9bb ("net: skb: move skb_pp_recycle() to skbuff.c") eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk") f72ff8b81ebc ("net: fix kfree_skb_list use of skb_mark_not_on_list") 9dde0cd3b10f ("net: introduce skb_poison_list and use in kfree_skb_list") b07a2d97ba5e ("net: skb: plumb napi state thru skb freeing paths") 8c48eea3adf3 ("page_pool: allow caching from safely localized NAPI") dd64b232deb8 ("page_pool: unlink from napi during destroy") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-11-09 07:22:35 -05:00
Ivan Vecera	d80ce17d20	page_pool: allow caching from safely localized NAPI JIRA: https://issues.redhat.com/browse/RHEL-12613 Conflicts: - simple context conflict in net/core/dev.c due to absence of commit 8b43fd3d1d7d8 ("net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ") that is out of scope of this series commit 8c48eea3adf3119e0a3fc57bd31f6966f26ee784 Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Apr 12 21:26:04 2023 -0700 page_pool: allow caching from safely localized NAPI Recent patches to mlx5 mentioned a regression when moving from driver local page pool to only using the generic page pool code. Page pool has two recycling paths (1) direct one, which runs in safe NAPI context (basically consumer context, so producing can be lockless); and (2) via a ptr_ring, which takes a spin lock because the freeing can happen from any CPU; producer and consumer may run concurrently. Since the page pool code was added, Eric introduced a revised version of deferred skb freeing. TCP skbs are now usually returned to the CPU which allocated them, and freed in softirq context. This places the freeing (producing of pages back to the pool) enticingly close to the allocation (consumer). If we can prove that we're freeing in the same softirq context in which the consumer NAPI will run - lockless use of the cache is perfectly fine, no need for the lock. Let drivers link the page pool to a NAPI instance. If the NAPI instance is scheduled on the same CPU on which we're freeing - place the pages in the direct cache. With that and patched bnxt (XDP enabled to engage the page pool, sigh, bnxt really needs page pool work :() I see a 2.6% perf boost with a TCP stream test (app on a different physical core than softirq). The CPU use of relevant functions decreases as expected: page_pool_refill_alloc_cache 1.17% -> 0% _raw_spin_lock 2.41% -> 0.98% Only consider lockless path to be safe when NAPI is scheduled - in practice this should cover majority if not all of steady state workloads. It's usually the NAPI kicking in that causes the skb flush. The main case we'll miss out on is when application runs on the same CPU as NAPI. In that case we don't use the deferred skb free path. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-31 15:09:26 +01:00
Ivan Vecera	8d205ac34b	net: skb: plumb napi state thru skb freeing paths JIRA: https://issues.redhat.com/browse/RHEL-12613 Conflicts: - adjusted due to lack of commit bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head") commit b07a2d97ba5ef154fe736aa510e43a3299eee5f8 Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Apr 12 21:26:03 2023 -0700 net: skb: plumb napi state thru skb freeing paths We maintain a NAPI-local cache of skbs which is fed by napi_consume_skb(). Going forward we will also try to cache head and data pages. Plumb the "are we in a normal NAPI context" information thru deeper into the freeing path, up to skb_release_data() and skb_free_head()/skb_pp_recycle(). The "not normal NAPI context" comes from netpoll which passes budget of 0 to try to reap the Tx completions but not perform any Rx. Use "bool napi_safe" rather than bare "int budget", the further we get from NAPI the more confusing the budget argument may seem (particularly whether 0 or MAX is the correct value to pass in when not in NAPI). Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Tested-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-31 15:09:08 +01:00
Paolo Abeni	ed60ca7371	skbuff: skb_segment, Call zero copy functions before using skbuff frags JIRA: https://issues.redhat.com/browse/RHEL-14364 Tested: LNST, Tier1 Upstream commit: commit 2ea35288c83b3d501a88bc17f2df8f176b5cc96f Author: Mohamed Khalfella <mkhalfella@purestorage.com> Date: Thu Aug 31 02:17:02 2023 -0600 skbuff: skb_segment, Call zero copy functions before using skbuff frags Commit `bf5c25d608` ("skbuff: in skb_segment, call zerocopy functions once per nskb") added the call to zero copy functions in skb_segment(). The change introduced a bug in skb_segment() because skb_orphan_frags() may possibly change the number of fragments or allocate new fragments altogether leaving nrfrags and frag to point to the old values. This can cause a panic with stacktrace like the one below. [ 193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc [ 193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G O 5.15.123+ #26 [ 193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0 [ 194.021892] Call Trace: [ 194.027422] <TASK> [ 194.072861] tcp_gso_segment+0x107/0x540 [ 194.082031] inet_gso_segment+0x15c/0x3d0 [ 194.090783] skb_mac_gso_segment+0x9f/0x110 [ 194.095016] __skb_gso_segment+0xc1/0x190 [ 194.103131] netem_enqueue+0x290/0xb10 [sch_netem] [ 194.107071] dev_qdisc_enqueue+0x16/0x70 [ 194.110884] __dev_queue_xmit+0x63b/0xb30 [ 194.121670] bond_start_xmit+0x159/0x380 [bonding] [ 194.128506] dev_hard_start_xmit+0xc3/0x1e0 [ 194.131787] __dev_queue_xmit+0x8a0/0xb30 [ 194.138225] macvlan_start_xmit+0x4f/0x100 [macvlan] [ 194.141477] dev_hard_start_xmit+0xc3/0x1e0 [ 194.144622] sch_direct_xmit+0xe3/0x280 [ 194.147748] __dev_queue_xmit+0x54a/0xb30 [ 194.154131] tap_get_user+0x2a8/0x9c0 [tap] [ 194.157358] tap_sendmsg+0x52/0x8e0 [tap] [ 194.167049] handle_tx_zerocopy+0x14e/0x4c0 [vhost_net] [ 194.173631] handle_tx+0xcd/0xe0 [vhost_net] [ 194.176959] vhost_worker+0x76/0xb0 [vhost] [ 194.183667] kthread+0x118/0x140 [ 194.190358] ret_from_fork+0x1f/0x30 [ 194.193670] </TASK> In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags local variable in skb_segment() stale. This resulted in the code hitting i >= nrfrags prematurely and trying to move to next frag_skb using list_skb pointer, which was NULL, and caused kernel panic. Move the call to zero copy functions before using frags and nr_frags. Fixes: `bf5c25d608` ("skbuff: in skb_segment, call zerocopy functions once per nskb") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reported-by: Amit Goyal <agoyal@purestorage.com> Cc: stable@vger.kernel.org Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-10-20 13:49:01 +02:00
Paolo Abeni	68e7745c12	net: prevent skb corruption on frag list segmentation JIRA: https://issues.redhat.com/browse/RHEL-14364 Tested: LNST, Tier1 Upstream commit: commit c329b261afe71197d9da83c1f18eb45a7e97e089 Author: Paolo Abeni <pabeni@redhat.com> Date: Fri Jul 7 10:11:10 2023 +0200 net: prevent skb corruption on frag list segmentation Ian reported several skb corruptions triggered by rx-gro-list, collecting different oops alike: [ 62.624003] BUG: kernel NULL pointer dereference, address: 00000000000000c0 [ 62.631083] #PF: supervisor read access in kernel mode [ 62.636312] #PF: error_code(0x0000) - not-present page [ 62.641541] PGD 0 P4D 0 [ 62.644174] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 62.648629] CPU: 1 PID: 913 Comm: napi/eno2-79 Not tainted 6.4.0 #364 [ 62.655162] Hardware name: Supermicro Super Server/A2SDi-12C-HLN4F, BIOS 1.7a 10/13/2022 [ 62.663344] RIP: 0010:__udp_gso_segment (./include/linux/skbuff.h:2858 ./include/linux/udp.h:23 net/ipv4/udp_offload.c:228 net/ipv4/udp_offload.c:261 net/ipv4/udp_offload.c:277) [ 62.687193] RSP: 0018:ffffbd3a83b4f868 EFLAGS: 00010246 [ 62.692515] RAX: 00000000000000ce RBX: 0000000000000000 RCX: 0000000000000000 [ 62.699743] RDX: ffffa124def8a000 RSI: 0000000000000079 RDI: ffffa125952a14d4 [ 62.706970] RBP: ffffa124def8a000 R08: 0000000000000022 R09: 00002000001558c9 [ 62.714199] R10: 0000000000000000 R11: 00000000be554639 R12: 00000000000000e2 [ 62.721426] R13: ffffa125952a1400 R14: ffffa125952a1400 R15: 00002000001558c9 [ 62.728654] FS: 0000000000000000(0000) GS:ffffa127efa40000(0000) knlGS:0000000000000000 [ 62.736852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 62.742702] CR2: 00000000000000c0 CR3: 00000001034b0000 CR4: 00000000003526e0 [ 62.749948] Call Trace: [ 62.752498] <TASK> [ 62.779267] inet_gso_segment (net/ipv4/af_inet.c:1398) [ 62.787605] skb_mac_gso_segment (net/core/gro.c:141) [ 62.791906] __skb_gso_segment (net/core/dev.c:3403 (discriminator 2)) [ 62.800492] validate_xmit_skb (./include/linux/netdevice.h:4862 net/core/dev.c:3659) [ 62.804695] validate_xmit_skb_list (net/core/dev.c:3710) [ 62.809158] sch_direct_xmit (net/sched/sch_generic.c:330) [ 62.813198] __dev_queue_xmit (net/core/dev.c:3805 net/core/dev.c:4210) net/netfilter/core.c:626) [ 62.821093] br_dev_queue_push_xmit (net/bridge/br_forward.c:55) [ 62.825652] maybe_deliver (net/bridge/br_forward.c:193) [ 62.829420] br_flood (net/bridge/br_forward.c:233) [ 62.832758] br_handle_frame_finish (net/bridge/br_input.c:215) [ 62.837403] br_handle_frame (net/bridge/br_input.c:298 net/bridge/br_input.c:416) [ 62.851417] __netif_receive_skb_core.constprop.0 (net/core/dev.c:5387) [ 62.866114] __netif_receive_skb_list_core (net/core/dev.c:5570) [ 62.871367] netif_receive_skb_list_internal (net/core/dev.c:5638 net/core/dev.c:5727) [ 62.876795] napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:434 ./include/net/gro.h:429 net/core/dev.c:6067) [ 62.881004] ixgbe_poll (drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3191) [ 62.893534] __napi_poll (net/core/dev.c:6498) [ 62.897133] napi_threaded_poll (./include/linux/netpoll.h:89 net/core/dev.c:6640) [ 62.905276] kthread (kernel/kthread.c:379) [ 62.913435] ret_from_fork (arch/x86/entry/entry_64.S:314) [ 62.917119] </TASK> In the critical scenario, rx-gro-list GRO-ed packets are fed, via a bridge, both to the local input path and to an egress device (tun). The segmentation of such packets unsafely writes to the cloned skbs with shared heads. This change addresses the issue by uncloning as needed the to-be-segmented skbs. Reported-by: Ian Kumlien <ian.kumlien@gmail.com> Tested-by: Ian Kumlien <ian.kumlien@gmail.com> Fixes: `3a1296a38d` ("net: Support GRO/GSO fraglist chaining.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-10-20 13:43:45 +02:00
Scott Weaver	03206d751a	Merge: CNB94: net: move gso declarations and functions to their own files MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3198 JIRA: https://issues.redhat.com/browse/RHEL-12679 Tested: Just built... no functional change Commits: ``` d457a0e329b0 ("net: move gso declarations and functions to their own files") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: Sabrina Dubroca <sdubroca@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-10-19 10:36:22 -04:00

1 2 3 4 5 ...

930 Commits